### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

In [53]:
import pandas as pd

# Bank Marketing

## 1) Business Understanding
### Objective:
#### The classification goal is to predict if the client will subscribe a term deposit

### Key Questions:
#### What are the target market segment for the Portuguese banking institution, and what shoud be customer retention policies

### Success Criteria:
#### Provide options to the Portuguese banking institution to optimize its marketing strategies, reduce costs, and increase the conversion rates of their marketing campaigns, thereby improving overall profitability

## 2) Data Understanding

## 2.1 Data Collection

The data is related with direct marketing campaigns of a Portuguese banking institution. 
   The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, 
   in order to access if the product (bank term deposit) would be (or not) subscribed.

In [81]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# variable information 
bank_marketing.variables

Unnamed: 0,name,role,type,demographic,description,units,missing_values
0,age,Feature,Integer,Age,,,no
1,job,Feature,Categorical,Occupation,"type of job (categorical: 'admin.','blue-colla...",,no
2,marital,Feature,Categorical,Marital Status,"marital status (categorical: 'divorced','marri...",,no
3,education,Feature,Categorical,Education Level,"(categorical: 'basic.4y','basic.6y','basic.9y'...",,no
4,default,Feature,Binary,,has credit in default?,,no
5,balance,Feature,Integer,,average yearly balance,euros,no
6,housing,Feature,Binary,,has housing loan?,,no
7,loan,Feature,Binary,,has personal loan?,,no
8,contact,Feature,Categorical,,contact communication type (categorical: 'cell...,,yes
9,day_of_week,Feature,Date,,last contact day of the week,,no


In [30]:
# metadata 
bank_marketing.metadata

{'uci_id': 222,
 'name': 'Bank Marketing',
 'repository_url': 'https://archive.ics.uci.edu/dataset/222/bank+marketing',
 'data_url': 'https://archive.ics.uci.edu/static/public/222/data.csv',
 'abstract': 'The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).',
 'area': 'Business',
 'tasks': ['Classification'],
 'characteristics': ['Multivariate'],
 'num_instances': 45211,
 'num_features': 16,
 'feature_types': ['Categorical', 'Integer'],
 'demographics': ['Age', 'Occupation', 'Marital Status', 'Education Level'],
 'target_col': ['y'],
 'index_col': None,
 'has_missing_values': 'yes',
 'missing_values_symbol': 'NaN',
 'year_of_dataset_creation': 2014,
 'last_updated': 'Fri Aug 18 2023',
 'dataset_doi': '10.24432/C5K306',
 'creators': ['S. Moro', 'P. Rita', 'P. Cortez'],
 'intro_paper': {'title': 'A data-driven approach to predict the success of

In [82]:

X.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day_of_week,month,duration,campaign,pdays,previous,poutcome
0,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,-1,0,
1,44,technician,single,secondary,no,29,yes,no,,5,may,151,1,-1,0,
2,33,entrepreneur,married,secondary,no,2,yes,yes,,5,may,76,1,-1,0,
3,47,blue-collar,married,,no,1506,yes,no,,5,may,92,1,-1,0,
4,33,,single,,no,1,no,no,,5,may,198,1,-1,0,


In [83]:
X.describe()

Unnamed: 0,age,balance,day_of_week,duration,campaign,pdays,previous
count,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0,45211.0
mean,40.93621,1362.272058,15.806419,258.16308,2.763841,40.197828,0.580323
std,10.618762,3044.765829,8.322476,257.527812,3.098021,100.128746,2.303441
min,18.0,-8019.0,1.0,0.0,1.0,-1.0,0.0
25%,33.0,72.0,8.0,103.0,1.0,-1.0,0.0
50%,39.0,448.0,16.0,180.0,2.0,-1.0,0.0
75%,48.0,1428.0,21.0,319.0,3.0,-1.0,0.0
max,95.0,102127.0,31.0,4918.0,63.0,871.0,275.0


Total number of dataset : 45,211

Total number of features : 16

Target variable : has the client subscribed a term deposit? (binary: "yes","no")

## 2.2 Exploratory Data Analysis (EDA)

In [84]:
# To make data exploration easie for the first iteration, I am just loading 10,00 random rows into a dataframe

customer_sample = X.sample(n=10000)

# Explore the dataset
print(customer_sample.head())
print(customer_sample.info())
print(customer_sample.describe())


       age         job   marital  education default  balance housing loan  \
42696   64     retired   married  secondary      no      588      no   no   
36570   37  technician   married   tertiary      no     1733     yes   no   
18685   45  technician  divorced  secondary      no       72     yes   no   
18528   36  management   married   tertiary      no     3770     yes   no   
19937   44  management   married   tertiary      no     5581      no   no   

        contact  day_of_week month  duration  campaign  pdays  previous  \
42696  cellular           18   jan       366         1     91         2   
36570  cellular           12   may       524         3    326         1   
18685  cellular           31   jul       115         5     -1         0   
18528  cellular           31   jul       150         4     -1         0   
19937  cellular            8   aug       202         2     -1         0   

      poutcome  
42696  failure  
36570    other  
18685      NaN  
18528      NaN  
1

## 2.2 Handle categorical variables

In [86]:
 # Convert categorical variables into numerical values using one-hot encoding
    
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Define the categorical features
categorical_features = X.select_dtypes(include=['object']).columns

# One-hot encode categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features)
    ],
    remainder='passthrough'
)

# Fit and transform the data
X_encoded = preprocessor.fit_transform(X)

# Convert to DataFrame for easy interpretation (optional)
encoded_df = pd.DataFrame(X_encoded, columns=preprocessor.get_feature_names_out())
encoded_df['target'] = y

encoded_df

Unnamed: 0,cat__job_admin.,cat__job_blue-collar,cat__job_entrepreneur,cat__job_housemaid,cat__job_management,cat__job_retired,cat__job_self-employed,cat__job_services,cat__job_student,cat__job_technician,...,cat__poutcome_success,cat__poutcome_nan,remainder__age,remainder__balance,remainder__day_of_week,remainder__duration,remainder__campaign,remainder__pdays,remainder__previous,target
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,58.0,2143.0,5.0,261.0,1.0,-1.0,0.0,no
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,44.0,29.0,5.0,151.0,1.0,-1.0,0.0,no
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,33.0,2.0,5.0,76.0,1.0,-1.0,0.0,no
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,47.0,1506.0,5.0,92.0,1.0,-1.0,0.0,no
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,33.0,1.0,5.0,198.0,1.0,-1.0,0.0,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,51.0,825.0,17.0,977.0,3.0,-1.0,0.0,yes
45207,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,1.0,71.0,1729.0,17.0,456.0,2.0,-1.0,0.0,yes
45208,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,72.0,5715.0,17.0,1127.0,5.0,184.0,3.0,yes
45209,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,57.0,668.0,17.0,508.0,4.0,-1.0,0.0,no


### Split the data:

In [89]:
# Split the data
X = encoded_df.drop('target', axis=1)
y = encoded_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Scale the features

In [95]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)



array([[-0.35938299,  1.90613718, -0.18348505, ..., -0.56588599,
        -0.41136376, -0.24477164],
       [-0.35938299, -0.52462121, -0.18348505, ..., -0.24538858,
        -0.41136376, -0.24477164],
       [-0.35938299, -0.52462121, -0.18348505, ..., -0.56588599,
        -0.41136376, -0.24477164],
       ...,
       [ 2.78254687, -0.52462121, -0.18348505, ..., -0.56588599,
        -0.41136376, -0.24477164],
       [ 2.78254687, -0.52462121, -0.18348505, ..., -0.24538858,
        -0.41136376, -0.24477164],
       [-0.35938299, -0.52462121, -0.18348505, ..., -0.24538858,
        -0.41136376, -0.24477164]])

## 3. Train and Evaluate Models

### Train each model and evaluate their performance using accuracy, precision, recall, F1-score, and ROC-AUC.

In [101]:
# Create a pipeline that includes both the preprocessor and the KNN model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier(n_neighbors=5))
])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(encoded_df, encoded_df['target'], test_size=0.2, random_state=42)

# Fit the pipeline to the training data
pipeline.fit(X_train, y_train)

ValueError: A given column is not a column of the dataframe