# Homework #3

## Dataset

In this homework, we will use Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score, accuracy_score, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
data = pd.read_csv('bank+marketing/bank/bank-full.csv', delimiter=';')
data.shape

FileNotFoundError: [Errno 2] No such file or directory: 'bank+marketing/bank/bank-full.csv'

In [None]:
data.info()

In [None]:
data.head()

## Features

For the rest of the homework, you'll need to use only these columns:

* `age`,
* `job`,
* `marital`,
* `education`,
* `balance`,
* `housing`,
* `contact`,
* `day`,
* `month`,
* `duration`,
* `campaign`,
* `pdays`,
* `previous`,
* `poutcome`,
* `y`

Select only them.

In [None]:
features = [
    'age', 'job', 'marital', 'education', 'balance', 'housing', 'contact', 
    'day', 'month', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'y'
]

In [None]:
data = data[features]

In [None]:
data.info()

In [None]:
data.nunique()

In [None]:
data.isna().sum()

In [None]:
data

## Question 1

What is the most frequent observation (mode) for the column `education`?

In [None]:
data.describe(include=["O"])

In [None]:
data['education'].value_counts()

## Question 2

* Create the correlation matrix for the numerical features of your dataset
* In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset
* What are the two features that have the biggest correlation in this dataset?

In [None]:
data_numeric = data.copy()
data_numeric = data.drop(
    ['job', 'marital', 'education', 'housing', 'contact', 'month', 'poutcome', 'y'], axis=1
)
data_numeric.describe()

In [None]:
data_numeric.corr()

In [None]:
plt.figure(figsize=(9, 6))
sns.heatmap(data_numeric.corr(),annot=True,linewidths=.5, cmap="Blues")
plt.title('Heatmap showing correlations between numerical data')
plt.show()

In [None]:
data_numeric.corr().unstack().sort_values(ascending=False)

`pdays` and `previous`

## Target encoding

* Now we want to encode the `y` variable
* Let's replace the values `yes`/`no` with `1`/`0`

In [None]:
data.y = (data.y == 'yes').astype(int)
data

## Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`
* Make sure that the target value `y` is not in your dataframe

In [None]:
SEED = 42

In [None]:
df_full_train, df_test = train_test_split(data, test_size=0.2, random_state=SEED)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=SEED)

assert len(data) == (len(df_train) + len(df_val) + len(df_test))

In [None]:
len(df_train), len(df_val), len(df_test)

In [None]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [None]:
y_train = df_train.y.values
y_val = df_val.y.values
y_test = df_test.y.values

## Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only
* Round the scores to 2 decimals using `round(score, 2)`
* Which of these variables has the biggest score?

In [None]:
def calculate_mi(series):
    return mutual_info_score(series, df_train.y)

In [None]:
cat = ['job', 'marital', 'education', 'housing', 'contact', 'month', 'poutcome']

In [None]:
df_mi = df_train[cat].apply(calculate_mi)
df_mi = df_mi.sort_values(ascending=False).to_frame(name='MI')
df_mi

`poutcome` has the biggest score.

In [None]:
df_train = df_train.drop('y', axis=1)
df_val = df_val.drop('y', axis=1)
df_test = df_test.drop('y', axis=1)

assert 'y' not in df_train.columns
assert 'y' not in df_val.columns
assert 'y' not in df_test.columns

## Question 4

* Now let's train a logistic regression
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding
* Fit the model on the training dataset:
    * To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    * `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits

In [None]:
dv = DictVectorizer(sparse=False)
train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

In [None]:
model = LogisticRegression(solver='liblinear', max_iter=1000, C=1.0, random_state=SEED)
model.fit(X_train, y_train)

In [None]:
val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict(X_val)

In [None]:
original_score = accuracy_score(y_val, y_pred)
original_score

In [None]:
y_pred = model.predict_proba(X_val)[:, 1]
term_decision = (y_pred >= 0.5)
accuracy = (y_val == term_decision).mean()
accuracy

In [None]:
accuracy = np.round(original_score, 2)
print(f'Accuracy = {accuracy}')

## Question 5

* Let's find the least useful feature using the _feature elimination_ technique
* Train a model with all these features (using the same parameters as in Q4)
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature
* Which of following feature has the smallest difference?
    - `age`
    - `balance`
    - `marital`
    - `previous`
    
> **note:** the difference doesn't have to be positive

In [None]:
features = df_train.columns.to_list()
features

In [None]:
scores = pd.DataFrame(columns=['eliminated_feature', 'accuracy', 'difference'])
for feature in features:
    subset = features.copy()
    subset.remove(feature)
    
    dv = DictVectorizer(sparse=False)
    train_dict = df_train[subset].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)

    model = LogisticRegression(solver='liblinear', max_iter=1000, C=1.0, random_state=SEED)
    model.fit(X_train, y_train)
    
    val_dict = df_val[subset].to_dict(orient='records')
    X_val = dv.transform(val_dict)
    
    y_pred = model.predict(X_val)
    score = accuracy_score(y_val, y_pred)
    
    scores.loc[len(scores)] = [feature, score, original_score - score]

In [None]:
scores

In [None]:
scores[scores.index == scores.difference.abs().idxmin()]

`age` and `balance` features are the least important in our case. So, you we can choose one of them.

> **note:** you can get other answers in Google Colab. Therefore, we decided to mark all the answers as correct.

## Question 6

* Now let's train a regularized logistic regression
* Let's try the following values of the parameter `C`: `[0, 0.01, 0.1, 1, 10]`
* Train models using all the features as in Q4
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits
* Which of these `C` leads to the best accuracy on the validation set?
> **note:** If there are multiple options, select the smallest `C`.

In [None]:
y_train.shape, y_val.shape

In [None]:
dv = DictVectorizer(sparse=False)
train_dict = df_train.to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val.to_dict(orient='records')
X_val = dv.transform(val_dict)

In [None]:
scores = {}
for C in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(solver='liblinear', max_iter=1000, C=C, random_state=SEED)
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_val)
    
    score = accuracy_score(y_val, y_pred)
    scores[C] = round(score, 3)
    print(f'C = {C}:\t Accuracy = {score}')

In [None]:
scores

In [None]:
print(f'The smallest `C` is {max(scores, key=scores.get)}.')

`C = 0.1` is also a valid answer.