# MACHINE LEARNING ZOOMCAMP - COHORT 2024

## HOMEWORK 3 - Classification
### Angole Daniel
Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

## Dataset
In this homework, we will use the Bank Marketing dataset. Download it from here.

Or you can do it with wget:

wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip

We need to take bank/bank-full.csv file from the downloaded zip-file.
In this dataset our desired target for classification task will be y variable - has the client subscribed a term deposit or not.

In [1]:
import pandas as pd


df = pd.read_csv("bank.csv", sep=";")
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,30,unemployed,married,primary,no,1787,no,no,cellular,19,oct,79,1,-1,0,unknown,no
1,33,services,married,secondary,no,4789,yes,yes,cellular,11,may,220,1,339,4,failure,no
2,35,management,single,tertiary,no,1350,yes,no,cellular,16,apr,185,1,330,1,failure,no
3,30,management,married,tertiary,no,1476,yes,yes,unknown,3,jun,199,4,-1,0,unknown,no
4,59,blue-collar,married,secondary,no,0,yes,no,unknown,5,may,226,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4516,33,services,married,secondary,no,-333,yes,no,cellular,30,jul,329,5,-1,0,unknown,no
4517,57,self-employed,married,tertiary,yes,-3313,yes,yes,unknown,9,may,153,1,-1,0,unknown,no
4518,57,technician,married,secondary,no,295,no,no,cellular,19,aug,151,11,-1,0,unknown,no
4519,28,blue-collar,married,secondary,no,1137,no,no,cellular,6,feb,129,4,211,3,other,no


## Features
For the rest of the homework, you'll need to use only these columns:

- age,
- job,
- marital,
- education,
- balance,
- housing,
- contact,
- day,
- month,
- duration,
- campaign,
- pdays,
- previous,
- poutcome,
- y
## Data preparation
- Select only the features from above.
- Check if the missing values are presented in the features.

In [2]:
df = df[[
    "age",
    "job",
    "marital",
    "education",
    "balance",
    "housing",
    "contact",
    "day",
    "month",
    "duration",
    "campaign",
    "pdays",
    "previous",
    "poutcome",
    "y",
]]

In [3]:
numerical = df.select_dtypes("number").columns.to_list()
categorical = df.select_dtypes("object").columns.to_list()

## Question 1
What is the most frequent observation (mode) for the column education?

In [4]:
df.education.mode()

0    secondary
Name: education, dtype: object

## Question 2
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.


In [5]:
correlation_matrix = df[numerical].corr()
correlation_matrix

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
age,1.0,0.08382,-0.017853,-0.002367,-0.005148,-0.008894,-0.003511
balance,0.08382,1.0,-0.008677,-0.01595,-0.009976,0.009437,0.026196
day,-0.017853,-0.008677,1.0,-0.024629,0.160706,-0.094352,-0.059114
duration,-0.002367,-0.01595,-0.024629,1.0,-0.068382,0.01038,0.01808
campaign,-0.005148,-0.009976,0.160706,-0.068382,1.0,-0.093137,-0.067833
pdays,-0.008894,0.009437,-0.094352,0.01038,-0.093137,1.0,0.577562
previous,-0.003511,0.026196,-0.059114,0.01808,-0.067833,0.577562,1.0


#### What are the two features that have the biggest correlation?

In [6]:
correlation_matrix.unstack().abs()[correlation_matrix.unstack().abs().lt(1)].idxmax()

('pdays', 'previous')

### Q2. Ans: 
'pdays' & 'previous'

## Target encoding
- Now we want to encode the y variable.
- Let's replace the values yes/no with 1/0.
### Split the data
- Split your data in train/val/test sets with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value y is not in your dataframe.

In [7]:
df = (
    df
    .assign(y=(df.y == "yes").astype(int))
)

In [8]:
from sklearn.model_selection import train_test_split

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

y_full_train = df_full_train.y.values
y_train = df_train.y.values
y_test = df_test.y.values
y_val = df_val.y.values

df_full_train = df_full_train.drop(columns="y")
df_train = df_train.drop(columns="y")
df_test = df_test.drop(columns="y")
df_val = df_val.drop(columns="y")

## Question 3
- Calculate the mutual information score between y and other categorical variables in the dataset. Use the training set only.
- Round the scores to 2 decimals using round(score, 2).

In [9]:
from sklearn.metrics import mutual_info_score

for col in ["contact", "education", "housing", "poutcome"]:
    print(round(mutual_info_score(y_train, df_train[col]), 2))

0.01
0.0
0.01
0.03


Which of these variables has the biggest mutual information score?
- contact
- education
- housing
- poutcome

### Ans: 
poutcome


## Question 4
- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
What accuracy did you get?

In [11]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression


dicts_full_train = df_full_train.to_dict(orient="records")
dicts_train = df_train.to_dict(orient="records")
dicts_test = df_test.to_dict(orient="records")
dicts_val = df_val.to_dict(orient="records")

dv = DictVectorizer(sparse=False)
dv.fit(dicts_train)

X_train = dv.transform(dicts_train)
X_val = dv.transform(dicts_val)

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

y_pred_val = model.predict(X_val)

(y_pred_val == y_val).mean().round(2)


0.89

In [12]:
accuracy_all = (y_pred_val == y_val).mean()

## Question 5
- Let's find the least useful feature using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?
- age
- balance
- marital
- previous

Note: The difference doesn't have to be positive.

In [14]:
results = []

for feature_to_exclude in df_train.columns:
    
    dicts_train = df_train.drop(columns=feature_to_exclude).to_dict(orient="records")
    dicts_val = df_val.drop(columns=feature_to_exclude).to_dict(orient="records")

    dv = DictVectorizer(sparse=False)
    dv.fit(dicts_train)

    X_train = dv.transform(dicts_train)
    X_val = dv.transform(dicts_val)

    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred_val = model.predict(X_val)
    accuracy = (y_pred_val == y_val).mean()
    difference = abs(accuracy - accuracy_all)
    results.append((feature_to_exclude, accuracy, difference))

    # print(f"Excluded feature '{col_to_exclude}', Accuracy: {accuracy}, Accuracy difference with baseline: {(accuracy-accuracy_all).round(2)}")

    df_results = pd.DataFrame(data=results, columns=["excluded feature", "accuracy", "difference"])

In [15]:
df_results.sort_values(by="difference")

Unnamed: 0,excluded feature,accuracy,difference
0,age,0.887168,0.0
5,housing,0.887168,0.0
10,campaign,0.887168,0.0
12,previous,0.887168,0.0
4,balance,0.886062,0.001106
11,pdays,0.888274,0.001106
6,contact,0.884956,0.002212
7,day,0.884956,0.002212
8,month,0.884956,0.002212
3,education,0.889381,0.002212


### Ans:
Age

### Question 6
- Now let's train a regularized logistic regression.
- Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].
- Train models using all the features as in Q4.
- Calculate the accuracy on the validation dataset and round it to 3 decimal digits.
- Which of these C leads to the best accuracy on the validation set?

Note: If there are multiple options, select the smallest C.

In [16]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression


dicts_full_train = df_full_train.to_dict(orient="records")
dicts_train = df_train.to_dict(orient="records")
dicts_test = df_test.to_dict(orient="records")
dicts_val = df_val.to_dict(orient="records")

dv = DictVectorizer(sparse=False)
dv.fit(dicts_train)

X_train = dv.transform(dicts_train)
X_val = dv.transform(dicts_val)

results = []

for c in [0.01, 0.1, 1, 10, 100]:
    model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred_val = model.predict(X_val)

    accuracy = (y_pred_val == y_val).mean().round(3)

    results.append((c, accuracy))

    df_results = pd.DataFrame(data=results, columns=["C", "accuracy"])

df_results.iloc[df_results["accuracy"].idxmax()]

C           1.000
accuracy    0.887
Name: 2, dtype: float64