# Homework 2

- [homework](https://github.com/alexkolo/ml-zoomcamp-2024/blob/main/cohorts/2024/03-classification/homework.md)
- [submit here](https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw03)
    - Due date: 15 October 2024 01:00 (local time)
    - [link to notebook](https://github.com/alexkolo/ml-zoomcamp-2024/blob/main/cohorts/2024/03-classification/hw03_my_answers.ipynb)

### Dataset

In this homework, we will use the Bank Marketing dataset. Download it from [here](https://archive.ics.uci.edu/static/public/222/bank+marketing.zip).

Or you can do it with `wget`:

```bash
wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip
```

We need to take `bank/bank-full.csv` file from the downloaded zip-file.  
In this dataset our desired target for classification task will be `y` variable - has the client subscribed a term deposit or not. 

In [None]:
# !wget https://archive.ics.uci.edu/static/public/222/bank+marketing.zip

### Features

For the rest of the homework, you'll need to use only these columns:

* `age`,
* `job`,
* `marital`,
* `education`,
* `balance`,
* `housing`,
* `contact`,
* `day`,
* `month`,
* `duration`,
* `campaign`,
* `pdays`,
* `previous`,
* `poutcome`,
* `y`

### Data preparation

* Select only the features from above.
* Check if the missing values are presented in the features.

In [2]:
import pandas as pd

df: pd.DataFrame = pd.read_csv(filepath_or_buffer="bank-full.csv", sep=";")
cols: list[str] = [
    "age",
    "job",
    "marital",
    "education",
    "balance",
    "housing",
    "contact",
    "day",
    "month",
    "duration",
    "campaign",
    "pdays",
    "previous",
    "poutcome",
    "y",
]
df = df[cols]

In [3]:
df.isna().sum()

age          0
job          0
marital      0
education    0
balance      0
housing      0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

### Question 1

What is the most frequent observation (mode) for the column `education`?

- `unknown`
- `primary`
- `secondary`
- `tertiary`

#### Answer

`secondary`

In [4]:
df.mode()

Unnamed: 0,age,job,marital,education,balance,housing,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,32,blue-collar,married,secondary,0,yes,cellular,20,may,124,1,-1,0,unknown,no


### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `age` and `balance`
- `day` and `campaign`
- `day` and `pdays`
- `pdays` and `previous`

#### Answer

"`pdays` and `previous`"

In [5]:
numerical: list[str] = df.select_dtypes(include="number").columns.to_list()

In [6]:
# compute correlation matrix
corr: pd.DataFrame = df.select_dtypes(include="number").corr()

In [7]:
# unstacked dataframe
corr_unstacked: pd.Series = corr.unstack()  # type: ignore
# sort values from biggest to smallest
corr_unstacked = corr_unstacked.sort_values(ascending=False)
# ignore diagonal
corr_unstacked = corr_unstacked[corr_unstacked.index.get_level_values(0) != corr_unstacked.index.get_level_values(1)]
print(corr_unstacked.head(n=4))

pdays     previous    0.45482
previous  pdays       0.45482
day       campaign    0.16249
campaign  day         0.16249
dtype: float64


### Target encoding

* Now we want to encode the `y` variable.
* Let's replace the values `yes`/`no` with `1`/`0`.


In [8]:
df["y"] = df["y"].map(arg={"yes": 1, "no": 0})


### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.

In [9]:
from sklearn.model_selection import train_test_split

seed = 42
f_text = 0.2
f_val = 0.2

In [10]:
df_full_train, df_test = train_test_split(df.drop(columns="y"), test_size=f_text, random_state=seed)
df_train, df_val = train_test_split(df_full_train, test_size=f_val, random_state=seed)

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `contact`
- `education`
- `housing`
- `poutcome`

#### Answer

`poutcome`

In [11]:
from sklearn.metrics import mutual_info_score

labels_pred = df.y[df_train.index]


def mutual_info_churn_score(series) -> float:
    return mutual_info_score(labels_true=series, labels_pred=labels_pred)


df_train: pd.DataFrame
categorical: list[str] = df_train.select_dtypes(exclude="number").columns.to_list()
mi = df_train[categorical].apply(mutual_info_churn_score)
mi.sort_values(ascending=False)

poutcome     0.029389
month        0.024972
contact      0.013437
housing      0.010465
job          0.007172
education    0.002777
marital      0.002019
dtype: float64

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.6
- 0.7
- 0.8
- 0.9

#### Answer

`0.90`

####  One-hot encoding
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.

In [12]:
# one-hot encoding of categorical variables
# df_train_1hot: pd.DataFrame = pd.get_dummies(data=df_train, columns=categorical, drop_first=True)
# df_val_1hot: pd.DataFrame = pd.get_dummies(data=df_val, columns=categorical, drop_first=True)
# df_test_1hot: pd.DataFrame = pd.get_dummies(data=df_test, columns=categorical, drop_first=True)

In [13]:
from sklearn.feature_extraction import DictVectorizer
import numpy as np

In [14]:
numerical: list[str] = df_train.select_dtypes(include="number").columns.to_list()
categorical: list[str] = df_train.select_dtypes(exclude="number").columns.to_list()
features: list[str] = categorical + numerical

In [15]:
dv = DictVectorizer(sparse=False)

train_dict: dict = df_train[features].to_dict(orient="records")
X_train: np.ndarray = dv.fit_transform(X=train_dict)

val_dict: dict = df_val[features].to_dict(orient="records")
X_val: np.ndarray = dv.transform(X=val_dict)

test_dict: dict = df_test[features].to_dict(orient="records")
X_test: np.ndarray = dv.transform(X=test_dict)

#### Fit the model on the training dataset.

- To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
- `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`

In [16]:
y_train: np.ndarray = df.y[df_train.index].to_numpy()
y_val: np.ndarray = df.y[df_val.index].to_numpy()
y_test: np.ndarray = df.y[df_test.index].to_numpy()

In [17]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)

In [18]:
model.fit(X=X_train, y=y_train)

#### Accuracy
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

In [19]:
y_pred: "pd.Series[float]" = model.predict_proba(X=X_val)[:, 1]

In [20]:
pred_class: "pd.Series[bool]" = y_pred >= 0.5

In [21]:
res_val = pd.DataFrame()
res_val["probability"] = y_pred
res_val["prediction"] = pred_class.astype(dtype=int)
res_val["actual"] = y_val
res_val["correct"] = res_val["prediction"] == res_val["actual"]

In [22]:
# accuracy
print(round(number=res_val["correct"].mean(), ndigits=2))

0.9


In [23]:
accuracy: float = res_val["correct"].mean()
accuracy

np.float64(0.9008847110865358)

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model with all these features (using the same parameters as in Q4).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `age`
- `balance`
- `marital`
- `previous`

> **Note**: The difference doesn't have to be positive.

##### Answer

`balance`

In [24]:
numerical: list[str] = df_train.select_dtypes(include="number").columns.to_list()
categorical: list[str] = df_train.select_dtypes(exclude="number").columns.to_list()
features: list[str] = categorical + numerical

In [25]:
res: dict[str, float] = {}

model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
dv = DictVectorizer(sparse=False)
for feature in ["age", "balance", "marital", "previous"]:  # features:
    print(feature)
    tmp_features: list[str] = features.copy()
    tmp_features.remove(feature)
    train_dict: dict = df_train[tmp_features].to_dict(orient="records")
    X_train: np.ndarray = dv.fit_transform(X=train_dict)
    y_train: np.ndarray = df.y[df_train.index].to_numpy()

    val_dict: dict = df_val[tmp_features].to_dict(orient="records")
    X_val: np.ndarray = dv.transform(X=val_dict)
    y_val: np.ndarray = df.y[df_val.index].to_numpy()

    model.fit(X=X_train, y=y_train)
    y_pred: "pd.Series[float]" = model.predict_proba(X=X_val)[:, 1]
    res[feature] = ((y_pred >= 0.5).astype(dtype=int) == y_val).astype(dtype=int).mean()


age
balance
marital
previous


In [26]:
(pd.Series(data=res) - accuracy).abs().sort_values().head(n=1)

balance    0.000276
dtype: float64

### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

#### Answer

`0.1`

In [27]:
dv = DictVectorizer(sparse=False)
train_dict: dict = df_train[features].to_dict(orient="records")
X_train: np.ndarray = dv.fit_transform(X=train_dict)
y_train: np.ndarray = df.y[df_train.index].to_numpy()
val_dict: dict = df_val[features].to_dict(orient="records")
X_val: np.ndarray = dv.transform(X=val_dict)
y_val: np.ndarray = df.y[df_val.index].to_numpy()


res_reg: dict[float, float] = {}
for C in [0.01, 0.1, 1, 10, 100]:
    print(C)
    model = LogisticRegression(solver="liblinear", C=C, max_iter=1000, random_state=42)
    model.fit(X=X_train, y=y_train)
    y_pred: "pd.Series[float]" = model.predict_proba(X=X_val)[:, 1]
    res_reg[C] = ((y_pred >= 0.5).astype(dtype=int) == y_val).astype(dtype=int).mean()

0.01
0.1
1
10
100


In [28]:
pd.Series(data=res_reg).sort_values(ascending=False).head(n=1)

0.1    0.902129
dtype: float64