### Dataset

Lead scoring dataset Bank Marketing dataset. [Download](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

With `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```
> I downloaded in the terminal to my ../data folder

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not.

### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For categorical features, replace them with 'NA'
    * For numerical features, replace with with 0.0 

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

df = pd.read_csv('../data/course_lead_scoring.csv')

In [3]:
df.head()
#df.shape

#data already looks neat with lowercase, underscores, etc
#checking for missing values
df.isnull().sum()


lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [4]:
#split categorical and numerical columns for different NA processing

df.dtypes

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
numerical_columns = list(df.dtypes[df.dtypes != 'object'].index)

df[categorical_columns].isnull().sum()
df[numerical_columns].isnull().sum()    

number_of_courses_viewed      0
annual_income               181
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [7]:
df[categorical_columns] = df[categorical_columns].fillna('NA')
df[numerical_columns] = df[numerical_columns].fillna(0)

### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- `retail` <---

In [11]:
df['industry'].value_counts()

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset.
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`  <---

Only consider the pairs above when answering this question.

In [22]:
# Calculate the correlation matrix for numerical features
correlation_matrix = df[numerical_columns].corr()

# Display the correlation matrix
print(correlation_matrix)

# Extract the correlation values for the specified pairs
pairs = {
    "interaction_count and lead_score": correlation_matrix.loc["interaction_count", "lead_score"],
    "number_of_courses_viewed and lead_score": correlation_matrix.loc["number_of_courses_viewed", "lead_score"],
    "number_of_courses_viewed and interaction_count": correlation_matrix.loc["number_of_courses_viewed", "interaction_count"],
    "annual_income and interaction_count": correlation_matrix.loc["annual_income", "interaction_count"]
}

for pair, corr_value in pairs.items():
    print(f"Correlation between {pair}: {corr_value:.3f}")

                          number_of_courses_viewed  annual_income  \
number_of_courses_viewed                  1.000000       0.009770   
annual_income                             0.009770       1.000000   
interaction_count                        -0.023565       0.027036   
lead_score                               -0.004879       0.015610   
converted                                 0.435914       0.053131   

                          interaction_count  lead_score  converted  
number_of_courses_viewed          -0.023565   -0.004879   0.435914  
annual_income                      0.027036    0.015610   0.053131  
interaction_count                  1.000000    0.009888   0.374573  
lead_score                         0.009888    1.000000   0.193673  
converted                          0.374573    0.193673   1.000000  
Correlation between interaction_count and lead_score: 0.010
Correlation between number_of_courses_viewed and lead_score: -0.005
Correlation between number_of_courses_viewe

### Split the data

- Split your data in train/val/test sets with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
- Make sure that the target value `converted` is not in your dataframe.

In [35]:
from sklearn.model_selection import train_test_split

# separate features and target
X = df.drop(columns=['converted'])
y = df['converted']

# Split the data into train (60%) and temp (40%)
X_train_df, X_temp_df, y_train_df, y_temp_df = train_test_split(X, y, test_size=0.4, random_state=42)

# Split the temp data into validation (20%) and test (20%)
X_val_df, X_test_df, y_val_df, y_test_df = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Print the sizes of the splits
print(f"Train set: {X_train_df.shape}, Validation set: {X_val_df.shape}, Test set: {X_test_df.shape}")

Train set: (877, 8), Validation set: (292, 8), Test set: (293, 8)


### Question 3

- Calculate the mutual information score between `converted` and other categorical variables in the dataset. Use the training set only.
- Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?

- `industry`
- `location`
- `lead_source` <---
- `employment_status`

In [36]:
from sklearn.metrics import mutual_info_score

# Calculate mutual information scores for categorical variables using the training set
categorical_columns = ['industry', 'location', 'lead_source', 'employment_status']  # Only the relevant columns
mutual_info_scores = {}

for col in categorical_columns:
    score = mutual_info_score(y_train_df, X_train_df[col])
    mutual_info_scores[col] = round(score, 2)
    print(f"Mutual Information between {col} and converted: {round(score, 2)}")

Mutual Information between industry and converted: 0.02
Mutual Information between location and converted: 0.0
Mutual Information between lead_source and converted: 0.03
Mutual Information between employment_status and converted: 0.02


### Question 4

- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
  - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
  - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- 0.74 <---
- 0.84
- 0.94

In [38]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_train_dict = X_train_df.to_dict(orient='records')
X_val_dict = X_val_df.to_dict(orient='records')

# Apply DictVectorizer
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train_dict)
X_val = dv.transform(X_val_dict)

# Train the model
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train_df)

# Make predictions on the validation
y_pred = model.predict(X_val)

# Calculate accuracy
accuracy = accuracy_score(y_val_df, y_pred)
print(f"Val Accuracy: {accuracy:.2f}")

Val Accuracy: 0.74


### Question 5

- Let's find the least useful feature using the _feature elimination_ technique.
- Train a model using the same features and parameters as in Q4 (without rounding).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

> *All????*

In [48]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

X_train_dict = X_train_df.to_dict(orient='records')
X_val_dict = X_val_df.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(X_train_dict)
X_val = dv.transform(X_val_dict)

# Train the model with all features
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train_df)
y_pred = model.predict(X_val)
original_accuracy = accuracy_score(y_val_df, y_pred)
print(f"Original Accuracy: {original_accuracy:.4f}")

# Feature elimination
feature_accuracies = {}
for feature in dv.feature_names_:
    # Remove the feature - weird way to do it with numpy arrays (because of previous DictVectorizer)
    reduced_X_train = X_train[:, [i for i, f in enumerate(dv.feature_names_) if f != feature]]
    reduced_X_val = X_val[:, [i for i, f in enumerate(dv.feature_names_) if f != feature]]
    
    # Train the model without the feature
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(reduced_X_train, y_train_df)
    y_pred = model.predict(reduced_X_val)
    accuracy = accuracy_score(y_val_df, y_pred)
    
    # Record the accuracy difference (in absolute terms)
    feature_accuracies[feature] = abs(accuracy - original_accuracy)
    #print(f"Accuracy without '{feature}': {accuracy} (Difference: {accuracy - original_accuracy})")


# Sort the feature_accuracies dictionary by values
sorted_features = sorted(feature_accuracies.items(), key=lambda x: x[1])

# Print the sorted features neatly
print("Feature accuracies difference:")
for feature, diff in sorted_features:
    print(f"{feature}: {diff}")

Original Accuracy: 0.7432
Feature accuracies difference:
employment_status=NA: 0.0
employment_status=employed: 0.0
employment_status=student: 0.0
industry=NA: 0.0
industry=finance: 0.0
industry=healthcare: 0.0
industry=manufacturing: 0.0
industry=other: 0.0
industry=retail: 0.0
industry=technology: 0.0
lead_score: 0.0
lead_source=NA: 0.0
lead_source=events: 0.0
lead_source=organic_search: 0.0
lead_source=referral: 0.0
lead_source=social_media: 0.0
location=NA: 0.0
location=africa: 0.0
location=asia: 0.0
location=australia: 0.0
location=europe: 0.0
location=middle_east: 0.0
location=north_america: 0.0
location=south_america: 0.0
employment_status=self_employed: 0.003424657534246589
employment_status=unemployed: 0.003424657534246589
industry=education: 0.003424657534246589
lead_source=paid_ads: 0.003424657534246589
number_of_courses_viewed: 0.06506849315068486
interaction_count: 0.06849315068493145
annual_income: 0.113013698630137


### Question 6

- Now let's train a regularized logistic regression.
- Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
- Train models using all the features as in Q4.
- Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

> *Answer* : It's all (???)

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

C_values = [0.01, 0.1, 1, 10, 100]

accuracies = {}

for C in C_values:
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_train, y_train_df)
    
    y_pred = model.predict(X_val)
    
    accuracy = accuracy_score(y_val_df, y_pred)
    accuracies[C] = accuracy
    print(f"C={C}: Validation Accuracy = {accuracy:.3f}")
    print(f"C={C}, Coefficients: {model.coef_}")

print("\nAccuracies for different values of C:")
for C, acc in accuracies.items():
    print(f"C={C}: {acc}")

C=0.01: Validation Accuracy = 0.743
C=0.01, Coefficients: [[-1.54986308e-05 -1.38818400e-02  2.84473331e-02  1.65496841e-02
   9.81712858e-03 -1.07635095e-01 -2.19118936e-02  5.30517813e-02
  -2.55297090e-02 -2.24924091e-02 -1.22293827e-02 -5.50949851e-03
  -1.69774363e-02 -1.51042417e-02  2.55041816e-01  4.30605564e-02
   7.95353967e-03 -1.56486967e-02 -1.54810535e-02 -9.65766117e-02
   6.94316344e-02 -1.63816017e-02  4.09622438e-03 -1.50863999e-02
  -1.24810038e-02 -1.09627449e-02  3.26015378e-03  4.89975207e-03
  -2.22195074e-02 -1.82092638e-02  4.16038439e-01]]
C=0.1: Validation Accuracy = 0.743
C=0.1, Coefficients: [[-1.82895893e-05 -1.35593994e-02  2.97880515e-02  1.80547407e-02
   1.20330232e-02 -1.06401372e-01 -2.11179445e-02  5.40284235e-02
  -2.50509542e-02 -2.16462414e-02 -1.14414537e-02 -4.19817699e-03
  -1.61889333e-02 -1.44696750e-02  2.94243475e-01  4.66717056e-02
   8.63902051e-03 -1.45953931e-02 -1.39296136e-02 -9.60066592e-02
   7.10159078e-02 -1.52082181e-02  4.46134