### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0 

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('course_lead_scoring.csv')
df

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.80,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1
...,...,...,...,...,...,...,...,...,...
1457,referral,manufacturing,1,,self_employed,north_america,4,0.53,1
1458,referral,technology,3,65259.0,student,europe,2,0.24,1
1459,paid_ads,technology,1,45688.0,student,north_america,3,0.02,1
1460,referral,,5,71016.0,self_employed,north_america,0,0.25,1


In [3]:
df.isnull().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

In [4]:
df_clean = df.copy()
for col in df_clean.columns:
    if df_clean[col].dtype == 'object':
        df_clean[col] = df_clean[col].fillna('NA')
    else:
        df_clean[col] = df_clean[col].fillna(0.0)

df_clean.isnull().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

In [5]:
df_clean

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.80,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1
...,...,...,...,...,...,...,...,...,...
1457,referral,manufacturing,1,0.0,self_employed,north_america,4,0.53,1
1458,referral,technology,3,65259.0,student,europe,2,0.24,1
1459,paid_ads,technology,1,45688.0,student,north_america,3,0.02,1
1460,referral,,5,71016.0,self_employed,north_america,0,0.25,1


In [6]:
df_clean.head().T

Unnamed: 0,0,1,2,3,4
lead_source,paid_ads,social_media,events,paid_ads,referral
industry,,retail,healthcare,retail,education
number_of_courses_viewed,1,1,5,2,3
annual_income,79450.0,46992.0,78796.0,83843.0,85012.0
employment_status,unemployed,employed,unemployed,,self_employed
location,south_america,south_america,australia,australia,europe
interaction_count,4,1,3,1,3
lead_score,0.94,0.8,0.69,0.87,0.62
converted,1,0,1,0,1


### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `retail`

In [7]:
df_clean['industry'].value_counts()

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `annual_income` and `interaction_count`

In [8]:

print("\n--- Correlation values for the specified pairs ---")
corr1 = df_clean['interaction_count'].corr(df_clean['lead_score'])
corr2 = df_clean['number_of_courses_viewed'].corr(df_clean['lead_score'])
corr3 = df_clean['number_of_courses_viewed'].corr(df_clean['interaction_count'])
corr4 = df_clean['annual_income'].corr(df_clean['interaction_count'])

print(f"'interaction_count' and 'lead_score': {corr1:.4f}")
print(f"'number_of_courses_viewed' and 'lead_score': {corr2:.4f}")
print(f"'number_of_courses_viewed' and 'interaction_count': {corr3:.4f}")
print(f"'annual_income' and 'interaction_count': {corr4:.4f}")


--- Correlation values for the specified pairs ---
'interaction_count' and 'lead_score': 0.0099
'number_of_courses_viewed' and 'lead_score': -0.0049
'number_of_courses_viewed' and 'interaction_count': -0.0236
'annual_income' and 'interaction_count': 0.0270


### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.


In [9]:
from sklearn.model_selection import train_test_split

X = df_clean.drop(columns=['converted'])
y = df_clean['converted']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

print(f"Training set size: {len(X_train)}")
print(f"Validation set size: {len(X_val)}")
print(f"Test set size: {len(X_test)}")

Training set size: 877
Validation set size: 292
Test set size: 293


### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `lead_source`

In [10]:
from sklearn.feature_selection import mutual_info_classif
import pandas as pd

categorical_cols = ['lead_source', 'industry', 'employment_status', 'location']

X_train_cat = X_train[categorical_cols].copy()

for col in X_train_cat.columns:
    X_train_cat[col] = X_train_cat[col].astype('category').cat.codes

mi_scores = mutual_info_classif(X_train_cat, y_train, discrete_features=True, random_state=42)

mi_series = pd.Series(mi_scores, index=categorical_cols)

print(mi_series.round(2))

lead_source          0.03
industry             0.02
employment_status    0.02
location             0.00
dtype: float64


### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.74


In [11]:
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

categorical_cols = ['lead_source', 'industry', 'employment_status', 'location']
numerical_cols = ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']

dv = DictVectorizer(sparse=False)

train_dict = X_train[categorical_cols + numerical_cols].to_dict(orient='records')
val_dict = X_val[categorical_cols + numerical_cols].to_dict(orient='records')

X_train_encoded = dv.fit_transform(train_dict)

X_val_encoded = dv.transform(val_dict)

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train_encoded, y_train)

y_pred = model.predict(X_val_encoded)
accuracy = accuracy_score(y_val, y_pred)

print(f"Validation Accuracy: {accuracy:.2f}")

Validation Accuracy: 0.74


### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `'industry'`

> **Note**: The difference doesn't have to be positive.

In [12]:
original_accuracy = accuracy_score(y_val, model.predict(X_val_encoded))
print(f"Original Model Accuracy: {original_accuracy}\n")

features = categorical_cols + numerical_cols
accuracy_diffs = {}

for feature_to_exclude in features:
    current_features = [f for f in features if f != feature_to_exclude]
    
    dv_temp = DictVectorizer(sparse=False)
    train_dict_temp = X_train[current_features].to_dict(orient='records')
    X_train_temp_encoded = dv_temp.fit_transform(train_dict_temp)
    
    val_dict_temp = X_val[current_features].to_dict(orient='records')
    X_val_temp_encoded = dv_temp.transform(val_dict_temp)
    
    model_temp = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model_temp.fit(X_train_temp_encoded, y_train)
    
    accuracy_without_feature = accuracy_score(y_val, model_temp.predict(X_val_temp_encoded))
    
    accuracy_diffs[feature_to_exclude] = original_accuracy - accuracy_without_feature
    print(f"Accuracy without '{feature_to_exclude}': {accuracy_without_feature:.4f} | Difference: {accuracy_diffs[feature_to_exclude]:.4f}")

least_useful_feature = min(accuracy_diffs, key=lambda k: abs(accuracy_diffs[k]))
print(f"\nFeature with the smallest difference: '{least_useful_feature}'")

Original Model Accuracy: 0.7431506849315068

Accuracy without 'lead_source': 0.7295 | Difference: 0.0137
Accuracy without 'industry': 0.7432 | Difference: 0.0000
Accuracy without 'employment_status': 0.7466 | Difference: -0.0034
Accuracy without 'location': 0.7432 | Difference: 0.0000
Accuracy without 'number_of_courses_viewed': 0.6781 | Difference: 0.0651
Accuracy without 'annual_income': 0.8562 | Difference: -0.1130
Accuracy without 'interaction_count': 0.6747 | Difference: 0.0685
Accuracy without 'lead_score': 0.7432 | Difference: 0.0000

Feature with the smallest difference: 'industry'


### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01


> **Note**: If there are multiple options, select the smallest `C`.

In [13]:
c_values = [0.01, 0.1, 1, 10, 100]
accuracies = {}

for c in c_values:
    model = LogisticRegression(solver='liblinear', C=c, max_iter=1000, random_state=42)
    model.fit(X_train_encoded, y_train)
    
    y_pred = model.predict(X_val_encoded)
    acc = accuracy_score(y_val, y_pred)
    accuracies[c] = round(acc, 3)
    
    print(f"C = {c}: Validation Accuracy = {acc:.3f}")

best_c = max(accuracies, key=accuracies.get)

print(f"\nThe C value that leads to the best accuracy is: {best_c}")

C = 0.01: Validation Accuracy = 0.743
C = 0.1: Validation Accuracy = 0.743
C = 1: Validation Accuracy = 0.743
C = 10: Validation Accuracy = 0.743
C = 100: Validation Accuracy = 0.743

The C value that leads to the best accuracy is: 0.01
