## Homework

> Note: sometimes your answer doesn't match one of the options exactly. 
> That's fine. 
> Select the option that's closest to your solution.


### Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

Or you can do it with `wget`:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
```

In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not. 

### Data preparation

* Check if the missing values are presented in the features.
* If there are missing values:
    * For caterogiral features, replace them with 'NA'
    * For numerical features, replace with with 0.0 


In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, mutual_info_score

In [65]:
data = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv"

In [66]:
!wget $data

--2025-10-13 17:58:07--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
connected. to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... 
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘course_lead_scoring.csv.2’


2025-10-13 17:58:07 (1.32 MB/s) - ‘course_lead_scoring.csv.2’ saved [80876/80876]



In [67]:
df = pd.read_csv('course_lead_scoring.csv')

In [68]:
# --- Data Preparation ---

In [69]:
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [70]:
df.head().T

Unnamed: 0,0,1,2,3,4
lead_source,paid_ads,social_media,events,paid_ads,referral
industry,,retail,healthcare,retail,education
number_of_courses_viewed,1,1,5,2,3
annual_income,79450.0,46992.0,78796.0,83843.0,85012.0
employment_status,unemployed,employed,unemployed,,self_employed
location,south_america,south_america,australia,australia,europe
interaction_count,4,1,3,1,3
lead_score,0.94,0.8,0.69,0.87,0.62
converted,1,0,1,0,1


In [71]:
# Identify feature types
numerical_features = df.select_dtypes(include=np.number).columns.tolist()
categorical_features = df.select_dtypes(include=['object']).columns.tolist()


In [72]:
# Remove the target variable from features list if present
target_variable = 'converted'
if target_variable in numerical_features:
    numerical_features.remove(target_variable)

In [73]:
# Check for missing values and impute
print("Missing values before imputation:")
print(df.isnull().sum()[df.isnull().sum() > 0])

Missing values before imputation:
lead_source          128
industry             134
annual_income        181
employment_status    100
location              63
dtype: int64


In [74]:
for col in categorical_features:
    if df[col].isnull().any():
        df[col] = df[col].fillna('NA')

for col in numerical_features:
    if df[col].isnull().any():
        df[col] = df[col].fillna(0.0)

In [75]:

print("\nMissing values after imputation (should be 0 for processed columns):")
print(df.isnull().sum().sum())


Missing values after imputation (should be 0 for processed columns):
0


# Question 1

What is the most frequent observation (mode) for the column industry?

- NA

- technology

- healthcare

- retail


In [76]:
# Q1
mode_industry = df['industry'].mode()[0]
print(f"Mode for industry: {mode_industry}")

Mode for industry: retail


# Question 2

Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- interaction_count and lead_score

- number_of_courses_viewed and lead_score

- number_of_courses_viewed and interaction_count

- annual_income and interaction_count

Only consider the pairs above when answering this question.


In [77]:
# Identify numerical features
numerical_features = ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']

# Create correlation matrix
corr_matrix = df[numerical_features].corr()
print("Correlation Matrix:")
print(corr_matrix)


Correlation Matrix:
                          number_of_courses_viewed  annual_income  \
number_of_courses_viewed                  1.000000       0.009770   
annual_income                             0.009770       1.000000   
interaction_count                        -0.023565       0.027036   
lead_score                               -0.004879       0.015610   

                          interaction_count  lead_score  
number_of_courses_viewed          -0.023565   -0.004879  
annual_income                      0.027036    0.015610  
interaction_count                  1.000000    0.009888  
lead_score                         0.009888    1.000000  


In [78]:
# Check the specific pairs mentioned in the question
pairs = [
    ('interaction_count', 'lead_score'),
    ('number_of_courses_viewed', 'lead_score'),
    ('number_of_courses_viewed', 'interaction_count'),
    ('annual_income', 'interaction_count')
]


In [79]:
print("Correlation for specific pairs:")
for feature1, feature2 in pairs:
    correlation = corr_matrix.loc[feature1, feature2]
    print(f"{feature1} and {feature2}: {correlation:.6f}")

Correlation for specific pairs:
interaction_count and lead_score: 0.009888
number_of_courses_viewed and lead_score: -0.004879
number_of_courses_viewed and interaction_count: -0.023565
annual_income and interaction_count: 0.027036


# Find the pair with the highest correlation
max_corr = 0
max_pair = None
for feature1, feature2 in pairs:
    correlation = abs(corr_matrix.loc[feature1, feature2])
    if correlation > max_corr:
        max_corr = correlation
        max_pair = (feature1, feature2)

print(f"\nThe pair with the biggest correlation: {max_pair[0]} and {max_pair[1]} ({corr_matrix.loc[max_pair[0], max_pair[1]]:.6f})")

# Split the data

- Split your data in train/val/test sets with 60%/20%/20% distribution.

- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.

- Make sure that the target value y is not in your dataframe.


In [81]:
# Split data: 60% train, 20% val, 20% test with seed 42
# First split: 60% train, 40% temp (val + test)
df_train, df_temp = train_test_split(df, test_size=0.4, random_state=42)

In [82]:
# Second split: split the 40% into 50/50 (20% val, 20% test)
df_val, df_test = train_test_split(df_temp, test_size=0.5, random_state=42)

In [83]:
# Reset indices
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [84]:
print(f"Train size: {len(df_train)} ({len(df_train)/len(df)*100:.1f}%)")
print(f"Val size: {len(df_val)} ({len(df_val)/len(df)*100:.1f}%)")
print(f"Test size: {len(df_test)} ({len(df_test)/len(df)*100:.1f}%)")

Train size: 877 (60.0%)
Val size: 292 (20.0%)
Test size: 293 (20.0%)


In [85]:
# Extract target variable y
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

# Remove target variable from dataframes
del df_train['converted']
del df_val['converted']
del df_test['converted']

In [86]:
print(f"Target distribution:")
print(f"Train: {y_train.sum()} converted out of {len(y_train)} ({y_train.mean()*100:.1f}%)")
print(f"Val: {y_val.sum()} converted out of {len(y_val)} ({y_val.mean()*100:.1f}%)")
print(f"Test: {y_test.sum()} converted out of {len(y_test)} ({y_test.mean()*100:.1f}%)")

Target distribution:
Train: 535 converted out of 877 (61.0%)
Val: 197 converted out of 292 (67.5%)
Test: 173 converted out of 293 (59.0%)


# Question 4

Now let's train a logistic regression.

Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.

Fit the model on the training dataset.

To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)

Calculate the accuracy on the validation dataset and round it to 2 decimal digits.
What accuracy did you get?

- 0.64
- 0.74
- 0.84
- 0.94

In [87]:
# Define features
numerical = ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']
categorical = ['lead_source', 'industry', 'employment_status', 'location']

# One-hot encoding using DictVectorizer
dv = DictVectorizer(sparse=False)

In [88]:
# Prepare training data
train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

In [89]:
# Train logistic regression model with specified parameters
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)


In [90]:
# Prepare validation data
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [91]:
# Calculate accuracy on validation set
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)

In [92]:
print(f"Validation accuracy: {accuracy}")
print(f"Validation accuracy (rounded to 2 decimals): {round(accuracy, 2)}")

round(accuracy, 2)

Validation accuracy: 0.7431506849315068
Validation accuracy (rounded to 2 decimals): 0.74


0.74

# Question 5
Let's find the least useful feature using the feature elimination technique.

Train a model using the same features and parameters as in Q4 (without rounding).

Now exclude each feature from this set and train a model without it. Record the accuracy for each model.

For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?
- 'industry'
- 'employment_status'
- 'lead_score'
  
> Note: The difference doesn't have to be positive.


In [93]:
# Define all features
numerical = ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']
categorical = ['lead_source', 'industry', 'employment_status', 'location']
all_features = categorical + numerical

In [94]:
# Train baseline model with all features (same as Q4)
dv = DictVectorizer(sparse=False)
train_dict = df_train[all_features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

In [95]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [96]:
val_dict = df_val[all_features].to_dict(orient='records')
X_val = dv.transform(val_dict)
y_pred = model.predict(X_val)
baseline_accuracy = accuracy_score(y_val, y_pred)


In [97]:
print(f"Baseline accuracy (with all features): {baseline_accuracy}")

Baseline accuracy (with all features): 0.7431506849315068


In [99]:
# Train models excluding each feature one at a time
accuracies_without = {}
differences = {}

for feature in all_features:
    # Create subset excluding current feature
    subset = [f for f in all_features if f != feature]
    
    # Train model without this feature
    dv = DictVectorizer(sparse=False)
    train_dict = df_train[subset].to_dict(orient='records')
    X_train = dv.fit_transform(train_dict)
    
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    val_dict = df_val[subset].to_dict(orient='records')
    X_val = dv.transform(val_dict)
    y_pred = model.predict(X_val)
    
    accuracy_without = accuracy_score(y_val, y_pred)
    accuracies_without[feature] = accuracy_without
    
    # Calculate difference: original - without
    diff = baseline_accuracy - accuracy_without
    differences[feature] = diff
    
    print(f"Without '{feature}': accuracy={accuracy_without:.6f}, diff={diff:.6f}")


Without 'lead_source': accuracy=0.729452, diff=0.013699
Without 'industry': accuracy=0.743151, diff=0.000000
Without 'employment_status': accuracy=0.746575, diff=-0.003425
Without 'location': accuracy=0.743151, diff=0.000000
Without 'number_of_courses_viewed': accuracy=0.678082, diff=0.065068
Without 'annual_income': accuracy=0.856164, diff=-0.113014
Without 'interaction_count': accuracy=0.674658, diff=0.068493
Without 'lead_score': accuracy=0.743151, diff=0.000000


In [100]:
print("Focus on the three specific features:")
for feature in ['industry', 'employment_status', 'lead_score']:
    print(f"'{feature}': difference = {differences[feature]:.6f}")

Focus on the three specific features:
'industry': difference = 0.000000
'employment_status': difference = -0.003425
'lead_score': difference = 0.000000


In [102]:
# Find feature with smallest difference among the three
target_features = ['industry', 'employment_status', 'lead_score']
min_diff_feature = min(target_features, key=lambda x: abs(differences[x]))
print(f"Feature with smallest difference: '{min_diff_feature}' ({differences[min_diff_feature]:.6f})")

Feature with smallest difference: 'industry' (0.000000)


# Question 6

Now let's train a regularized logistic regression.

Let's try the following values of the parameter C: [0.01, 0.1, 1, 10, 100].

Train models using all the features as in Q4.

Calculate the accuracy on the validation dataset and round it to 3 decimal digits.
    
Which of these C leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> Note: If there are multiple options, select the smallest C.

In [None]:
# Define features (same as Q4)
numerical = ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']
categorical = ['lead_source', 'industry', 'employment_status', 'location']
all_features = categorical + numerical

In [105]:
# Prepare data with one-hot encoding
dv = DictVectorizer(sparse=False)
train_dict = df_train[all_features].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)


In [106]:
val_dict = df_val[all_features].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [107]:
# Try different C values
C_values = [0.01, 0.1, 1, 10, 100]
results = {}

print("Testing different C values:")
print("="*60)

for C in C_values:
    # Train model with current C value
    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    # Predict and calculate accuracy
    y_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_pred)
    
    # Round to 3 decimal digits
    accuracy_rounded = round(accuracy, 3)
    results[C] = accuracy_rounded
    
    print(f"C={C:>6}: accuracy = {accuracy:.6f}, rounded = {accuracy_rounded}")

Testing different C values:
C=  0.01: accuracy = 0.743151, rounded = 0.743
C=   0.1: accuracy = 0.743151, rounded = 0.743
C=     1: accuracy = 0.743151, rounded = 0.743
C=    10: accuracy = 0.743151, rounded = 0.743
C=   100: accuracy = 0.743151, rounded = 0.743


In [108]:
# Find the best C (highest accuracy, smallest C if tied)
max_accuracy = max(results.values())
best_C = min([c for c, acc in results.items() if acc == max_accuracy])

print(f"\nBest accuracy: {max_accuracy}")
print(f"Best C: {best_C}")

best_C


Best accuracy: 0.743
Best C: 0.01


0.01