### Dataset

In [106]:
!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

--2025-10-07 12:06:46--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 

200 OK
Length: 80876 (79K) [text/plain]
Saving to: ‘course_lead_scoring.csv’


2025-10-07 12:06:46 (40.1 MB/s) - ‘course_lead_scoring.csv’ saved [80876/80876]



### Data Preparation
- Check if the missing values are presented in the features.
- If there are missing values:
    - For categorical features, replace them with 'NA'
    - For numerical features, replace with with 0.0

In [107]:
import pandas as pd
import numpy as np

In [108]:
df = pd.read_csv('course_lead_scoring.csv')

In [109]:
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [110]:
df.tail()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
1457,referral,manufacturing,1,,self_employed,north_america,4,0.53,1
1458,referral,technology,3,65259.0,student,europe,2,0.24,1
1459,paid_ads,technology,1,45688.0,student,north_america,3,0.02,1
1460,referral,,5,71016.0,self_employed,north_america,0,0.25,1
1461,organic_search,finance,3,92855.0,student,north_america,3,0.41,1


In [111]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

In [112]:
target = 'converted'
features = ['lead_source', 'industry', 'number_of_courses_viewed', 'annual_income', 
            'employment_status', 'location', 'interaction_count', 'lead_score']

In [113]:
for col in features:
    if df[col].dtype == 'object':
        df[col] = df[col].fillna('NA')
    else:
        df[col] = df[col].fillna(0.0)

In [114]:
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


In [115]:
df.tail()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
1457,referral,manufacturing,1,0.0,self_employed,north_america,4,0.53,1
1458,referral,technology,3,65259.0,student,europe,2,0.24,1
1459,paid_ads,technology,1,45688.0,student,north_america,3,0.02,1
1460,referral,,5,71016.0,self_employed,north_america,0,0.25,1
1461,organic_search,finance,3,92855.0,student,north_america,3,0.41,1


In [116]:
df.isnull().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

### Q1. 
What is the most frequent observation (mode) for the column `industry`?

In [117]:
df['industry'].describe()

count       1462
unique         8
top       retail
freq         203
Name: industry, dtype: object

In [118]:
df['industry'].mode()

0    retail
Name: industry, dtype: object

### Q2.
Create the correlation matrix for the numerical features of your dataset. In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?
- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `lead_score`

In [119]:
print("interaction_count & lead_score:", df['interaction_count'].corr(df['lead_score']))
print("number_of_courses_viewed & lead_score:", df['number_of_courses_viewed'].corr(df['lead_score']))
print("number_of_courses_viewed & interaction_count:", df['number_of_courses_viewed'].corr(df['interaction_count']))
print("annual_income & lead_score:", df['annual_income'].corr(df['lead_score']))

interaction_count & lead_score: 0.009888182496913084
number_of_courses_viewed & lead_score: -0.004878998354681257
number_of_courses_viewed & interaction_count: -0.023565222882888117
annual_income & lead_score: 0.015609546050138909


+1 denotes perfect positive correlation, 0 for no linear correlation while -1 means perfect negative correlation. In this case, all correlations are very close to 0, there's no strong linear relationship between these variables pairs.

### Data Splitting

In [120]:
from sklearn.model_selection import train_test_split

In [121]:
len(df)

1462

In [122]:
X = df[features]
y = df[target]

In [123]:
# Split into train (60%) and temp (40%)
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
# Split temp into validation (20%) and test (20%)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [124]:
len(X_train) + len(X_val) + len(X_test)

1462

In [125]:
X_train.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score
442,referral,finance,1,61705.0,unemployed,south_america,4,0.65
319,referral,education,1,55199.0,employed,south_america,4,0.09
767,referral,retail,1,40841.0,self_employed,africa,4,0.61
756,referral,other,1,28242.0,employed,middle_east,3,0.84
424,events,retail,0,64775.0,self_employed,south_america,3,0.7


### Q3.
- Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
- Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?

In [126]:
from sklearn.feature_selection import mutual_info_classif
from sklearn.preprocessing import LabelEncoder

In [127]:
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns

In [128]:
X_train_categorical = X_train[categorical_cols].copy()

In [129]:
for col in categorical_cols:
    le = LabelEncoder()
    X_train_categorical[col] = le.fit_transform(X_train_categorical[col].astype(str))

In [130]:
mi_scores = mutual_info_classif(X_train_categorical, y_train, discrete_features=True, random_state=42)

In [131]:
for col, score in zip(categorical_cols, mi_scores):
    print(f"{col}: {round(score, 2)}")

lead_source: 0.03
industry: 0.02
employment_status: 0.02
location: 0.0


Values are all close to 0. None of the features individually provide much information about the target.

### Q4. 
- Now let's train a logistic regression.
- Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
- Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

In [132]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

In [146]:
def build_pipeline(data, C=1.0):
    # Support both DataFrame and list of column names
    if isinstance(data, list):
        data = X_train[data]  # Use global X_train as reference
        
    categorical_cols = data.select_dtypes(include=['object', 'category']).columns
    numerical_cols = data.select_dtypes(include=['int64','float64','number']).columns

    preprocessor = ColumnTransformer(
        transformers=[
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_cols),
            ('num', 'passthrough', numerical_cols)
        ])
    
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42))
    ])

    return model

In [134]:
model = build_pipeline(X_train)
model.fit(X_train, y_train)
y_val_pred = model.predict(X_val)

In [135]:
val_accuracy = accuracy_score(y_val, y_val_pred)
print("Validation Accuracy:", round(val_accuracy, 2))

Validation Accuracy: 0.74


### Q5.
- Let's find the least useful feature using the feature elimination technique.
- Train a model using the same features and parameters as in Q4 (without rounding).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.

Which of following feature has the smallest difference?

In [141]:
all_features = X_train.columns.tolist()

In [142]:
all_features

['lead_source',
 'industry',
 'number_of_courses_viewed',
 'annual_income',
 'employment_status',
 'location',
 'interaction_count',
 'lead_score']

In [143]:
pipeline_all = build_pipeline(X_train[all_features])
pipeline_all.fit(X_train, y_train)
y_val_pred = pipeline_all.predict(X_val)

baseline_acc = accuracy_score(y_val, y_val_pred)
print(f"Baseline Accuracy (all features): {round(baseline_acc, 2)}")

Baseline Accuracy (all features): 0.74


In [144]:
diffs = {}

# Feature ablation
for feature in all_features:
    features_subset = [f for f in all_features if f != feature]
    pipeline = build_pipeline(features_subset)
    pipeline.fit(X_train[features_subset], y_train)
    y_pred = pipeline.predict(X_val[features_subset])
    acc = accuracy_score(y_val, y_pred)
    diff = baseline_acc - acc  # No rounding in calculation
    diffs[feature] = diff
    print(f"Removed '{feature}': accuracy = {acc:.4f}, diff = {diff:.4f}")

least_useful_feature = min(diffs, key=lambda k: abs(diffs[k]))
print(f"\nLeast useful feature (smallest absolute diff in accuracy): {least_useful_feature} (diff = {diffs[least_useful_feature]:.4f})")

Removed 'lead_source': accuracy = 0.7295, diff = 0.0137
Removed 'industry': accuracy = 0.7432, diff = 0.0000
Removed 'number_of_courses_viewed': accuracy = 0.6781, diff = 0.0651
Removed 'annual_income': accuracy = 0.8562, diff = -0.1130
Removed 'employment_status': accuracy = 0.7466, diff = -0.0034
Removed 'location': accuracy = 0.7432, diff = 0.0000
Removed 'interaction_count': accuracy = 0.6747, diff = 0.0685
Removed 'lead_score': accuracy = 0.7432, diff = 0.0000

Least useful feature (smallest absolute diff in accuracy): industry (diff = 0.0000)


Observation on the drops:
- `industry`, `location`, and `lead_score` did not change accuracy at all. These features aren’t contributing meaningful signal to the model or are redundant with other features.
- `lead_source` drops accuracy a bit. Can be considered an important feature.
- `number_of_courses_viewed` and `interaction_count` lead to a significant drop. These are important features.

Interesting cases:
- `annual_income` when removed shows an increase in accuracy. Potentially noisy or misleading.
- `employment_status` lead to a slight increase in accuracy.

### Q6.
- Now let's train a regularized logistic regression.
- Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
- Train models using all the features as in Q4.
- Calculate the accuracy on the validation dataset and round it to 3 decimal digits.
- Which of these C leads to the best accuracy on the validation set?

In [None]:
C_values = [0.01, 0.1, 1, 10, 100]

In [147]:
best_acc = 0
best_C = None

for C in C_values:
    pipeline = build_pipeline(X_train, C=C)
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_val)
    acc = accuracy_score(y_val, y_pred)
    acc_rounded = round(acc, 3)
    print(f"C={C}: accuracy = {acc_rounded}")
    if acc > best_acc:
        best_acc = acc
        best_C = C

print(f"\nBest C value: {best_C} with accuracy = {round(best_acc, 3)}")

C=0.01: accuracy = 0.743
C=0.1: accuracy = 0.743
C=1: accuracy = 0.743
C=10: accuracy = 0.743
C=100: accuracy = 0.743

Best C value: 0.01 with accuracy = 0.743


Smaller C values denote stronger regularization (more penalty for complexity). On the other hand, larger C values mean weaker regularization (model can fit more closely to training data). In this case, the model is not sensitive to regularization or it has reached its ceiling.