## Homework

> Note: sometimes your answer doesn't match one of the options exactly. 
> That's fine. 
> Select the option that's closest to your solution.

### Dataset

In this homework, we will use the lead scoring dataset Bank Marketing dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv).

Or you can do it with `wget`:

In [1]:
data = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv'
!wget $data -O course_lead_scoring.csv

--2025-10-13 08:34:20--  https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 80876 (79K) [text/plain]
Saving to: 'course_lead_scoring.csv'

     0K .......... .......... .......... .......... .......... 63% 1,31M 0s
    50K .......... .......... ........                        100% 2,00M=0,05s

2025-10-13 08:34:21 (1,50 MB/s) - 'course_lead_scoring.csv' saved [80876/80876]




In this dataset our desired target for classification task will be `converted` variable - has the client signed up to the platform or not. 

In [2]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Data preparation

Check if the missing values are presented in the features.

If there are missing values:

* For caterogiral features, replace them with 'NA'
* For numerical features, replace with with 0.0

In [3]:
df = pd.read_csv('course_lead_scoring.csv')
df.head()

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,paid_ads,,1,79450.0,unemployed,south_america,4,0.94,1
1,social_media,retail,1,46992.0,employed,south_america,1,0.8,0
2,events,healthcare,5,78796.0,unemployed,australia,3,0.69,1
3,paid_ads,retail,2,83843.0,,australia,1,0.87,0
4,referral,education,3,85012.0,self_employed,europe,3,0.62,1


normalizaing categorical columns

In [4]:
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

In [5]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

columns names

In [6]:
df.columns

Index(['lead_source', 'industry', 'number_of_courses_viewed', 'annual_income',
       'employment_status', 'location', 'interaction_count', 'lead_score',
       'converted'],
      dtype='object')

numerical variables are

In [7]:
df.dtypes[df.dtypes != 'object'].index

Index(['number_of_courses_viewed', 'annual_income', 'interaction_count',
       'lead_score', 'converted'],
      dtype='object')

one way to konw unique values and types for each column

In [8]:
for c in df.columns:
    print(f'For Column: {c}')
    print(f'The unique values are: {df[c].unique()}')
    print(f'dtypes is: {df[c].dtypes}')
    print('----')

For Column: lead_source
The unique values are: ['paid_ads' 'social_media' 'events' 'referral' 'organic_search' nan]
dtypes is: object
----
For Column: industry
The unique values are: [nan 'retail' 'healthcare' 'education' 'manufacturing' 'technology'
 'other' 'finance']
dtypes is: object
----
For Column: number_of_courses_viewed
The unique values are: [1 5 2 3 0 4 6 8 7 9]
dtypes is: int64
----
For Column: annual_income
The unique values are: [79450. 46992. 78796. ... 45688. 71016. 92855.]
dtypes is: float64
----
For Column: employment_status
The unique values are: ['unemployed' 'employed' nan 'self_employed' 'student']
dtypes is: object
----
For Column: location
The unique values are: ['south_america' 'australia' 'europe' 'africa' 'middle_east' nan
 'north_america' 'asia']
dtypes is: object
----
For Column: interaction_count
The unique values are: [ 4  1  3  6  2  0  5  7  9  8 10 11]
dtypes is: int64
----
For Column: lead_score
The unique values are: [0.94 0.8  0.69 0.87 0.62 0.83 0.

looking for null/nan values

In [9]:
df.isnull().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

details of Nan values

In [10]:
df[df.isnull()]

Unnamed: 0,lead_source,industry,number_of_courses_viewed,annual_income,employment_status,location,interaction_count,lead_score,converted
0,,,,,,,,,
1,,,,,,,,,
2,,,,,,,,,
3,,,,,,,,,
4,,,,,,,,,
...,...,...,...,...,...,...,...,...,...
1457,,,,,,,,,
1458,,,,,,,,,
1459,,,,,,,,,
1460,,,,,,,,,


If there are missing values:

* For caterogiral features, replace them with 'NA'
* For numerical features, replace with with 0.0

In [11]:
df.converted.isnull().sum()

np.int64(0)

In [12]:
df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

filling NaN/null values

In [13]:
numerical = ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']
categorical = ['lead_source',  'industry', 'employment_status', 'location']
df[numerical].isnull().sum()


number_of_courses_viewed      0
annual_income               181
interaction_count             0
lead_score                    0
dtype: int64

In [14]:
df[categorical].isnull().sum()

lead_source          128
industry             134
employment_status    100
location              63
dtype: int64

In [15]:
df['annual_income'] = df['annual_income'].fillna(0.0)
df[categorical] = df[categorical].fillna('NA')
#verify if NaN values are filled
df.isnull().sum(), df[categorical].isnull().sum(), df[numerical].isnull().sum()

(lead_source                 0
 industry                    0
 number_of_courses_viewed    0
 annual_income               0
 employment_status           0
 location                    0
 interaction_count           0
 lead_score                  0
 converted                   0
 dtype: int64,
 lead_source          0
 industry             0
 employment_status    0
 location             0
 dtype: int64,
 number_of_courses_viewed    0
 annual_income               0
 interaction_count           0
 lead_score                  0
 dtype: int64)

there is no more NaN or Null values


### Question 1

What is the most frequent observation (mode) for the column `industry`?

- `NA`
- `technology`
- `healthcare`
- `retail`

In [16]:
df.industry.mode()

0    retail
Name: industry, dtype: object

Q1 - Response:
* the mode of industry column is 'retail' 

### Question 2

Create the [correlation matrix](https://www.google.com/search?q=correlation+matrix) for the numerical features of your dataset. 
In a correlation matrix, you compute the correlation coefficient between every pair of features.

What are the two features that have the biggest correlation?

- `interaction_count` and `lead_score`
- `number_of_courses_viewed` and `lead_score`
- `number_of_courses_viewed` and `interaction_count`
- `annual_income` and `interaction_count`

Only consider the pairs above when answering this question.


In [17]:
df[numerical].nunique()

number_of_courses_viewed      10
annual_income               1268
interaction_count             12
lead_score                   101
dtype: int64

In [18]:
df[numerical].corrwith(df.converted).sort_values(ascending=False)

number_of_courses_viewed    0.435914
interaction_count           0.374573
lead_score                  0.193673
annual_income               0.053131
dtype: float64

Q2 - Response:
* the two features that have the biggest correlation are
* - `number_of_courses_viewed` and `interaction_count`

### Split the data

* Split your data in train/val/test sets with 60%/20%/20% distribution.
* Use Scikit-Learn for that (the `train_test_split` function) and set the seed to `42`.
* Make sure that the target value `y` is not in your dataframe.


In [19]:
from sklearn.model_selection import train_test_split

In [20]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)

In [21]:
len(df_train), len(df_val), len(df_test)

(876, 293, 293)

In [22]:
len(df)

1462

In [23]:
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)
df_full_train = df_full_train.reset_index(drop=True)

In [24]:
y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

del df_train['converted']
del df_val['converted']
del df_test['converted']

### Question 3

* Calculate the mutual information score between `y` and other categorical variables in the dataset. Use the training set only.
* Round the scores to 2 decimals using `round(score, 2)`.

Which of these variables has the biggest mutual information score?
  
- `industry`
- `location`
- `lead_source`
- `employment_status`

In [25]:
categorical

['lead_source', 'industry', 'employment_status', 'location']

In [26]:
df_train.isnull().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
dtype: int64

In [27]:
from sklearn.metrics import mutual_info_score

In [28]:
def mutual_info_churn_score(series):
    return mutual_info_score(series, y_train)

In [29]:
mi = df_train[categorical].apply(mutual_info_churn_score)
round(mi.sort_values(ascending=False), 2)

lead_source          0.04
employment_status    0.01
industry             0.01
location             0.00
dtype: float64

Q3 - Response:

* the variable that has the biggest mutual information score is:

- `lead_source`

### Question 4

* Now let's train a logistic regression.
* Remember that we have several categorical variables in the dataset. Include them using one-hot encoding.
* Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - `model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)`
* Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

What accuracy did you get?

- 0.64
- 0.74
- 0.84
- 0.94

In [30]:
from sklearn.feature_extraction import DictVectorizer

In [31]:
categorical

['lead_source', 'industry', 'employment_status', 'location']

In [32]:
numerical

['number_of_courses_viewed',
 'annual_income',
 'interaction_count',
 'lead_score']

convert categorical and numerical values into a dictionary, create new  instance of DicVectorizer and finally fill it with the  created dictoinary

In [33]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

X_train.shape, X_val.shape

((876, 31), (293, 31))

In [34]:
dv.get_feature_names_out()

array(['annual_income', 'employment_status=NA',
       'employment_status=employed', 'employment_status=self_employed',
       'employment_status=student', 'employment_status=unemployed',
       'industry=NA', 'industry=education', 'industry=finance',
       'industry=healthcare', 'industry=manufacturing', 'industry=other',
       'industry=retail', 'industry=technology', 'interaction_count',
       'lead_score', 'lead_source=NA', 'lead_source=events',
       'lead_source=organic_search', 'lead_source=paid_ads',
       'lead_source=referral', 'lead_source=social_media', 'location=NA',
       'location=africa', 'location=asia', 'location=australia',
       'location=europe', 'location=middle_east',
       'location=north_america', 'location=south_america',
       'number_of_courses_viewed'], dtype=object)

In [35]:
dv.transform(train_dict[3:4])

array([[7.4956e+04, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 3.0000e+00,
        3.4000e-01, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
        1.0000e+00]])

In [36]:
train_dict[3]

{'lead_source': 'NA',
 'industry': 'technology',
 'employment_status': 'employed',
 'location': 'europe',
 'number_of_courses_viewed': 1,
 'annual_income': 74956.0,
 'interaction_count': 3,
 'lead_score': 0.34}

In [37]:
X_val[1]

array([5.9656e+04, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 4.0000e+00,
       6.5000e-01, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00,
       3.0000e+00])

In [38]:
from sklearn.linear_model import LogisticRegression

In [39]:
model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'liblinear'
,max_iter,1000


two dimension array with just one row

In [40]:
model.coef_

array([[-1.77843877e-05, -1.47154423e-02,  3.39095225e-02,
         2.66248432e-03,  1.15238518e-02, -1.02527697e-01,
        -2.48510995e-02,  4.93604222e-02, -2.01258344e-02,
        -1.34214865e-02, -3.00232200e-03, -9.25991830e-03,
        -3.17957304e-02, -1.60513114e-02,  3.11339155e-01,
         5.12012528e-02,  2.01511698e-02, -1.20346284e-02,
        -1.16021521e-02, -1.15251880e-01,  7.95303436e-02,
        -2.99401329e-02,  3.95843295e-03, -1.14296944e-02,
        -1.12457415e-02, -5.59987025e-03,  8.26402635e-03,
         5.58598769e-03, -3.33967159e-02, -2.52837052e-02,
         4.53752887e-01]])

get just w (weights) with round it

In [41]:
model.coef_[0].round(3)

array([-0.   , -0.015,  0.034,  0.003,  0.012, -0.103, -0.025,  0.049,
       -0.02 , -0.013, -0.003, -0.009, -0.032, -0.016,  0.311,  0.051,
        0.02 , -0.012, -0.012, -0.115,  0.08 , -0.03 ,  0.004, -0.011,
       -0.011, -0.006,  0.008,  0.006, -0.033, -0.025,  0.454])

get the bias term

In [42]:
model.intercept_[0]

np.float64(-0.06914728027824993)

use the model, getting the HARD PREDICTIONS

In [43]:
model.predict(X_train)

array([1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,

getting the SOFT PREDICTIONS, the second column is a propability of 'converted' clients

In [44]:
model.predict_proba(X_train)

array([[0.42085658, 0.57914342],
       [0.12716509, 0.87283491],
       [0.41183895, 0.58816105],
       ...,
       [0.25265786, 0.74734214],
       [0.3302157 , 0.6697843 ],
       [0.14407824, 0.85592176]])

we need the second column that are the predictions, and getting these in a validation dataframe

In [45]:
y_pred = model.predict_proba(X_val)[:, 1]
y_pred

array([0.61192162, 0.79982616, 0.53021342, 0.47131479, 0.5706613 ,
       0.44227166, 0.87127669, 0.84883114, 0.83290037, 0.614978  ,
       0.54968025, 0.78153087, 0.69039784, 0.77017121, 0.52659438,
       0.91706424, 0.53170633, 0.42123047, 0.30146454, 0.84881582,
       0.79488652, 0.73670373, 0.44527209, 0.64838383, 0.41768818,
       0.75393417, 0.90166115, 0.33903047, 0.43181429, 0.9680681 ,
       0.92018714, 0.37487987, 0.65230099, 0.90650057, 0.75164115,
       0.6420212 , 0.82250074, 0.83375553, 0.65911599, 0.30978853,
       0.78942264, 0.35546364, 0.96517758, 0.63389304, 0.51274194,
       0.53230532, 0.82287784, 0.744074  , 0.73452312, 0.68955216,
       0.46964441, 0.84539251, 0.55635242, 0.92637871, 0.65258021,
       0.61526271, 0.63816994, 0.28304016, 0.48049823, 0.57890616,
       0.35497341, 0.62175049, 0.38960775, 0.61156055, 0.85304277,
       0.75430135, 0.89185953, 0.71946457, 0.95387623, 0.89209517,
       0.75277086, 0.33850137, 0.61376593, 0.51622273, 0.64088

people who have more than 50% are  converted ones

In [46]:
converted_ones = (y_pred >= 0.5)
converted_ones 

array([ True,  True,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True, False,
       False,  True,  True,  True, False,  True, False,  True,  True,
       False, False,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True, False,  True, False,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True,  True, False, False,  True, False,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True, False,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
       False,  True,  True,  True, False,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True,  True,  True,  True,
        True, False, False,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False,  True, False,  True,
       False,  True,  True, False,  True,  True, False,  True,  True,
       False,  True,

accuracy of predictions

In [47]:
y_val

array([0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 0, 1, 0])

In [48]:
converted_ones.astype(int)

array([1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 1, 0])

In [49]:
y_val == converted_ones

array([False,  True, False,  True, False,  True,  True,  True,  True,
       False, False,  True,  True, False, False,  True, False, False,
        True,  True,  True,  True,  True, False,  True, False,  True,
        True,  True,  True,  True,  True, False,  True,  True,  True,
        True,  True, False,  True, False, False,  True, False, False,
       False,  True,  True,  True,  True,  True,  True, False,  True,
       False,  True,  True,  True, False,  True,  True,  True, False,
       False,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False,  True,  True,  True,  True,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True, False,  True,  True,  True,  True,  True, False,
       False,  True,  True, False, False,  True,  True,  True, False,
        True, False,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,

In [50]:
round((y_val == converted_ones).mean(), 2)

np.float64(0.7)

In [51]:
global_score_accuracy = (y_val == converted_ones).mean()
global_score_accuracy

np.float64(0.6996587030716723)

### Question 5 

* Let's find the least useful feature using the *feature elimination* technique.
* Train a model using the same features and parameters as in Q4 (without rounding).
* Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
* For each feature, calculate the difference between the original accuracy and the accuracy without the feature. 

Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

> **Note**: The difference doesn't have to be positive.

In [52]:
features = categorical + numerical
original_accuracy = global_score_accuracy
differences = {}

for feature in features:
	# Remove the feature from the list
	features_subset = [f for f in features if f != feature]
	
	# Prepare the data
	dv = DictVectorizer(sparse=False)
	train_dict = df_train[features_subset].to_dict(orient='records')
	X_train = dv.fit_transform(train_dict)
	val_dict = df_val[features_subset].to_dict(orient='records')
	X_val = dv.transform(val_dict)
	
	# Train the model
	model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
	model.fit(X_train, y_train)
	y_pred = model.predict_proba(X_val)[:, 1]
	converted_ones = (y_pred >= 0.5)
	score = (y_val == converted_ones).mean()
	
	# Calculate the difference
	differences[feature] = original_accuracy - score

print(differences)

{'lead_source': np.float64(-0.0034129692832765013), 'industry': np.float64(0.0), 'employment_status': np.float64(0.0034129692832763903), 'location': np.float64(-0.010238907849829393), 'number_of_courses_viewed': np.float64(0.14334470989761094), 'annual_income': np.float64(-0.15358361774744034), 'interaction_count': np.float64(0.14334470989761094), 'lead_score': np.float64(-0.0068259385665528916)}


In [53]:
df_differences = pd.DataFrame.from_dict(differences, orient='index', columns=['accuracy_difference'])
df_differences = df_differences.sort_values(by='accuracy_difference', ascending=False)  
df_differences

Unnamed: 0,accuracy_difference
interaction_count,0.143345
number_of_courses_viewed,0.143345
employment_status,0.003413
industry,0.0
lead_source,-0.003413
lead_score,-0.006826
location,-0.010239
annual_income,-0.153584



Which of following feature has the smallest difference?

- `'industry'`
- `'employment_status'`
- `'lead_score'`

Q5 - Response:
- `'industry'`

### Question 6

* Now let's train a regularized logistic regression.
* Let's try the following values of the parameter `C`: `[0.01, 0.1, 1, 10, 100]`.
* Train models using all the features as in Q4.
* Calculate the accuracy on the validation dataset and round it to 3 decimal digits.

Which of these `C` leads to the best accuracy on the validation set?

- 0.01
- 0.1
- 1
- 10
- 100

> **Note**: If there are multiple options, select the smallest `C`.

In [54]:
from sklearn.linear_model import Ridge

In [55]:
C_values = [0.01, 0.1, 1, 10, 100]
accuracies = []

for c in C_values:
	model = Ridge(alpha=c, max_iter=1000, random_state=42)
	model.fit(X_train, y_train)
	y_pred = model.predict(X_val)
	converted_ones = (y_pred >= 0.5)
	acc = (y_val == converted_ones).mean()
	accuracies.append(round(acc, 3))
	print(f"C={c}: accuracy={round(acc, 3)}")

best_c = C_values[np.argmax(accuracies)]
print(f"Best C: {best_c} with accuracy {max(accuracies)}")

C=0.01: accuracy=0.802
C=0.1: accuracy=0.802
C=1: accuracy=0.802
C=10: accuracy=0.809
C=100: accuracy=0.805
Best C: 10 with accuracy 0.809


Q6 - Response
* 0.01 is the best accuracy on the validation dataset

## Submit the results

* Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw03
* If your answer doesn't match options exactly, select the closest one