## Reminder about churn prediction accuracy

In [9]:
import pandas as pd
import numpy as np
np.set_printoptions(legacy='1.25')

import matplotlib.pyplot as plt

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

In [3]:
df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

df.columns = df.columns.str.lower().str.replace(' ', '_')

categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')

df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)

df.churn = (df.churn == 'yes').astype(int)

In [5]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.churn.values
y_val = df_val.churn.values
y_test = df_test.churn.values

del df_train['churn']
del df_val['churn']
del df_test['churn']

In [6]:
numerical = ['tenure', 'monthlycharges', 'totalcharges']

categorical = [
    'gender',
    'seniorcitizen',
    'partner',
    'dependents',
    'phoneservice',
    'multiplelines',
    'internetservice',
    'onlinesecurity',
    'onlinebackup',
    'deviceprotection',
    'techsupport',
    'streamingtv',
    'streamingmovies',
    'contract',
    'paperlessbilling',
    'paymentmethod',
]

In [7]:
dv = DictVectorizer(sparse=False)

train_dict = df_train[categorical + numerical].to_dict(orient='records')
X_train = dv.fit_transform(train_dict)

model = LogisticRegression()
model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [10]:
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

y_pred = model.predict_proba(X_val)[:, 1]
churn_decision = (y_pred >= 0.5)
(y_val == churn_decision).mean()

0.8034066713981547

## Fixing the model by feature scaling

Sometimes numerical features require scaling, especially for iterative solves like "lbfgs".  
We can fix this model by using a scaler. You can read more about scalers [here](https://scikit-learn.org/stable/modules/preprocessing.html).

In [11]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

First, we prepare the numerical variables. We'll use the scaler for that and write the results to `X_train_num`:

In [16]:
X_train_num = df_train[numerical].values

scaler = StandardScaler()
#scaler = MinMaxScaler()

X_train_num = scaler.fit_transform(X_train_num)

The scaler scales the numerical features. Compare the un-scaled version of tenure with the scaled one:

In [14]:
df_train.tenure.values

array([72, 10,  5, ...,  2, 27,  9])

In [17]:
X_train_num[:, 0]

array([ 1.6217693 , -0.900001  , -1.10336957, ..., -1.22539072,
       -0.20854785, -0.94067471])

## Using OneHotEncoding instead of DictVectorizer
There are other ways to implement one-hot encoding. E.g. using the OneHotEncoding class.  
Let's process categorical features using OneHotEncoding. We'll write the results to X_train_cat:

In [20]:
ohe = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

In [21]:
X_train_cat = ohe.fit_transform(df_train[categorical].values)

In [23]:
ohe.get_feature_names_out()

array(['x0_female', 'x0_male', 'x1_0', 'x1_1', 'x2_no', 'x2_yes', 'x3_no',
       'x3_yes', 'x4_no', 'x4_yes', 'x5_no', 'x5_no_phone_service',
       'x5_yes', 'x6_dsl', 'x6_fiber_optic', 'x6_no', 'x7_no',
       'x7_no_internet_service', 'x7_yes', 'x8_no',
       'x8_no_internet_service', 'x8_yes', 'x9_no',
       'x9_no_internet_service', 'x9_yes', 'x10_no',
       'x10_no_internet_service', 'x10_yes', 'x11_no',
       'x11_no_internet_service', 'x11_yes', 'x12_no',
       'x12_no_internet_service', 'x12_yes', 'x13_month-to-month',
       'x13_one_year', 'x13_two_year', 'x14_no', 'x14_yes',
       'x15_bank_transfer_(automatic)', 'x15_credit_card_(automatic)',
       'x15_electronic_check', 'x15_mailed_check'], dtype=object)

Now we need to combine two matrices into one - X_train:

In [24]:
X_train = np.column_stack([X_train_num, X_train_cat])

And now let's train the model:

In [25]:
model = LogisticRegression(solver='lbfgs', C=1.0, random_state=42)
model.fit(X_train, y_train)

And check its accuracy:

In [28]:
from sklearn.metrics import accuracy_score

In [26]:
X_val_num = df_val[numerical].values
X_val_num = scaler.transform(X_val_num)

X_val_cat = ohe.transform(df_val[categorical].values)

X_val = np.column_stack([X_val_num, X_val_cat])

In [29]:
y_pred = model.predict_proba(X_val)[:, 1]
accuracy_score(y_val, y_pred >= 0.5)

0.8062455642299503

It's a little bit better than the version without scaled features.