## Imports and data prep

In [2]:
import pandas as pd
import numpy as np
 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
 
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [4]:
# The next snippet is about data preparation, where we need to read the csv file, 
# make the column names more homogenous, and deal with categorical and numerical values.

# Data preparation
data_url = 'https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-03-churn-prediction/WA_Fn-UseC_-Telco-Customer-Churn.csv'

df = pd.read_csv(data_url)
 
df.columns = df.columns.str.lower().str.replace(' ', '_')
 
categorical_columns = list(df.dtypes[df.dtypes == 'object'].index)
 
for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(' ', '_')
 
df.totalcharges = pd.to_numeric(df.totalcharges, errors='coerce')
df.totalcharges = df.totalcharges.fillna(0)
 
df.churn = (df.churn == 'yes').astype(int)

In [5]:
# our label

y_train = df.churn

In [6]:
# The next snippet is about data splitting. Again we use the train_test_split 
# function to divide the dataset in full_train and test data.

# Data splitting
 
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)


## Train the model

In [7]:
# Divide dataset - define the numerical column names and categorical column names.

numerical = ['tenure', 'monthlycharges', 'totalcharges']
 
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']

In [8]:
# The next snippet is about the train function. It has three arguments – the training dataframe 
# and the target values y_train, and the third argument is C which is a LogisticRegression parameter 
# for our model. First step here is to create dictionaries from the categorical columns, remember the 
# numerical columns are ignored here. Next we create a DictVectorizer instance which we need to use 
# fit_transform function on the dictionaries. So we get the X_train. Then we create our model which is 
# a logistic regression model, that we can use for training (fit function) based on the training data 
# (X_train and y_train). To apply the model later we need to return the DictVectorizer and the model as well.

def train(df_train, y_train, C=1.0):
    dicts = df_train[categorical + numerical].to_dict(orient='records')
 
    dv = DictVectorizer(sparse=False)
    X_train = dv.fit_transform(dicts)
 
    model = LogisticRegression(C=C, max_iter=1000)
    model.fit(X_train, y_train)
 
    return dv, model

In [None]:
As I just mentioned in the paragraph before to use the model we need also the DictVectorizer. Both are arguments for the predict function which is show in the next snippet. Besides both arguments you also need a dataframe where we can provide a prediction for. First step here is the same like in training function, we need to get the dictionaries. This can be transformed by the DictVectorizer so we get the X, what we need to make a prediction on. What we return here is the predicted probability for churning.