# Logistic Regression for Classification
## Churn Prediction

**Objective: Identify clients that want to leave the company**

* Data: https://www.kaggle.com/blastchar/telco-customer-churn
* Target: 1, for people who leave, 0 for people who stay
* Use Logistic Regression for chain predection

## Setup

In [33]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

## Read Data and initial Preparation
* Look at the data
* Make column names and values uniform
* Check if all columns read correctly
* Check if "churn" column needs preparation

In [6]:
df = pd.read_csv("../data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

**Alternative to download the Data**
```
data = https://www.kaggle.com/blastchar/telco-customer-churn
!wget $data -O ../data/MA_Fn-UseC_-Telco-Customer-Churn.csv"
```

In [7]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


* The dataframe is relatively large, we can not see all variables using .head()

In [10]:
# Transpose dataframe to see all columns
df.head().T

Unnamed: 0,0,1,2,3,4
customerID,7590-VHVEG,5575-GNVDE,3668-QPYBK,7795-CFOCW,9237-HQITU
gender,Female,Male,Male,Male,Female
SeniorCitizen,0,0,0,0,0
Partner,Yes,No,No,No,No
Dependents,No,No,No,No,No
tenure,1,34,2,45,2
PhoneService,No,Yes,Yes,No,Yes
MultipleLines,No phone service,No,No,No phone service,No
InternetService,DSL,DSL,DSL,DSL,Fiber optic
OnlineSecurity,No,Yes,Yes,Yes,No


In [11]:
# Make columns consistent
df.columns = df.columns.str.lower().str.replace(" ", "_")
df.columns

Index(['customerid', 'gender', 'seniorcitizen', 'partner', 'dependents',
       'tenure', 'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod', 'monthlycharges', 'totalcharges', 'churn'],
      dtype='object')

In [15]:
categorical_columns = list(df.dtypes[df.dtypes == object].index)
categorical_columns

for c in categorical_columns:
    df[c] = df[c].str.lower().str.replace(" ", "_")

In [16]:
df.head().T

Unnamed: 0,0,1,2,3,4
customerid,7590-vhveg,5575-gnvde,3668-qpybk,7795-cfocw,9237-hqitu
gender,female,male,male,male,female
seniorcitizen,0,0,0,0,0
partner,yes,no,no,no,no
dependents,no,no,no,no,no
tenure,1,34,2,45,2
phoneservice,no,yes,yes,no,yes
multiplelines,no_phone_service,no,no,no_phone_service,no
internetservice,dsl,dsl,dsl,dsl,fiber_optic
onlinesecurity,no,yes,yes,yes,no


* Have a look at the datatypes

In [17]:
df.dtypes

customerid           object
gender               object
seniorcitizen         int64
partner              object
dependents           object
tenure                int64
phoneservice         object
multiplelines        object
internetservice      object
onlinesecurity       object
onlinebackup         object
deviceprotection     object
techsupport          object
streamingtv          object
streamingmovies      object
contract             object
paperlessbilling     object
paymentmethod        object
monthlycharges      float64
totalcharges         object
churn                object
dtype: object

**Notes**
* "seniorcitizen is int
* "totalcharges" is an object - should be a number

In [18]:
df.totalcharges

0         29.85
1        1889.5
2        108.15
3       1840.75
4        151.65
         ...   
7038     1990.5
7039     7362.9
7040     346.45
7041      306.6
7042     6844.5
Name: totalcharges, Length: 7043, dtype: object

In [19]:
# convert it to a number
pd.to_numeric(df["totalcharges"])

ValueError: Unable to parse string "_" at position 488

**Notes**
* This error occurs, because "_" are in the data and pandas does not know how to convert this into a number
* This happened, because we replaced all " " with "_"
* We can use ```errors="coerce"``` in ```to_numeric```, then everything that can not b converted, will be converted to NaN

In [21]:
# convert it to a number
tc = pd.to_numeric(df["totalcharges"], errors="coerce")
tc

0         29.85
1       1889.50
2        108.15
3       1840.75
4        151.65
         ...   
7038    1990.50
7039    7362.90
7040     346.45
7041     306.60
7042    6844.50
Name: totalcharges, Length: 7043, dtype: float64

In [24]:
tc.isnull().sum()

11

In [25]:
df["totalcharges"] = tc

* Replace missing values with 0

In [26]:
df["totalcharges"] = df["totalcharges"].fillna(0)

* Look at the "churn" variable

In [28]:
df["churn"].head()

0     no
1     no
2    yes
3     no
4    yes
Name: churn, dtype: object

* Replace yes / no with 1 / 0

In [32]:
df["churn"] = (df["churn"] == "yes").astype(int)
df["churn"].head()

0    0
1    0
2    1
3    0
4    1
Name: churn, dtype: int64

## Setting up the Validation Framework
* Perform the train / vl /test split using scikit-learn

In [36]:
# 80% train + val = train_full, 20% test
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=1)

In [38]:
print(f"train_full length {len(df_train_full)}, test length {len(df_test)}")

train_full length 5634, test length 1409


In [41]:
# 75% train, 25% val out of train_full 
# 60% train, 20% val, 20% test out of df
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=1)

In [42]:
print(f"train length {len(df_train)}, val length {len(df_val)}")

train length 4225, val length 1409


In [43]:
# reset index
df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

In [44]:
y_train = df_train["churn"]
y_val = df_val["churn"]
y_test = df_test["churn"]

In [45]:
# delete "churn from df_train, df_val, df_test (not from df)
del df_train["churn"]
del df_val["churn"]
del df_test["churn"]

## EDA
* Check missing values
* Look at the target variable "churn"
* Look at numerical and categrical variables

In [47]:
# For the EDA we will df_train_full
df_train_full = df_train_full.reset_index(drop=True)

In [48]:
# check missing values
df_train_full.isnull().sum()

customerid          0
gender              0
seniorcitizen       0
partner             0
dependents          0
tenure              0
phoneservice        0
multiplelines       0
internetservice     0
onlinesecurity      0
onlinebackup        0
deviceprotection    0
techsupport         0
streamingtv         0
streamingmovies     0
contract            0
paperlessbilling    0
paymentmethod       0
monthlycharges      0
totalcharges        0
churn               0
dtype: int64

* No missing values

In [49]:
# look at the target variable
df_train_full["churn"].value_counts()

0    4113
1    1521
Name: churn, dtype: int64

In [51]:
# look at the target variable
df_train_full["churn"].value_counts(normalize=True)

0    0.730032
1    0.269968
Name: churn, dtype: float64

* Number of churned users is about 1/3 of the non-churned users
* "Churn rate" = 0.269968: Fraction of users that churned (~27%)

In [53]:
# churn rate can also be calculated using mean
df_train_full["churn"].mean()

0.26996805111821087

In [57]:
global_churn_rate = df_train_full["churn"].mean()
round(global_churn_rate, 2)

0.27

* Look at other Variables
* Numerical variables of interest: "tenure", "monthlycharges", "totalcharges"
* Categorical variables 

In [59]:
numerical = ["tenure", "monthlycharges", "totalcharges"]

In [63]:
categorical = ['gender', 'seniorcitizen', 'partner', 'dependents',
       'phoneservice', 'multiplelines', 'internetservice',
       'onlinesecurity', 'onlinebackup', 'deviceprotection', 'techsupport',
       'streamingtv', 'streamingmovies', 'contract', 'paperlessbilling',
       'paymentmethod']

In [65]:
df_train_full[categorical].nunique()

gender              2
seniorcitizen       2
partner             2
dependents          2
phoneservice        2
multiplelines       3
internetservice     3
onlinesecurity      3
onlinebackup        3
deviceprotection    3
techsupport         3
streamingtv         3
streamingmovies     3
contract            3
paperlessbilling    2
paymentmethod       4
dtype: int64

**Notes**
* A lot of categorical variables are binary

## Feature Importance: Churn Rate and Risk Ratio
Feature importance analysis (part of EDA) - identify which features affect our target variable
* Churn rate
* Risk ratio
* Mutual information - later

## Feature Importance: Mutual Information
Mutual information - concept from information theory, it tells us how much we can learn about one variable  if we want to know the value of another

* https://en.wikipedia.org/wiki/Mutual_information

## Feature Importance: Correlation
How about numerical columns
* Correlation Coefficient

## One-Hot Encoding
* Use scikit learn to encode categorical features

## Logistic Regression
* Binary Classification
* Linear vs Logistic Regression

## Training Logistic Regression with Scikit-Learn
* Train a Model with Scikit-Learn
* Apply it to the Validation Dataset
* Calculate the Accuracy

## Model Interpretation
* Look at the Coefficients
* Train a smaller Model with fewer Features

## Using the Model

## Summary
* Feature importance - risk, mutual information, correlation
* One-hot encoding can be implemented with ```DictVectorizer```
* Logistic Regression - linear model like linear regression
* Output of logreg - propability
* Interpretation of weights is similar to linear regression

## Explore more