# Telco Exploration

In this notebook we will explore the data in our train data set and form hypthosis based on our findings. The intital hypthosis I have is 

* $H\alpha$ - There is a relationship between fiber internet customers and customers who churn
* $H0$ - There is no relationship between fiber internet cumstomers and customers who churn

In [43]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
import prepare
from scipy import stats
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
print("Success")

Success


In [44]:
# Import the data from the prepare file and assign it each data frame to the proper components
train, validate, test = prepare.prep_telco()
# print data types to make sure they reflect what we expect
train.columns

Index(['customer_id', 'gender', 'senior_citizen', 'partner', 'dependents',
       'tenure', 'phone_service', 'multiple_lines', 'internet_service_type_id',
       'online_security', 'online_backup', 'device_protection', 'tech_support',
       'streaming_tv', 'streaming_movies', 'contract_type_id',
       'paperless_billing', 'payment_type_id', 'monthly_charges',
       'total_charges', 'churn', 'internet_service_type', 'payment_type',
       'contract_type', 'is_alone', 'tenure_year', 'additional_services',
       'has_streaming', 'has_internet', 'auto_pay', 'has_fiber'],
      dtype='object')

My intital hypthosis is trying to determine if there is a relationship between customers who churn and customers who have fiber internet. This means that we should use a **Chi Squared Test**. Below we run the test and determine whether or not the two catagories are related.

In [45]:
a = .05
observed = pd.crosstab(train.churn, (train['internet_service_type'] == 'Fiber optic'))
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")



Observed

[[2011 1086]
 [ 345  776]]
---
Expected

[[1729.85585586 1367.14414414]
 [ 626.14414414  494.85585586]]
---

chi^2 = 388.0877
p     = 0.0000
Reject H0


While we can reject the null hypthosis due to the P value being very small, the chi^2 value is very high which leads tends to mean that the relationship is not very strong. We should look for othe values to train our model on. Our next hypthosis will be:

* $H\alpha$ - There is a relationship between internet customers and customers who churn
* $H0$ - There is no relationship between internet cumstomers and customers who churn

This will also use a chi^2 test.

In [46]:
observed = pd.crosstab(train.churn, train.has_internet)
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")



Observed

[[ 869 2228]
 [  62 1059]]
---
Expected

[[ 683.57207207 2413.42792793]
 [ 247.42792793  873.57207207]]
---

chi^2 = 241.5620
p     = 0.0000
Reject H0


There again exists a realtionship between internet customers and customers who churn; however, the chi^2 value is also high. Let's test another hypthosis.

* $H\alpha$ - The average monthly charage of customers who churn is greather than the average monthly charage of all customers
* $H0$ - The average monthly charage of customer who churn is equal to the customer monthly average

Since we are comparing a continuious varaible to catagorical varaible (churn) and we are looking at one end of the tail we will use a **One Tail T-test**

In [47]:
churn_sample = train[train.churn==1].monthly_charges
overall_mean = train.monthly_charges.mean()

t, p = stats.ttest_1samp(churn_sample, overall_mean)

print(t, p/2, a)

if p/2 > a:
    print("We fail to reject H0")
elif t < 0:
    print("We fail to reject H0")
else:
    print("We reject H0")

13.100067668543078 6.921203124772269e-37 0.05
We reject H0


There appears to be a 

* $H\alpha$ - The average monthly charage of customers who churn != the average monthly charage of customers that stay
* $H0$ - The average monthly charage of customer who churn = the average monthly charage of customers that stay

Since we are comparing a continuious varaible to catagorical varaible (churn) and we are looking at both ends of the tail we will use a **Two Tail T-test**

In [48]:
churn_sample = train[train.churn==1].monthly_charges
stay_sample = train[train.churn==0].monthly_charges

t, p = stats.ttest_ind(churn_sample, stay_sample, equal_var=False)

print(f"t={t}, p={p}")

if p/2 > a:
    print("We fail to reject H0")
elif t < 0:
    print("We fail to reject H0")
else:
    print("We reject H0")

t=14.210657645266926, p=4.0171796958564946e-44
We reject H0


Let's also test to see whether or not customers who spend more than $75/mo are more likely to churn than those who do not.

* $H\alpha$ - There exist a realtionship between customers who have a streaming service and churn
* $H0$ - There is not a realtionship between customers who have a streaming servic and churn


This will also use a chi^2 test.

In [49]:
observed = pd.crosstab(train.churn, train['has_streaming'])
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")

Observed

[[1627 1470]
 [ 476  645]]
---
Expected

[[1544.09459459 1552.90540541]
 [ 558.90540541  562.09459459]]
---

chi^2 = 33.0016
p     = 0.0000
Reject H0


Let's also test to see whether or not customers who spend more than $75/mo are more likely to churn than those who do not.

* $H\alpha$ - There exist a realtionship between customers who have additional services and churn
* $H0$ - There is not a realtionship between customers who have a additional services and churn


This will also use a chi^2 test.

In [50]:
observed = pd.crosstab(train.churn, train.additional_services)
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")

Observed

[[1197 1900]
 [ 496  625]]
---
Expected

[[1243.05855856 1853.94144144]
 [ 449.94144144  671.05855856]]
---

chi^2 = 10.4953
p     = 0.0012
Reject H0


Let's also test to see if auto and churn are dependent.

* $H\alpha$ - There exist a realtionship between auto pay and churn
* $H0$ - There is not a realtionship between auto pay and churn

In [51]:
observed = pd.crosstab(train.churn, train.auto_pay)
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")

Observed

[[1551 1546]
 [ 833  288]]
---
Expected

[[1750.41441441 1346.58558559]
 [ 633.58558559  487.41441441]]
---

chi^2 = 195.6140
p     = 0.0000
Reject H0


From the aboave statistical samplings we know the following:
* There is a realtionship between has internet and churn
* There is a realtionship between has fiber churn
* Customers who churn spend more each month than customers who don't churn 
* Customers who churn spend more than the average
* There is a realtionship between has_streaming and churn
* There is a realtionship between additional services and churn
* There is a realtionship between auto pay and churn

Let's begin to build a model.

In [95]:
base = train
base['baseline'] = 0
base_accuracy = (base['churn'] == base['baseline']).mean()
train.drop(columns = 'baseline', inplace=True)
base_accuracy

0.7342342342342343

In [96]:
train.columns

Index(['customer_id', 'gender', 'senior_citizen', 'partner', 'dependents',
       'tenure', 'phone_service', 'multiple_lines', 'internet_service_type_id',
       'online_security', 'online_backup', 'device_protection', 'tech_support',
       'streaming_tv', 'streaming_movies', 'contract_type_id',
       'paperless_billing', 'payment_type_id', 'monthly_charges',
       'total_charges', 'churn', 'internet_service_type', 'payment_type',
       'contract_type', 'is_alone', 'tenure_year', 'additional_services',
       'has_streaming', 'has_internet', 'auto_pay', 'has_fiber'],
      dtype='object')

# Modeling

1) Decision Tree

I will build a decision tree based on the above statiscially significant catagorical features to try and predict churn

In [97]:
from sklearn.tree import DecisionTreeClassifier

X = train[['has_streaming', 'auto_pay', 'additional_services', 'has_internet']]
y = train[['churn']]

clf = DecisionTreeClassifier(max_depth=4, random_state=333)
clf.fit(X, y)
y_pred = clf.predict(X)
y_pred_proba = clf.predict_proba(X)


print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X, y)))
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))

Accuracy of Decision Tree classifier on training set: 0.76
[[2831  266]
 [ 726  395]]
              precision    recall  f1-score   support

           0       0.80      0.91      0.85      3097
           1       0.60      0.35      0.44      1121

    accuracy                           0.76      4218
   macro avg       0.70      0.63      0.65      4218
weighted avg       0.74      0.76      0.74      4218



2) Logistic Regression

In [104]:
from sklearn.linear_model import LogisticRegression

logit1 = LogisticRegression()

X = train[['tenure', 'monthly_charges', 'total_charges']]
y = train.churn

logit1 = logit1.fit(X, y)
y_pred = logit1.predict(X)
print('Accuracy of Logit1 classifier on training set: {:.2f}'
     .format(logit1.score(X, y)))
print(classification_report(y, y_pred))

Accuracy of Logit1 classifier on training set: 0.78
              precision    recall  f1-score   support

           0       0.82      0.91      0.86      3097
           1       0.63      0.43      0.51      1121

    accuracy                           0.78      4218
   macro avg       0.72      0.67      0.69      4218
weighted avg       0.77      0.78      0.77      4218



In [85]:
from sklearn.tree import DecisionTreeClassifier

X = train[['has_streaming', 'auto_pay', 'additional_services', 'has_fiber']]
y = train[['churn']]

clf = DecisionTreeClassifier(max_depth=4, random_state=333)
clf.fit(X, y)
y_pred = clf.predict(X)
y_pred_proba = clf.predict_proba(X)


print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X, y)))
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))

Accuracy of Decision Tree classifier on training set: 0.75
[[2973  124]
 [ 914  207]]
              precision    recall  f1-score   support

           0       0.76      0.96      0.85      3097
           1       0.63      0.18      0.29      1121

    accuracy                           0.75      4218
   macro avg       0.70      0.57      0.57      4218
weighted avg       0.73      0.75      0.70      4218



In [99]:
from sklearn.tree import DecisionTreeClassifier

X = train[['online_security', 'online_backup', 'auto_pay', 'has_internet']]
y = train[['churn']]

clf = DecisionTreeClassifier(max_depth=3, random_state=333)
clf.fit(X, y)
y_pred = clf.predict(X)
y_pred_proba = clf.predict_proba(X)


print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X, y)))
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))

Accuracy of Decision Tree classifier on training set: 0.74
[[2445  652]
 [ 445  676]]
              precision    recall  f1-score   support

           0       0.85      0.79      0.82      3097
           1       0.51      0.60      0.55      1121

    accuracy                           0.74      4218
   macro avg       0.68      0.70      0.68      4218
weighted avg       0.76      0.74      0.75      4218



In [107]:
X = validate[['tenure', 'monthly_charges', 'total_charges']]
y = validate.churn

y_pred = logit1.predict(X)
print('Accuracy of Logit1 classifier on validate set: {:.2f}'
     .format(logit1.score(X, y)))
print(classification_report(y, y_pred))

Accuracy of Logit1 classifier on validate set: 0.78
              precision    recall  f1-score   support

           0       0.81      0.91      0.86      1033
           1       0.63      0.42      0.51       374

    accuracy                           0.78      1407
   macro avg       0.72      0.67      0.68      1407
weighted avg       0.76      0.78      0.76      1407



In [108]:
X = test[['tenure', 'monthly_charges', 'total_charges']]
y = test.churn

y_pred = logit1.predict(X)
print('Accuracy of Logit1 classifier on test set: {:.2f}'
     .format(logit1.score(X, y)))
print(classification_report(y, y_pred))

Accuracy of Logit1 classifier on test set: 0.80
              precision    recall  f1-score   support

           0       0.82      0.92      0.87      1033
           1       0.67      0.46      0.54       374

    accuracy                           0.80      1407
   macro avg       0.75      0.69      0.71      1407
weighted avg       0.78      0.80      0.78      1407

