# Telco Exploration

In this notebook we will explore the data in our train data set and form hypthosis based on our findings. The intital hypthosis I have is 

* $H\alpha$ - There is a relationship between fiber internet customers and customers who churn
* $H0$ - There is no relationship between fiber internet cumstomers and customers who churn

In [64]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
import prepare
from scipy import stats
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
print("Success")

Success


In [65]:
# Import the data from the prepare file and assign it each data frame to the proper components
train, validate, test, telco = prepare.prep_telco()
# print data types to make sure they reflect what we expect
telco.isnull().sum()


customer_id              0
gender                   0
senior_citizen           0
partner                  0
dependents               0
tenure                   0
phone_service            0
multiple_lines           0
online_security          0
online_backup            0
device_protection        0
tech_support             0
streaming_tv             0
streaming_movies         0
paperless_billing        0
monthly_charges          0
total_charges            0
churn                    0
internet_service_type    0
payment_type             0
contract_type            0
tenure_year              0
single_no_dependents     0
multiple_phone_lines     0
streaming                0
backedup_and_secured     0
has_internet             0
monthly_75+              0
dtype: int64

My intital hypthosis is trying to determine if there is a relationship between customers who churn and customers who have fiber internet. This means that we should use a **Chi Squared Test**. Below we run the test and determine whether or not the two catagories are related.

In [3]:
a = .05
observed = pd.crosstab(telco.churn, (telco['internet_service_type'] == 'Fiber optic'))
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")



Observed

[[3364 1799]
 [ 572 1297]]
---
Expected

[[2889.87030717 2273.12969283]
 [1046.12969283  822.87030717]]
---

chi^2 = 663.3565
p     = 0.0000
Reject H0


While we can reject the null hypthosis due to the P value being very small, the chi^2 value is very high which leads tends to mean that the relationship is not very strong. We should look for othe values to train our model on. Our next hypthosis will be:

* $H\alpha$ - There is a relationship between internet customers and customers who churn
* $H0$ - There is no relationship between internet cumstomers and customers who churn

This will also use a chi^2 test.

In [4]:
observed = pd.crosstab(telco.churn, telco.has_internet)
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")



Observed

[[1407 3756]
 [ 113 1756]]
---
Expected

[[1116.00682594 4046.99317406]
 [ 403.99317406 1465.00682594]]
---

chi^2 = 362.9478
p     = 0.0000
Reject H0


There again exists a realtionship between internet customers and customers who churn; however, the chi^2 value is also high. Let's test another hypthosis.

* $H\alpha$ - The average monthly charage of customers who churn is greather than the average monthly charage of all customers
* $H0$ - The average monthly charage of customer who churn is equal to the customer monthly average

Since we are comparing a continuious varaible to catagorical varaible (churn) and we are looking at one end of the tail we will use a **One Tail T-test**

In [5]:
churn_sample = telco[telco.churn==1].monthly_charges
overall_mean = telco.monthly_charges.mean()

t, p = stats.ttest_1samp(churn_sample, overall_mean)

print(t, p/2, a)

if p/2 > a:
    print("We fail to reject $H_{0}$")
elif t < 0:
    print("We fail to reject $H_{0}$")
else:
    print("We reject $H_{0}$")

16.685281538217694 1.1366940575678459e-58 0.05
We reject $H_{0}$


There appears to be a 

* $H\alpha$ - The average monthly charage of customers who churn != the average monthly charage of customers that stay
* $H0$ - The average monthly charage of customer who churn = the average monthly charage of customers that stay

Since we are comparing a continuious varaible to catagorical varaible (churn) and we are looking at both ends of the tail we will use a **Two Tail T-test**

In [6]:
churn_sample = telco[telco.churn==1].monthly_charges
stay_sample = telco[telco.churn==0].monthly_charges

t, p = stats.ttest_ind(churn_sample, stay_sample, equal_var=False)

print(f"t={t}, p={p}")

if p/2 > a:
    print("We fail to reject $H_{0}$")
elif t < 0:
    print("We fail to reject $H_{0}$")
else:
    print("We reject $H_{0}$")

t=18.174510616491737, p=4.754563808455743e-71
We reject $H_{0}$


Let's also test to see whether or not customers who spend more than $75/mo are more likely to churn than those who do not.

* $H\alpha$ - There exist a realtionship between customers who spend 75 or more and customers who churn
* $H0$ - There is not a realtionship between customers who spend 75 or more and customers who churn


This will also use a chi^2 test.

In [11]:
observed = pd.crosstab(telco.churn, telco['monthly_75+'])
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")

Observed

[[3122 2041]
 [ 790 1079]]
---
Expected

[[2872.24914676 2290.75085324]
 [1039.75085324  829.24914676]]
---

chi^2 = 183.4193
p     = 0.0000
Reject H0


Let's also test to see whether or not customers who stream and churn.

* $H\alpha$ - There exist a realtionship between customers who spend stream and churn
* $H0$ - There is not a realtionship between customers who spend stream and churn


This will also use a chi^2 test.

In [8]:
observed = pd.crosstab(telco.churn, telco.streaming)
chi2, p, degf, expected = stats.chi2_contingency(observed)

print('Observed\n')
print(observed.values)
print('---\nExpected\n')
print(expected)
print('---\n')
print(f'chi^2 = {chi2:.4f}')
print(f'p     = {p:.4f}')

if p < a:
    print('Reject H0')
else:
    print("Fail to reject H0")

Observed

[[2729 2434]
 [ 808 1061]]
---
Expected

[[2596.91851536 2566.08148464]
 [ 940.08148464  928.91851536]]
---

chi^2 = 50.4699
p     = 0.0000
Reject H0


From the aboave statistical samplings we know the following:
* Whether a customer has internet or not has a impact on churn
* Whether a customer has fiber or not impacts churn
* Customers who churn spend more each month than customers who don't churn 
* Customers who churn spend more than the average
* There is a realtionship between customers who spend 90 or more and those who churn
* There is a realtionship between customers who stream and those that churn

Let's begin to build a model.

In [9]:
train['baseline'] = 0
base_accuracy = (train['churn'] == train['baseline']).mean()
base_accuracy

0.7342342342342343

# Model 1 - Decision Tree

I will build a decision tree based on the above statiscially significant catagorical features to try and predict churn

In [66]:
from sklearn.tree import DecisionTreeClassifier

X = train[['streaming', 'has_internet', 'monthly_75+']]
y = train[['churn']]

clf = DecisionTreeClassifier(max_depth=5, random_state=333)
clf.fit(X, y)
y_pred = clf.predict(X)
y_pred_proba = clf.predict_proba(X)


print('Accuracy of Decision Tree classifier on training set: {:.2f}'
     .format(clf.score(X, y)))
print(confusion_matrix(y, y_pred))
print(classification_report(y, y_pred))

Accuracy of Decision Tree classifier on training set: 0.73
[[3097    0]
 [1121    0]]
              precision    recall  f1-score   support

           0       0.73      1.00      0.85      3097
           1       0.00      0.00      0.00      1121

    accuracy                           0.73      4218
   macro avg       0.37      0.50      0.42      4218
weighted avg       0.54      0.73      0.62      4218



  _warn_prf(average, modifier, msg_start, len(result))


Model 1 is a bust because...

In [67]:
from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(fit_intercept=True, random_state=333)

logit = logit.fit(train[['tenure']], train['churn'])
y_pred = logit.predict(train[['churn']])
y_pred_proba = logit.predict_proba(train[['churn']])
print(classification_report(train.churn, y_pred))
confusion_matrix(train.churn, y_pred)

              precision    recall  f1-score   support

           0       0.73      1.00      0.85      3097
           1       0.00      0.00      0.00      1121

    accuracy                           0.73      4218
   macro avg       0.37      0.50      0.42      4218
weighted avg       0.54      0.73      0.62      4218



  _warn_prf(average, modifier, msg_start, len(result))


array([[3097,    0],
       [1121,    0]], dtype=int64)