<a href="https://colab.research.google.com/github/eolson615/SpringboardDSCareerTrack/blob/master/Capstone2_featureselection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#In this notebook I will go through and see if there are any features that I would be able to remove to help improve the accuracy of the model.

##I use the chi2_contingency test to obtain the chi squared p-value for each feature. And I use the chi2_contingency test with a Bonferroni-adjusted method for the p-value threshold to analyze each category within each feature.

##The features that end up being flagged in the notebook, 'gender' and 'PhoneService', are removed and the subsequent data set modeled in the notebook Capstone2_LogReg_and_RFC_usingselectedfeatures. 



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import mutual_info_classif

##I import the data and separate the variable we are hoping to predict ('Churn'). Then I  perform a train test split with the same random state (56) as that used in other notebooks.

In [None]:
df_url = 'https://raw.githubusercontent.com/eolson615/SpringboardDSCareerTrack/master/Capstone2/Data/telcodata_posteda.csv'
df = pd.read_csv(df_url, index_col=[0])
df.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [None]:
X = df.drop(columns='Churn')
y = df.Churn
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=56)

In [None]:
print('Xtrain', type(Xtrain), Xtrain.shape, Xtrain)
print('\n')
print("Xtrest", type(Xtest), Xtest.shape, Xtest)
print('\n')
print('ytrain', type(ytrain), ytrain.shape, ytrain)
print('\n')
print("ytrest", type(ytest), ytest.shape, ytest)

Xtrain <class 'pandas.core.frame.DataFrame'> (5282, 19)       gender  SeniorCitizen  ... MonthlyCharges TotalCharges
2500    Male              0  ...          75.50      4025.60
2456    Male              0  ...          24.55      1160.45
1705  Female              1  ...          90.45      5044.80
3011  Female              0  ...          19.75       210.65
5444    Male              0  ...         107.75      4882.80
...      ...            ...  ...            ...          ...
1259  Female              1  ...          95.25      4424.20
5538  Female              0  ...          81.10        81.10
3264  Female              0  ...          91.10       964.35
399   Female              0  ...          20.05       415.10
2532  Female              1  ...         101.10      4016.20

[5282 rows x 19 columns]


Xtrest <class 'pandas.core.frame.DataFrame'> (1761, 19)       gender  SeniorCitizen  ... MonthlyCharges TotalCharges
6046  Female              0  ...          85.30       781.40
6494  

##Here I perform the chi-squared p-value check on the relationship between the target feature and the other categorical features. I then compile the results and put them in a data frame for ease of comparison.
<br>

##For this test:
##**Null Hypothesis** = there is no relationship between the feature and the variable we are looking to predict <br>
##**Alternative hypothesis** = there is a relationship and therefore should be kept when training and testing models.

##Based on the results I should run a model with the 'gender' and 'PhoneService' features removed to see if that will add predictive power to the model.

In [None]:
cat_features = Xtrain.select_dtypes(include=['object']).columns
chi2_check = []
chi2_pvalue_list = []
for i in cat_features:
    chi2_pvalue = chi2_contingency(pd.crosstab(ytrain, Xtrain[i]))[1]
    chi2_pvalue_list.append(chi2_pvalue)
    if chi2_pvalue < 0.05:
        chi2_check.append('Reject Null Hypothesis- this is a potential relationship ')
    else:
        chi2_check.append('Fail to Reject Null Hypothesis- this catagory might have more noise than information')
chi2_testresults = pd.DataFrame(data = [cat_features, chi2_check, chi2_pvalue_list] 
             ).T 
chi2_testresults.columns = ['Column', 'Hypothesis', 'Chi2_pvalue']
chi2_testresults

Unnamed: 0,Column,Hypothesis,Chi2_pvalue
0,gender,Fail to Reject Null Hypothesis- this catagory ...,0.51253
1,Partner,Reject Null Hypothesis- this is a potential re...,7.5075e-32
2,Dependents,Reject Null Hypothesis- this is a potential re...,1.45241e-35
3,PhoneService,Fail to Reject Null Hypothesis- this catagory ...,0.239679
4,MultipleLines,Reject Null Hypothesis- this is a potential re...,0.0126135
5,InternetService,Reject Null Hypothesis- this is a potential re...,1.6039399999999998e-123
6,OnlineSecurity,Reject Null Hypothesis- this is a potential re...,1.36957e-136
7,OnlineBackup,Reject Null Hypothesis- this is a potential re...,1.65753e-95
8,DeviceProtection,Reject Null Hypothesis- this is a potential re...,7.983219999999999e-92
9,TechSupport,Reject Null Hypothesis- this is a potential re...,2.93115e-135


##Now I break down the categorical features into their individual categories by using dummy variables and run those individual columns through the chi2 test as well.

##In this case the only 2 features (MultipleLines_No & MultipleLines_no phone service) that did not meet the p-value thresh hold set using the Bonferroni-adjusted method (p/N where N = the number of unique category values in the original feature). Those two features originally were categories in the multiple lines feature, which had 3 categories. The 3rd category (MultipleLines_Yes) was significant, so it seems like the main feature of multiple lines should not be dropped.

In [None]:
chi2_check_dummies = {}
for i in chi2_testresults[chi2_testresults['Chi2_pvalue'] < 0.05]['Column']:
    dummies = pd.get_dummies(Xtrain[i])
    bon_p_value = 0.05/Xtrain[i].nunique()
    for series in dummies:
        if chi2_contingency(pd.crosstab(ytrain, dummies[series]))[1] < bon_p_value:
            chi2_check_dummies['{}-{}'.format(i, series)] = 'Reject Null Hypothesis'
        else:
            chi2_check_dummies['{}-{}'.format(i, series)] = 'Fail to Reject Null Hypothesis'
chi2_testresults_dummies = pd.DataFrame(data = [chi2_check_dummies.keys(), chi2_check_dummies.values()]).T
chi2_testresults_dummies.columns = ['Pair', 'Hypothesis']
chi2_testresults_dummies

Unnamed: 0,Pair,Hypothesis
0,Partner-No,Reject Null Hypothesis
1,Partner-Yes,Reject Null Hypothesis
2,Dependents-No,Reject Null Hypothesis
3,Dependents-Yes,Reject Null Hypothesis
4,MultipleLines-No,Fail to Reject Null Hypothesis
5,MultipleLines-No phone service,Fail to Reject Null Hypothesis
6,MultipleLines-Yes,Reject Null Hypothesis
7,InternetService-DSL,Reject Null Hypothesis
8,InternetService-Fiber optic,Reject Null Hypothesis
9,InternetService-No,Reject Null Hypothesis


##The results of the this feature selection notebook are tested in the notebook Capstone2_LogReg_and_RFC_usingselectedfeatures.