# D208 - Task 2 Logistic Regression for Predictive Modeling

## Part I: Research Question

### A.  Describe the purpose of this data analysis by doing the following:

#### 1.  Summarize one research question that is relevant to a real-world organizational situation captured in the data set you have selected and that you will answer using logistic regression.

#### 2.  Define the objectives or goals of the data analysis. Ensure that your objectives or goals are reasonable within the scope of the data dictionary and are represented in the available data.



## Part II: Method Justification

### B.  Describe logistic regression methods by doing the following:

#### 1.  Summarize the assumptions of a logistic regression model.

#### 2.  Describe the benefits of using the tool(s) you have chosen (i.e., Python, R, or both) in support of various phases of the analysis.

#### 3.  Explain why logistic regression is an appropriate technique to analyze the research question summarized in Part I.



## Part III: Data Preparation

### C.  Summarize the data preparation process for logistic regression by doing the following:

#### 1.  Describe your data preparation goals and the data manipulations that will be used to achieve the goals.

#### 2.  Discuss the summary statistics, including the target variable and all predictor variables that you will need to gather from the data set to answer the research question.

#### 3.  Explain the steps used to prepare the data for the analysis, including the annotated code.


In [41]:
# import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [42]:
# import data from csv
df = pd.read_csv('churn_clean.csv')

# set it so we can see all columns
pd.set_option('display.max_columns', None)

In [43]:
# create a dictionary of current column names mapping to desired column names
survey_dict = {'Item1':'timely_responses', 
               'Item2':'timely_fixes', 
               'Item3':'timely_replacements', 
               'Item4':'reliability', 
               'Item5':'options', 
               'Item6':'respectful_response', 
               'Item7':'courteous_exchange', 
               'Item8':'evidence_of_active_listening'}

# rename the column names based on survey_dict
df = df.rename(columns=survey_dict)

In [44]:
# change the dataframe columns to more appropriate data types
df = df.astype({'Population':int, 
                'Area':'category',
                'Children':int, 
                'Age':int,
                'Income':float, 
                'Marital':'category', 
                'Gender':'category', 
                'Churn':'string',
                'Outage_sec_perweek':float, 
                'Email':int, 
                'Contacts':int, 
                'Yearly_equip_failure':int,
                'Techie':'category', 
                'Contract':'category', 
                'Port_modem':'category', 
                'Tablet':'category', 
                'InternetService':'category',
                'Phone':'category', 
                'Multiple':'category', 
                'OnlineSecurity':'category', 
                'OnlineBackup':'category',
                'DeviceProtection':'category', 
                'TechSupport':'category', 
                'StreamingTV':'category', 
                'StreamingMovies':'category',
                'PaperlessBilling':'category', 
                'PaymentMethod':'category', 
                'Tenure':float, 
                'MonthlyCharge':float,
                'Bandwidth_GB_Year':float, 
                'timely_responses':int, 
                'timely_fixes':int, 
                'timely_replacements':int, 
                'reliability':int, 
                'options':int,
                'respectful_response':int, 
                'courteous_exchange':int, 
                'evidence_of_active_listening':int}, copy=False)

In [45]:
df = df[['Population', 'Area', 'Age', 'Gender', 'Children', 'Marital', 
         'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure',
         'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService',
         'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
         'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 
         'PaymentMethod', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year',
         'timely_responses', 'timely_fixes', 'timely_replacements', 'reliability',
         'options', 'respectful_response', 'courteous_exchange', 
         'evidence_of_active_listening', 'Churn']]

In [46]:
# show dataframe head
df.head()

Unnamed: 0,Population,Area,Age,Gender,Children,Marital,Outage_sec_perweek,Email,Contacts,Yearly_equip_failure,Techie,Contract,Port_modem,Tablet,InternetService,Phone,Multiple,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,PaymentMethod,Tenure,MonthlyCharge,Bandwidth_GB_Year,timely_responses,timely_fixes,timely_replacements,reliability,options,respectful_response,courteous_exchange,evidence_of_active_listening,Churn
0,38,Urban,68,Male,0,Widowed,7.978323,10,0,1,No,One year,Yes,Yes,Fiber Optic,Yes,No,Yes,Yes,No,No,No,Yes,Yes,Credit Card (automatic),6.795513,172.455519,904.53611,5,5,5,3,4,4,3,4,No
1,10446,Urban,27,Female,1,Married,11.69908,12,0,1,Yes,Month-to-month,No,Yes,Fiber Optic,Yes,Yes,Yes,No,No,No,Yes,Yes,Yes,Bank Transfer(automatic),1.156681,242.632554,800.982766,3,4,3,3,4,3,4,4,Yes
2,3735,Urban,50,Female,4,Widowed,10.7528,9,0,1,Yes,Two Year,Yes,No,DSL,Yes,Yes,No,No,No,No,No,Yes,Yes,Credit Card (automatic),15.754144,159.947583,2054.706961,4,4,2,4,4,3,3,3,No
3,13863,Suburban,48,Male,1,Married,14.91354,15,2,0,Yes,Two Year,No,No,DSL,Yes,No,Yes,No,No,No,Yes,No,Yes,Mailed Check,17.087227,119.95684,2164.579412,4,4,4,2,5,4,3,3,No
4,11352,Suburban,83,Male,0,Separated,8.147417,16,2,1,No,Month-to-month,Yes,No,Fiber Optic,No,No,No,No,No,Yes,Yes,No,No,Mailed Check,1.670972,149.948316,271.493436,4,4,4,3,4,4,4,5,Yes


#### 4.  Generate univariate and bivariate visualizations of the distributions of variables in the cleaned data set. Include the target variable in your bivariate visualizations.

#### 5.  Provide a copy of the prepared data set.

## Part IV: Model Comparison and Analysis

### D.  Compare an initial and a reduced logistic regression model by doing the following:

#### 1.  Construct an initial logistic regression model from all predictors that were identified in Part C2

In [47]:
from sklearn.feature_selection import SelectKBest, chi2
def get_kbest(x_train, y_train):
    selector = SelectKBest(chi2, k='all')
    X_train_new = selector.fit_transform(x_train, y_train) 
    mask = selector.get_support()    
    new_features = x_train.columns[mask]
    return(pd.DataFrame(sorted(zip(selector.scores_, new_features), reverse=True)))

In [48]:
num_churn = df.replace({'Churn':{'Yes':1, 'No':0}})

In [49]:
dummy_df = pd.get_dummies(num_churn)

In [50]:
ordered_cols = [col for col in dummy_df.columns if col != 'Churn'] + ['Churn']
dummy_df = dummy_df[ordered_cols]

In [51]:
from sklearn.model_selection import train_test_split

# split into train/test sets
train, test = train_test_split(dummy_df, test_size=0.2, random_state=42)

In [52]:
X_train, y_train, X_test, y_test = train.iloc[:,0:-1], train.iloc[:,-1:], test.iloc[:,0:-1], test.iloc[:,-1:]

In [53]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

In [54]:
# numerical values in classification df
X_nums = ['Population', 'Age', 'Children', 'Outage_sec_perweek', 'Email', 
          'Contacts', 'Yearly_equip_failure', 'Tenure', 'MonthlyCharge',
          'Bandwidth_GB_Year', 'timely_responses', 'timely_fixes',
          'timely_replacements', 'reliability', 'options', 'respectful_response',
          'courteous_exchange', 'evidence_of_active_listening'], 

# scaler for numbers in classification df
transformer = ColumnTransformer(
    [('scaler', StandardScaler(), X_nums)],
    remainder='passthrough'
)

In [55]:
from sklearn.linear_model import LogisticRegression

In [56]:
from sklearn.pipeline import make_pipeline


In [57]:
clf = LogisticRegression(random_state=42, solver='lbfgs', max_iter=400).fit(X_train, y_train.values.ravel())
clf.score(X_test, y_test)

0.8935

#### 2.  Justify a statistically based variable selection procedure and a model evaluation metric to reduce the initial model in a way that aligns with the research question.

In [58]:
# get kbest score for each feature
kbest = pd.DataFrame(get_kbest(X_train, y_train))
kbest[0:15]

Unnamed: 0,0,1
0,2154370.0,Bandwidth_GB_Year
1,37427.79,Tenure
2,25332.05,Population
3,11900.46,MonthlyCharge
4,339.4282,StreamingMovies_Yes
5,324.1656,StreamingMovies_No
6,271.8987,Contract_Month-to-month
7,224.4141,StreamingTV_Yes
8,219.7504,StreamingTV_No
9,191.7482,Contract_Two Year


In [59]:
kbest_feats = ['Bandwidth_GB_Year', 'Tenure', 'Population', 'MonthlyCharge', 
               'StreamingMovies_Yes', 'StreamingMovies_No', 'StreamingTV_Yes', 'StreamingTV_No', 
               'Contract_Month-to-month', 'Contract_One year', 'Contract_Two Year']

In [60]:
X_train, X_test = X_train[kbest_feats], X_test[kbest_feats]


#### 3.  Provide a reduced logistic regression model.



Note: The output should include a screenshot of each model.


In [61]:
clf = LogisticRegression(random_state=42, solver='lbfgs', max_iter=400).fit(X_train, y_train.values.ravel())
clf.score(X_test, y_test)

0.895

### E.  Analyze the data set using your reduced logistic regression model by doing the following:

#### 1.  Explain your data analysis process by comparing the initial and reduced logistic regression models, including the following elements:

* the logic of the variable selection technique

* the model evaluation metric

#### 2.  Provide the output and any calculations of the analysis you performed, including a confusion matrix.



Note: The output should include the predictions from the refined model you used to perform the analysis. 

#### 3.  Provide the code used to support the implementation of the logistic regression models.

## Part V: Data Summary and Implications

### F.  Summarize your findings and assumptions by doing the following:

#### 1.  Discuss the results of your data analysis, including the following elements:

* a regression equation for the reduced model

* an interpretation of coefficients of the statistically significant variables of the model

* the statistical and practical significance of the model

* the limitations of the data analysis

#### 2.  Recommend a course of action based on your results.



## Part VI: Demonstration

### G.  Provide a Panopto video recording that includes all of the following elements:

* a demonstration of the functionality of the code used for the analysis

* an identification of the version of the programming environment

* a comparison of the two logistic regression models you used in your analysis

* an interpretation of the coefficients







### H.  List the web sources used to acquire data or segments of third-party code to support the application. Ensure the web sources are reliable.



### I.  Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized