## D208 Predictive Modeling PA
##### Submitted By Edwin Perry
### Table of Contents
<ol>
    <li><a href="#A">Research Question</a></li>
    <li><a href="#B">Describing Dataset</a></li>
    <li><a href="#C">Univariate Analysis</a></li>
    <li><a href="#D">Bivariate Analysis</a></li>
    <li><a href="#E">Summary</a></li>
    <li><a href="#F">Web Sources</a></li>
    <li><a href="#G">Sources</a></li>
</ol>

<a id="A"></a>
#### A: Research Question
##### 1. Providing the question
The research question I elected to examine is as follows: "What factors are most closely associated with a customer leaving the business within the most recent month?" This knowledge would be essential for the business, as understanding the customers' reasons for discontinuing the service can help the business take actions in the future to retain customers for longer periods of times. As acquiring customers costs money, retaining customers is essential for the business to maximize profits.


##### 2. Goals of Analysis
The goals for this analysis is to use a multiple regression model to determine which factors in the telecommunications dataset (independent/explanatory variables) correlate to customers leaving the business (the dependent/target variable). If the analysis can identify these factors, customers can be retained over longer periods of time, thus reducing the necessity of acquiring new customers or allowing for a higher number of concurrent customers, resulting in higher profitability for the business.

<a id="B"></a>
#### B: Method Justification:
##### 1: Assumptions of a Multiple Logistic Regression Model
There are multiple assumptions made when one uses a multiple logistic regression model for data analysis. 
<ol>
    <li>The first of these assumptions is that the dependent variable can be classified as binary. If there are more than two possible values for the dependent variable, then logistic regression is not sufficient to answer the research question</li>
    <li>Another assumption to be made would be that the independent variables are independent of each other. If the independent variables interacted with each other, then that obscures the impact on the dependent variable. This is a phenomenon referred to as multicollinearity</li>
    <li>A multiple logistic regression model also assumes that each data point (each row in the csv file) are independent of each other. If they interacted with each other, then the primary cause of the dependent variable is obscured, with some level of correlation being caused instead by the interaction between data points, rather than some observed variable.</li>
    <li>Furthermore, multiple logistic regression models assume that the model has access to a sufficiently large sample size. Otherwise, factors that are not related may show a false relationship, due to mere coincedence.</li>
</ol>

##### 2: Benefits of Python

The programming language I will be using for this analysis will be Python. There are a number of benefits to using Python for this analysis:
<ol>
    <li>Python's simple, easy to understand syntax will make the steps of this analysis easy to develop and understand</li>
    <li>Python has a number of libraries designed specifically for this type of data analysis, such as sklearn, Pandas, NumPy, and SciPy</li>
    <li>Python's Jupyter Notebooks makes displaying the data, separated out into sections with explanations, incredibly easy</li>    
</ol>
For these reasons, I have decided that Python would be an ideal tool for this data analysis

##### 3: Explanation/Justification of Multiple Logistic Regression as a Technique

Multiple logistic regression is a method used in data analysis to analyze the impact of multiple different independent variables on one dependent variable with two possible values. This corresponds well to the research question. The dataset contains many different explanatory variables that can be tested to find their impact on the dependent variable, and the dependent variable contains only two possibilities: either the customer has cancelled service within the past month or they havent. As such, multiple logistic regression is the best method to answer the research question

<a id="C"></a>
#### C: Data Preparation
##### 1: Goals and Steps

There are a number of issues within the existing dataset. One such existing issue is that zip code is stored as a float, rather than a string. This leads to the loss of leading zeroes in the data. Furthermore, many entries are stored as string data types when they have binary values. This leads to a reduction in efficiency and the utilization of more time/resources in analyzing, when they could be stored as booleans. Finally, outliers exist within the dataset, which may mislead when engaging in data analysis. The steps of the data preperation/cleaning will be as follows
<ol>
    <li>Remove columns unnecessary to answering the research question</li>
    <li>Convert columns that can be to booleans</li>
    <li>Remove any duplicated customer ID's in the dataset, to ensure no customer is double counted</li>
    <li>Remove any entries without an entry in a column that is categorical</li>
    <li>For any entry missing values in a quantitative column, replace the missing value with the median value of the column</li>
    <li>Filter out any entries that have a z-score in a quantitative column more than 3 or less than -3</li>
</ol>

The goal of this data cleaning would be to have a dataset with no outliers in any quantitative fields, no null values in any fields, no unnecessary columns, and all data being formatted correctly and in the most efficient manner

##### 2: Dependent and Independent Variables

All of the variables are summarized and explained below:

<b>Children</b>

The number of children the customer has, stored as an integer. This will be useful in determining whether those with larger sized families are likely to remain as long-term customers.


In [1]:
!pip install plotnine

[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try 'pacman -S
[31m   [0m python-xyz', where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch-packaged Python package,
[31m   [0m create a virtual environment using 'python -m venv path/to/venv'.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip.
[31m   [0m 
[31m   [0m If you wish to install a non-Arch packaged Python application,
[31m   [0m it may be easiest to use 'pipx install xyz', which will manage a
[31m   [0m virtual environment for you. Make sure you have python-pipx
[31m   [0m installed via pacman.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS distribution provider. You can override this, at the risk of breaking your Python installation or OS, by passing --break-s

In [1]:
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import plotnine as p9
import seaborn as sns
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

ModuleNotFoundError: No module named 'plotnine'

In [None]:
data = pd.read_csv("/home/edwinp/Downloads/d9rkejv84kd9rk30fi2l/churn_clean.csv")
pd.set_option("display.max_columns", None)

In [None]:
data.Children.value_counts().sort_index()

<b>Income</b>

The annual income of the customer, as reported at sign up. Stored as a float data type. Used as an independent variable

In [None]:
data.Income.describe()

<b>Area</b>

The residential area type that the customer lives in. Can be classified as rural, urban, or suburban. Stored as a string data type. Used as an independent variable

In [None]:
data.Area.value_counts().sort_index()

<b>Age</b>

The age of the customer as reported at sign up. Stored as an integer. Used as an independent variable

In [None]:
data.Age.describe()

<b>Outage</b>

The number of seconds each week a person experiences an outage on average. Stored as a float. This will used as an independent variable

In [None]:
data.Outage_sec_perweek.describe()

<b>Contract</b>

The contract term of the customer, with the options being Month-to-month, One year, and Two year. Stored as a string. This will be used as an independent variable

In [None]:
data.Contract.value_counts().sort_index()

<b>Monthly Charge</b>

The amount, in dollars, that the customer is charged each month. Stored as a float data type. This is an independent variable

In [None]:
data.MonthlyCharge.describe()

<b>Marital</b>

Marital status of the customer, as reported at sign-up. Stored as a string data type, with 5 possible values: Divorced, Married, Never Married, Separated, and Widowed. Used as an independent variable

In [None]:
data.Marital.value_counts().sort_index()

<b>Gender</b>

The gender of the customer, as reported at sign-up. Stored as a string containing 3 possible values: Female, Male, and Nonbinary. Used as an independent variable to determine if gender may influence the churn of the customer.

In [None]:
data.Gender.value_counts().sort_index()

<b>Techie</b>

Whether or not the customer describes themselves as technically inclined, reported at sign-up. Can have two possible values: Yes and No. Currently stored as a string, but will be converted to boolean. An independent varibale to determine whether technological inclination is possibly associated with the customer churn

In [None]:
data.Techie.value_counts().sort_index()

<b>Tenure</b>

The number of months the customer has stayed with the provider. Stored as a float. Used as an independent variable potentially associated with churn

In [None]:
data.Tenure.describe()

In [None]:
data.Bandwidth_GB_Year.describe()

In [None]:
data.PaymentMethod.value_counts()

In [None]:
data.Churn.value_counts()

In [None]:
data['Income'].fillna(data['Income'].median(), inplace=True)
data["Tenure"].fillna(data["Tenure"].median(), inplace=True)
data["MonthlyCharge"].fillna(data["MonthlyCharge"].median(), inplace=True)
data["Outage_sec_perweek"].fillna(data["Outage_sec_perweek"].median(), inplace=True)
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Children"].fillna(data["Children"].median(), inplace=True)
data.dropna(how='any',inplace=True)
data.drop_duplicates(subset=['Customer_id'])
zscores = stats.zscore(data['Income'])
data['IncomeZScore'] = zscores
zscores = stats.zscore(data['Tenure'])
data['TenureZScore'] = zscores
zscores = stats.zscore(data['MonthlyCharge'])
data['MonthlyChargeZScore'] = zscores
zscores = stats.zscore(data['Outage_sec_perweek'])
data['Outage_sec_perweekZScore'] = zscores
zscores = stats.zscore(data['Age'])
data['AgeZScore'] = zscores
zscores = stats.zscore(data['Children'])
data['ChildrenZScore'] = zscores
data = data.query("(IncomeZScore < 3 & IncomeZScore > -3) & (TenureZScore < 3 & TenureZScore > -3) & (MonthlyChargeZScore < 3 & MonthlyChargeZScore > -3) & (Outage_sec_perweekZScore < 3 & Outage_sec_perweekZScore > -3) & (AgeZScore < 3 & AgeZScore > -3) & (ChildrenZScore < 3 & ChildrenZScore > -3)")

data = data.drop(["IncomeZScore", 'TenureZScore', 'MonthlyChargeZScore', 'Outage_sec_perweekZScore', 'AgeZScore', 'ChildrenZScore', 'CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State', 'County', 'Zip', 'Lat', 'Lng', 'TimeZone', 'Job', 'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Email', 'Contacts', 'Yearly_equip_failure', 'Population', 'PaperlessBilling', 'Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8'], axis=1)
data.info()

In [None]:
display(data.head())

##### 3: Univariate and Bivariate Statistics

In [None]:
plt.figure(figsize= [15,5])
plt.title("Distribution of Customer Tenure")
churn = data["Churn"].value_counts()
plt.pie(churn, labels=churn.index, autopct='%1.1f%%')

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Income")
plt.subplot(1,2,1)
plt.title("Distribution of Income Level")
bins = np.arange(0,275000,10000)
plt.hist(data=data, x="Income", bins=bins)
plt.xlabel("Income")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Income Level vs Tenure")
sns.violinplot(data=data, x="Income", y="Churn", orient="h")
plt.xlabel("Income")
plt.ylabel("Tenure")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Yearly Bandwidth (GB)")
plt.subplot(1,2,1)
plt.title("Distribution of Yearly Bandwidth (GB)")
bins = np.arange(100,7500,100)
plt.hist(data=data, x="Bandwidth_GB_Year", bins=bins)
plt.xlabel("Yearly Bandwidth (GB)")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Yearly Bandwidth (GB) vs Tenure")
sns.violinplot(data=data, x="Bandwidth_GB_Year", y="Churn", orient="h")
plt.xlabel("Yearly Bandwidth (GB)")
plt.ylabel("Tenure")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Monthly Charge")
plt.subplot(1,2,1)
plt.title("Distribution of Monthly Charge Level")
bins = np.arange(50,300,10)
plt.hist(data=data, x="MonthlyCharge", bins=bins)
plt.xlabel("Monthly Charge")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Monthly Charge vs Tenure")
sns.violinplot(data=data, x="MonthlyCharge", y="Churn", orient = "h")
plt.xlabel("Monthly Charge")
plt.ylabel("Tenure")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Age")
plt.subplot(1,2,1)
plt.title("Distribution of Age")
bins = np.arange(15,90,1)
plt.hist(data=data, x="Age", bins=bins)
plt.xlabel("Age")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Age vs Tenure")
sns.violinplot(data=data, x="Age", y="Churn", orient = "h")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Outage Seconds Per Week")
plt.subplot(1,2,1)
plt.title("Distribution of Outage Seconds Per Week")
bins = np.arange(0,25,1)
plt.hist(data=data, x="Outage_sec_perweek", bins=bins)
plt.xlabel("Outage Seconds Per Week")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Outage Seconds Per Week vs Tenure")
sns.violinplot(data=data, x="Outage_sec_perweek", y="Churn", orient = "h")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Children")
plt.subplot(1,2,1)
plt.title("Distribution of Children")
bins = np.arange(0,10,1)
plt.hist(data=data, x="Children", bins=bins)
plt.xlabel("Children")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Children vs Tenure")
sns.countplot(data=data, x="Children", hue="Churn")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Marital Status")
plt.subplot(1,2,1)
plt.title("Distribution of Marital Status")
marital_status = data["Marital"].value_counts()
plt.pie(marital_status, labels=marital_status.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Marital Status and Tenure")
sns.countplot(data = data, x="Marital", hue="Churn")
plt.xlabel("Marital Status")
plt.ylabel("Tenure")

In [None]:
plt.figure(figsize=[20,5])
plt.suptitle("Investigation into Payment Methods")
plt.subplot(1,2,1)
plt.title("Distribution of Payment Methods")
payment_method = data["PaymentMethod"].value_counts()
plt.pie(payment_method, labels=payment_method.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Payment Methods and Tenure")
sns.countplot(data = data, x="PaymentMethod", hue="Churn")
plt.xlabel("Payment Method")
plt.ylabel("Tenure")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Residential Areas")
plt.subplot(1,2,1)
plt.title("Distribution of Area Types")
area_type = data["Area"].value_counts()
plt.pie(area_type, labels=area_type.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Area Type and Tenure")
sns.countplot(data = data, x="Area", hue="Churn")
plt.xlabel("Area Type")
plt.ylabel("Tenure")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Contract Types")
plt.subplot(1,2,1)
plt.title("Distribution of Contract Types")
contract_type = data["Contract"].value_counts()
plt.pie(contract_type, labels=contract_type.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Contract Type and Tenure")
sns.countplot(data = data, x="Contract", hue="Churn")
plt.xlabel("Contract Type")
plt.ylabel("Tenure")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Gender")
plt.subplot(1,2,1)
plt.title("Distribution of Gender")
client_gender = data["Gender"].value_counts()
plt.pie(client_gender, labels=client_gender.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Gender and Tenure")
sns.countplot(data = data, x="Gender", hue="Churn")
plt.xlabel("Gender")
plt.ylabel("Tenure")

In [None]:
plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Technical Inclination")
plt.subplot(1,2,1)
plt.title("Distribution of Technical Inclination")
client_tech = data["Techie"].value_counts()
plt.pie(client_tech, labels=client_tech.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Technical Inclination and Tenure")
sns.countplot(data = data, x="Techie", hue="Churn")
plt.xlabel("Technical Inclination")
plt.ylabel("Tenure")

##### 4: Data transformation goals
All columns containing categorical data still require data transformation. Techie is a column that only needs the data type to be converted, from the string values to 1 or 0. For all other categorical data, though, they need to be split into multiple columns that provide the information as to what the entry is. To take one example, the contract column will be converted into a yearcontract column, a twoyearcontract column, and a month_to_month column, each of which containing only 0 and 1 as values, for true and false

In [None]:
data["Techie"] = data["Techie"].apply(lambda x: 0 if x == "Yes" else 1)
data['Area'] = data['Area'].astype("category")
data['Marital'] = data.Marital.astype('category')
data.Gender = data.Gender.astype('category')
data.Contract = data.Contract.astype('category')
data.PaymentMethod = data.PaymentMethod.astype('category')
data = pd.get_dummies(data, columns=['Gender', 'Area', 'Marital', 'Contract', 'PaymentMethod', 'Techie'], dtype=int)
data["Churn"] = data["Churn"].apply(lambda x: 0 if x == "Yes" else 1)

print(data.info())

##### 5: CSV Export
See attached csv for the cleaned and transformed data

In [None]:
data.to_csv('./D208CleanedData.csv')

In [None]:
data.info()

<h4>D: Initial and Reduced Logistic Regression Model</h4>
<h5>1: Initial Multiple Logistic Regression Model</h5>

In [None]:
x = data[['Gender_Male', 'Gender_Female', 'Gender_Nonbinary', 'Area_Urban', 'Area_Suburban', 'Area_Rural', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Marital_Married', 'Contract_One year', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Credit Card (automatic)', 'PaymentMethod_Bank Transfer(automatic)', 'PaymentMethod_Mailed Check','Techie_0', 'Techie_1',  'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']]
vif = pd.DataFrame()
vif["Factor"] = x.columns
vif["vif"] = [variance_inflation_factor(x.values, i) for i in range(len(x.columns))]
print(vif)

In [None]:
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Area_Suburban',  'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'PaymentMethod_Mailed Check','Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']]
vif = pd.DataFrame()
vif["Factor"] = x.columns
vif["vif"] = [variance_inflation_factor(x.values, i) for i in range(len(x.columns))]
print(vif)

In [None]:
y = data['Churn']
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Area_Suburban',  'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'PaymentMethod_Mailed Check','Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']]
x = sm.add_constant(x)
model = sm.Logit(y, x).fit()
print(model.summary())

##### 2: Justifying a Statistically Based Feature Selection Procedure
We previously establish, multicolinearity can cause issues, so a reduction in the features is justified to ensure that multicollinearity is not impacting this model. Therefore, we will be using backward stepwise elimination to remove the variable with the highest p-value over a threshold of 0.05. This will be repeated until every variable has a p-value of <0.05

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Area_Suburban',  'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'PaymentMethod_Mailed Check','Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Area_Suburban',  'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Marital_Separated',  'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Marital_Separated',  'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Marital_Separated',  'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children', 'Age', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children', 'Age', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children','MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children','MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
y = data.Churn
x = data[['Gender_Male', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

This last model is our final model for the logistic regression, as all p-values have fallen to 0, beneath the threshold of 0.05.This model is slightly improved, as can be seen from the change in the pseudo R-squ value from 0.4874 to 0.4856
##### 3: Reduced Logistic Regression Model
A large number of our variables were removed. The variables were as follows:
<ul>
    <li>Area</li>
    <li>Marital</li>
    <li>Outage_sec_perweek</li>
    <li>Income</li>
    <li>Age</li>
    <li>Children</li>
</ul>
The remaining independent variables are as follows:
<ul>
    <li>Gender_Male</li>
    <li>Contract_Month_to_month</li>
    <li>PaymentMethod_Electronic Check</li>
    <li>Techie</li>
    <li>MonthlyCharge</li>
    <li>Bandwidth_GB_Year</li>
</ul>
All of these influence the only dependent variable of Churn

In [None]:
data = data[['Churn', 'Gender_Male', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'MonthlyCharge', 'Bandwidth_GB_Year']] 
y = data.Churn
x = data[['Gender_Male', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

###### <a id="E"></a>
#### E: Analyzing Dataset
##### 1: Explanation of Data Analysis Procedure
The initial multiple logistic regression model had a large number of variables, the majority of which having no value to the model itself. After the variables with p-values above 0.05 were removed, the model reduced the overall number of dependent in it from 26 to 6. With these, we see a correlation with the churn without needing to worry about the multicolinearity of the variables influencing or biasing the results

##### 2: Output and Calculations

In [None]:
sns.violinplot(x='MonthlyCharge', y="Churn", data=data, orient = "h")

In [None]:
sns.violinplot(x='Bandwidth_GB_Year', y="Churn", orient="h", data=data)

In [None]:
plt.title("Relationship of Techie and Churn")
sns.countplot(data = data, x="Techie_0", hue="Churn")
plt.xlabel("Techie")
plt.ylabel("Count")

In [None]:
plt.title("Relationship of Payment Method and Churn")
sns.countplot(data = data, x='PaymentMethod_Electronic Check', hue="Churn")
plt.xlabel("Payment Method")
plt.ylabel("Count")

In [None]:
plt.title("Relationship of Contract and Churn")
sns.countplot(data = data, x='Contract_Month-to-month', hue="Churn")
plt.xlabel("Month-to-month")
plt.ylabel("Count")

In [None]:
plt.title("Relationship of Gender and Churn")
sns.countplot(data = data, x='Gender_Male', hue="Churn")
plt.xlabel("Male")
plt.ylabel("Count")

A copy of the code used, without markdown cells or visualizations, can be found below

In [None]:
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import plotnine as p9
import seaborn as sns
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

data = pd.read_csv("/home/edwinp/Downloads/d9rkejv84kd9rk30fi2l/churn_clean.csv")
pd.set_option("display.max_columns", None)

data['Income'].fillna(data['Income'].median(), inplace=True)
data["Tenure"].fillna(data["Tenure"].median(), inplace=True)
data["MonthlyCharge"].fillna(data["MonthlyCharge"].median(), inplace=True)
data["Outage_sec_perweek"].fillna(data["Outage_sec_perweek"].median(), inplace=True)
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Children"].fillna(data["Children"].median(), inplace=True)
data.dropna(how='any',inplace=True)
data.drop_duplicates(subset=['Customer_id'])
zscores = stats.zscore(data['Income'])
data['IncomeZScore'] = zscores
zscores = stats.zscore(data['Tenure'])
data['TenureZScore'] = zscores
zscores = stats.zscore(data['MonthlyCharge'])
data['MonthlyChargeZScore'] = zscores
zscores = stats.zscore(data['Outage_sec_perweek'])
data['Outage_sec_perweekZScore'] = zscores
zscores = stats.zscore(data['Age'])
data['AgeZScore'] = zscores
zscores = stats.zscore(data['Children'])
data['ChildrenZScore'] = zscores
data = data.query("(IncomeZScore < 3 & IncomeZScore > -3) & (TenureZScore < 3 & TenureZScore > -3) & (MonthlyChargeZScore < 3 & MonthlyChargeZScore > -3) & (Outage_sec_perweekZScore < 3 & Outage_sec_perweekZScore > -3) & (AgeZScore < 3 & AgeZScore > -3) & (ChildrenZScore < 3 & ChildrenZScore > -3)")

data = data.drop(["IncomeZScore", 'TenureZScore', 'MonthlyChargeZScore', 'Outage_sec_perweekZScore', 'AgeZScore', 'ChildrenZScore', 'CaseOrder', 'Customer_id', 'Interaction', 'UID', 'City', 'State', 'County', 'Zip', 'Lat', 'Lng', 'TimeZone', 'Job', 'Port_modem', 'Tablet', 'InternetService', 'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Email', 'Contacts', 'Yearly_equip_failure', 'Population', 'PaperlessBilling', 'Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8'], axis=1)
data.info()

display(data.head())

plt.figure(figsize= [15,5])
plt.title("Distribution of Customer Tenure")
churn = data["Churn"].value_counts()
plt.pie(churn, labels=churn.index, autopct='%1.1f%%')

plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Income")
plt.subplot(1,2,1)
plt.title("Distribution of Income Level")
bins = np.arange(0,275000,10000)
plt.hist(data=data, x="Income", bins=bins)
plt.xlabel("Income")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Income Level vs Tenure")
sns.violinplot(data=data, x="Income", y="Churn", orient="h")
plt.xlabel("Income")
plt.ylabel("Tenure")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Yearly Bandwidth (GB)")
plt.subplot(1,2,1)
plt.title("Distribution of Yearly Bandwidth (GB)")
bins = np.arange(100,7500,100)
plt.hist(data=data, x="Bandwidth_GB_Year", bins=bins)
plt.xlabel("Yearly Bandwidth (GB)")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Yearly Bandwidth (GB) vs Tenure")
sns.violinplot(data=data, x="Bandwidth_GB_Year", y="Churn", orient="h")
plt.xlabel("Yearly Bandwidth (GB)")
plt.ylabel("Tenure")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Monthly Charge")
plt.subplot(1,2,1)
plt.title("Distribution of Monthly Charge Level")
bins = np.arange(50,300,10)
plt.hist(data=data, x="MonthlyCharge", bins=bins)
plt.xlabel("Monthly Charge")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Monthly Charge vs Tenure")
sns.violinplot(data=data, x="MonthlyCharge", y="Churn", orient = "h")
plt.xlabel("Monthly Charge")
plt.ylabel("Tenure")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Age")
plt.subplot(1,2,1)
plt.title("Distribution of Age")
bins = np.arange(15,90,1)
plt.hist(data=data, x="Age", bins=bins)
plt.xlabel("Age")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Age vs Tenure")
sns.violinplot(data=data, x="Age", y="Churn", orient = "h")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Outage Seconds Per Week")
plt.subplot(1,2,1)
plt.title("Distribution of Outage Seconds Per Week")
bins = np.arange(0,25,1)
plt.hist(data=data, x="Outage_sec_perweek", bins=bins)
plt.xlabel("Outage Seconds Per Week")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Outage Seconds Per Week vs Tenure")
sns.violinplot(data=data, x="Outage_sec_perweek", y="Churn", orient = "h")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation of Children")
plt.subplot(1,2,1)
plt.title("Distribution of Children")
bins = np.arange(0,10,1)
plt.hist(data=data, x="Children", bins=bins)
plt.xlabel("Children")
plt.ylabel("Number of Customers")
plt.subplot(1,2,2)
plt.title("Children vs Tenure")
sns.countplot(data=data, x="Children", hue="Churn")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Marital Status")
plt.subplot(1,2,1)
plt.title("Distribution of Marital Status")
marital_status = data["Marital"].value_counts()
plt.pie(marital_status, labels=marital_status.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Marital Status and Tenure")
sns.countplot(data = data, x="Marital", hue="Churn")
plt.xlabel("Marital Status")
plt.ylabel("Tenure")

plt.figure(figsize=[20,5])
plt.suptitle("Investigation into Payment Methods")
plt.subplot(1,2,1)
plt.title("Distribution of Payment Methods")
payment_method = data["PaymentMethod"].value_counts()
plt.pie(payment_method, labels=payment_method.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Payment Methods and Tenure")
sns.countplot(data = data, x="PaymentMethod", hue="Churn")
plt.xlabel("Payment Method")
plt.ylabel("Tenure")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Residential Areas")
plt.subplot(1,2,1)
plt.title("Distribution of Area Types")
area_type = data["Area"].value_counts()
plt.pie(area_type, labels=area_type.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Area Type and Tenure")
sns.countplot(data = data, x="Area", hue="Churn")
plt.xlabel("Area Type")
plt.ylabel("Tenure")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Contract Types")
plt.subplot(1,2,1)
plt.title("Distribution of Contract Types")
contract_type = data["Contract"].value_counts()
plt.pie(contract_type, labels=contract_type.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Contract Type and Tenure")
sns.countplot(data = data, x="Contract", hue="Churn")
plt.xlabel("Contract Type")
plt.ylabel("Tenure")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Gender")
plt.subplot(1,2,1)
plt.title("Distribution of Gender")
client_gender = data["Gender"].value_counts()
plt.pie(client_gender, labels=client_gender.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Gender and Tenure")
sns.countplot(data = data, x="Gender", hue="Churn")
plt.xlabel("Gender")
plt.ylabel("Tenure")

plt.figure(figsize=[15,5])
plt.suptitle("Investigation into Technical Inclination")
plt.subplot(1,2,1)
plt.title("Distribution of Technical Inclination")
client_tech = data["Techie"].value_counts()
plt.pie(client_tech, labels=client_tech.index, autopct='%1.1f%%')
plt.subplot(1,2,2)
plt.title("Relationship of Technical Inclination and Tenure")
sns.countplot(data = data, x="Techie", hue="Churn")
plt.xlabel("Technical Inclination")
plt.ylabel("Tenure")

data["Techie"] = data["Techie"].apply(lambda x: 0 if x == "Yes" else 1)
data['Area'] = data['Area'].astype("category")
data['Marital'] = data.Marital.astype('category')
data.Gender = data.Gender.astype('category')
data.Contract = data.Contract.astype('category')
data.PaymentMethod = data.PaymentMethod.astype('category')
data = pd.get_dummies(data, columns=['Gender', 'Area', 'Marital', 'Contract', 'PaymentMethod', 'Techie'], dtype=int)
data["Churn"] = data["Churn"].apply(lambda x: 0 if x == "Yes" else 1)

y = data['Churn']
x = data[['Gender_Male', 'Gender_Female', 'Gender_Nonbinary', 'Area_Urban', 'Area_Suburban', 'Area_Rural', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Marital_Married', 'Contract_One year', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Credit Card (automatic)', 'PaymentMethod_Bank Transfer(automatic)', 'PaymentMethod_Mailed Check','Techie_0', 'Techie_1',  'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']]
x = sm.add_constant(x)
model = sm.Logit(y, x).fit()
print(model.summary())

x = data[['Gender_Male', 'Gender_Female', 'Gender_Nonbinary', 'Area_Urban', 'Area_Suburban', 'Area_Rural', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Marital_Married', 'Contract_One year', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Credit Card (automatic)', 'PaymentMethod_Bank Transfer(automatic)', 'PaymentMethod_Mailed Check','Techie_0', 'Techie_1',  'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']]
vif = pd.DataFrame()
vif["Factor"] = x.columns
vif["vif"] = [variance_inflation_factor(x.values, i) for i in range(len(x.columns))]
print(vif)



y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Area_Suburban',  'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'PaymentMethod_Mailed Check','Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Area_Suburban',  'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Marital_Never Married', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'Outage_sec_perweek', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Area_Urban', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Marital_Divorced', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Gender_Female', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Marital_Separated', 'Contract_Two Year', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Marital_Separated',  'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'PaymentMethod_Bank Transfer(automatic)', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Marital_Separated',  'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children', 'Age', 'Income', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Marital_Separated',  'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children', 'Age', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children', 'Age', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Marital_Widowed', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children','MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'Children','MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

y = data.Churn
x = data[['Gender_Male', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

This last model is our final model for the logistic regression, as all p-values have fallen to 0, beneath the threshold of 0.05.This model is significantly improved, as can be seen from the change in the pseudo R-squ value from 0.487 to 0.3662
##### 3: Reduced Logistic Regression Model
A large number of our variables were removed. The variables were as follows:
<ul>
    <li>Area</li>
    <li>Marital</li>
    <li>Outage_sec_perweek</li>
    <li>Income</li>
    <li>Age</li>
    <li>Children</li>
</ul>
The remaining independent variables are as follows:
<ul>
    <li>Gender_Male</li>
    <li>Contract_Month_to_month</li>
    <li>PaymentMethod_Electronic Check</li>
    <li>Techie</li>
    <li>MonthlyCharge</li>
    <li>Bandwidth_GB_Year</li>
</ul>
All of these influence the only dependent variable of Churn

In [None]:
data = data[['Churn', 'Gender_Male', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'MonthlyCharge', 'Bandwidth_GB_Year']] 
y = data.Churn
x = data[['Gender_Male', 'Contract_Month-to-month', 'PaymentMethod_Electronic Check', 'Techie_0', 'MonthlyCharge', 'Bandwidth_GB_Year']].assign(const=1)
model = sm.Logit(y, x)
results = model.fit()
print(results.summary())

In [None]:
print(f'The odds ratio for Gender_Male is {round(np.exp(-0.2861), 4)}')
print(f'The odds ratio for Contract_Month-to-month is {round(np.exp(-2.4606), 4)}')
print(f'The odds ratio for PaymentMethod_Electronic Check is {round(np.exp(-0.3607), 4)}')
print(f'The odds ratio for Techie_0 is {round(np.exp(-0.7659), 4)}')
print(f'The odds ratio for MonthlyCharge is {round(np.exp(-0.0416), 4)}')
print(f'The odds ratio for Bandwidth_GB_Year is {round(np.exp(-0.0009), 4)}')

<a id="F"></a>
#### F: Summarizing Findings
##### 1: Results of Analysis
This multiple regression analysis yielded the following equation:

$$
ln(p/1-p) = 8.0615 - 2.4606(Contract Month-to-month) - 0.7659(Techie 0) - 0.3607(PaymentMethod Electronic Check) - 0.2861(Gender Male) - 0.0416(MonthlyCharge) + 0.0009(BandwidthGBYear)
$$

This can be used to conclude the following:

<ul>
    <li>With everything else constant, a Month-to-month contract results in a 0.0854% increase to the likelihood to stay with the company</li>
    <li>With everything else constant, the customer not being a techie results in a 0.7659% increase to the likelihood to stay with the company</li>
    <li>With everything else constant, the customer using electronic checks for payment results in a 0.6972% increase to the likelihood to stay with the company</li>
    <li>With everything else constant, an increase of one unit to Bandwith_GB_Year will correspond to a 0.9991% increase in likelihood to stay with the company</li>
    <li>With everything else constant, an increase of one unit to MonthlyCharge will correspond to a 0.9593% increase in likelihood to stay with the company</li>
</ul>
The fact that the LLR p-value is 0 indicates that this is a reliable model that can be trusted in the results. As such, this model can be used, to some extent to drive decision making. However, there are still some concerns that ought to be noted:
<ul>
    <li>Limited size of cleaned dataset: There are only 7584 entries in the dataset once it has been cleaned. Ideally, there would be a larger dataset to analyze, to ensure that the data is reliable</li>
    <li>Limited depth of analysis: for the categorical variables we began with, we were only able to evaluate one of the resultant columns, due to multicollinearity. Unfortunately, this limits the extent to which data can be analyzed. For example, we can compare month-to-month contracts with all others, but we can't compare month-to-month vs one year vs two year to see how each one influences the data</li>
</ul>

##### 2: Recommended Course of Action
Though the influence of any of these factors is only slightly correlated with any variable, we can still take action to ensure the maximum number of customers are retained. Offering more month-to-month contracts, appealing to self-described techies, encouraging the use of electronic checks, attracting users with higher data usage, and pursuing customers willing to pay higher monthly costs would all result in an improvement in the ability to retain customers. However, as the correlation of these factors and churn are very slight, it would also be recommended to engage in deeper analysis with more variables that might influence the churn rate more.

<a id="G"></a>
#### G: Panopto Video
A Panopto video recording of my code in action can be found at the following link:
https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=2cc3360a-7574-4af8-b858-b189016ac623


<a id="H"></a>
#### H: Code References
<a src=https://www.w3resource.com/python-exercises/pandas/missing-values/python-pandas-missing-values-exercise-14.php>Pandas information</a> used for cleaning the data and excluding outliers
<a src=https://towardsdatascience.com/feature-selection-techniques-in-regression-model-26878fe0e24e>Feature Selection Techniques in Regression Model: Ashutosh Tripathi</a> used for the backwards stepwise elimination.

<a id="I"></a>
#### H: Source References
<a src="https://search.ebscohost.com/login.aspx?direct=true&db=nlebk&AN=2091371&site=eds-live&scope=site&authtype=sso&custid=ns017578&ebv=EB&ppid=pp_9">Chantal D. Larose, Daniel T. Larose: Data Science Using Python and R</a> used to understand Variance Inflation Factor and how to analyze/improve the model