## D212 Data Mining 2 PA 2
##### Submitted By Edwin Perry
### Table of Contents
<ol>
    <li><a href="#A">Research Question</a></li>
    <li><a href="#B">Technique Justification</a></li>
    <li><a href="#C">Data Preparation</a></li>
    <li><a href="#D">Analysis</a></li>
    <li><a href="#E">Data Summary and Implications</a></li>
    <li><a href="#F">Panopto Video</a></li>
</ol>


<h4 id="A">Research Question</h4>
<h5>Question</h5>
<p>The research question for this analysis is "Can customer churn be predicted by a decision tree after Principal Component Analysis (PCA) is used to reduce dimensionality?" This question helps to drive the success of the business, as it enables the business to understand customer sentiment and likelihood of losing a customer, potentially helping the business to reduce customers lost and prepare earlier for new customer acquisition</p>
<h5>Goal of Analysis</h5>
<p>The goal of the analysis is to create a model that can accurately predict customer churn. The telecommunications industry has a high customer acquisition cost, which means identifying customers that are likely to churn in the future allow the business to take actions to maximize revenue, ideally by finding a way to keep customers, as the cost to acquire a new customer is usually 10 times the cost to retain a customer.</p>
<h4 id="B">Technique Justification</h4>
<h5>Method Explanation</h5>
Principal Component Analysis is a useful technique in data analysis. Specifically, PCA allows an analyst to reduce the number of dimensions being considered by a model, allowing the analyst to extract composite features. These composite features allow for different dimensions to be considered as part of one larger dimension that can be more useful at deriving insights within the analysis. My anticipated outcome of PCA would be a significant reduction to the number of dimensions being considered, using the explained_variance_ratio attribute, which would allow me to have a more streamlined and effective decision tree than the analysis would otherwise be able to obtain</p>
<h5>Assumptions of PCA</h5>
<p>One of the assumptions that I must make for PCA to be valid is that there are no outliers (or that the impact of outliers in minimal). PCA uses eigenvectors to determine principal components, and as such, a single or handful of extreme values can lead to misleading results. As such, we will be filtering outliers in our analysis, to ensure that the PCA can be validly performed.</p>
<h4 id="C">Data Preparation</h4>
<h5>Continuous Variable Identification</h5>
<p>The PCA dimension reduction will be performed specifically on continuous variables, though any quantifiable data is valid, as the rubric specifically states to perform this analysis on continuous variables. The ones relevant for the analysis are as follows:
<ul>
<li>Tenure: The number of months that the customer has been a customer of the telecommunications company</li>
<li>Income: The annual income in dollars of the customer</li>
<li>Bandwidth_gb_year: The amount of data a customer uses in a year</li>
<li>Outage_sec_perweek: The average number of seconds the customer's neighborhood experiences on a weekly basis</li>
</ul>
</p>
<h5>Standardization</h5>
<p>Before the data can be used in the analysis, there are certain steps required to prepare the data, including standardizing the values from the continuous columns. The following code is the entirety of the process used to prepare this data:</p>

In [3]:
import pandas as pd
from pandas.api.types import CategoricalDtype
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, roc_auc_score, roc_curve, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from scipy import stats


In [4]:
data = pd.read_csv("./Data Source/churn_clean.csv")
pd.set_option("display.max_columns", None)
print(data.head())
data['Income'].fillna(data['Income'].median(), inplace=True)
data["Tenure"].fillna(data["Tenure"].median(), inplace=True)
data["MonthlyCharge"].fillna(data["MonthlyCharge"].median(), inplace=True)
data["Email"].fillna(data["Email"].median(), inplace=True)
data["Contacts"].fillna(data["Contacts"].median(), inplace=True)
data["Age"].fillna(data["Age"].median(), inplace=True)
data["Bandwidth_GB_Year"].fillna(data["Bandwidth_GB_Year"].median(), inplace=True)
data.dropna(how='any',inplace=True)
zscores = stats.zscore(data['Income'])
data['IncomeZScore'] = zscores
zscores = stats.zscore(data['Tenure'])
data['TenureZScore'] = zscores
zscores = stats.zscore(data['MonthlyCharge'])
data['MonthlyChargeZScore'] = zscores
zscores = stats.zscore(data['Email'])
data['EmailZScore'] = zscores
zscores = stats.zscore(data['Contacts'])
data['ContactsZScore'] = zscores
zscores = stats.zscore(data['Age'])
data['AgeZScore'] = zscores
zscores = stats.zscore(data['Bandwidth_GB_Year'])
data['Bandwidth_GB_YearZScore'] = zscores
data = data.query("(IncomeZScore < 3 & IncomeZScore > -3) & (TenureZScore < 3 & TenureZScore > -3) & (MonthlyChargeZScore < 3 & MonthlyChargeZScore > -3) & (Bandwidth_GB_YearZScore < 3 & Bandwidth_GB_YearZScore > -3) & (EmailZScore < 3 & EmailZScore > -3) &  (AgeZScore < 3 & AgeZScore > -3) & (ContactsZScore < 3 & ContactsZScore > -3)")
data = data.drop(columns=['IncomeZScore', 'TenureZScore', 'MonthlyChargeZScore', 'EmailZScore', 'ContactsZScore', 'AgeZScore', 'Bandwidth_GB_YearZScore', 'Customer_id'])


   CaseOrder Customer_id                           Interaction  \
0          1     K409198  aa90260b-4141-4a24-8e36-b04ce1f4f77b   
1          2     S120509  fb76459f-c047-4a9d-8af9-e0f7d4ac2524   
2          3     K191035  344d114c-3736-4be5-98f7-c72c281e2d35   
3          4      D90850  abfa2b40-2d43-4994-b15a-989b8c79e311   
4          5     K662701  68a861fd-0d20-4e51-a587-8a90407ee574   

                                UID         City State                 County  \
0  e885b299883d4f9fb18e39c75155d990  Point Baker    AK  Prince of Wales-Hyder   
1  f2de8bef964785f41a2959829830fb8a  West Branch    MI                 Ogemaw   
2  f1784cfa9f6d92ae816197eb175d3c71      Yamhill    OR                Yamhill   
3  dc8a365077241bb5cd5ccd305136b05e      Del Mar    CA              San Diego   
4  aabb64a116e83fdc4befc1fbab1663f9    Needville    TX              Fort Bend   

     Zip       Lat        Lng  Population      Area             TimeZone  \
0  99927  56.25100 -133.37571          3

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Income'].fillna(data['Income'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data["Tenure"].fillna(data["Tenure"].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on