#### Chelsey De Dios

# D209 Task 1 Classification Analysis

## Part I: Research Question

### A.  Describe the purpose of this data mining report by doing the following:

#### 1.  Propose one question relevant to a real-world organizational situation that you will answer using one of the following classification methods:

* k-nearest neighbor (KNN)

* Naive Bayes

The question we will be posing is whether we can correctly classify customer's as either churn or no churn (leaving the company or not) based on other data about the customer using k-nearest neighbors to classify customers.

#### 2.  Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.

One goal of this analysis will be to use k-nearest neighbors to correctly classify most customers as either churning or not.
 

## Part II: Method Justification

### B.  Explain the reasons for your chosen classification method from part A1 by doing the following:

#### 1.  Explain how the classification method you chose analyzes the selected data set. Include expected outcomes.

K-nearest neighbors starts with a certain number of known categories, in this case Yes and No for churn and then when given an unknown value, analyzes the value to see whether it falls closely to other previously classified data points, and classifies the new data point based on how closely it comes to similar datapoints that are already classified, which are the nearest 'neighbors'. The algorithm chooses the category based on how many nearest neighbors that you pass through the algorithm. For example, if you pass 3 neighbors, it will plot the datapoint, and find the 3 nearest data points the new data comes close to. Then it will classify it based on which of those neighbors and what number of those neighbors are closer.

The expected outcome would be for the classifier to be able to identify a new customer entry as being either a churn or no churn customer based on it's proximity to other similar customers.

#### 2.  Summarize one assumption of the chosen classification method.

K-nearest neighbors assumes that, in the categories chosen, similar data points are plotted somewhat tightly together so that the algorithm can figure out the categories.

#### 3.  List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

a. Pandas allows us to work with the data through dataframes which allow for various simple data manipulations.

b. Numpy allows us to work with arrays of data, and is needed for some Pandas manipulations.

c. seaborn and matplotlib pyplot allow us to create visualizations easily so we can look at our data graphically.

d. sklearn allows us to use their machine learning algorithms in a black box manner, and to transform our data to work with our model.

## Part III: Data Preparation

### C.  Perform data preparation for the chosen data set by doing the following:

#### 1.  Describe one data preprocessing goal relevant to the classification method from part A1.

Data will be encoded into dummy variables in order to be numeric which allows it to work with the algorithm.

#### 2.  Identify the initial data set variables that you will use to perform the analysis for the classification question from part A1, and classify each variable as continuous or categorical.

This is performed below.

#### 3.  Explain each of the steps used to prepare the data for the analysis. Identify the code segment for each step.


In [2]:
# import data from csv
df = pd.read_csv('churn_clean.csv')

# set it so we can see all columns
pd.set_option('display.max_columns', None)

##### a. Change Column Names

It is useful to change the column names in order to better identify non-descriptive variable names.

In [3]:
# create a dictionary of current column names mapping to desired column names
survey_dict = {'Item1':'timely_responses', 
               'Item2':'timely_fixes', 
               'Item3':'timely_replacements', 
               'Item4':'reliability', 
               'Item5':'options', 
               'Item6':'respectful_response', 
               'Item7':'courteous_exchange', 
               'Item8':'evidence_of_active_listening'}

# rename the column names based on survey_dict
df = df.rename(columns=survey_dict)

##### b. Change Data Types

Now we will change the datatypes of our columns by passing a dictionary to df.astype mapping our column names to their new typing. We will do this because models will recognize the variable's datatype and deal with data appropriately.

In [4]:
# change the dataframe columns to more appropriate data types
df = df.astype({'Population':int, 
                'Area':'category',
                'Children':int, 
                'Age':int,
                'Income':float, 
                'Marital':'category', 
                'Gender':'category', 
                'Churn':'category',
                'Outage_sec_perweek':float, 
                'Email':int, 
                'Contacts':int, 
                'Yearly_equip_failure':int,
                'Techie':'category', 
                'Contract':'category', 
                'Port_modem':'category', 
                'Tablet':'category', 
                'InternetService':'category',
                'Phone':'category', 
                'Multiple':'category', 
                'OnlineSecurity':'category', 
                'OnlineBackup':'category',
                'DeviceProtection':'category', 
                'TechSupport':'category', 
                'StreamingTV':'category', 
                'StreamingMovies':'category',
                'PaperlessBilling':'category', 
                'PaymentMethod':'category', 
                'Tenure':float, 
                'MonthlyCharge':float,
                'Bandwidth_GB_Year':float, 
                'timely_responses':int, 
                'timely_fixes':int, 
                'timely_replacements':int, 
                'reliability':int, 
                'options':int,
                'respectful_response':int, 
                'courteous_exchange':int, 
                'evidence_of_active_listening':int}, copy=False)

In [5]:
# subset the dataframe to relevant variables
df = df[['Population', 'Area', 'Age', 'Gender', 'Children', 'Marital', 'Income',
         'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure',
         'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService',
         'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
         'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 
         'PaymentMethod', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year',
         'timely_responses', 'timely_fixes', 'timely_replacements', 'reliability',
         'options', 'respectful_response', 'courteous_exchange', 
         'evidence_of_active_listening', 'Churn']]

In [6]:
# create dataframe of variables with classification of categorical or numeric
print('Variable and DataType')
types = pd.DataFrame(['numeric' if df[i].dtypes == (int or float) 
                      else 'categorical' for i in df.columns], df.columns, columns=['DataType'])
types

Variable and DataType


Unnamed: 0,DataType
Population,numeric
Area,categorical
Age,numeric
Gender,categorical
Children,numeric
Marital,categorical
Income,categorical
Outage_sec_perweek,categorical
Email,numeric
Contacts,numeric


Above all of the variables used in this analysis and their datatypes are listed.

##### c. Get dummy variables for categorical data

Here we will first replace all binary values in variables with 1's and 0's. Then, using pd.getdummies we will get dummy variables/one hot encoded variables to make our categorical data numeric in order to work with our model.

In [7]:
# get a list of columns with churn at the end
ordered_cols = [i for i in df.columns if i != 'Churn'] + ['Churn']

# reorder columns to get target variable last
ordered_df = df[ordered_cols]

In [8]:
# replace all Yes/No values with 1 and 0 in all columns except target
dummy_df = ordered_df[ordered_df.columns[:-1]].replace({'Yes':1, 'No':0})

In [9]:
# get dummy values for dataframe
dummy_df = pd.get_dummies(dummy_df)

# append churn to dummy_df
dummy_df['Churn'] = ordered_df['Churn']

In [10]:
# get numeric columns list
num_cols = set(df._get_numeric_data().columns)

# get categorical column list and remove target
cat_cols = set(df.columns) - num_cols
cat_cols.remove('Churn')

# get categorical values in dummy_df
dummy_cats = list(set(dummy_df.columns) - num_cols)

In [11]:
# change categorical data back to category
dummy_df[dummy_cats] = dummy_df[dummy_cats].astype('category')

##### d. Scale Numerical Data

Now we will use sklearn's StandardScaler to scale our numeric data so nothing is improperly weighted.

In [12]:
# scaler numeric data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

dummy_df[list(num_cols)] = scaler.fit_transform(dummy_df[list(num_cols)])

##### e. Reorder Columns for Target Variable

Next we will reorder our columns to put our target variable 'Churn' at the end.

In [13]:
# order columns to get target variable to the end
ordered_cols = [i for i in dummy_df.columns if i != 'Churn'] + ['Churn']
dummy_df = dummy_df[ordered_cols]

#### 4.  Provide a copy of the cleaned data set.

In [14]:
# export cleaned data to csv
df.to_csv('t1_data')

## Part IV: Analysis

### D.  Perform the data analysis and report on the results by doing the following:

#### 1.  Split the data into training and test data sets and provide the file(s).

In [15]:
from sklearn.model_selection import train_test_split

# split into train/test sets
train, test = train_test_split(dummy_df, test_size=0.3, random_state=42)

In [16]:
# export training data to csv
train.to_csv('train.csv')

In [17]:
# export test data to csv
test.to_csv('test.csv')

In [18]:
# split data into explanatory and target variables
X_train, y_train, X_test, y_test = train.iloc[:,0:-1], train.iloc[:,-1:], test.iloc[:,0:-1], test.iloc[:,-1:]

#### 2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

In [20]:
# fit data to algorithm
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_train, y_train.values.ravel())

# get prediction for x test
pred = neigh.predict(X_test)

# get accuracy score
neigh.score(X_test, y_test)

0.8123333333333334

The score for this classification is around .812, which is the accuracy score for our test set. The accuracy is fraction of correct predictions vs the total predictions of the model.

In [21]:
# get roc_auc_score
from sklearn.metrics import roc_auc_score
neigh_prob = neigh.predict_proba(X_test)
roc_auc_score(y_test, neigh_prob[:,1])

0.8303101561606978

Our ROC AUC score above is around .83

In [22]:
# create a confusion matrix
from sklearn.metrics import confusion_matrix
con_matrix = pd.DataFrame(confusion_matrix(y_test, pred), columns=['Churn_Yes_True', 'Churn_No_True'])
con_matrix[''] = ['Churn_Yes_Pred', 'Churn_No_Pred']
con_matrix.set_index('')

Unnamed: 0,Churn_Yes_True,Churn_No_True
,,
Churn_Yes_Pred,1928.0,228.0
Churn_No_Pred,335.0,509.0


In the above confusion matrix we can see the results of our classification with 1928 true positives, 228 false positives, 335 false negatives and 509 false positives.

#### 3.  Provide the code used to perform the classification analysis from part D2.

The code for the classification is above.

### Part V: Data Summary and Implications

### E.  Summarize your data analysis by doing the following:

#### 1.  Explain the accuracy and the area under the curve (AUC) of your classification model.

The accuracy score for the knn classification was just over .812. The accuracy is the fraction of correct predictions vs the total predictions for the classification model.

The ROC gives us the plotted true positive to false positive rate for every possible classification threshold. The AUC gives us the area under the curve of the ROC graph, which tells us how often our model provides correct positive or negative classifications. Our score for our ROC AUC is .83, which means 83% of cases were correctly identified.

#### 2.  Discuss the results and implications of your classification analysis.

The results of this analysis suggest that it is possible to classify customers as churning or not based on the inputs from this data around with over 80% accuracy, making this a potentially useful model. This also suggests that this data is useful for predicting customer churn.

#### 3.  Discuss one limitation of your data analysis.

One limitation of this analysis is that there are only 10,000 entries for customer data, and we do not know the true population size of customers to compare and decide if this model it truly as valuable as the scoring suggests.

#### 4.  Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in part E2.

It would be good to attempt to use this model to identify customers that have not yet churned, but are classified as those who would churn in this model. After these customers are identified, it would be beneficial to focus on those customers whom are classified as churning for retention.
 

## Part VI: Demonstration

### F.  Provide a Panopto video recording that includes a demonstration of the functionality of the code used for the analysis and a summary of the programming environment.

https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=3757504a-78d6-49bf-98bf-addb016dfeee

### G.  Record the web sources used to acquire data or segments of third-party code to support the analysis. Ensure the web sources are reliable.

N/A

### H.  Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.

N/A