#### Chelsey De Dios

# D209 Task 2: Predictive Analysis

## Part I: Research Question

### A.  Describe the purpose of this data mining report by doing the following:

#### 1.  Propose one question relevant to a real-world organizational situation that you will answer using one of the following prediction methods:

* decision trees

* random forests

* advanced regression (i.e., lasso or ridge regression)

In this analysis we will be asking whether we can predict a customer's tenure based on other data about the customer using Random Forest.

#### 2.  Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.

The goal of this analysis will to predict customer tenure based on other data points regarding the customer, such as demographic and service related data.
 

## Part II: Method Justification

### B.  Explain the reasons for your chosen prediction method from part A1 by doing the following:

#### 1.  Explain how the prediction method you chose analyzes the selected data set. Include expected outcomes.

Random Forest Regression fits regression models of the explanatory variables to the target variables, and then splits the data for each of the explanatory variables. After theis, the sum of squared error is calculated at the points between the actual and predicted values. The lowest sum of squared error is chosen and the process continues until all data is analyzed.

The expected outcome in this case would be that the algorithm identifies customers tenure based on the results of these calculations.

#### 2.  Summarize one assumption of the chosen prediction method.

This model assumes that for each explanatory variable there is a 'best split' that will eventually lead to the correct identification of the target variable.

#### 3.  List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.

In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

a. Pandas allows us to work with the data through dataframes which allow for various simple data manipulations.

b. Numpy allows us to work with arrays of data, and is needed for some Pandas manipulations.

c. seaborn and matplotlib pyplot allow us to create visualizations easily so we can look at our data graphically.

d. sklearn allows us to use their machine learning algorithms in a black box manner, and to transform our data to work with our model.

## Part III: Data Preparation

### C.  Perform data preparation for the chosen data set by doing the following:

#### 1.  Describe one data preprocessing goal relevant to the prediction method from part A1.

Data will be encoded into dummy variables in order to be numeric which allows it to work with the algorithm.

#### 2.  Identify the initial data set variables that you will use to perform the analysis for the prediction question from part A1, and group each variable as continuous or categorical. 

This is performed below.

#### 3.  Explain the steps used to prepare the data for the analysis. Identify the code segment for each step.



In [2]:
# import data from csv
df = pd.read_csv('churn_clean.csv')

# set it so we can see all columns
pd.set_option('display.max_columns', None)

##### a. Change Column Names

It is useful to change the column names in order to better identify non-descriptive variable names.

In [3]:
# create a dictionary of current column names mapping to desired column names
survey_dict = {'Item1':'timely_responses', 
               'Item2':'timely_fixes', 
               'Item3':'timely_replacements', 
               'Item4':'reliability', 
               'Item5':'options', 
               'Item6':'respectful_response', 
               'Item7':'courteous_exchange', 
               'Item8':'evidence_of_active_listening'}

# rename the column names based on survey_dict
df = df.rename(columns=survey_dict)

##### b. Change Data Types

Now we will change the datatypes of our columns by passing a dictionary to df.astype mapping our column names to their new typing. We will do this because models will recognize the variable's datatype and deal with data appropriately.

In [4]:
# change the dataframe columns to more appropriate data types
df = df.astype({'Population':int, 
                'Area':'category',
                'Children':int, 
                'Age':int,
                'Income':float, 
                'Marital':'category', 
                'Gender':'category', 
                'Churn':'category',
                'Outage_sec_perweek':float, 
                'Email':int, 
                'Contacts':int, 
                'Yearly_equip_failure':int,
                'Techie':'category', 
                'Contract':'category', 
                'Port_modem':'category', 
                'Tablet':'category', 
                'InternetService':'category',
                'Phone':'category', 
                'Multiple':'category', 
                'OnlineSecurity':'category', 
                'OnlineBackup':'category',
                'DeviceProtection':'category', 
                'TechSupport':'category', 
                'StreamingTV':'category', 
                'StreamingMovies':'category',
                'PaperlessBilling':'category', 
                'PaymentMethod':'category', 
                'Tenure':float, 
                'MonthlyCharge':float,
                'Bandwidth_GB_Year':float, 
                'timely_responses':int, 
                'timely_fixes':int, 
                'timely_replacements':int, 
                'reliability':int, 
                'options':int,
                'respectful_response':int, 
                'courteous_exchange':int, 
                'evidence_of_active_listening':int}, copy=False)

In [5]:
# subset the dataframe to relevant variables
df = df[['Population', 'Area', 'Age', 'Gender', 'Children', 'Marital', 'Income',
         'Outage_sec_perweek', 'Email', 'Contacts', 'Yearly_equip_failure',
         'Techie', 'Contract', 'Port_modem', 'Tablet', 'InternetService',
         'Phone', 'Multiple', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
         'TechSupport', 'StreamingTV', 'StreamingMovies', 'PaperlessBilling', 
         'PaymentMethod', 'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year',
         'timely_responses', 'timely_fixes', 'timely_replacements', 'reliability',
         'options', 'respectful_response', 'courteous_exchange', 
         'evidence_of_active_listening', 'Churn']]

In [6]:
# create dataframe of variables with classification of categorical or numeric
print('Variable and DataType')
types = pd.DataFrame(['numeric' if df[i].dtypes == (int or float) 
                      else 'categorical' for i in df.columns], df.columns, columns=['DataType'])
types

Variable and DataType


Unnamed: 0,DataType
Population,numeric
Area,categorical
Age,numeric
Gender,categorical
Children,numeric
Marital,categorical
Income,categorical
Outage_sec_perweek,categorical
Email,numeric
Contacts,numeric


Above all of the variables used in this analysis and their datatypes are listed.

##### c. Get dummy variables for categorical data

Here we will first replace all binary values in variables with 1's and 0's. Then, using pd.getdummies we will get dummy variables/one hot encoded variables to make our categorical data numeric in order to work with our model.

In [7]:
# get a list of columns with target at the end
ordered_cols = [i for i in df.columns if i != 'Tenure'] + ['Tenure']

# reorder columns to get target variable last
ordered_df = df[ordered_cols]

In [8]:
# replace all Yes/No values with 1 and 0 in all columns except target'
dummy_df = ordered_df[ordered_df.columns[:-1]].replace({'Yes':1, 'No':0})

In [9]:
# get dummy values for dataframe
dummy_df = pd.get_dummies(dummy_df)

# append churn to dummy_df
dummy_df['Tenure'] = ordered_df['Tenure']

In [10]:
# get numeric datatype columns list
num_cols = set(df._get_numeric_data().columns)

# get categorical datatype columns list except target
cat_cols = set(df.columns) - num_cols
cat_cols.remove('Churn')

# get categorical dtype columns in dummy_df
dummy_cats = list(set(dummy_df.columns) - num_cols)

In [11]:
# change categorical data back to category
dummy_df[dummy_cats] = dummy_df[dummy_cats].astype('category')

##### d. Scale Numerical Data

Now we will use sklearn's StandardScaler to scale our numeric data so nothing is improperly weighted.

In [12]:
# scale numeric data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

dummy_df[list(num_cols)] = scaler.fit_transform(dummy_df[list(num_cols)])

##### e. Reorder Columns for Target Variable

Next we will reorder our columns to put our target variable 'Churn' at the end.

In [13]:
# order columns to get target variable to the end
ordered_cols = [i for i in dummy_df.columns if i != 'Tenure'] + ['Tenure']
dummy_df = dummy_df[ordered_cols]

#### 4.  Provide a copy of the cleaned data set.

In [14]:
df.to_csv('t2_data.csv')

## Part IV: Analysis

### D.  Perform the data analysis and report on the results by doing the following:

#### 1.  Split the data into training and test data sets and provide the file(s).

In [15]:
from sklearn.model_selection import train_test_split

# split into train/test sets
train, test = train_test_split(dummy_df, test_size=0.3, random_state=42)

In [16]:
# export training data to csv
train.to_csv('train.csv')

In [17]:
# export test data to csv
test.to_csv('test.csv')

In [18]:
# split data into explanatory and target variables
X_train, y_train, X_test, y_test = train.iloc[:,0:-1], train.iloc[:,-1:], test.iloc[:,0:-1], test.iloc[:,-1:]

#### 2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

The first analysis is going to be the accuracy score, which is the fraction of correctly identified data to all data.

In [19]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(random_state=42).fit(X_train, y_train.values.ravel())
rfr.score(X_train, y_train)

0.9997191402856261

Our accuracy score is very nearly 100%.

The second calculation will be the mean squared error, which is the average of the square of the errors, or how far away the model's predicition is from each true value.

In [20]:
from sklearn.metrics import mean_squared_error
y_pred = rfr.predict(X_test)
mean_squared_error(y_test, y_pred)

0.0019252786024908436

The mean squared error is around .0019.

#### 3.  Provide the code used to perform the prediction analysis from part D2.
 

## Part V: Data Summary and Implications

### E.  Summarize your data analysis by doing the following:

#### 1.  Explain the accuracy and the mean squared error (MSE) of your prediction model.

The accuracy score of the model is around 99.9%, which is almost perfect. This could be due to the relationship of Tenure and Bandwidth_GB_Year which we have discovered in previous exercises. The MSE is .0019 which is very low, which makes sense in regards to our findings with accuracy. The model has a very high accuracy and thus the average of the errors, or how far our prediciton deviates from the true value, is very low.

#### 2.  Discuss the results and implications of your prediction analysis.

The result of this analysis is a near-perfect model of customer tenure based on other datapoints. If this were a real dataset, it would mean that this model would almost always correctly identify a customer's tenure based on the other data collected about them located in this dataset. 

#### 3.  Discuss one limitation of your data analysis.

One limitation of this analysis is that there are only 10,000 entries for customer data, and we do not know the true population size of customers to compare and decide if this model it truly as valuable as the scoring suggests.

#### 4.  Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in part E2.

Based on the findings in this predictive it is possible to guess a customer's tenure based on the other information about them in this dataset. It would be good to use this to find the relationship between customers with a higher tenure to then figure out what causes some customers to stay and others to go.

## Part VI: Demonstration

### F.  Provide a Panopto video recording that includes a demonstration of the functionality of the code used for the analysis and a summary of the programming environment.

https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=a9e9b773-a451-451d-8d31-addb01800d9b

### G.  Record the web sources used to acquire data or segments of third-party code to support the analysis. Ensure the web sources are reliable.
 
N/A

### H.  Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.

N/A