# D209 - Data Mining 1 Performance Assessment Task 2
Aidan Soares, 012042436, Western Governors University

### A1: Research Question
For this task, my research question is "Can I effectively predict customer Tenure via decision trees using the data available?". I will be using the Churn dataset for this assessment. Tenure is the foundation for sustainable and long-term financial success for telecommunication companies. The longer individuals retain subscription to one company, the more likely it is that they have developped a strong service model, great customer satisfaction rates, and will naturally grow through online user reviews and word-of-mouth recommendations. This growth is ideal for an industry where competition is fierce and each company has to undercut each other via loss leading service packages, as customers can very easily be persuaded in dropping one company for another with a simple phone call. 

By making predictions of customer tenure through the collected data available, these companies can assess the longevity of their customer relations based on existing features, identifying whether or not current service packages require improvement for customer satistfaction. This will ultimately result in effective resource allocation, dictating exactly when the company may be in danger of customers ending their tenure.

### A2: Goal
The question I pose is similar in nature to the one used in D208's Task 1. I have attempted to develop a regression model that could accurately establish known and relevant relationships between independant variables and the depandant variable Tenure. Within the multiple linear regression model I created, I found that the resulting analysis did not conform to the assumptions necessary to deem the model acceptable and applicable in a corporate environment. My goal for this assessment is to try again using decision trees this time to build a machine learning model that can help the company predict its customers tenure, allowing them to generate useful insights into variables that are negatively impacting the tenure rates.

### B1: Classification Method
The classification method I have chosen is a decision tree. This classification method is structured like a tree, consisting of nodes in heirarchal format that classify datapoints and sort them as information is passed down the tree. Each node serves as a test for some attribute such as "Does the customer have online protection? How many children does the customer have? etc...", splitting said node into two or more sub-nodes based on homogeny (similarity). The tree will use multiple algorithms to perform these tests recursively until the decision criteria has been met (Chauhan, 2022). From this, the tree will identify the nodes and selected criteria to identify branch pathways that result in an appropriately predicted tenure length to its best ability, this final resulting node is referred to as the leaf node. Using this method, I am aiming to build a model that will be able to assess a customer given their collected data and deduce their estimated tenure, the company can then utilize these predictions for cost-benefit analyses on how to proceed to increase customer retention.

### B2: Decision Tree Assumption
The major assumption of decision trees is that it is a non-parametric algorithm, the data can be collected from samples that do not follow specific distributions. This means that the data we are working with for this algorithm is not restricted to any one dimension, the decision tree can deal with linear, non-linear data, categorical data, ordinal data all the same (Datacamp, 2023). Its multidimensional nature of decision trees allow for minimal cleaning in order to start building the model, but it is also able to handle high variety of data with less computational requirements and solid accuracy. 

### B3: Packages/Libraries
For this assessment I will be using Python. This language was chosen due to its ability to handle a wide variety of data analytical processes, large sets of data, and ease of use for data transformation, cleaning, and visualization. Many python packages are tailored to specific tasks with intuitive naming conventions, giving me efficiency in my analysis.

The packages I have chosen to apply for my assessment, as well as their purpose are as follows:
- Pandas: for importing my dataset into a dataframe, allows for manipulation of data, columns, datatypes.
- NumPy: for conducting mathematical operations and array manipulation.
- Matplotlib & Seaborn: packages designed for visualization of distribution as it has multiple chart types.
- scikit-learn: provides several tools necessary to conduct my decision tree model. Elaborated on below.
- sklearn train_test_split: used to split the dataset between training and testing data. GridSearchCV will be used to identify optimal values for maximizing performance
- sklearn DecisionTreeRegressor: used to generate the decision tree.
- several sklearn metrics libraries such as: r2_score, accuracy_score, mean_squared_error that are used to evaluate our decision tree algorithm's accuracy.

In [145]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import r2_score, accuracy_score, mean_squared_error
from sklearn.tree import DecisionTreeRegressor, plot_tree

#importing dataset into a dataframe
df = pd.read_csv('churn_clean.csv', index_col=0)
df.head()

Unnamed: 0_level_0,Customer_id,Interaction,UID,City,State,County,Zip,Lat,Lng,Population,...,MonthlyCharge,Bandwidth_GB_Year,Item1,Item2,Item3,Item4,Item5,Item6,Item7,Item8
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,K409198,aa90260b-4141-4a24-8e36-b04ce1f4f77b,e885b299883d4f9fb18e39c75155d990,Point Baker,AK,Prince of Wales-Hyder,99927,56.251,-133.37571,38,...,172.455519,904.53611,5,5,5,3,4,4,3,4
2,S120509,fb76459f-c047-4a9d-8af9-e0f7d4ac2524,f2de8bef964785f41a2959829830fb8a,West Branch,MI,Ogemaw,48661,44.32893,-84.2408,10446,...,242.632554,800.982766,3,4,3,3,4,3,4,4
3,K191035,344d114c-3736-4be5-98f7-c72c281e2d35,f1784cfa9f6d92ae816197eb175d3c71,Yamhill,OR,Yamhill,97148,45.35589,-123.24657,3735,...,159.947583,2054.706961,4,4,2,4,4,3,3,3
4,D90850,abfa2b40-2d43-4994-b15a-989b8c79e311,dc8a365077241bb5cd5ccd305136b05e,Del Mar,CA,San Diego,92014,32.96687,-117.24798,13863,...,119.95684,2164.579412,4,4,4,2,5,4,3,3
5,K662701,68a861fd-0d20-4e51-a587-8a90407ee574,aabb64a116e83fdc4befc1fbab1663f9,Needville,TX,Fort Bend,77461,29.38012,-95.80673,11352,...,149.948316,271.493436,4,4,4,3,4,4,4,5


### C1: Data Preprocessing
While I stated that decision tree algorithms are ones that can deal with a high dimensionality of data, evaluation still needs to be done on numeric data even if my independant variable is categorical. Thus, to reiterate from my previous task's paper; categorical data will be re-expressed ordinally as 1 and 0 so that it may be used within my analysis. For binary categorical variables, I will be re-expressing their 'yes' and 'no' answers as 1 and 0. For categorical variables with more than 2 responses, I will preform one-hot encoding to generate dummy columns in the 1, 0 answer format, ensuring that I drop the first column to prevent any potential problems with multicollinearity from arising. I will largely be re-using the independant variables that remained from my D208 task 1 paper in my final multiple linear regression model as the variables isolated during my analysis demonstrated the most promise in a statistically significant relationship to my dependant variable.

### C2: Dataset Variables
| Variable Name | Numeric/Categorical | Dependant/Independant |
| :- | :- | :- |
| Tenure | Numeric | *Dependant* |
| Children | Numeric | Independant |
| Age | Numeric | Independant |
| Outage | Numeric | Independant |
| Monthly Charge | Numeric | Independant |
| Bandwidth | Numeric | Independant |
| Contract | Categorical | Independant |
| Internet Service | Categorical | Independant |
| Online Security | Categorical | Independant |

### C3: Analysis Steps
First and foremost, as I do with all my assessments, I will first utilize a .info() and .duplicated() function to identify any null values/duplicates that need to be imputed or removed.

In [146]:
#checking for null values in dataframe
print(df.info())

#checking for duplicate values in dataframe
print(df.duplicated().value_counts())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 1 to 10000
Data columns (total 49 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Customer_id           10000 non-null  object 
 1   Interaction           10000 non-null  object 
 2   UID                   10000 non-null  object 
 3   City                  10000 non-null  object 
 4   State                 10000 non-null  object 
 5   County                10000 non-null  object 
 6   Zip                   10000 non-null  int64  
 7   Lat                   10000 non-null  float64
 8   Lng                   10000 non-null  float64
 9   Population            10000 non-null  int64  
 10  Area                  10000 non-null  object 
 11  TimeZone              10000 non-null  object 
 12  Job                   10000 non-null  object 
 13  Children              10000 non-null  int64  
 14  Age                   10000 non-null  int64  
 15  Income             

As there are no null values or duplicates within my dataset, no imputations will be required for any entry.

Next will come the re-expression of all my categorical variables into numeric data: Converting binary answers into 1 and 0 format, and converting multi-categorical variable data into dummy variables, making sure to drop the first column to prevent any issues with perfect multicollinearity.

In [147]:
#creating dataframe to store data from C2
clean_df = df[["Tenure", "Children", "Age", "Outage_sec_perweek", "Contract", "InternetService", "MonthlyCharge", "Bandwidth_GB_Year"]]

#inserting online security column after re-expressing the data ordinally
clean_df.insert(6, "OnlineSecurity", df["OnlineSecurity"].replace({"Yes": 1, "No": 0}))

#creating dummy variables, dropping the first column from each
clean_df = pd.get_dummies(clean_df, columns=["Contract", "InternetService"], drop_first=True)

#printing dataframe to see if information has been updated
clean_df.head()

Unnamed: 0_level_0,Tenure,Children,Age,Outage_sec_perweek,OnlineSecurity,MonthlyCharge,Bandwidth_GB_Year,Contract_One year,Contract_Two Year,InternetService_Fiber Optic,InternetService_None
CaseOrder,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,6.795513,0,68,7.978323,1,172.455519,904.53611,1,0,1,0
2,1.156681,1,27,11.69908,1,242.632554,800.982766,0,0,1,0
3,15.754144,4,50,10.7528,0,159.947583,2054.706961,0,1,0,0
4,17.087227,1,48,14.91354,1,119.95684,2164.579412,0,1,0,0
5,1.670972,0,83,8.147417,0,149.948316,271.493436,0,0,1,0


Note that due to the multidimensional ability for decision tree algorithms to handle and manage large scale data, none of the data needs to by scaled or normalized like in other assessments.

In [148]:
#printing summary statisics for the dependant variable Tenure to view maximum values for later assessment of mse/rmse
df["Tenure"].describe()

count    10000.000000
mean        34.526188
std         26.443063
min          1.000259
25%          7.917694
50%         35.430507
75%         61.479795
max         71.999280
Name: Tenure, dtype: float64

### C4: Cleaned Dataset
The below is the code for the cleaned dataset, outputted to a csv and submitted alongside my notebook, test/training data, and panopto video.

In [149]:
#exporting the dataset to csv file
clean_df.to_csv('dectree_dataset.csv', index=False)

### D1: Splitting the Data

In [150]:
#splitting data from my cleaned dataframe into training and testing data
X = clean_df.drop(columns=["Tenure"])
y = clean_df["Tenure"]

#splitting the data into training and testing sets, with a 80%train, 20% test split, 
#using the random state 1 to maintain the exact split across multiple runs of this notebook (Boorman, n.d.)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(8000, 10)
(2000, 10)
(8000,)
(2000,)


From the above, we can see that the data has been split at 80/20, with 8000 entries for each training set, and 2000 entries for each testing set. The X variable correctly demonstrates the 10 independant variables chosen for analysis.

In [151]:
#exporting each set of training/testing data to csv
X_train.to_csv('t2_X_train.csv', index=False)
X_test.to_csv('t2_X_test.csv', index=False)
y_train.to_csv('t2_y_train.csv', index=False)
y_test.to_csv('t2_y_test.csv', index=False)

### D2: Output/Intermediate Calculations
As stated above, the decision tree regression model learns from the training data fed to it. Once I fit the training data I generated above, the model maps and develops node tests based on the known variables, relationships, and outcomes present within the data I have at my disposal. In assessing my decision tree, I will print the resulting tree's mean squared error (MSE), the average sqaured distance between my model's predicted and actual target values. The lower my model's MSE the higher the indicated performance, as this demonstrates a closer estimate to my actual result. 

The amount of levels a decision tree makes in creating nodes does impact the processing speed and resulting MSE, as trees using default settings can overfit and try to adapt to noise within the data, resulting in poor performance and inability for generalized use. To combat this, I will be utilizing hyperparameter tuning tools such as GridSearchCV to identify the most optimal parameters to apply for my model such as max depth of the tree and the minimum samples needed for a leaf node. The steps for this tuning can be found in the cell below via commented notes but as a general overview: I start by creating a basic decision tree regressor model and an array of numbers for assessment of ideal parameters.

In [152]:
#creating dt variable to store the original model, using all default parameters to see how the model performs
#though I am using a random_state of 2 to maintain the shape of my decision tree across all my notebook runs
dt = DecisionTreeRegressor(random_state = 2)

#creating a dictionary of arrays for assessment of ideal parameters 
param_list = {
    'max_depth': range(1, 10),
    'min_samples_leaf': range(1,30),
}

#utilizing the GridSearchCV function, using the determination of ideal score set to R squared value (Saini, 2020)
grid = GridSearchCV(estimator = dt, param_grid = param_list, scoring = 'r2', cv=10, n_jobs=-1)

#fitting my training data to the optimal parameter grid
grid.fit(X_train, y_train)

#printing the best parameters from my gridsearch
print(grid.best_estimator_)

DecisionTreeRegressor(max_depth=9, min_samples_leaf=5, random_state=2)


Thus, the ideal paramters for this decision tree are a max depth of 9, minimum samples required for a leaf is 5, and of course my assigned random state of 2 to retain the shape across multiple runs of my notebook.

In [153]:
#creating my ideal decision tree regressor model using the parameters found from my grid search
ideal_dt = grid.best_estimator_

#creating predictor for test set
y_pred = ideal_dt.predict(X_test)

#calculating and printing the mse/rmse value
mse = mean_squared_error(y_test, y_pred)
rmse = mse**(1/2)
print("The mean squared error for the model is: " + str(mse))
print("The root mean squared error for the model is: " + str(rmse))

#printing the r-squared value of my test data fitted against my ideal model
rsq = r2_score(y_test, y_pred)
print("R-squared value for test data against ideal decision tree model: " + str(rsq))

The mean squared error for the model is: 2.9003243317362446
The root mean squared error for the model is: 1.7030338610069515
R-squared value for test data against ideal decision tree model: 0.9958806522197612


The last thing I would like to do even though it is not necessary for my research question is identify which variables are being weighted the heaviest in my model's performance. I will do so using the feature_importances_ function learned from a tutorial video made by Ryan Nolan (Nolan, 2023).

In [154]:
#creating and storing the array for my feature weighting 
features = pd.DataFrame(ideal_dt.feature_importances_, index = X.columns)
features

Unnamed: 0,0
Children,0.000182
Age,0.000479
Outage_sec_perweek,1.8e-05
OnlineSecurity,1.8e-05
MonthlyCharge,0.003854
Bandwidth_GB_Year,0.990287
Contract_One year,2e-06
Contract_Two Year,0.0
InternetService_Fiber Optic,0.002467
InternetService_None,0.002693


As we can see from the above, the Bandwidth column is the only feature that demonstrates a significant impact on the determination of Tenure.

### D3: Code
All code used to perform prediction analysis can be found within section D2 above.

### E1: Accuracy & Mean Squared Error
Using the calculated ideal parameters of optimal maximum nodes of 9, and minimum sample leaf size of 5, the testing set's mean squared error (MSE) was determined to be 2.9. The value represents the square of the deviance magnitude from the predicted unit's true accuracy. As a better idea of what this value means for our model, the root of the mean squared error (RMSE) was determined to be 1.7, and this represents a standard error of 1.7 months for a customer's tenure. For the context of our assessment here, this is a very low value given that tenure can range from 1 month to 72 months. Overall, the model demonstrates a strong ability to predict a customer's total tenure within 1 to 2 months of accuracy. Ultimately, I cannot say for certain if this RMSE is an ideal low value as I lack the domain knowledge to ascertain its validity in the telecommunications industry but for my assessment I believe this is an appropriate result.

### E2: Results
In regards to the r-squared value of my decision tree regression model, I calculated a result of 0.995. This value demonstrates that my decision tree model can explain approximately 99.5% of the variance of my target variable. Meaning that the model above can very accurately predict Tenure's variablility. This is further supported by the RMSE value determined being very low, as such, it is strongly indicated that my decision tree model is useful in predicting a customer's tenure. However, no model is perfect, and it is possible that this one has a hidden failure through potential overfitting.

### E3: Limitation
As stated above, it is possible that my decision tree model suffers from overfitting. Overfitting implies that the model was trained so effectively through the training data that it is being influenced by noise and specific patterns within the training data, teaching it to only able to draw predictions on this data instead of anything new or unknown. This means that while my model is very good at predicting the target variable, it can only do so on my current dataset and is unable to draw real relationships.

### E4: Recommendations
It was mentioned above that there is a possibility that the decision tree regression model may have overfit to the data currently available. This means that the model has gotten extremely good at assessing the customer's likely tenure rate for our telecommunications company. Thus, I can justifyable say that this model would be appropriate for deployment in assessing current metrics of customer Tenure of the current subscribers, allowing for management to identify very accurately the moments in which a customer may require additional service offerings to tempt them to remain a subscriber.

However, overfitting comes with the consequence of having difficulty adapting to new or unknown data because it is only really good at assessing *currently available data*. As such, while it is perfectly suitable for assessing current customer churn rates, what if the telecom corporation would like to assess potential tenure rates for new clientele. Say for example, the company would like to target an entirely new market base, a new demographic, or maybe aim to poach customers from another telecom company. They are likely then looking to identify the likely revenue output to weigh against the cost of customer acquisition to establish forecasted budgets and determine feasability. This model may not be applicable for that because it only demonstrates predictive accuracy on the customer data available *within* the company, customers who may have an entirely different mindset, values, or appeal to the brand. I would strongly recommend that before this model can be used for external assessmnet, it has to be validated against either industry standards, or if possible, another company's dataset. That way, the company can be certain if this model will be applicable to new unknown data, rather than just the dataset it has been given.

### F: Panopto
My panopto video can be found here: https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=103af793-6bcf-4c31-ad05-b1600177b8fc

### G: Code Sources
Boorman, George. (n.d.). *Supervised Learning with scikit-learn [MOOC].* Datacamp. https://app.datacamp.com/learn/courses/supervised-learning-with-scikit-learn

Saini, Bhanwar. (2020, September 29). *Hyperparameter Tuning of Decision Tree Classifier Using GridSearchCV*. PlainEnglish. https://plainenglish.io/blog/hyperparameter-tuning-of-decision-tree-classifier-using-gridsearchcv-2a6ebcaffeda

Nolan, Ryan. (2023, August 17). *How to Build Your First Decision Tree in Python (scikit-learn)* Youtube. https://www.youtube.com/watch?v=YkYpGhsCx4c&t=444s

### H: Sources
Chauhan, Nagesh Singh. (2022, February 9). *Decision Tree Algorithm, Explained*. KDNuggets. https://www.kdnuggets.com/2020/01/decision-tree-algorithm-explained.html

Datacamp. (2023, February). *Decision Tree Classification in Python Tutorial*. Datacamp. https://www.datacamp.com/tutorial/decision-tree-classification-python