#### D209 - Predictive Analysis
azaheer@wgu.edu

### Part I: Research Question
#### A.  Describe the purpose of this data mining report by doing the following:
1.  Propose one question relevant to a real-world organizational situation that you will answer using one of the following prediction methods:
    - random forests
    
    A. I will use the random forests and the following predictors to identify which customer are at high risk of churn?
        • Children
        • Income
        • Tenure
        • Bandwidth_GB_Year
        • Age

2.  Define one goal of the data analysis. Ensure that your goal is reasonable within the scope of the scenario and is represented in the available data.

    A. The stakeholders can review the data provided by the analysis and create incentives to keep the customers that are likely to terminate their contracts with the company. This will lead to a lower churn rate.

### Part II: Method Justification
#### B.  Explain the reasons for your chosen prediction method from part A1 by doing the following:
1.  Explain how the prediction method you chose analyzes the selected data set. Include expected outcomes.

    A. The 'random forest' algorithm improves accuracy by extending on 'single decision tree' where 'multiple decision trees’ are built with random subsets of the data. This extension ensures randomness and grows the decision tree forest. Additionally, it searches for the best feature among the random forest as opposed to the most important. This ensures there is a wide diversity which leads to a better model. It is one of the most accurate learning algorithms available (Richmond, 2016). I expect the outcome will improve accuracy and decrease overfitting.
    

2.  Summarize one assumption of the chosen prediction method.

    A. I am assuming that the random forests has no formal distributional assumption and is non-parametric. It is able to  handle skewed, multi-modal data and categorical data that are ordinal or non ordinal (Richmond, 2016).

3.  List the packages or libraries you have chosen for Python or R, and justify how each item on the list supports the analysis.

     A. I will utilize Python due to my previous interaction with it and its Pandas and Sklearn modules. Additionally, I will be using Jupyter notebook as the IDE because it provides a user-friendly experience. Pandas is an excellent package for working with data set as it makes it easy to load and manipulate columns and/or rows to replace null values. I will use the ‘train_test_split’ module to split the data between training and test using the ‘test_size’ with ‘.20’ parameter to split it 80/20 and ‘random_state’with ‘21’ for shuffling before splitting. Then RandomForestRegressor will be used to instantiate the regression algorithms. ‘mean_squared_error’ and ‘accuracy_score’ are used to produce the error score and provide accuracy of the model respectively.

### Part III: Data Preparation
### C.  Perform data preparation for the chosen data set by doing the following:
1.  Describe one data preprocessing goal relevant to the prediction method from part A1.

    A. Select the appropriate independent variables and hyper-parameters for the analysis.

2.  Identify the initial data set variables that you will use to perform the analysis for the prediction question from part A1, and group each variable as continuous or categorical.
    
    A. I will be using the following independent variables to analyze and predict the future bandwidth usage:

     ##### Categorical Predictor:
     • Churn
    
     ##### Continuous Predictor:    
     • Children
     • Income
     • Tenure
     • Bandwidth_GB_Year
     • Age

3.  Explain the steps used to prepare the data for the analysis. Identify the code segment for each step.

    1. Use Pandas to import the CSV file in the data frame.
    2. Examine and ensure data type consistency in the columns.
    3. Identify and resolve spelling mistakes in column headers or row level data.
    4. Validate there are no Null value if so remove them.
    5. Run regression on the prepared data set.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

#Show all Columns and Rows
pd.options.display.max_columns = None
pd.options.display.max_rows = None

# Load data set
df = pd.read_csv('churn_clean.csv')

# Amend columns with no names
df = df.rename(columns=({ 'Item1': 'Timely_Response', 'Item2':'Timely_Fixes', 'Item3':'Timely_Replacements', 
                         'Item4':'Reliability', 'Item5':'Options', 'Item6':'Respectful_Response',
                         'Item7':'Courteous_Exchange', 'Item8':'Evidence_of_active_listening'}))

In [2]:
# Dropping columns that I think are not relavant to the analysis
df = df.drop(columns= ["CaseOrder", "Customer_id", "Interaction","Outage_sec_perweek", "UID", "City", "State", 
                            "Techie","PaperlessBilling","Yearly_equip_failure","County", "Zip", "Lat", "Lng", "Population", 
                            "Area", "TimeZone", "Job", "PaymentMethod", "DeviceProtection",
                            "OnlineBackup","OnlineBackup","OnlineBackup", "OnlineBackup","Port_modem","OnlineSecurity", 
                            "Multiple","Phone","TechSupport","Contract","Tablet","InternetService", "StreamingTV", "StreamingMovies", 
                            "Timely_Response", "Timely_Fixes", "Timely_Replacements","Reliability","Options","Gender",
                            "Marital","Respectful_Response","Courteous_Exchange","Evidence_of_active_listening","Email", "Contacts"
                           ]) 

In [3]:
# display data set with all the columns
df.head(n=3)

Unnamed: 0,Children,Age,Income,Churn,Tenure,MonthlyCharge,Bandwidth_GB_Year
0,0,68,28561.99,No,6.795513,172.455519,904.53611
1,1,27,21704.77,Yes,1.156681,242.632554,800.982766
2,4,50,9609.57,No,15.754144,159.947583,2054.706961


In [4]:
# Validate there are no nulls
df.isnull().sum()

Children             0
Age                  0
Income               0
Churn                0
Tenure               0
MonthlyCharge        0
Bandwidth_GB_Year    0
dtype: int64

4. Provide a copy of the prepared data set.

In [5]:
# Prepared dataset in the root folder 'prepared_dataset.csv'
df.to_csv('prepared_dataset.csv')

### Part IV: Analysis
#### D.  Perform the data analysis and report on the results by doing the following:
1.  Split the data into training and test data sets and provide the file(s).

In [6]:
# Data conversion
df['Churn'] = [1 if v == 'Yes' else 0 for v in df['Churn']]

# Feature Selection
X = df.drop('Churn', axis=1).values
y = df['Churn']

# Data split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.02, random_state=21)

# Save training and test data set to csv file
pd.DataFrame(X_train).to_csv('X_train.csv')
pd.DataFrame(X_test).to_csv('X_test.csv')
pd.DataFrame(y_train).to_csv('y_train.csv')
pd.DataFrame(y_test).to_csv('y_test.csv')

In [19]:
df.head()

Unnamed: 0,Children,Age,Income,Churn,Tenure,MonthlyCharge,Bandwidth_GB_Year
0,0,68,28561.99,0,6.795513,172.455519,904.53611
1,1,27,21704.77,1,1.156681,242.632554,800.982766
2,4,50,9609.57,0,15.754144,159.947583,2054.706961
3,1,48,18925.23,0,17.087227,119.95684,2164.579412
4,0,83,40074.19,1,1.670972,149.948316,271.493436


2.  Describe the analysis technique you used to appropriately analyze the data. Include screenshots of the intermediate calculations you performed.

I used the Random Forest algorithm to build several decision trees and then merge them to get an accurate and stable prediction. The algorithm grows the model and adds randomness by searching for the best feature while splitting a node. This will result in diversity that leads to a better model.

The below example shows the algorithm flow.

In [7]:
from IPython.display import Image
Image(url= "two-tree-random-forest.png", width=400, height=400)

In [8]:
# Train the model
rdf = RandomForestRegressor(n_estimators=1000, random_state=1)
rdf.fit(X_train, y_train)

RandomForestRegressor(n_estimators=1000, random_state=1)

3.  Provide the code used to perform the prediction analysis from part D2.

In [9]:
# Prediction Train
y_pred_train = rdf.predict(X_train)

In [10]:
# Prediction Test
y_pred_test = rdf.predict(X_test)

### Part V: Data Summary and Implications
#### E.  Summarize your data analysis by doing the following:
1.  Explain the accuracy and the mean squared error (MSE) of your prediction model.

In [11]:
# MSE
print('MSE: {0:.2f}'.format(mean_squared_error(y_test, y_pred_test)))

MSE: 0.14


2.  Discuss the results and implications of your prediction analysis.

The churn variable has a low MSE score of .14, suggesting a high accuracy rate at predicting which customer will churn based on the dataset. However, since Mean Square Error is a square of averages, a single high value will lead to a higher mean. This makes MSE vulnerable to large outliers.

The model adequately predicts the Churn rate but could perform better if further analysis is performed on the independent variables.

3.  Discuss one limitation of your data analysis.

There are a couple of limitations:
    
    1. The algorithm requires more trees to have an accurate prediction, which results in a slower model.
    2. It doesn't predict beyond the range in the training data, which can lead to overfitting in a noisy dataset. Thompson, B. (2021, December 13)

4.  Recommend a course of action for the real-world organizational situation from part A1 based on your results and implications discussed in part E2.

The stakeholders need to analyse which features are common among the customers that have left the company in the past. This will help them identify which customers are at higher risk of churn and provide those customers with additional relevant services at a competitive price. This will show the customers that the company is a one-stop shop for all their needs, which will lead them to stay with the company for a long time.

### Part VI: Demonstration
#### F.  Provide a Panopto video recording that includes a demonstration of the functionality of the code used for the analysis and a summary of the programming environment.
 
Note: The audiovisual recording should feature you visibly presenting the material (i.e., not in voiceover or embedded video) and should simultaneously capture both you and your multimedia presentation.
 
Note: For instructions on how to access and use Panopto, use the "Panopto How-To Videos" web link provided below. To access Panopto's website, navigate to the web link titled "Panopto Access," and then choose to log in using the “WGU” option. If prompted, log in using your WGU student portal credentials, and then it will forward you to Panopto’s website.
 
To submit your recording, upload it to the Panopto drop box titled “Data Mining I – NVM2.” Once the recording has been uploaded and processed in Panopto's system, retrieve the URL of the recording from Panopto and copy and paste it into the Links option. Upload the remaining task requirements using the Attachments option.
 
Panapto: https://wgu.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=0d1c8144-b02f-44b4-a2f5-ae2c014ef5e3

#### G.  Record the web sources used to acquire data or segments of third-party code to support the analysis. Ensure the web sources are reliable.

```{bibliography}
Pandas. (2021). Pandas DataFrames. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html

Get started with references. (2021). Jupyterbook. https://jupyterbook.org/tutorials/references.html#tutorials-references

Marques, A. M. (2020, March 11). How to show all columns / rows of a Pandas Dataframe? Towards Data Science. https://towardsdatascience.com/how-to-show-all-columns-rows-of-a-pandas-dataframe-c49d4507fcf

Starmer, J. (2018, March 5). StatQuest: Logistic Regression. YouTube. https://www.youtube.com/watch?v=yIYKR4sgzI8&t=121s

V. (2019, July 21). Pandas: Apply a function to single or selected columns or rows in Dataframe. ThisPointer. https://thispointer.com/pandas-apply-a-function-to-single-or-selected-columns-or-rows-in-dataframe/

Wijaya, C. Y. (2021, December 15). 5 Must-Know Dimensionality Reduction Techniques via Prince. Medium. https://towardsdatascience.com/5-must-know-dimensionality-reduction-techniques-via-prince-e6ffb27e55d1

Thompson, B. (2021, December 13). A limitation of Random Forest Regression - Towards Data Science. Medium. https://towardsdatascience.com/a-limitation-of-random-forest-regression-db8ed7419e9f

Great Learning Team. (2022, January 17). Mean Squared Error - Explained | What is Mean Square Error? GreatLearning Blog: Free Resources What Matters to Shape Your Career! https://www.mygreatlearning.com/blog/mean-square-error-explained/
```

#### H.  Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.

```` 
Richmond, S. (2016, March 21). Algorithms Exposed: Random Forest | BCCVL. Bccvl.Org.Au. Retrieved 10 January 2022, from https://bccvl.org.au/algorithms-exposed-random-forest/

Chantal D. Larose, & Daniel T. Larose. (2019). Data Science Using Python and R. Wiley.

Donges, N. (2021, September 17). A Complete Guide to the Random Forest Algorithm. Built In. https://builtin.com/data-science/random-forest-algorithm

````