# Classification trees - Level 1

Fill the empty spaces `- - -` with code.
Check out the messages after the `#` symbol for extra help

# 0. Problem
You have been hired by GAP as data analysts! Your first task is to predict how many units of a limited edition jumper will be purchased by the most loyal customers. To do that you and your team have conducted a survey of 701 loyal customers. You and your team collected some valuable data including age, gender, salary, how much money the customer spent that day at the store and the last month, whether the customer has used the online shop, and how many jumpers the customer bought the last year. Some of the customers reply to the last question of the survey to make it clear whether or not they will buy the limited-edition jumper. Unfortunately, the last question was not recorded for all the interviewed people. You want to know how many of the 701 interviewed customers will buy the jumper if more than 70% of the interviewed customers are likely to buy the jumper, then the limited-edition jumper will be launched, but if the percentage is lower, unfortunately, the limited-edition jumper will not see the light. To do that we have to use a classification model!

# 1. Overview

This notebook uses decision trees to classify and predict whether the age, gender, salary, how much money the customer spent today and the last month in GAP, and how many jumpers the customer bought the last year could predict the new acquisition of a jumper.

# 2. Import the following packages

Import `Pandas` as `pd` and `Numpy` as `np` </br>
From `sklearn` import `tree` and `metrics` </br>
From `sklearn.model_selection` import `train_test_split` </br>
Import `seaborn`, and `matplotlib.pyplot` as `sns` and `plt` </br>
Import `StringIO` from `sklearn.externals.six` and `Image` from `IPython.display` </br>
Import `pydotplus`

In [None]:
import pandas  as pd
import numpy as np
from sklearn import tree, metrics
from sklearn.model_selection import train_test_split
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO 
from IPython.display import Image  
import pydotplus

# http://decd.co/classification-link   data from

# 3. Load data 

Download the "WholeDataset.csv" from [GitHub](https://github.com/DecodedCo/Classification) to your working directory, save it as `WholeDataset.csv`.

Import the CSV file into Python.


In [None]:
# To read the dataset use the function read_csv from pandas
data = pd.read_csv ("WholeDataset.csv")

# 4. Explore the data

In this part of the notebook we need to:
1. Check the first 8 observations
2. Check the dimensions of the dataset
3. Print the information of `data` including the index dtype and column dtypes, non-null values and memory usage
4. Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values

## Check the first 8 observations

In [None]:
# Check the first 8 rows of the dataframe
# To do that use the function head
data.head(8)

## Check the dimensions of the dataset

In [None]:
# Get dimensions of training dataframe
# Use the attribute .shape
data.shape

## Print the information of `data` including the index dtype and column dtypes, non-null values and memory usage

In [None]:
# Get high-level information on the columns
# use the .info() function
data.info()

## Generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values

In [None]:
# use the function .describe()
data.describe(include = "all")

# 5. Clean the data

In this section of the notebook, we need to clean the data.
We need to:
1. Let's change the column names to something more meaningful
2. Tidy the factors of the column gender - avoid redundancy
3. Replace 1 and 0 with "Yes" and "No" in the `Decision` column

## Let's change the column names to something more meaningful
Change the name of the columns `spent` and `salaRy` to `spent_month` and `salary`, respectively

In [None]:
# Check out the column names
# use the attribute .columns
data.columns

In [None]:
# Change the name of the columns spent and salaRy using the function .rename()
# use the parameter column to define the old and new names
# use inplace = True
data.rename(columns = {"spent":"spent_month", "salaRy":"salary"},
            inplace = True)

In [None]:
# Check out the column names
data.columns

##  Tidy the factors of the column `gender` - avoid redundancy

In [None]:
# Let's have a look at the column "gender"
# Use the function .describe()
data["gender"].describe()

In [None]:
# Let's check the unique values of the column "gender"
data["gender"].unique()

Replace the redundant values of the column `gender` using only `Female` and `Male`

In [None]:
# Use the function .replace() on the column "gender"; replace female with Female
data["gender"] = data["gender"].replace("female", "Female")

In [None]:
# Let's check the unique values of the column "gender"
data["gender"].unique()

In [None]:
# Use the function .replace() on the column "gender"; replace "male", "M", "m" with "Male"
data["gender"] = data["gender"].replace(["male", "M", "m"], "Male")

In [None]:
# Let's check the unique values of the column "gender"
data["gender"].unique()

## Replace 1 and 0 with "Yes" and "No" in the `Decision` column

In [None]:
# use the function .replace() on the column 'Decision'; replace 1 and 0 with "YES" and "NO"
data["Decision"] = data["Decision"].replace(1, "YES")
data["Decision"] = data["Decision"].replace(0,"NO")
data.info()
data["Decision"].unique()

# 6. Spliting the dataset into NOPredict and Predict
In this section of the notebook we need to:
1. Drop all the empty values of the `Decision` column and save it as NOPredict
2. Explore the data using boxplots and scatter plots of several variables in the y-axis and the decision on the x-axis
3. Use the subset with all empty values in the column `Decision` and save it as Predict
4. Divide the NOPredict subset into X and y, and then into train and test subsets for X and y
5. Create the dummy variables to deal with categorical inputs

## Drop all the empty values of the `Decision` column and save it as NOPredict

In [None]:
# NoPredict is the dataset with the known values for the decision
# Use the function .dropna()
NOPredict = data.dropna()
NOPredict["Decision"].describe()

## Explore the data using boxplots and scatter plots of several variables in the y-axis and the decision on the x-axis

In [None]:
# Exploring the NOPredict
# Select for the y axis any variable and compare it with the decision column
# Can you find a single variable that will help us to classify the decision column
sns.boxplot(y='spent_month', x= 'Decision', data=NOPredict )
plt.show()

#sns.boxplot(y='No_jumpers_per_year', x= 'Decision', data=NOPredict )
    

sns.boxplot(y='Distance', x= 'Decision', data=NOPredict )
        

In [None]:
# Exploring the NOPredict
# Select for the x and y axis any variable and compare them with the decision column using the parameter hue = "Decision"
# Can you find a single variable that will help us to classify the decision column
sns.scatterplot(y='spent_month', x= 'Distance', hue = "Decision", data =NOPredict)

In [None]:
sns.pairplot(NOPredict , hue='Decision')

## Use the subset with all empty values in the column `Decision` and save it as Predict

In [None]:
# use the function pd.isnull to subset the data with only null values
Predict = data[pd.isnull(data["Decision"])]
Predict.head()

In [None]:
# use .describe to see a summary of Predict
Predict.describe()

In [None]:
#Let's check the names of the columns first
NOPredict.columns

## Divide the NOPredict subset into X and y

In [None]:
# Feature selection 
feature_cols = ["age", "gender", "No_jumpers_per_year", "spent_today", "spent_month",
       "salary", "Distance", "Online"]
X = NOPredict[feature_cols]
y = NOPredict.Decision


In [None]:
print(type(y))

## Subset X and y into X_train, X_test, y_train, y_test

In [None]:
# Subset X and y using the function train_test_split
# call the results X_train, X_test, y_train, y_test
# use 75% for the train size
# set the random seed to 246
X_train, X_test, y_train, y_test=train_test_split(X, y, 
                                                  test_size =.25,
                                                  random_state = 123)


## Create the dummy variables to deal with categorical inputs

In [None]:
# One-hot encoding all features in training set (X_train)
X_train = pd.get_dummies(X_train )

# One-hot encoding all features in testing set (X_test)
X_test = pd.get_dummies(X_test)

In [None]:
y_train.shape

# 7. Running the model 
Let's check the documentation of scikit-learn about decision trees https://scikit-learn.org/stable/modules/tree.html </br>
Check out in particular section `1.10.5. Tips on practical use`

Your facilitator will be walking through this section

## Entropy model - no max_depth

In [None]:
clf_entropy = tree.DecisionTreeClassifier("entropy" , random_state = 1234)
clf_entropy.fit(X_train, y_train)
y_pred = clf_entropy.predict(X_test)
y_pred = pd.Series(y_pred)
clf_entropy

In [None]:
dot_data = StringIO()
tree.export_graphviz(clf_entropy, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names=X_train.columns,class_names = ["NO", "YES"])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [None]:
print("Model Entropy - no max depth")
print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
# print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
print('Precision score' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
print('Recall score' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))

## Gini impurity model - no max_depth

In [None]:
clf_gini = tree.DecisionTreeClassifier('gini', random_state = 1234)
clf_gini.fit(X_train, y_train)
y_pred = clf_gini.predict(X_test)
y_pred = pd.Series(y_pred)
clf_gini

In [None]:
dot_data = StringIO()
tree.export_graphviz(clf_entropy, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names=X_train.columns,class_names = ["NO", "YES"])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [None]:
print("Model Gini impurity model")
print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
#print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
print('Precision score' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
print('Recall score' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))

## Entropy model model - max_depth 3

In [None]:
clf_entropy_3 = tree.DecisionTreeClassifier(criterion='entropy', max_depth = 3, random_state = 1234)
clf_entropy_3.fit(X_train, y_train)
y_pred = clf_entropy_3.predict(X_test)
y_pred = pd.Series(y_pred)
clf_entropy_3

In [None]:
dot_data = StringIO()
tree.export_graphviz(clf_entropy_3, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names=X_train.columns,class_names = ["NO", "YES"])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [None]:
print("Model Entropy model max depth 3")
print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
#print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
print('Precision score' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
print('Recall' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))

## Gini impurity  model - max depth 3

In [None]:
clf_gini_3 = tree.DecisionTreeClassifier(criterion='gini', random_state = 1234, max_depth = 3)
clf_gini_3.fit(X_train, y_train)
y_pred = clf_gini_3.predict(X_test)
y_pred = pd.Series(y_pred)
clf_gini

In [None]:
dot_data = StringIO()
tree.export_graphviz(clf_gini_3, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names=X_train.columns,class_names = ["NO", "YES"])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

In [None]:
print("Gini impurity  model - max depth 3")
print("Accuracy:", metrics.accuracy_score(y_test,y_pred))
#print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_pred))
print('Precision score' , metrics.precision_score(y_test,y_pred, pos_label = "YES"))
print('Recall score' , metrics.recall_score(y_test,y_pred, pos_label = "NO"))

### Which model are you going to use?

## Now it is time to count how many loyal customers are going to buy the jumper
 1. Let's calculate from the original dataset how many loyal customers said originally and explicitly that they will purchase the limited-edition jumper

In [None]:
data["Decision"].value_counts()

2. Let's calculate the number of people that according to the model will be willing to purchase the jumper </br>
a. Subset the Predict dataset into `new_X` considering all the variables except `Decision` </br>
b. Use that dataset to predict a new variable called `potential_buyers`

In [None]:
# Feature selection 
feature_cols = ["age", "gender", "No_jumpers_per_year", "spent_today", "spent_month",
       "salary", "Distance", "Online"]
new_X = Predict[feature_cols]

In [None]:
# One-hot encoding all features in training set
new_X = pd.get_dummies(new_X)
potential_buyers = clf_gini.predict(new_X)

In [None]:
np.unique(potential_buyers, return_counts=True)

The total number of potential buyers is 302 + 177 = 479

In [None]:
print("The total number of interviewed people was", data.salary.count())

In [None]:
# Let's calculate the proportion of buyers
buyer=479/701

In [None]:
print("Only ", round((buyer)*100, 2), "% of people want to buy the limited edition jumper" )

## Conclusion
As number if poeople who would buy is less than 70% the product will not be launched

In [None]:
# use random forest tree to find the best fit model