# Day 6 Lab, IS 4487

The purpose of this lab is to prepare you to complete today's project quiz. Here are the questions you need to be able to answer.

- Understand model accuracy.  And:  Why is it a performance metric for classification and not regression?
- Calculate accuracy for a simple majority class model (this is the same as calculating the proportion of the majority class in a binary variable).
- Fit a tree model of the target with just one predictor variable and calculate the accuracy of this model.
- Calculate accuracy for the tree model.
- Explain how the classification tree algorithm chooses which variable to split on and where to split.

Additionally will talk about cross validation and overfitting.

## Load Libraries



In [None]:
import pandas as pd

# Import packages needed for the classification tree
from sklearn.tree import plot_tree
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier, export_graphviz # Import Decision Tree Classifier


## Get Data


In [None]:
mtc = pd.read_csv("https://raw.githubusercontent.com/jefftwebb/is_4487_base/dd870389117d5b24eee7417d5378d80496555130/Labs/DataSets/megatelco_leave_survey.csv")

## Clean the data

Perform the cleaning from the previous labs:
- Remove negative values of income and house
- Remove absurdly large value of handset_price
- Remove NAs
- Change `college` to `yes` and `no`
- Make string variables into categorical variables. We've been using `astype("categorical")` to do that, but there is a better way when we also need to define an order among multiple levels.  This is important for plotting and for accurate modelling.

Start by making `mtc_clean` an explicit copy.  This will avoid the warning you've been getting: "A value is trying to be set on a copy of a slice from a DataFrame."


In [None]:
# Make copy
mtc_clean = mtc.copy()

In [None]:
# filter rows
mtc_clean = mtc[(mtc['house'] > 0) & (mtc['income'] > 0) & (mtc['handset_price'] < 1000)]


In [None]:
# remove NAs
mtc_clean = mtc_clean.dropna()

In [None]:
# Recode college
mtc_clean['college'] = mtc_clean['college'].replace({'one': 'yes', 'zero': 'no'})

We'll use the `pd.Categorical()` function to simultaneously make the string variables categorical and set the levels.  The syntax is:

`data['column'] = pd.Categorical(data['column'],
  categories = list,
  ordered = True)`
  
  where "list" is the list of levels in order, such as:  `['small', 'medium', 'large']`. Without explicitly setting this order the default would be alphabetic:  large, medium, small.

  Here's an example:

In [None]:
# Convert string to categorical variable
mtc_clean['leave'] = pd.Categorical(mtc_clean['leave'],
                                    categories = ['LEAVE', 'STAY'],
                                    ordered = True)


In [None]:
mtc_clean['leave'].dtype

Looks good.  

Go ahead and do a similar transformation on the remainder of categorical variables:

- `reported_satisfaction`
- `reported_usage_level`
- `considering_change_of_plan`
- `college`

In [None]:
# Your code goes here

## Calculate the proportion of the majority class  
What is the proportion of customers who churned? Note that `len(data)` (where "data" is your data frame) returns a count of the number of observations.


In [None]:
# Your code goes here


Why should we care?

The majority class in the target variable will serve as an important benchmark for model performance. Predicting the majority class is the simplest possible classifier. We'll call it the "majority class classifier." It represents the best predictive guess you can make, in the absence of other information.  The accuracy of the majority class classifier is simply the proportion of the majority class in the data.

Why is this?

*Accuracy is defined as the proportion of correctly predicted labels. It is a commonly used error metric for evaluating classifier performance.*

**Whatever later model we develop should have better accuracy than this performance benchmark.**

## Fit a basic tree model

Use just one variable, `income`. This is a very simple tree we'll call the "money tree."



In [None]:
# Step 1:  Initialize model, specifying
# 1. split criterion is entropy
# 2. max_depth = 2

money_tree = DecisionTreeClassifier(criterion = "entropy", max_depth = 2)

Explanation of code:

- `DecisionTreeClassifier()`: Creates an instance of the decision tree classifier from scikit-learn
- `criterion="entropy"`: Specifies the function to measure the quality of a split (entropy measures the impurity of the split)
- `max_depth=2`: Limits the tree to a maximum depth of 2 levels, controlling complexity and preventing overfitting
- Output: Returns a configured decision tree classifier object, ready to be fitted with data


In [None]:
# Step 2: Create Decision Tree Classifer, specifying
# 1. X (the predictor set) as income
# 2. y (the target) as leave

money_tree = money_tree.fit(X = mtc_clean[['income']],
                            y = mtc_clean['leave'])


Explanation of code:

- `money_tree.fit()`: Trains the decision tree classifier on the provided data
- `X = mtc_clean[['income']]`: Input feature (predictor), selecting only the `income` column *as a DataFrame* (note the double square brackets)
- `y = mtc_clean['leave']`: Target variable, the `leave` column containing the class labels
- Output: Returns the fitted decision tree model, now trained on the income data to predict customer churn

Gemini prompt:  "what is the difference between double and single square brackets in Pandas for slicing?"

In [None]:
# Step 3: Visualize the money tree model
plot_tree(money_tree,
          feature_names=[['income']],
          class_names=['STAY', 'LEAVE'],
          filled=True)


Explanation of code:

- `plot_tree()`: Function from scikit-learn to visualize the decision tree
- `money_tree`: The fitted decision tree model to be visualized
- `feature_names=[['income']]`: Labels the feature as `income` in the tree diagram
- `class_names=['STAY', 'LEAVE']`: Specifies the names for the target classes in the visualization
- `filled=True`: Colorizes the nodes based on the majority class at each node


This plot is a bit confusing!  Here is an interpretive guide.

1. **Root Node (Top)**:
   - Split: income <= 99993.0
   - Samples: 4994
   - Initial prediction: LEAVE

2. **Second Level**:
   - Left Branch (income <= 20181.0):
     - Samples: 3303
     - Prediction: LEAVE
   - Right Branch (20181.0 < income <= 159576.0):
     - Samples: 1691
     - Prediction: STAY

3. **Third Level (Leaf Nodes)**:
   - Far Left (income <= 20181.0):
     - Samples: 8
     - Prediction: LEAVE (high certainty, entropy = 0.0)
   - Middle Left (20181.0 < income <= 99993.0):
     - Samples: 3295
     - Prediction: LEAVE (with uncertainty, entropy = 0.991)
   - Middle Right (99993.0 < income <= 159576.0):
     - Samples: 1680
     - Prediction: STAY (with uncertainty, entropy = 0.977)
   - Far Right (income > 159576.0):
     - Samples: 11
     - Prediction: STAY (high certainty, entropy = 0.0)



## Check Accuracy

What is the accuracy of the money_tree?


In [None]:
# 1. Generate predictions from the model for the training data
pred = money_tree.predict(X = mtc_clean[['income']])


Explanation of code:

- `money_tree.predict()`: Method to make predictions using the trained decision tree model
- `X = mtc_clean[['income']]`: Input data for prediction, using only the `income` column from the DataFrame
- Output: Returns an array of predicted class labels ('STAY' or 'LEAVE') for each row in the input data


In [None]:
# 2. Calculate accuracy as the proportion of correct predictions

# Your code goes here


So, this is better than the accuracy of the majority class classifier, which was our benchmark.  Success!

Would a more complicated model have better performance measured in terms of accuracy?

## Overfitting

Refit the tree, only this time leave out the `max_depth` argument.  This will allow the tree to fit as complicated a model as possible.

In [None]:
# Your code goes here


Whoa!

What is the accuracy of this model?

In [None]:
# Your code goes here

Killing it!

## Cross-validation

Or:  maybe not.  Will you get promoted or ... fired?

This model is excessively complicated.  It has figured out how to classify the target perfectly *in the training data*. It has essentially just memorize the training data.  However, when it encounters new data **it will suck**. That's because new data will have patterns that were not in the training data (this is the nature of sampling) and the overfit model will get badly confused.

To demonstrate this we will use cross-validation.

The simplest version of cross-validation uses a training set and a testing or validation set. (This is called the validation set method.) Simply, we divide the data that we have into two parts: 80% goes into the training set and 20% into the testing set. (80/20 is a common choice.) That division is done randomly using the `sample()` function.



In [None]:
# divide mtc_clean into train and test
train = mtc_clean.sample(frac=0.8, random_state=200) # 80% of data for training
test = mtc_clean.drop(train.index) # the remaining 20%


`random_state` is a seed that ensures the same split when set to 200 (an arbitrary choice). This will make our results comparable.

In [None]:
train.head()

In [None]:
test.head()

The cross validation procedure is to create the model using the train data and then evaluate it--that is, get predictions--using the test data. Accuracy of the model is therefore calculated using the test data. This way, we get an accurate picture of how the model will perform in the wild, with new data.

Make sure to leave out the `max_depth` argument.

In [None]:
# Initialize the classifier.  Leave out max_depth
# Your code goes here


In [None]:
# Fit the model using train
# Your code goes here


In [None]:
# Predict using test
# Your code goes here


In [None]:
# Calculate model accuracy on the test set
# Your code goes here


Better or worse than the simpler model?  

Let's explore this further.  More complicated models are good--up to a point.  Choose a `max_depth` argument greater than 2 and see if you can improve on the simple model's accuracy without getting too complex and overfitting.

In [None]:
# Initialize the classifier--chose a max_depth > 2
# Your code goes here


In [None]:
# Fit the model using train
# Your code goes here


In [None]:
# Predict using test
# Your code goes here

In [None]:
# Calculate model accuracy on the test set
# Your code goes here


## Additional predictors

Pick one additional variable to use as a predictor--one that you think is a driver of churn at MegaTelCo, based on your EDA.

Refit the model using your best performing `max_depth` setting, with `income` and your chosen second predictor.

How does this model perform on the test set?

Adding a predictor makes for a more complicated model.  But complicated is good -- as long as it does not tip over into overfitting.

In [None]:
# Initialize the classifier--chose a max_depth > 2
# Your code goes here


In [None]:
# Fit the model using train
# Your code goes here


In [None]:
# Predict using test
# Your code goes here

In [None]:
# Calculate model accuracy on the test set
# Your code goes here
