In [None]:
#Run the following code to print multiple outputs from a cell
get_ipython().ast_node_interactivity = 'all'

# Predictive Modeling

## Importing & Profiling the Data
The file we'll be using today is "creditCardDefaultReduced.csv". It contains information about Taiwanese credit card customers and includes the following fields:

* Payment: Paid or Missed
* Card: Type of card
* Education: Education level
* Marriage: Marital status
* Age: Age of customer
* BillAmt_1: Amount of last bill
* PayAmt_1: Amount of last payment
* Pay_0: Number of months delayed
* Limit_Bal: Balance of the customer's credit card limit

Import the file, "creditCardDefaultReduced.csv", and save it in a variable called `df`. Then, do some initial data profiling of the data. 

In [None]:
import pandas as pd
df = pd.read_csv("creditCardDefaultReduced.csv")
df

In [None]:
df.dtypes
df.isna().sum()

In [None]:
df.describe()

In [None]:
pd.plotting.scatter_matrix(df)

In [None]:
df["Payment"].value_counts(normalize = True)
df["Education"].value_counts(normalize = True)
df["Marriage"].value_counts(normalize = True)

We want to explain the "Payment" column, so let's try to make a histogram to see if it's normally distributed:

In [None]:
df["Payment"].hist()

This is not continuous data, so we'll use a classification model to predict whether someone will make or miss their payment.

## Outcome and Features

We need to create two inputs for machine learning models:
* Outcome: A single column of data that the model will try to explain
* Features: One or more columns of data that explain the outcome

We'll experiment with different models today, but all require an outcome and set of features.

### Outcome

Outcome is relatively simple since it's a single column:

In [None]:
outcome = df["Payment"]
outcome

### Features

We'll handle 2 types of data differently -- numeric and factors (categorical variables).

#### Numeric Features
First, the data has numeric variables like `Limit_Bal` and `Age`:

In [None]:
numericFeatures = df[["Limit_Bal", "Bill_Amt1", "Pay_Amt1", "Age"]]
numericFeatures

#### Categorical Features
Second, the data has categorical columns.
* Despite all the discussion about new data (image, audio, video, etc.), models only work with numbers. Data needs to be numeric
* But we really, really, **really** want to use non-numeric data
* Some examples:
  - Colors? Red, green, blue
  - Day of the week? Sunday, Monday, Tuesday, ..., Saturday
  - School? Boston College, Notre Dame, Boston University
  - ...and many others
* What do we do?

Some data, like `Marriage` is a set of discrete values:

In [None]:
df["Marriage"].value_counts()

`get_dummies` converts this data to multiple columns of true (1) and false (0) values:

In [None]:
dummiesMarriage = pd.get_dummies(df["Marriage"], prefix = "Marriage", drop_first = True)
dummiesMarriage

Now you try...create dummy variables for the `Card` and `Pay_0` variables:

Now we need to combine all our features into a single variable using the `concat` command:

In [None]:
features = pd.concat([numericFeatures, dummiesMarriage, dummiesPay_0, dummiesCard], axis = 1)
features

*The `axis = 1` parameter tells the `concat` function to combine by columns instead of by rows.*

Now, we have 2 variables:
* `outcome` contains the single column of data that we will try to explain
* `features` contains multiple columns of data that we believe explains the `outcome` variable

## Partitioning the Data
When we have sufficient data, we want to use only part of the data to train the model. By keeping part of the data as test data, we can better evaluate how well the model works.

![image.png](attachment:f6e49498-488d-4ef2-95a1-9a27809a0402.png)

Why split the data?

What if you were trying to create a model to separate the dots?

![image.png](attachment:5c2849f1-95e7-428e-8fe6-e68c2757c945.png)

This model would be perfect on red, but not so great on blue:

![image.png](attachment:935a74b9-42e1-4771-ae45-3e7b18f7f677.png)

And this model would be perfect on blue, but not on red:

![image.png](attachment:4237130e-5a81-4515-9e4c-cf8d56644490.png)

This, of course, is the perfect model, but it's perfect only for these specific dots (overfitting):

![image.png](attachment:5d876b0d-ede4-4085-8d9d-864ebe57a042.png)

Risk of overfitting:
* Models can "memorize" the right answer.
* Inference diﬃcult.
* Encourages overconﬁdence in model.

This is ultimately a better balance between fitting these dots and fitting another set of dots:

![image.png](attachment:2070d869-4654-4a29-9ebf-39fb6e01211c.png)

This balance is what we are trying to strike when we build our models.

We can use the `sklearn.model_selection.train_test_split` command to randomly split the data into subsets. It is an unusual function that returns four values. The `test_size` parameter indicates the portion to save for testing.

In [None]:
from sklearn.model_selection import train_test_split
featuresTrain, featuresTest, outcomeTrain, outcomeTest = train_test_split(features, 
                                                                          outcome, 
                                                                          test_size = 0.33, 
                                                                          random_state = 42)
featuresTrain.shape
featuresTest.shape
outcomeTrain.shape
outcomeTest.shape

Notice the parameter `random_state = 42`. If we didn't include this parameter, each of us would get a different split, resulting in different models. Setting the `random_state` ensures that "randomness" works exactly the same for each of us. It also ensures that your model results will be replicable each time you run your code. (FYI, you can set the `random_state` to whatever number you'd like...42 is my favorite number.)

## Building a Decision Tree

![image.png](attachment:11f8e071-8b01-478d-8ee5-cfe706cd1f9e.png)

* A ﬂowchart-like structure
* Each internal node represents a test on an attribute
* Each branch represents the outcome of the test
* Each leaf node represents a class label
* The paths from root to leaf represent classiﬁcation rules.

Pros and cons?
* Simple
* White-box
* Sensitive to data ('high variance')
* May require many levels
* Generally low accuracy

Let's build the model. There are generally 4 steps:
```
1. Set-up the model
2. Fit the model using the training data
3. Use the fitted model to predict outcomes
4. Assess the fit
```

**1. Set-up the model:**

In [None]:
import sklearn.tree
modelTree = sklearn.tree.DecisionTreeClassifier(random_state = 42)

**2. Fit the model using the training data:**

In [None]:
resultTree = modelTree.fit(featuresTrain, outcomeTrain)

To actually see the tree, you can run this code:

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(25,20))
_ = sklearn.tree.plot_tree(resultTree, feature_names = list(features), filled = True,\
                       class_names = modelTree.classes_.tolist(), max_depth = 2)

# To save the tree as a png file:
# fig.savefig("decision_tree.png")

You can also get a text printout of the entire tree using this command:

In [None]:
tree_rules = sklearn.tree.export_text(modelTree, feature_names=list(features))
print(tree_rules)

**3. Use the fitted model to predict outcomes:**

In [None]:
predTreeTrain = modelTree.predict(featuresTrain)
predTreeTrain

**4. Assess the fit:**

In [None]:
sklearn.metrics.r2_score(outcomeTrain, featuresTrain) # this is wrong

What happened?

Metrics like $R^2$, mean squared error, mean absolute error, etc., make sense for regression, but not classification.

How can we tell if a model is good or bad?
* Depends on the reference
  - "Better or worse" more than "good or bad"
* Depends on the dependent variable
* Depends on the kind of model
* Depends on the goal

What outcomes are we trying to predict?

In [None]:
pd.value_counts(outcomeTrain)

What did the model predict?

In [None]:
pd.value_counts(predTreeTrain)

So, we know the prediction isn't perfect. But what did it miss? How imperfect?

### Confusion Matrix

Check the confusion matrix to assess the model's goodness of fit:

In [None]:
sklearn.metrics.accuracy_score(outcomeTrain, predTreeTrain)
cmTreeTrain = sklearn.metrics.confusion_matrix(outcomeTrain, predTreeTrain)
sklearn.metrics.ConfusionMatrixDisplay(cmTreeTrain, display_labels = modelTree.classes_).plot()

We can quickly look at many metrics with the `classification_report`:

In [None]:
print(sklearn.metrics.classification_report(outcomeTrain, predTreeTrain))

* Out of all the people that the model predicted Missed, 99% actually Missed
* Out of all the people that actually Missed, the model predicted 100% of them
* Out of all the people that the model predicted Paid, 100% actually Paid
* Out of all the people that actually Paid, the model predicted 100% of them
* Support tells us how many of each case are in the data

It's no surprise that the model does well on the training data that was used to build it. 

The real "test" (forgive the pun) is to see how the model fits the test data to assess overfitting:

In [None]:
# First create the predictions on the test data
predTreeTest = modelTree.predict(featuresTest)

# Then assess the fit
cmTreeTest = sklearn.metrics.confusion_matrix(outcomeTest, predTreeTest)
sklearn.metrics.ConfusionMatrixDisplay(cmTreeTest, display_labels = modelTree.classes_).plot()
print(sklearn.metrics.classification_report(outcomeTest, predTreeTest))

How did the model do at predicting new data?
* Bad overfitting
* Decision trees, while nice to look at, are terrible about overfit

## Random Forests

The steps are the same for random forests as they are for decision trees, with the exception that you use `sklearn.ensemble.RandomForestClassifier` to build the model:

In [None]:
# 1. Set-up the model
import sklearn.ensemble
modelForest = sklearn.ensemble.RandomForestClassifier(random_state = 42)

# 2. Fit the model using the training data
resultForest = modelForest.fit(featuresTrain, outcomeTrain)

# 3. Predict outcomes from the training and testing data
predForestTrain = modelForest.predict(featuresTrain)
predForestTest = modelForest.predict(featuresTest)

# 4. Assess the fit
print(sklearn.metrics.classification_report(outcomeTrain, predForestTrain))
cmForestTest = sklearn.metrics.confusion_matrix(outcomeTest, predForestTest)
sklearn.metrics.ConfusionMatrixDisplay(cmForestTest, display_labels = modelForest.classes_).plot()
print(sklearn.metrics.classification_report(outcomeTest, predForestTest))

## Support Vector Machines

Now that you've seen how to build Decision Tree and Random Forest models, see if you can figure out how to build a Support Vector Machine model. You'll need to import `sklearn.svm` and the function is called `SVC()` (which stands for support vector classifier). Once again, use a `random_state` of 42.

Is your model better or worse than the previous 2 models?

## Scaling Your Features

You likely got warning messages when running your Support Vector Machine model because the model had difficulty predicting a missed payment. Part of this is due to the fact that the features are using different scales...for example, `Limit_Bal` ranges from 10,000 to 1,000,000 whereas `Age` ranges from 21 to 79. 

One way to fix this is to normalize your features so that they are all on the same scale. A very common way to do this is to convert all features to values between 0 and 1 with the following formula:

$$
X_{\text{scaled}} = \frac{X - X_{\text{min}}}{X_{\text{max}} - X_{\text{min}}}
$$

We can use the `MinMaxScaler()` in `sklearn.preprocessing` to do this:

In [None]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
featuresTrain_norm = scaler.fit_transform(featuresTrain) #you fit to the training features
featuresTest_norm = scaler.transform(featuresTest)       #you only transform the test features

Now, we can rerun the model using `featuresTrain_norm` instead of `featuresTrain` and we can get predictions using the normalized training and test data. 

Copy/paste your SVM model from above and edit it to use the normalized features. Did you get better results?

Note: Because Decision Trees and Random Forests are split on each feature itself and not compared to other features, it is generally preferred to keep them un-scaled and in their original state.

## Try Another...

Instead of repeating the class exercise, use the 3 models we covered in class to try to classify the Species of flower in the data set, "Iris.csv":