<a href="https://colab.research.google.com/github/connordouglas10/DS-Tech-2024spring-ADMIN/blob/main/Module3_Fitting_CrossVal/Fitting_and_overfitting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#If opening in colab run this cell
!git clone https://github.com/connordouglas10/DS-Tech-2024spring-ADMIN.git
%cd DS-Tech-2024spring-ADMIN/Module3_Fitting_CrossVal/

# Fitting models and overfitting


2024 Spring - Instructors: Foster Provost and Connor Douglas

Teaching Assistant: Connor Douglas
***

## Packages

In [None]:
# Import the libraries we will be using

import os
import numpy as np
import pandas as pd
import math
import matplotlib.pylab as plt
import seaborn as sns

%matplotlib inline
sns.set(style='ticks', palette='Set2')

# some custom libraries!
import sys
sys.path.append("..")
from ds_utils.decision_surface import *

Notice that we're importing library code that we've developed just for this class. In the future, new common code will continue to be added to the `ds_utils` folder. 

## Motivational example

To look more carefully at predictive modeling, imagine "our data" are some noisy observations from a nonlinear function. We're going to approximate that function by fitting a polynomial to the observations. 

In [None]:
#Create the data and show them
#In order that we can plot the data points and the function, we will just have one feature (x1)
num_samples = 50
# Set randomness so that we all get the same answer
np.random.seed(42)

def true_function(X):
    return np.sin(1.5 * np.pi * X)

def plot_example(X, Y, functions):
    # Get some X values to plot the functions
    X_test = pd.DataFrame(np.linspace(0, 1, 100), columns=['x1'])
    # Plot data and true function
    for key in functions:
        plt.plot(X_test, functions[key](X_test), label=key)
    plt.scatter(X, Y, edgecolor='b', s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")

# Add X in the range of [0, 1]
X = pd.DataFrame(np.sort(np.random.rand(num_samples)), columns=['x1'])
# Add some random noise to the observations
Y = true_function(X.x1) + np.random.randn(num_samples) * 0.5
# Plot stuff
functions = {"True function": true_function}
plot_example(X, Y, functions)
plt.show()

Let's assume that we don't know the true function.  We choose to model our noisy observations using linear regression.  (Recall that we had a sneak peek at building linear regression models in Python at the end of the Class #1 notebook; compare with the fitting of models for binary target variables from last class.)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Fit linear model
model = LinearRegression()
model.fit(X, Y)
# Evaluate model with mean squared error; just as an example!
mse = mean_squared_error(Y, model.predict(X))
# Plot results
functions["Model"] = model.predict
plot_example(X, Y, functions)
#Note how you can customize your plots
plt.title("Linear Model\n MSE: %.2f" % mse)
plt.show()

Does the linear regression fit our data well? 

Rather than trying a linear regression, let's make our functional form more complex.  We will fit polynomial regressions. How do different degree polynomials fit the data? Recall that a polynomial on a single variable looks like:

$$ a_1 + a_2 x + a_3 x^2 + ... $$

In [None]:
#Create function to fit different degree polynomials
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

def fit_polynomial(X, Y, degree):
    # create different powers of X
    polynomial_features = PolynomialFeatures(degree=degree, include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline([("polynomial_features", polynomial_features), ("linear_regression", linear_regression)])
    pipeline.fit(X, Y)
    return pipeline

In [None]:
#Now let's see what we have learned

def plot_poly(X, Y, degree):
    # Fit polynomial model
    model = fit_polynomial(X, Y, degree)
    # Evaluate model
    mse = mean_squared_error(Y, model.predict(X))
    # Plot results
    functions["Model"] = model.predict
    plt.title("Degree %d\n MSE: %.2f" % (degree, mse))
    plot_example(X, Y, functions)
    
plot_poly(X, Y, degree=2)
plt.show()

This seems to fit our data better than the purely linear model. What if we use polynomials with higher degrees?

(Remember -- the ML doesn't see the green line!)

In [None]:
plt.figure(figsize=(14, 5))
# degrees of the polynomial
degrees = [1, 2, 3]
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())
    plot_poly(X, Y, degrees[i])
plt.show()

What do you see there as the effect of allowing more complexity in the modeling process? Take a look at what happens when we use a regression tree on data generated from the *true function*.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

Expected_Y = true_function(X.x1)
plt.figure(figsize=(14, 5))
# Fit Regression Trees
depths = [1]
#depths = [1,2]
for i, depth in enumerate(depths):
    ax = plt.subplot(1, len(depths), i + 1)
    plt.setp(ax, xticks=(), yticks=())    
    model = DecisionTreeRegressor(max_depth=depth)
    model.fit(X, Expected_Y)
    functions = {"True function": true_function, "Tree (depth {})".format(depth): model.predict}
    plot_example(X, Expected_Y, functions)
plt.show()

## Predicting wine quality

_"All wines should be tasted; some should only be sipped, but with others, drink the whole bottle."_ - Paulo Coelho, Brida

We will use a data set related to the red variant of the Portuguese "Vinho Verde" wine. We will predict the "sensory" output based on physicochemical inputs.  (Here there is no data about grape types, wine brand, wine selling price, etc.). Our goal is to use machine learning to detect above-average wines (perhaps to send these wines later to professional tasters?).

Let's start by loading the data.

In [None]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
wine_df = pd.read_csv(url, delimiter=";").dropna()
wine_df.head(15)


In [None]:
wine_df.describe()

In [None]:
# Now, let's change the label to reflect our decision problem, namely, to identify above-average wines.
avg_quality = wine_df.quality.mean()
wine_df["is_good"] = wine_df.quality > avg_quality
#Note above the "Pandas" way of doing things: process all the instances simultaneously
#   computing the mean in one swoop; assigning the new column to the instances all at once.

#Now we will get rid of the old feature, quality.
#  Ask yourself: what would have happened if had used quality in predicting the new target?
#    (Hint: leakage!  But make sure you understand exactly why.)
wine_df = wine_df.drop("quality", axis="columns")
# Replace white spaces with underscores in column names
wine_df.columns = [c.replace(' ', '_') for c in wine_df.columns]
wine_df.head(5)

In [None]:
#Now let's set ourselves up for predictive modeling
# Get column names and predictor columns
column_names = wine_df.columns
predictor_columns = column_names[:-1]

Let's see if any of the features seem to be very predictive by themselves.

In [None]:
rows = 4
cols = 3
fig, axs = plt.subplots(ncols=cols, nrows=rows, figsize=(5*cols, 6*rows))
axs = axs.flatten()
for i in range(len(predictor_columns)):
        wine_df.boxplot(predictor_columns[i], by="is_good", grid=False, ax=axs[i], sym='k.')
plt.tight_layout()

There's no single feature that can separate the data perfectly. Alcohol and total sulfur dioxide look somewhat predictive though. 

## Tree-structured models
Let's now re-explore the modeling technique we introduced last class -- tree-structured models.  And in particular, classification trees, since our target is to predict (the probability of) whether the wine is good or not -- binary classification (class probability estimation).

For illustration, we will increase the complexity of the tree using the maximum depth allowed. (Note that using max_depth is for illustration -- I recommend using the minimum number of instances at a leaf or at a split in practice; we can talk about that.)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
X = wine_df[predictor_columns]
Y = wine_df.is_good

def training_accy(X, y, model):  #to add accuracy metric to plots - as an example!
    y_hat = model.fit(X, y).predict(X)
    return accuracy_score(y, [1 if ty >= 0.5 else 0 for ty in y_hat])

def plot_trees(X, Y, col1, col2, depths, show_probs=False, show_acc=False):
    ncol = 3
    nrows = int(np.ceil(len(depths) / ncol))
    plt.figure(figsize=[15, 7*nrows])

    for i in range(len(depths)):
        depth = depths[i] 
        # Plot
        plt.subplot(nrows, ncol, 1+i)
        model = DecisionTreeClassifier(max_depth=depth, criterion="entropy")
        Decision_Surface(X, col1, col2, Y, model, sample=0.1, gridsize=100,probabilities=show_probs)
        model.fit(X,Y)
        acc = training_accy(X,Y,model)
        if show_acc:
          plt.title(f"Decision Tree Classifier (max depth={depth}, acc = {acc:.2f})")
        else:
          plt.title(f"Decision Tree Classifier (max depth={depth})")
        
    plt.tight_layout()
    plt.show()
    
plot_trees(X, Y, "alcohol", "total_sulfur_dioxide", depths=[1,2,3], show_probs=False, show_acc=False)

### Trees can represent any function of the input to arbitrary precision

If you experiment with the tree depth, you will see that you can fit the data better and better. Deeper trees chop the instance space into smaller and smaller pieces.  Check it out below with the `depths` variable. (Will this finer and finer segmentation go on forever?)

**Thought exercise**: Tree learning can fit the data up to the theoretical limit of accuracy.  Is this 100%  If not, why not?

**Extra:** Can you visualize the actual tree-structured model?  Hint: there's a function to do it in last week's notebook.  [Caveat: Visualizing huge trees isn't so effective.]

In [None]:
plot_trees(X, Y, "alcohol", "total_sulfur_dioxide", depths=[1,2,3,4,5,6,10,20,30])

## Linear discriminant models

Chapter 4 introduces linear models.  We've built one already -- a linear regression.  Let's try building a linear model for this prediction problem. 

Looking at the data (see scatterplots above), can you estimate by eye where a good **linear discriminant** would be?  (What's a linear discriminant again?)

If you remember, *linear regression* looks like this:

$$ y = b + a_1 x_1 + a_2 x_2 + a_3 x_3 + ... $$

If you are estimating the probability of one of two different classes, traditional linear regression won't work as well as some people hope. Probabilities need to be bounded between zero and one. To solve this problem, one of the most common machine learning tools is **logistic regression**.  Chapter 4 describes it. You can also find logistic regression modeling in the sklearn package.

Let's plot both linear regression and logistic regression together to compare them..

In [None]:
from sklearn.linear_model import LogisticRegression

def plot_linear(X, Y, col, model_type, ymin=-0.1, ymax=1.1, sample=1):
    if model_type == "Linear Regression":
        model = LinearRegression()
        predict_fn = model.predict
    else:
        model = LogisticRegression()
        predict_fn = lambda obs: model.predict_proba(obs)[:, 1]
    title = model_type + " Regression"
    # Fit model
    col_min = X[col].min()
    col_max = X[col].max()
    col_df = pd.DataFrame(X[col], columns=[col])
    model.fit(col_df, Y)
    # Evaluate predictions
    Y_pred = predict_fn(col_df)
    mse = mean_squared_error(Y, Y_pred)
    # Plot prediciton line
    col_line = pd.DataFrame(np.linspace(col_min, col_max, 100), columns=[col])
    plt.plot(col_line, predict_fn(col_line))
    # Plot sample
    indices = np.random.permutation(range(len(Y)))[:int(sample*len(Y))].tolist()
    plt.scatter(col_df[col][indices], Y[indices], edgecolor='b')
    plt.xlabel(col)
    plt.ylabel("Good?")
    plt.xlim((col_min, col_max))
    plt.ylim((ymin, ymax))
    plt.title("%s, MSE %0.3f" % (title, mse))
    
def linear_predict(model, X):
    return model.predict(X)

def logistic_predict(model, X):
    return model.predict_proba(X)[:, 1]

In [None]:
plt.figure(figsize=[15,7])

plt.subplot(1, 2, 1)
plot_linear(X, Y, "alcohol", "Linear Regression")

#plt.subplot(1,2,2)
#plot_linear(X, Y, "alcohol", "Logistic Regression")

And, of course, we can look at the decision surface produced by logistic regression

In [None]:
plt.figure(figsize=[7,7])

#plt.subplot(1, 2, 1)
#Decision_Surface(X, "alcohol", "total_sulfur_dioxide", Y, LinearRegression(), sample=0.1, probabilities=False)
#lin_accy = training_accy(X[["alcohol", "total_sulfur_dioxide"]], Y, LinearRegression())
#plt.title("Linear Regression, Accy: %0.3f" % lin_accy)

plt.subplot(1, 1, 1)
Decision_Surface(X, "alcohol", "total_sulfur_dioxide", Y, LogisticRegression(), sample=0.1, probabilities=False)
lr_accy = training_accy(X[["alcohol", "total_sulfur_dioxide"]], Y, LogisticRegression())
plt.title("Logistic Regression, Accy: %0.3f" % lr_accy)

plt.tight_layout()
plt.show()

### Estimating Probabilities


For many business problems, we don't need just to estimate the categorical target variable, but we want to estimate the probability that a particular value will be taken. Just about every classification model can also tell you the estimated probability of class membership.

Intuitively, how would you generate probabilities from a classification tree? From a linear discriminant?

Let's look at the probabilities estimated by these models. As shown below, you can visualize the probabilities both for the linear model and the tree-structured model. Note that the native `LinearRegression` class in sklearn doesn't have probability estimation capability (Why do you think?). We can only perform this operation with logistic regression.

In [None]:
plt.figure(figsize=[15,7])

plt.subplot(1, 2, 1)
depth=5
model = DecisionTreeClassifier(max_depth=depth, criterion="entropy")
Decision_Surface(X, "alcohol", "total_sulfur_dioxide", Y, model, sample=0.1, probabilities=True)
plt.title("Decision tree with depth " + str(depth))

plt.subplot(1, 2, 2)
model = LogisticRegression()
Decision_Surface(X, "alcohol", "total_sulfur_dioxide", Y, model, sample=0.1, probabilities=True)
plt.title("Logistic regression")
plt.show()

Let's revisit the deeper and deeper trees from above, but this time visualizing the probabilities.  

(Do the probabilities for the last trees look odd? )


In [None]:
plot_trees(X, Y, "alcohol", "total_sulfur_dioxide", depths=[1,2,3,4,5,6,10,20,30], show_probs=True)

### Non-linear numeric models

Tree-structured models are non-linear, and can fit the data very well. It seems like a linear model possibly cannot. Can we use the mechanism of fitting linear models to generate non-linear boundaries with logistic regression?

Yes! We did this already for numeric regression.  Here we also add non-linear features, such as  $ x^2 $  or  $ x^3 $ for any feature $ x $. We can even include a full set of polynomial feature interactions: given input features $x_1$ and $x_2$, we can, for instance,  build models and prediction on $x_1 + x_2 + x_1^2 + x_2^2 + x_1x_2$.

This is one of the most common ways of introducing non-linearity into numeric function modeling: use a linear function learner, but introduce non-linear features.

In [None]:
def polynomial_model(model=LogisticRegression(), degree=1):
    polynomial_features = PolynomialFeatures(degree=degree, include_bias=True)
    pipeline = Pipeline([("polynomial_features", polynomial_features), ("model", model)])
    return pipeline

In [None]:
plt.figure(figsize=[15,7])
from sklearn.linear_model import LogisticRegression
degrees = [1,2,3]
for i in range(len(degrees)):
    model = polynomial_model(LogisticRegression(solver='liblinear',max_iter=1000), degrees[i])
    plt.subplot(1, len(degrees), i+1) 
    Decision_Surface(X, "alcohol", "total_sulfur_dioxide", Y, model, probabilities=True, sample=0.1)
    accy = training_accy(X, Y, model)
    plt.title("Degree %d, Accy: %0.3f" % (degrees[i], accy))
plt.show()

Which model is better in this case?? Look at the **accuracy** of each one. Accuracy is simply the count of correct decisions divided by the total number of decisions. Here we are computing the accuracy of the model when it makes predictions on the training set, examples the model "already knows the answer to". 

[From sklearn documentation on sklearn.metrics.accuracy_score: "In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true."  [More about the accuracy measure..](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)]

## Generalization

Our evaluation above actually was not what we really want.

What we want are models that **generalize** to data that were not used to build them! In other words, we want this model to be able to predict the target for new data instances! Do we know how well our models generalize? Why is this important?

Let's apply this concept to our data. Now, before we fit out models, we set aside some data to be used later for testing ('holdout data').  This allows us to assess whether the model simply fit the training dataset well, or whether it truly found generalizable regularities. 

Let's use sklearn to set aside some randomly selected holdout data.

In [None]:
from sklearn.model_selection import train_test_split

# Set randomness so that we all get the same answer
np.random.seed(42)
#np.random.seed(43)
shuffled_df = wine_df.sample(frac=1)
X = shuffled_df[predictor_columns]
Y = shuffled_df.is_good
# Split the data into train and test pieces for both X and Y
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

model = DecisionTreeClassifier(max_depth=10)
model.fit(X_train, Y_train)

print ( "Accuracy on training = %.4f" % accuracy_score(model.predict(X_train), Y_train) )
print ( "Accuracy on test = %.4f" % accuracy_score(model.predict(X_test), Y_test) )

Accuracy on the training set is better than on the test set! Why is this? What can we do to make things better? What happens if our tree gets even deeper? 

In [None]:
def plot_fitting_curve(datasets, maxdepth=15):
    # Intialize accuracies
    accuracies = {}
    for key in datasets:
        accuracies[key] = []
    # Initialize depths
    depths = range(1, maxdepth+1)
    # Fit model for each specific depth
    for md in depths:
        model = DecisionTreeClassifier(max_depth=md, random_state=42)
        # Record accuracies
        for key in datasets:
            X = datasets[key]['X']
            Y = datasets[key]['Y']
            if key == "X-Val":
                accuracies[key].append(cross_val_score(model, X, Y, scoring="accuracy", cv=5).mean())
            else:
                model.fit(datasets['Train']['X'], datasets['Train']['Y'])
                accuracies[key].append(accuracy_score(model.predict(X), Y))
    # Plot each curve
    plt.figure(figsize=[10,7])
    for key in datasets:
        plt.plot(depths, accuracies[key], label=key)
    # Plot details
    plt.title("Performance on train and test data")
    plt.xlabel("Max depth")
    plt.ylabel("Accuracy")
    # Find minimum accuracy in all runs
    min_acc = np.array(list(accuracies.values())).min()
    plt.ylim([min_acc, 1.0])
    plt.xlim([1, maxdepth])
    plt.legend()
    plt.grid()
    plt.show()
    
datasets = {"Train": {"X": X_train, "Y": Y_train}, "Test": {"X": X_test, "Y": Y_test}}
plot_fitting_curve(datasets)

## Cross validation

Above, we made a single train/test split. We set aside 20% of our data and *never* used it for training. We also never used the 80% of the data set aside for training to test generalizability.  Although this is far better than testing on the training data, which does not measure generalization performance at all, there are two potential problems with the simple holdout approach.

1) Perhaps the random split was particularly bad (or good).  Do we have any confidence in our accuracy estimate?

2) We are using only 20% of the data for testing.  Could we possibly use the data more fully for testing?

3) Often we want to know something about the distribution of our evaluation metrics. A simple train/test split only allows a single "point estimate"

Instead of only making the split once, let's use **cross-validation** -- every record will contribute to testing as well as to training.


<img src="https://github.com/pearl-yu/foster_2022fall/blob/2022-master/Module3_Fitting_CrossVal/images/cross.png?raw=1" alt="Drawing" style="width: 600px;"/>

In [None]:
from sklearn.model_selection import cross_val_score

model = DecisionTreeClassifier(max_depth=10)
scores = cross_val_score(model, X, Y, scoring="accuracy", cv=10)

print ("Cross Validated Accuracy: %0.3f +/- %0.3f" % (scores.mean(), scores.std()))

We can add this cross-validated accuracy to our plot above

In [None]:
datasets["X-Val"] = {"X": X, "Y": Y}
plot_fitting_curve(datasets)

In this particular example, the performance on the test set does not drop as the trees get deeper.

The book shows this:

<img src="https://github.com/pearl-yu/foster_2022fall/blob/2022-master/Module3_Fitting_CrossVal/images/generalization.png?raw=1" alt="Drawing" style="width: 600px;"/>


So... take a look at the Homework example:

In [None]:
# Load data
path = "./data/data-hw1.csv"
np.random.seed(42)
df = pd.read_csv(path)
# Shuffle data
# Get features and label
columns = ["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR", "CGPA", "Research"] 
X = df[columns]
Y = df["Chance of Admit"] > 0.5
# Split the data into train and test pieces for both X and Y
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)
# Test for overfitting
datasets = {"Train": {"X": X_train, "Y": Y_train}, 
            "Test": {"X": X_test, "Y": Y_test},
            "X-Val": {"X": X, "Y": Y}}
plot_fitting_curve(datasets, maxdepth=10)