---

### 🎓 **Professor**: Apostolos Filippas

### 📘 **Class**: Web Analytics

### 📋 **Topic**: Evaluating Models

### 🔗 **Link**: https://bit.ly/WA_LEC8_EVAL

### 🛢️ **Data**: https://bit.ly/mailingData 

🚫 **Note**: You are not allowed to share the contents of this notebook with anyone outside this class without written permission by the professor.

---

# 1. Starting out

We start off as we usually do. Let's import some things that will be useful.

In [None]:
# Import pandas to read in data
import pandas as pd
import numpy as np

# Import matplotlib for plotting
import matplotlib.pylab as plt
%matplotlib inline

# Import decision trees and logistic regression
from sklearn.tree import DecisionTreeClassifier

# Import train, test, and evaluation functions
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# 2. Data
We're going to use a mail response data set from a real direct marketing campaign located in `files/mailing.csv`. 

You can download the files from [https://bit.ly/mailingData](https://bit.ly/mailingData).


Each record represents an individual who was targeted with a direct marketing offer.  The offer was a solicitation to make a charitable donation. 

The columns (features) are:

```
Col.  Name      Description
----- --------- ----------------------------------------------------------------
1     income    household income
2     Firstdate data assoc. with the first gift by this individual
3     Lastdate  data associated with the most recent gift 
4     Amount    average amount by this individual over all periods (incl. zeros)
5     rfaf2     frequency code
6     rfaa2     donation amount code
7     pepstrfl  flag indicating a star donator
8     glast     amount of last gift
9     gavr      amount of average gift
10    class     one if they gave in this campaign and zero otherwise.
```

Our goal is to build a model to predict if people will give during the current campaign (this is the attribute called `"class"`).

Let's read our data in and put the target variable in `Y` and all the other features in `X`.

In [None]:
# Read data using pandas
data = pd.read_csv("files/mailing.csv")

# Split into X and Y
X = data.drop(columns=['class'])
Y = data['class']

In [None]:
len(data)

In [None]:
X

In [None]:
data.head()

In [None]:
X.head()

In [None]:
Y.head()

# 3. Overfitting

Let's first create a classification algorithm called "decision tree," fit a model (learn the model from the data), and use it to get predictions on all of our data.

In [None]:
# Create an empty, unlearned tree
tree = DecisionTreeClassifier(criterion="entropy")

# Fit/train the tree
tree.fit(X, Y)

# Get a prediction
Y_predicted = tree.predict(X)

# Get the accuracy of this prediction
accuracy = accuracy_score(Y_predicted, Y)

# Print the accuracy
print(f"The accuracy is {100*accuracy:2f}")

That's a pretty high accuracy. Is it? Let's check the base rate.

In [None]:
Y.mean()

In [None]:
1 - Y.mean()

95% of the people do not donate --> 99.5% accuracy is pretty good (much higher than the base rate of 95%).

However, we might be overfitting our data. The model might have "memorized" where all the points are. This does not lead to models that will generalize well.

We can create training and testing sets very easily. Here we will create train and test sets of `X` and `Y` where we assign 70% of our data to training.

In [None]:
# Split X and Y into training and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.70)

Now, let's look at the same decision tree but fit it with our training data and test it on our testing data.

In [None]:
# Create an empty, unlearned tree
tree = DecisionTreeClassifier(criterion="entropy")

# Fit/train the tree on the training data
tree.fit(X_train, Y_train)

# Get a prediction from the tree on the test data
Y_test_predicted = tree.predict(X_test)

# Get the accuracy of this prediction
accuracy = accuracy_score(Y_test_predicted, Y_test)

# Print the accuracy
print(f"The accuracy is {100*accuracy:2f}")

Let's also use cross validation with 5 folds to see how well our model performs.

In [None]:
# Create an empty, unlearned tree
tree = DecisionTreeClassifier(criterion="entropy")
    
# This will get us 5-fold cross validation accuracy with our tree and our data
# We can do this in one line!
cross_fold_accuracies = cross_val_score(tree, X, Y, scoring="accuracy", cv=5)
    
# Average accuracy
average_cross_fold_accuracy = np.mean(cross_fold_accuracies)

In [None]:
for fold in cross_fold_accuracies:
    print(fold)

In [None]:
print(average_cross_fold_accuracy)

That's a pretty big difference! Which accuracy do you "trust" more? Why?

# 4. Creating a simple learning curve

In [None]:
import random
random.seed(9001)

# do an 80/20 split of the data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.70)

# Here are some percentages to get you started. Feel free to try more!
training_percentages = [i*0.1 -0.0001 for i in range(1, 11)]
accuracies = []

for training_percentage in training_percentages:
    X_temp_train, X_temp_test, Y_temp_train, Y_temp_test = train_test_split(X_train, Y_train, train_size=training_percentage)

    # This will create an empty logistic regression
    #logistic = LogisticRegression()
    tree = DecisionTreeClassifier(criterion="entropy")

    
    # This will fit/train your logistic regression
    #logistic.fit(X_train, Y_train)
    tree.fit(X_temp_train, Y_temp_train)

    
    # This will get predictions
    #Y_test_predicted = logistic.predict(X_test)
    Y_test_predicted = tree.predict(X_test)

    
    # With these predictions we can get an accuracy. Where should we store this accuracy?
    acc = accuracy_score(Y_test_predicted, Y_test)
    accuracies.append(acc)

# We want to plot our results. What list should we use for the x-axis? What about the y-axis?
plt.plot(training_percentages, accuracies)
plt.show()

# 5. Creating a simple fitting curve


In [None]:
# Let's fit our training data size to 80%
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.80)

# Let's try different max depths for a decision tree
max_depths = range(1, 100)
accuracies = []
accuracies_train = []

for max_depth in max_depths:
    # This will create an empty decision tree at a specified max depth
    tree = DecisionTreeClassifier(criterion="entropy", max_depth=max_depth)
    
    # This will fit/train your tree
    tree.fit(X_train, Y_train)
    
    # This will get accuracy and keep track of it
    Y_test_predicted  = tree.predict(X_test)
    Y_train_predicted = tree.predict(X_train)
    accuracies.append(accuracy_score(Y_test_predicted, Y_test))
    accuracies_train.append(accuracy_score(Y_train_predicted, Y_train))

# We want to plot our results
plt.plot(max_depths, accuracies)
plt.ylabel("Accuracy")
plt.xlabel("Max depth (model complexity)")
plt.show()



In [None]:


plt.ylabel("Accuracy")
plt.xlabel("Max depth (model complexity)")
plt.show()

# Bonus: The ROC curve

 The ROC (Receiver Operating Characteristic) curve is a staple in predictive modeling, especially for binary classification problems. It provides us with a graphical representation of a model's true positive rate vs. its false positive rate, over various threshold values.

 ## A short logistic regression introduction
Logistic regression is another statistical method for analyzing datasets.
- In linear regression, we  were trying to fit a line
$$y = b_0 + b_1 x_1 + \ldots + b_n x_n$$
- In logistic function, we are trying to fit a **sigmoid function**
$$y = 1 / (1 + e^{-(b_0 + b_1 x_1 + \ldots + b_n x_n)})$$
- Similar to linear regression, our goal during fitting is to estimate the "best"  parameters $b_0, b_1, \ldots, b_n$.

 Note that the prediction of the logistic regression will always be avalue between 0 and 1. This allows us to interpret the prediction as the probability that an instance belongs to a target class.
 
 

In [None]:
# Plot a simple sigmoid function
x = np.linspace(-10, 10, 1000)
y = 1 / (1 + np.exp(-x))
plt.plot(x, y)

In [None]:
# Now ploit several sigmoid functions with different slopes and intercepts
x = np.linspace(-10, 10, 1000)
y1 = 1 / (1 + np.exp(-x))
y2 = 1 / (1 + np.exp(-2*x))
y3 = 1 / (1 + np.exp(-0.5*x))
plt.plot(x, y1)
plt.plot(x, y2)
plt.plot(x, y3)

## Fitting a logistic regression model

 Let's first create some data

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt

# Sample Dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Convert training data to DataFrame
df_train = pd.DataFrame(X_train, columns=[f'Feature_{i+1}' for i in range(X_train.shape[1])])
df_train['True_Label'] = y_train

# Convert test data to DataFrame
df_test = pd.DataFrame(X_test, columns=[f'Feature_{i+1}' for i in range(X_test.shape[1])])
df_test['True_Label'] = y_test

df_train.head()

Now let's fit the model to the training data and get the predicted probabilities for the test data.

In [None]:
# initialize the model
clf = LogisticRegression()
# fit the model
clf.fit(X_train, y_train)
# get predictions on train and test data
train_probabilities = clf.predict_proba(X_train)[:,1]
test_probabilities = clf.predict_proba(X_test)[:,1]
train_predictions = clf.predict(X_train)
test_predictions = clf.predict(X_test)


The "best" coefficients that we learnt are 

In [None]:
# The best coefficients we've learnt
clf.coef_

And here are the predicted labels, if we use a threshold of 0.5 (the default)

In [None]:
df_train['Predicted_Label'] = train_predictions
df_train['Predicted_Probability'] = train_probabilities
df_test['Predicted_Label'] = test_predictions
df_test['Predicted_Probability'] = test_probabilities

df_test.head(20)

In [None]:
# print the base rate of the dataset
print(f'Base Rate: {np.round(1 - df_test["True_Label"].mean(), 4)}')


# get accuracy on train and test data
train_accuracy = clf.score(X_train, y_train)
test_accuracy = clf.score(X_test, y_test)
print(f'Train Accuracy: {train_accuracy}')
print(f'Test Accuracy: {test_accuracy}')

## Picking the optimal threshold

Even though we used 0.5 as the threshold, we can use any threshold we want. In fact, it might be better to use a different threshold based on the problem we are trying to solve, i.e how much we care about true positives vs. false positives vs. true negatives vs. false negatives.

The ROC curve helps us visualize what would have happened had we picked different thresholds

In [None]:
# Compute ROC curve
fpr, tpr, thresholds = roc_curve(y_test, test_probabilities)

# Selected thresholds
selected_thresholds = [1, 0.75, 0.5, 0.25, 0]
selected_indices = [np.argmin(np.abs(thresholds-t)) for t in selected_thresholds]

# Visualization adjustments
plt.figure(figsize=(10,8))
plt.plot(fpr, tpr, color='darkorange', label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')

# Scatter plot for selected thresholds
plt.scatter(fpr[selected_indices], tpr[selected_indices], color='black', s=50, label='Selected thresholds')
for ind in selected_indices:
    plt.annotate(f"{thresholds[ind]:.2f}", (fpr[ind], tpr[ind]), textcoords="offset points", xytext=(-10,-10), ha='center')

# Adjust axis limits for better visibility
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])

plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve with Selected Thresholds')
plt.legend(loc="lower right")
plt.show()

- The 45 degree line represents a random guess of probability, and then classifying based on this random guess and the corresponding threshold. I.e., for a threshold = 1, we would classify everything as negative; for a threshold = 0, we would classify everything as positive; and so on.
- The curve represents the performance of our model at different thresholds. The higher the curve, the better the model.
- The area under the curve (AUC) is a measure of how good the model is. The higher the AUC, the better the model.
  
More https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc 