# An Introduction to Regression and Classification with Scikit-Learn

## The data

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Side note:
* I will be posting data files and shared notebooks in a GitHub repository: https://github.com/benjum/UCLA-24W-DH150
* If you're a bit rusty on the Python, do not worry.  I will be distributing Python-relevant tutorial materials later this week.
    * If you later review the tutorial materials and the following still makes no sense, you may want to consult with me about how challenging this course will be for you.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/benjum/UCLA-24W-DH150/main/Data/gdp-vs-lifesatisfaction.csv')

In [None]:
data = data.to_numpy()

In [None]:
data

In [None]:
# Assign the appropriate elements to x and y
x = data[:,1]
y = data[:,2]

In [None]:
# Make a scatter plot

plt.plot(x,y,'ko')
plt.show()

## Scikit-learn

* Scikit-Learn docs: https://scikit-learn.org/stable/index.html
* Linear Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
* K-Nearest Neighbors Regressor: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html
* K-Nearest Neighbors Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
* Logistic Regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## Linear Regression

In [None]:
# sklearn is our library and 
# linear_model is a package within the library that contains various linear models to use

import sklearn.linear_model

In [None]:
# To use a machine learning algorithm in scikit-learn
# we use the relevant "Estimator" object
# Here that estimator object is "LinearRegression"

model = sklearn.linear_model.LinearRegression()

Technical note: sklearn usually wants to work with feature arrays x that are 2D arrays of shape (n_samples, n_features).  Therefore, if your x is a 1D numpy array, you'll need to reshape it to be like (n_samples, 1).  You can achieve this with `x.reshape(-1,1)`, with the `-1` meaning that the n_samples value will be inferred from `x`.

In [None]:
# Reshape x (https://numpy.org/doc/stable/reference/generated/numpy.reshape.html)
# The "-1" means that the length is inferred.
# Here the first dimension is the number of elements

x = x.reshape(-1,1)

In [None]:
# Train the model
model.fit(x,y)

In [None]:
# Make a prediction
x_test = [[25000]]
model.predict(x_test)

In [None]:
# Visualize what the predictions are for this model

plt.plot(x,y,'ko')

x_new = np.linspace(8000,58000,2)
x_new = x_new.reshape(-1,1)
y_pred = model.predict(x_new)
plt.plot(x_new, y_pred)

plt.show()

In [None]:
# For this model, we can retrieve parameters for our model equation
print(model.coef_, model.intercept_)

In [None]:
# Visualize what the predictions are for this model

plt.plot(x,y,'ko')

x_new = np.linspace(8000,58000,2)
x_new = x_new.reshape(-1,1)

# the predicted y values are now from a model equation, 
# not from results of calling the predict function
y_pred = model.intercept_ + model.coef_ * x_new

plt.plot(x_new, y_pred)

plt.show()

In [None]:
# If we had test data, we could also try to ascertain an error
# Here we'll calculate an error simply on the same set of data
# that is, we'll compare the actual y values with the model's predicted y values

from sklearn.metrics import mean_squared_error, r2_score

print('MSE = ', mean_squared_error(y, model.predict(x)))
print('R^2 = ', r2_score(y, model.predict(x)))

## K-Nearest Neighbors

Replacing the Linear Regression model with the K-Nearest Neighbors model in the previous code is as simple as replacing these two lines:

```python
import sklearn.linear_model
model = sklearn.linear_model.LinearRegression()
```

with these two:

```python
import sklearn.neighbors
model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
```

Do note, however, that changing the model like this changes what is getting trained and what is being "learned":
* K-Nearest Neighbors does not learn the optimum number of neighbors, that is a hyperparameter
* K-Nearest Neighbors does not learn the optimal value for any model coefficients

In [None]:
import sklearn.neighbors

In [None]:
# Our estimator object is now KNeighborsRegressor
# and we initialize one of its parameters, 
# namely the number of nearest neighbors to use in the algorithm

model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

Technical note: sklearn usually wants to work with feature arrays x that are 2D arrays of shape (n_samples, n_features).  Therefore, if your x is a 1D numpy array, you'll need to reshape it to be like (n_samples, 1).  You can achieve this with `x.reshape(-1,1)`, with the `-1` meaning that the n_samples value will be inferred from `x`.

In [None]:
# Reshape x (https://het.as.utexas.edu/HET/Software/Numpy/reference/generated/numpy.reshape.html)
# The "-1" means that the length is inferred.
# Here the first dimension is the number of elements

x = x.reshape(-1,1)

In [None]:
# Train the model
model.fit(x,y)

In [None]:
# Make a prediction
x_test = [[25000]]
model.predict(x_test)

In [None]:
# Visualize what the predictions are for this model

plt.plot(x,y,'ko')

# NOTE!! We include a lot of points in our x array here for plotting the predictions
# Linear regression was ok with only 2 points because a line is determined by 2 points
# Here there can be very sharp jumps in value along the range in x
x_new = np.linspace(8000,58000,100000)
x_new = x_new.reshape(-1,1)
y_pred = model.predict(x_new)
plt.plot(x_new, y_pred)

plt.show()

In [None]:
# If we had test data, we could also try to ascertain an error
# Here we'll calculate an error simply on the same set of data
# that is, we'll compare the actual y values with the model's predicted y values

from sklearn.metrics import mean_squared_error, r2_score

print('MSE = ', mean_squared_error(y, model.predict(x)))
print('R^2 = ', r2_score(y, model.predict(x)))

### Visualizing the algorithm's output as a function of n_neighbors

I will ocassionally use the Python library `ipywidgets` to make interactive visualizations.  These can provide useful insights into the scaling of functions with respect to input parameters.
* https://ipywidgets.readthedocs.io/en/stable/

Here we'll use it to make an interactive visualization that allows us to see how the prediction curve of KNN varies a function of n_neighbors for this data set.

In [None]:
import ipywidgets

In [None]:
def knn(n=3):

    model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=n)
    model.fit(x,y)
    
    plt.plot(x,y,'ko')

    x_new = np.linspace(8000,58000,100000)
    x_new = x_new.reshape(-1,1)
    y_pred = model.predict(x_new)
    plt.plot(x_new, y_pred)

    plt.show()
    
ipywidgets.interact(knn,n=(1,29))

# Classification

Classification is simple to do here as well.  However, it is not just a matter of switching the algorithm.  Classification is applied for different types of data, and it has different applicable algorithms as well as different performance metrics.

We are going to use the same data, but now convert it into 0 and 1 values.  Classification is appropriate for a discrete set of values like this.

In [None]:
y

In [None]:
y >= 6.5

In [None]:
y = (y >= 6.5)

In [None]:
import sklearn.neighbors

In [None]:
# Before for regression:
# model = sklearn.neighbors.KNeighborsRegressor(n_neighbors=1)

# Classifier
model = sklearn.neighbors.KNeighborsClassifier(n_neighbors=3)

In [None]:
# Train the model
model.fit(x,y)

In [None]:
# Make a prediction
x_test = [[25000]]
model.predict(x_test)

In [None]:
# Visualize what the predictions are for this model

plt.plot(x, y, 'ko')

x_new = np.linspace(8000,58000,100000)
x_new = x_new.reshape(-1,1)
y_pred = model.predict(x_new)
plt.plot(x_new, y_pred)

plt.show()

## Classification's performance metrics

In [None]:
model.score(x, y)

In [None]:
# If the model correctly classifies i points and misclassifies j points out of k total
# the score should be i/k
28/29

The above is termed the accuracy.  By default, each algorithm's "score" method may be a little different:
* [`score` for KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score)

In [None]:
# Note that when calculating the precision and recall here, if your classes are not 0/1
# you will need to specify what class is positive vs negative (the "pos_label")

print(f"Accuracy: {sklearn.metrics.accuracy_score(y, model.predict(x)):.2%}")
print(f"Precision: {sklearn.metrics.precision_score(y, model.predict(x), pos_label=1):.2%}")
print(f"Recall: {sklearn.metrics.recall_score(y, model.predict(x), pos_label=1):.2%}")

You can get more information on the model performance with a confusion matrix. 

In the case of binary classification, the confusion matrix shows true negatives, true positives, false positives, and false positives.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(y, model.predict(x))

In [None]:
confmat = confusion_matrix(y, model.predict(x))

fig, ax = plt.subplots(figsize=(5, 5))
ax.imshow(confmat)

# the below just sets the axis labels, tick marks, and text inside the boxes
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
for i in range(2):
    for j in range(2):
        ax.text(j, i, confmat[i,j], ha='center', va='center', color='red')

plt.show()

And with the classification report, we can see the precision and recall (and other scores) broken down according to class.

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y, model.predict(x)))

We discussed precision and recall for binary classification, but for multi-class classification problems, these metrics can be computed in slightly different ways depending on how one does averaging. 

A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), a weighted average will compute the metric independently for each class but then additionally take into account the support when calculating the overall average, and a micro-average will aggregate the contributions of all classes to compute the average metric. 

In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

# Classification based on Logistic Regression

In [None]:
import sklearn.linear_model

In [None]:
# Before with linear regression
# model = sklearn.linear_model.LinearRegression()

# Classifier with logistic regression
model = sklearn.linear_model.LogisticRegression()

In [None]:
# Train the model
model.fit(x,y)

In [None]:
# Visualize what the predictions are for this model

plt.plot(x, y, 'ko')

x_new = np.linspace(8000,58000,100000)
x_new = x_new.reshape(-1,1)
y_pred = model.predict(x_new)
plt.plot(x_new, y_pred)

plt.show()

Note what is different about the above prediction curve relative to that for KNN Classification.  Can you make them more similar?

## Learned model

With logistic regression, the optimal values for coefficients of a specific model equation have been learned.

In [None]:
# For this model, we can actually retrieve parameters for our model equation
print(model.coef_, model.intercept_)

In [None]:
model.classes_

What's the `intercept_` and `coef_` for a logistic model?

$$f(x) = \frac{1}{1+e^{-(a_0 + a_1 x)}}$$

In [None]:
# Visualize what the predictions are for this model

plt.plot(x,y,'ko')

x_new = np.linspace(8000,58000,100000)
x_new = x_new.reshape(-1,1)

# the predicted y values are now from a model equation, 
# not from results of calling the predict function
y_model = 1 / (1 + np.exp(-(model.intercept_ + model.coef_ * x_new)))

plt.plot(x_new, y_model)

plt.show()

The learned model $f(x)$ gives us a probability of belonging to the "positive" class, and we can take $f(x) > 0.5$, for example, to be a threshold for classifying as one class vs another. 

In [None]:
# Visualize what the predictions are for this model

plt.plot(x,y,'ko')

x_new = np.linspace(8000,58000,100000)
x_new = x_new.reshape(-1,1)

# the predicted y values are now from a model equation, 
# not from results of calling the predict function
y_model = 1 / (1 + np.exp(-(model.intercept_ + model.coef_ * x_new)))

plt.plot(x_new, y_model)

plt.axhline(0.5,color='r',linestyle='--')

x_new = np.linspace(8000,58000,100000)
x_new = x_new.reshape(-1,1)
y_pred = model.predict(x_new)
plt.plot(x_new, y_pred)

plt.show()

## Classification's performance metrics

In [None]:
model.score(x, y)

In [None]:
# If the model correctly classifies i points and misclassifies j points out of k total
# the score should be i/k
27/29

The above is termed the accuracy.  By default, each algorithm's "score" method may be a little different:
* [`score` for KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.score)

In [None]:
# Note that when calculating the precision and recall here, if your classes are not 0/1
# you will need to specify what class is positive vs negative (the "pos_label")

print(f"Accuracy: {sklearn.metrics.accuracy_score(y, model.predict(x)):.2%}")
print(f"Precision: {sklearn.metrics.precision_score(y, model.predict(x), pos_label=1):.2%}")
print(f"Recall: {sklearn.metrics.recall_score(y, model.predict(x), pos_label=1):.2%}")

You can get more information on the model performance with a confusion matrix. 

In the case of binary classification, the confusion matrix shows true negatives, true positives, false positives, and false positives.

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(y, model.predict(x))

In [None]:
confmat = confusion_matrix(y, model.predict(x))

fig, ax = plt.subplots(figsize=(5, 5))
ax.imshow(confmat)

# the below just sets the axis labels, tick marks, and text inside the boxes
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
for i in range(2):
    for j in range(2):
        ax.text(j, i, confmat[i,j], ha='center', va='center', color='red')

plt.show()

And with the classification report, we can see the precision and recall (and other scores) broken down according to class.

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y, model.predict(x)))