# Machine Learning

Machine learning is the science (and art) of programming computers so they can learn from data. 

For example:
- Ranking your web search result (Google)
- Recommendation system: Amazon
- AlphaGo
- Chatgpt or Large Language Model (Large Language Model)
- Spam filter
- Image classification
- Detecting credit card fraud
- ...

In general, we hope to use previous data to design(train) some algorithms(model), then use trained models to predict when we receive new data.

- Example 1: We can use emails (data) to construct a spam filter (algorithm, model), then we can use spam filter to predict that new emails (new data) is a spam or not.

- Example 2: We can use your current friends list (data) to construct a friend recommendation system (algorithm, model), then we can use this model to find people (new data) who you may know.

- ...

So, our goal is to finding a model that can be used on **unseen data**.


## Machine Learning pipeline:

1. Look at the Big Picture:
    
    - You should understand your task, e.g regression (predict housing price) or classification (handwritten numbers).
    - Select a performance measure
    - ...

2. Get the Data

    - Download the data from internet
    - ...

3. Discover your data and visualize your data

    - Pandas and Matplotlib
    - ...

4. Prepare your data for Machine Learning Algorithm

    - Algorithm only accepts numeric values
    - Data cleaning (fillna, dropna)
    - Text and categorical attributes to numerics
    - ...

5. Select and train your model


6. Fine-tune your model

    - Cross-validation
    - try different models and ensemble them
    - evaluate your model on test set

7. Launch, monitor, and maintain your model



# Supervised learning

Supervised learning means that you can use both input $x$ and output $y$ to train a model. The output $y$ is also called *labels*.

If the labels are "continuous" real numbers (e.g. stock price, house price, insurance fee, height, weights, and etc ), then it is **regression problem**.

If the labels are discrete categories (e.g. dog vs cat, True vs False, handwritten digits numbers(0-9), 0 vs 1, or -1 vs 1, and etc), then it is **classification problem**.


We will also discuss unsupervised learning and semisupervised learning problem.


## Why do we do train/test split?

- Remember that our goal is to create a model such that it can be generalized to unseen data set.

- It is easy to design some trivial models that performs well on training data set but fail to work on new data points. This phenomenon is called **overfitting**.


## No Free Luch (NFL) theorem:

A model is a simplified version of the observations. The simplifications are meants to discard the superfluous details that are unlikely to generalize to new instances. To decide what data to discard and what data to keep, you must make *assumptions*. For example, a linear model makes the assumption that the data is fundamentally linear and that the distance between instances and the straight line is just noise, which can safely be ignored.

In a famous 1996 paper, see [here](https://direct.mit.edu/neco/article/8/7/1341/6016/The-Lack-of-A-Priori-Distinctions-Between-Learning), David Wolpert demonstrated that if you make no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch (NFL) theorem. For some datasets the best model is linear model, while for other datasets it is a neural network. There is no model that is a priori guaranteed to work better. The only way to know for sure which model is to evaluate them all. 

Since this is not possible, in practice you make some reasonable assumptions (based on visualization or data preprocessing) and evaluate only a few reasonable models. For example, for simple tasks you may evaluate linear models with various levels of regularization, and for a complex problem you may evaluate various neural networks.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Regression example

In this part, we make a fake regression dataset and then train a linear regression model. The main goal is to review the basic machine learning process:

Generating fake dataset:

    sklearn.datasets.make_regression(n_samples=100, n_features=100, *, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)

In [2]:
from sklearn.datasets import make_regression
X, Y = make_regression(n_samples = 20000, n_features = 100, noise = 1, n_informative = 10)
print(X.shape, Y.shape)

(20000, 100) (20000,)


Train test split

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2)

Usually, we need to normalize the training data, like what we did in PCA. I skip this step here because I am using randomly generated data and each column are from the same distribution or in the same scale. 

After normalization, we can train the model, evaluate the model on the test data.

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# fit model 
reg = LinearRegression().fit(X_train, y_train)

# evaluate on the test data
y_pred = reg.predict(X_test)

# Since it is a regression problem, mean squared error is a common metric to evaluate the performance
# you can use the built-in command to compute the error
mse1 = mean_squared_error(y_test, y_pred)
print(f'MSE is {mse1:.5f}')

# or you can use your own code
mse2 = np.linalg.norm(y_pred - y_test)**2 / np.size(y_test)
print(f'MSE is {mse2:.5f}')

MSE is 0.99523
MSE is 0.99523


In [5]:
reg.coef_

array([ 5.25900957e-03,  1.26206471e-02,  7.92951221e-03,  1.01083233e-02,
       -1.87240543e-03, -1.25996529e-02, -9.92589950e-03, -6.35133876e-03,
       -1.32279044e-02,  4.59877729e-03, -7.23562740e-03, -1.18436903e-03,
       -7.75593597e-04, -2.38016302e-02,  6.83823076e-03, -5.43839086e-03,
        6.64618074e-03,  1.04601593e-03,  2.28782510e-03,  4.37496975e+01,
        1.48472242e+01, -9.13927802e-03,  7.36322324e+00, -1.17209486e-02,
        8.13902420e-03,  6.63583962e-03,  8.90309040e-03, -1.12265377e-03,
       -6.02611083e-03,  6.92299701e-03,  1.20132060e-03, -8.97781761e-03,
        1.02341201e-02, -5.34153162e-03,  2.36666721e-03, -6.85268370e-03,
       -2.00303316e-03, -1.02342395e-03,  9.90335935e+01,  2.29086277e-04,
       -1.36800265e-02,  1.71452511e-02,  8.69351526e-03, -3.45400517e-03,
       -8.10230599e-03, -5.06575363e-03,  5.60991530e-04, -3.54729976e-03,
        1.78668564e-03,  9.35637469e+01,  8.13531819e-03,  4.79125350e-03,
       -1.20329866e-02,  

# Feature selection

Notice that there are ten informative features, let's try to select them.

In [6]:
# count the number of informative features
print(np.sum(reg.coef_ >= 0.1) )

# select the index
idx = np.argwhere(reg.coef_ >= 0.1).squeeze()

# define a new training and test dataset
X_train_reduced = X_train[:, idx]
X_test_reduced = X_test[:, idx]

# fit model 
reg1 = LinearRegression().fit(X_train_reduced, y_train)

# evaluate on the test data
y_pred = reg1.predict(X_test_reduced)

# compute error
mse1 = mean_squared_error(y_test, y_pred)
print(f'MSE is {mse1:.5f}')

# or you can use your own code
mse2 = np.linalg.norm(y_pred-y_test)**2/np.size(y_test)
print(f'MSE is {mse2:.5f}')

10
MSE is 0.98571
MSE is 0.98571


# Classification example

We can use the `make_classification` command to generate synthetic data samples.

    sklearn.datasets.make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

In [7]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, Y = make_classification(n_samples = 10000, n_features = 30, n_informative = 6, n_classes = 4)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)

## Model 1: K-nearest neighbors

#### Algorithm explanation:

An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

#### Regression:
K-nearest neighbors algorithm can also be used for regression problem. In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors. If k = 1, then the output is simply assigned to the value of that single nearest neighbor.

#### Documentation:
Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

In [8]:
from sklearn.neighbors import KNeighborsClassifier

# train the model
neigh = KNeighborsClassifier(n_neighbors = 10)
neigh.fit(X_train, y_train)

# test accuracy
y_pred = neigh.predict(X_test)
acc = neigh.score(X_test, y_test)
print(f'Test accuracy is {acc:.2f}')

# compute test accuracy without using score
acc1 = np.sum(y_pred.reshape(y_test.shape) == y_test) / np.size(y_test)
print(f'Test accuracy is {acc1:.2f}')

# return probability
prob = neigh.predict_proba(X_test)
print(prob)

Test accuracy is 0.73
Test accuracy is 0.73
[[0.  0.2 0.7 0.1]
 [0.1 0.9 0.  0. ]
 [0.3 0.6 0.  0.1]
 ...
 [0.5 0.  0.  0.5]
 [0.  0.8 0.2 0. ]
 [0.1 0.8 0.  0.1]]


You can use cross-validation to select the best number of neighbors.

## Logistic Regression

See wikipedia for model explanation: https://en.wikipedia.org/wiki/Logistic_regression

**Logistic regression is designed for classification problems only. In other words, you can not use this model for regression problems.**

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

In [9]:
# train the model
from sklearn.linear_model import LogisticRegression

Log = LogisticRegression().fit(X_train, y_train)

# test accuracy
y_pred = Log.predict(X_test)
acc = Log.score(X_test, y_test)
print(f'Test accuracy is {acc:.2f}')

# compute test accuracy without using score
acc1 = np.sum(y_pred.reshape(y_test.shape) == y_test) / np.size(y_test)
print(f'Test accuracy is {acc1:.2f}')

# return probability
prob = Log.predict_proba(X_test)
print(prob)

Test accuracy is 0.61
Test accuracy is 0.61
[[0.033916   0.13411484 0.79754229 0.03442688]
 [0.0495778  0.93354834 0.00899332 0.00788055]
 [0.21904958 0.54030867 0.02962937 0.21101239]
 ...
 [0.54630145 0.0078421  0.02151895 0.4243375 ]
 [0.01094789 0.5367038  0.43217756 0.02017076]
 [0.33666718 0.47591573 0.04511426 0.14230284]]


## Tree based model

A decision tree looks like this: ![](https://scikit-learn.org/stable/_images/sphx_glr_plot_iris_dtc_002.png)


A decision tree is a flowchart-like structure in which each internal node represents a "test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

In [10]:
from sklearn.tree import DecisionTreeClassifier

Tree = DecisionTreeClassifier(max_depth = 10).fit(X_train, y_train)

# test accuracy
y_pred = Tree.predict(X_test)
acc = Tree.score(X_test, y_test)
print(f'Test accuracy is {acc:.2f}')

# compute test accuracy without using score
acc1 = np.sum(y_pred.reshape(y_test.shape) == y_test) / np.size(y_test)
print(f'Test accuracy is {acc1:.2f}')

# return probability
prob = Tree.predict_proba(X_test)
print(prob)

Test accuracy is 0.74
Test accuracy is 0.74
[[0.         0.         1.         0.        ]
 [0.03246753 0.94805195 0.01298701 0.00649351]
 [0.14285714 0.53571429 0.14285714 0.17857143]
 ...
 [0.525      0.025      0.         0.45      ]
 [0.         0.98717949 0.         0.01282051]
 [0.03246753 0.94805195 0.01298701 0.00649351]]


## Ensemble Learning

To better understand random forest model, we should know **Ensemble Learning** first, which is an important technique in Machine Learning.

#### Motivation:

Suppose that you have a complex question of thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert's answer. This is called the *wisdom of the crowd*. Similarly, if you aggregate the predictions of a groups of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus this technique is called Ensemble Learning.

#### When do we use ensemble model?

Usually, ensemble model works better than single model, but there is no guarantee. In my opinion, since we must try different single models due to No Free Luch Theorem, it does not hurt to ensemble all single models you have tried, and look at the performance on test dataset.

#### Example:

Let's ensemble KNNeighbor, Logistic Regression, and Tree model together.

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

In [11]:
from sklearn.ensemble import VotingClassifier

# define classifiers you want to use
kn_clf = KNeighborsClassifier()
log_clf = LogisticRegression()
tree_clf = DecisionTreeClassifier()

# define max vote classifier
voting_clf = VotingClassifier(estimators = [('kn', kn_clf),
                                            ('lr', log_clf),
                                            ('tree', tree_clf)],
                             voting = 'hard')
# train max vote classifier
voting_clf.fit(X_train, y_train)

# let's look at each classifier's accuracy on the test set:
for clf in (kn_clf, log_clf, tree_clf, voting_clf):
    
    clf.fit(X_train, y_train)
    
    acc = clf.score(X_test, y_test)
    
    print(clf.__class__.__name__, '\n\n', acc, '\n\n')

KNeighborsClassifier 

 0.716 


LogisticRegression 

 0.609 


DecisionTreeClassifier 

 0.7135 


VotingClassifier 

 0.738 




## Random Forest

In short, random forest is an ensemble of decision trees. 

A group of Decision Tree classifiers are trained on a different random subset of the training set. Then, you use max vote (ensemble technique) to obtain the prediction. 


Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html


In [13]:
from sklearn.ensemble import RandomForestClassifier

# train random feature model
rf = RandomForestClassifier().fit(X_train, y_train)

# accuracy
print( f'Test accuracy is {rf.score(X_test, y_test)}', '\n\n')

# return probability
prob = rf.predict_proba(X_test)
print(prob)

Test accuracy is 0.798 

[[0.04 0.09 0.77 0.1 ]
 [0.08 0.81 0.06 0.05]
 [0.21 0.46 0.13 0.2 ]
 ...
 [0.46 0.01 0.03 0.5 ]
 [0.   0.83 0.11 0.06]
 [0.21 0.71 0.   0.08]]
