# Machine Learning

Machine learning is the science (and art) of programming computers so they can learn from data. 

For example:
- Ranking your web search result (Google)
- Recommendation system: Amazon
- AlphaGo
- Chatgpt or Large Language Model (Large Language Model)
- Spam filter
- Image classification
- Detecting credit card fraud
- ...

In general, we hope to use previous data to design(train) some algorithms(model), then use trained models to predict when we receive new data.

- Example 1: We can use emails (data) to construct a spam filter (algorithm, model), then we can use spam filter to predict that new emails (new data) is a spam or not.

- Example 2: We can use your current friends list (data) to construct a friend recommendation system (algorithm, model), then we can use this model to find people (new data) who you may know.

- ...

So, our goal is to finding a model that can be used on **unseen data**.


## Machine Learning pipeline:

1. Look at the Big Picture:
    
    - You should understand your task, e.g regression (predict housing price) or classification (handwritten numbers).
    - Select a performance measure
    - ...

2. Get the Data

    - Download the data from internet
    - ...

3. Discover your data and visualize your data

    - Pandas and Matplotlib
    - ...

4. Prepare your data for Machine Learning Algorithm

    - Algorithm only accepts numeric values
    - Data cleaning (fillna, dropna)
    - Text and categorical attributes to numerics
    - ...

5. Select and train your model


6. Fine-tune your model

    - Cross-validation
    - try different models and ensemble them
    - evaluate your model on test set

7. Launch, monitor, and maintain your model



# Supervised learning

Supervised learning means that you can use both input $x$ and output $y$ to train a model. The output $y$ is also called *labels*.

If the labels are "continuous" real numbers (e.g. stock price, house price, insurance fee, height, weights, and etc ), then it is **regression problem**.

If the labels are discrete categories (e.g. dog vs cat, True vs False, handwritten digits numbers(0-9), 0 vs 1, or -1 vs 1, and etc), then it is **classification problem**.


We will also discuss unsupervised learning and semisupervised learning problem.


## Why do we do train/test split?

- Remember that our goal is to create a model such that it can be generalized to unseen data set.

- It is easy to design some trivial models that performs well on training data set but fail to work on new data points. This phenomenon is called **overfitting**.


## No Free Luch (NFL) theorem:

A model is a simplified version of the observations. The simplifications are meants to discard the superfluous details that are unlikely to generalize to new instances. To decide what data to discard and what data to keep, you must make *assumptions*. For example, a linear model makes the assumption that the data is fundamentally linear and that the distance between instances and the straight line is just noise, which can safely be ignored.

In a famous 1996 paper, see [here](https://direct.mit.edu/neco/article/8/7/1341/6016/The-Lack-of-A-Priori-Distinctions-Between-Learning), David Wolpert demonstrated that if you make no assumption about the data, then there is no reason to prefer one model over any other. This is called the No Free Lunch (NFL) theorem. For some datasets the best model is linear model, while for other datasets it is a neural network. There is no model that is a guarantee to work better. The only way to know for sure which model is to evaluate them all. 

Since this is not possible, in practice you make some reasonable assumptions (based on visualization or data preprocessing) and evaluate only a few reasonable models. For example, for simple tasks you may evaluate linear models with various levels of regularization, and for a complex problem you may evaluate various neural networks.

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Regression example

In this part, we make a fake regression dataset and then train a linear regression model. The main goal is to review the basic machine learning process:

Generating fake dataset:

    sklearn.datasets.make_regression(n_samples=100, n_features=100, *, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)

In [2]:
from sklearn.datasets import make_regression

X, y = make_regression(n_samples = 20000, n_features = 100, n_informative = 10, noise = 1)

print(X.shape, y.shape)

(20000, 100) (20000,)


Train test split

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

Usually, we need to normalize the training data, like what we did in PCA. I skip this step here because I am using randomly generated data and each column are from the same distribution or in the same scale. 

After normalization, we can train the model, evaluate the model on the test data.

In [4]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# fit model

reg = LinearRegression().fit(X_train, y_train)

# evaluate on the test data

y_pred = reg.predict(X_test)

# test error: y_pred and y_test

mse1 = np.linalg.norm(y_pred - y_test) ** 2 / np.size(y_test)

mse2 = mean_squared_error(y_pred, y_test)

print(f'MSE is {mse1:.4f}')
print(f'MSE is {mse2:.4f}')

MSE is 0.9994
MSE is 0.9994


In [5]:
# reg.coef_

# Feature selection

Notice that there are ten informative features, let's try to select them.

In [6]:
# count the number of informative features

print(np.sum(reg.coef_ >= 0.1))

# select the index

idx = np.argwhere(reg.coef_ >= 0.1).squeeze()

print(idx)

# redefine the new training and test data

X_train_reduced = X_train[:, idx]

X_test_reduced = X_test[:, idx]

# fit model

reg1 = LinearRegression().fit(X_train_reduced, y_train)

# evaluate on the test dataset

y_pred1 = reg1.predict(X_test_reduced)

# test error: y_pred and y_test

mse = mean_squared_error(y_pred1, y_test)

print(f'MSE is {mse:.4f}')

10
[46 49 53 57 58 66 73 78 90 93]
MSE is 0.9912


# Classification example

We can use the `make_classification` command to generate synthetic data samples.

    sklearn.datasets.make_classification(n_samples=100, n_features=20, *, n_informative=2, n_redundant=2, n_repeated=0, n_classes=2, n_clusters_per_class=2, weights=None, flip_y=0.01, class_sep=1.0, hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

In [7]:
from sklearn.datasets import make_classification

X, y = make_classification(n_samples = 20000, n_features = 30, n_informative = 6, n_classes = 4)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

## Model 1: K-nearest neighbors

Model explanation is given during the lecture.

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

You can use cross-validation to select the best number of neighbors.

## Logistic Regression

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

## Tree based model

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

## Ensemble Learning

To better understand random forest model, we should know **Ensemble Learning** first, which is an important technique in Machine Learning.

#### Motivation:

Suppose that you have a complex question of thousands of random people, then aggregate their answers. In many cases you will find that this aggregated answer is better than an expert's answer. This is called the *wisdom of the crowd*. Similarly, if you aggregate the predictions of a groups of predictors (such as classifiers or regressors), you will often get better predictions than with the best individual predictor. A group of predictors is called an ensemble; thus this technique is called Ensemble Learning.

Let's ensemble KNNeighbor, Logistic Regression, and Tree model together.

Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html

## Random Forest

In short, random forest is an ensemble of decision tree. 

A group of Decision Tree classifiers are trained on a different random subset of the training set. Then, you use max vote (ensemble technique) to obtain the prediction. 


Python documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
