# Machine Learning with scikit-learn

We go through an example of building a Logistic Regression model of the Iris dataset.

## 1. Module Imports

### Pandas

We use `pandas` for data manipulation. Standard practice is to nickname as `pd`

### Matplotlib

We use `matplotlib` for data visualization. Standard practice is to nackname as `plt`.

### Scikit-learn

We use `scikit-learn` for machine learning. We are using 3 functions/classes from scikit-learn.

1. `load_iris`: Function for loading the dataset we'll be using
2. `train_test_split`: Function for splitting the data into a training set and test set
3. `LogisticRegression`: Class for the Logistic Regression classification model

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

## 2. Loading the data

We are working with the Iris dataset, which has measurements of different Iris flowers. There are three different types of Irises.

The `DESRC` attribute gives information about the dataset.

In [None]:
iris = load_iris()
print(iris.DESCR)

## 3. Pandas DataFrame

We create a Pandas DataFrame with our data. This is a datatype that is used for storing a table of data.

We segment the data to only use classes 1 and 2 to make the classification problem easier.

The `head` method returns the first 5 rows.

In [None]:
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df = iris_df[iris_df['target'] >= 1] # use only 2 classes
iris_df.head()

## 4. Graph the Data

We use matplotlib to graph the data. We create a scatter plot, where class 1 (Iris-Versicolour) is blue triangles and class 2 (Iris-Virginica) is orange circles. We use just two of the features (petel length and petal width) so that it's graphable.

In [None]:
plt.figure(figsize=(10, 8))
xlabel, ylabel = "petal length (cm)", "petal width (cm)"
plt.xlabel(xlabel)
plt.ylabel(ylabel)
class1 = iris_df[iris_df["target"] == 1]
class2 = iris_df[iris_df["target"] == 2]
plt.scatter(class1[xlabel], class1[ylabel], marker='^', label="Iris-Versicolour")
plt.scatter(class2[xlabel], class2[ylabel], label="Iris-Virginica")
plt.legend()

## 5. Create feature matrix and target array

We create a feature matrix `X` that has the values for all the datapoints and the two features that we're using.

We create a target array `y` that is the target values (either 1 for Iris-Versicolour or 2 for Iris-Virginica)

In [None]:
X = iris_df[[xlabel, ylabel]].values
y = iris_df["target"].values
print(X)
print(y)

## 6. Split the data into training set and test set

We use the `train_test_split` method from scikit-learn to split the dataset into a training set and a test set.

We can see the shape (size) of the original and the new feature matrices and target arrays.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
print("feature matrix size:", X.shape)
print("training feature matrix size:", X_train.shape)
print("test feature matrix size:", X_test.shape)
print("target array size:", y.shape)
print("training target array size:", y_train.shape)
print("test target array size:", y_test.shape)

## 7. Build the model

Create an instance of the `LogisticRegression` class and use the `fit` method to build the model.

In [None]:
iris_model = LogisticRegression().fit(X_train, y_train)

To make the warning go away, we can specify the solver.

In [None]:
iris_model = LogisticRegression(solver='lbfgs').fit(X_train, y_train)

This determines the coefficients, which we can print out.

In [None]:
print(iris_model.coef_, iris_model.intercept_)

## 8. Make predictions

We use the `predict` method to make a prediction. First let's make a single prediction of the first datapoint in `X_test` and compare it to the target.

In [None]:
print("datapoint:", X_test[0])
print("target:", y_test[0])
print("prediction:", iris_model.predict([X_test[0]]))

Now let's make predictions for all the datapoints in `X_test`. We create a DataFrame of the feature values, target and the prediction so we can easily print it out.

In [None]:
y_predict = iris_model.predict(X_test)
pd.DataFrame({
    xlabel: X_test[:, 0],
    ylabel: X_test[:, 1],
    "target": y_test,
    "prediction": y_predict,
    "correct?": y_test == y_predict})

## 9. Calculate the score

We can count how many of the predictions were correct.

In [None]:
num_correct = (y_test == y_predict).sum()
num_datapoints = y_test.shape[0]
print(f"accuracy = {num_correct} / {num_datapoints} = {num_correct / num_datapoints}")

More simply, use the `score` method to calculate the accuracy score on the test set.

In [None]:
print(iris_model.score(X_test, y_test))

# Your Turn!

Feel free to approach any exercise first.


Here are the general steps to follow when building a model:

## Exercise 1

Use all four features (instead of just pedal length and pedal width) in the model for the iris dataset.

1. Create an alternative feature matrix `X1` that has all four features.
2. Use the `train_test_split` function to create training and test sets.
3. Create a `LogisticRegression` model and use the `fit` method with the training set.
4. Use the `score` method with the test set.

## Exercise 2

Load another model from sklearn and compare the score with linear regression on the iris dataset.

**Don't run `train_test_split` again. Use the same train and test sets for all the algorithms.**

Here are the docs & import statements for the other classificatoin models:

* [Desision tree for classification](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html): `from sklearn.tree import DecisionTreeClassifier`
* [Random forest for classification](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html): `from sklearn.ensemble import RandomForestClassifier`
* [Neural network for classification](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPClassifier): `from sklearn.neural_network import MLPClassifier`

Note that the `fit`, `predict` and `score` methods are the same for all the models.

1. Start with the training and test sets.
2. Create a `DecisionTreeClassifier` model and use the `fit` method with the training set.
3. Use the `score` method with the test set.

Which model has the highest accuracy score?

## Exercise 3

Try building a model for the boston regression dataset.

Here's the import statement: `from sklearn.datasets import load_boston`.

Follow the same syntax as with the `load_iris` function.

Here's the docs & import statements for the regression models:

* [Linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html): `from sklearn.linear_model import LinearRegression`
* [Desision tree for regression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html): `from sklearn.tree import DecisionTreeRegressor`
* [Random forest for regression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html): `from sklearn.ensemble import RandomForestRegressor`
* [Neural network for regression](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor): `from sklearn.neural_network import MLPRegressor`

**Make sure to just run `train_test_split` once and not with each model.**

1. Load the data with the `load_boston` function.
2. Create feature matrix `X` and target array `y`.
3. Run `train_test_split` to create training and test sets.
4. For each model:
    * Create a model and use the `fit` method with the training set.
    * Use the `score` method with the test set.

Which model has the highest R2 score?

## Even more

Try building models for other [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) or some [real world datasets](https://scikit-learn.org/stable/datasets/real_world.html). Or find a fun challenge on [kaggle.com](https://www.kaggle.com/).