## Machine Learning with scikit-learn

We go through an example of building a Logistic Regression model of the titanic dataset.

Download the dataset here: https://www.openml.org/d/40945 (choose CSV)

We will be using `pandas` for data manipulation, `sklearn` for machine learning models and `matplotlib` for visualization.

In [None]:
import pandas as pd
from sklearn.datasets import load_iris

In [None]:
iris = load_iris()
print(iris.DESCR)

In [None]:
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_df['target'] = iris.target
iris_df = iris_df[iris_df['target'] >= 1] # use only 2 classes
iris_df.head()

In [None]:
from matplotlib import pyplot as plt

In [None]:
# GRAPH THE DATA
plt.figure(figsize=(10, 8))
xlabel, ylabel = "petal length (cm)", "petal width (cm)"
plt.xlabel(xlabel)
plt.ylabel(ylabel)
class0 = iris_df[iris_df["target"] == 1]
class1 = iris_df[iris_df["target"] == 2]
plt.scatter(class0[xlabel], class0[ylabel], marker='^', label="Iris-Versicolour")
plt.scatter(class1[xlabel], class1[ylabel], label="Iris-Virginica")
plt.legend()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [None]:
# BUILD MODEL
X = iris_df[[xlabel, ylabel]].values
y = iris_df["target"].values
X_train, X_test, y_train, y_test = train_test_split(X, y)
iris_model = LogisticRegression(solver='lbfgs').fit(X_train, y_train)

In [None]:
print(iris_model.coef_, iris_model.intercept_)

In [None]:
# MAKE SINGLE PREDICTION
print("datapoint:", X_test[0])
print("target:", y_test[0])
print("prediction:", iris_model.predict([X_test[0]]))

In [None]:
# PREDICT ON ENTIRE TEST SET
y_predict = iris_model.predict(X_test)
pd.DataFrame({xlabel: X_test[:, 0], ylabel: X_test[:, 1], "target": y_test, "prediction": y_predict})

In [None]:
# CALCULATE SCORE ON ENTIRE TEST SET
print(iris_model.score(X_test, y_test))

# Your Turn!

Feel free to approach any exercise first.


Here are the general steps to follow when building a model:

1. Load data, build DataFrame
2. Use `train_test_split` to create training and test set
3. For each model:
    - Build model with the train set (`fit` method)
    - Score with the test set (`score` method)

## Exercise 1

Use all four features (instead of just pedal length and pedal width) in the model for the iris dataset.

## Exercise 2

Load another model from sklearn and compare the score with linear regression on the iris dataset.

**Don't run `train_test_split` again. Use the same train and test sets for all the algorithms.**

Here are the docs & import statements for the other regression models:

* [Desision tree for classification](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html): `from sklearn.tree import DecisionTreeClassifier`
* [Random forest for classification](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html): `from sklearn.ensemble import RandomForestClassifier`
* [Neural network for classification](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPClassifier): `from sklearn.neural_network import MLPClassifier`

Follow the syntax of the `LinearRegression` model. The `fit`, `predict` and `score` methods are the same!

## Exercise 3

Try building a model for the boston regression dataset.

Here's the import statement: `from sklearn.datasets import load_boston`.

Follow the same syntax as with the `load_iris` function.

Here's the docs & import statements for the classification models:

* [Linear regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html): `from sklearn.linear_model import LinearRegression`
* [Desision tree for regression](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html): `from sklearn.tree import DecisionTreeRegressor`
* [Random forest for regression](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html): `from sklearn.ensemble import RandomForestRegressor`
* [Neural network for regression](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html#sklearn.neural_network.MLPRegressor): `from sklearn.neural_network import MLPRegressor`

## Even more

Try building models for other [toy datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html) or some [real world datasets](https://scikit-learn.org/stable/datasets/real_world.html). Or find a fun challenge on [kaggle.com](https://www.kaggle.com/).