# Introduction to sklearn: `fit`, `predict`, and `score`

*Scikit.learn* (also called `sklearn`) is a Python library for machine learning. The library implements a wealth of tested and highly tuned learning algorithms with a common API. 

Every algorithm thus implements a `fit` function to train the model; a `predict` function that applies the trained model on new data; and `score` to measure how well your model is doing on data for which you have gold labels.

In [None]:
%matplotlib inline
import sklearn
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import codecs
import json

### Task: Professional athelete classification

To introduce the `sklearn` API, we consider a simple two-way (or binary) classification problem, the classic sumo wrestler vs baseball player:

![Sumo vs. Major League Baseball](sumo-vs-mlb.jpg)

In other words, from a pool of professional athletes, can we predict who is a sumo wrestler and who is a Major League Baseball player? The features that we'll use to make this decision are the *height* and the *weight* of the athlete.

### Data sources

The data for the task comes from two separate data sets, which we ask you to to harmonize and combine. 

#### Sumos

Data on sumo wrestlers was obtained by issuing this [query](http://tinyurl.com/m5k2ej8) on FreeBase and saved to the file `sumos.json`. It's in JSON format, an often used interchange data format that looks similar to Python syntax. 



In [None]:
sumo_json = json.load(open("sumos.json"))
sumo_json

Below we create a pandas `DataFrame` with the sumo wrestler's dataset.

In [None]:
sumo = pd.DataFrame(sumo_json['result'])
sumo.tail()

#### Baseball players

The dataset with height and weight for players in Major League Baseball (MLB) was downloaded from this [HTML page](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights#SOCR_Data_-_1035_Records_of_Heights_.28in.29_and_Weights_.28lbs.29_of_Major_League_Baseball_Players), copied into a spreadsheet and exported as a csv-file.

In [None]:
mlb = pd.read_csv("mlb_heights.csv", encoding='utf-8')
mlb.tail()

### Exercise: harmonize weight and height

The MLB and sumos datasets use different units for the weights and heights of the athletes. For the baseball player we have the height in inches and weights in pounds, whereas the registered heights are in meters and the weights in kilos.

Create (up to ) two new columns in each dataset that are consistent in both datasets:

- **weight_kg** weight in kilos
- **height_cm** height in centimeters


In [None]:
# Your code here

#### Combining MLB players and sumo wrestlers

Below we combine the the two datasets in a common `DataFrame`. 

Note that not every data point of the MLB dataset is used. Can you think of a reason for doing this?


In [None]:
sumo_vs_mlb = pd.concat([sumo[['height_cm', 'weight_kg']], 
                         mlb.ix[100:200, ['height_cm', 'weight_kg']]])
sumo_vs_mlb.tail()

Create a `numpy` array `is_sumo` of labels with the same length as the `sumo_vs_mlb` dataset. A value of one in this array should indicate that the corresponding row in the `sumo_vs_mlb` is a sumo wrester. In the same manner, zero means that the corresponding row is not a sumo (i.e. is an MLB player). 

In [None]:
is_sumo = np.ones(len(sumo_vs_mlb), dtype=bool)
# Your code here

We can now visualize the distribution of the weights and heights of our dataset in 2D space

In [None]:
fig, ax = plt.subplots()
sumo_vs_mlb.ix[is_sumo].plot(kind='scatter', x='weight_kg', y='height_cm', 
                             color='blue', label='Sumo', ax=ax)
sumo_vs_mlb.ix[~is_sumo].plot(kind='scatter', x='weight_kg', y='height_cm', 
                             color='red', label='Baseball', ax=ax);

Do you think it's possible to learn classifier that can perfectly separate the sumo and the baseball class?

### Learning a model

Step through the code below, reading the code and executing each cell. Make sure that you inspect any variable that you are curious about by **writing it in a new cell and executing that cell**. Also check the documentation on the sklearn website.

#### Converting dataset from `pandas` to `numpy`

In [None]:
X = sumo_vs_mlb[['weight_kg', 'height_cm']].values

#### Creating a fixed train and test set

We divide the dataset into two parts. One part will be used for training, while the other part is set aside with the purpose of estimating how our model generalizes. The parts are refered to as the training and the test set. It's important not to mix these two up, e.g. using your test data to train your model. 

In [None]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, is_sumo, random_state=100)
print("Train shape", X_train.shape, "Test shape", X_test.shape)

### Fitting the classifier



In [None]:
from sklearn.linear_model import Perceptron
perceptron = Perceptron()
perceptron

In [None]:
perceptron.fit(X_train, y_train);

In [None]:
perceptron.predict(X_test)

### Evaluation

In [None]:
y_pred = perceptron.predict(X_test)
y_pred

In [None]:
n_correct = (y_pred == y_test).sum()
print("Accuracy", n_correct / float(y_test.shape[0]))

In [None]:
perceptron.score(X_test, y_test)

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report

print("Precision", precision_score(y_test, y_pred))
print("Recall", recall_score(y_test, y_pred))
print("F1 (balanced)", f1_score(y_test, y_pred))

In [None]:
print(classification_report(y_test, y_pred))

### Estimated parameters of the model

Here we peek into the parameters of the model. This is an advanced section, so you shouldn't worry if you don't understand everything. We'll get back to this in Lecture 6. 

Do make sure that you see the decision boundary plot in the last cell.

In [None]:
print("shape", perceptron.coef_.shape)
perceptron.coef_

In [None]:
print("shape", perceptron.intercept_.shape)
perceptron.intercept_

#### Plotting the decision boundary

In [None]:
def decision_boundary(w, bias, dist=0, x_start=0, x_end=300):
    y_start = -(x_start * w[0] + bias - dist) / w[1]
    y_end = -(x_end * w[0] + bias - dist) / w[1]
    return [x_start, x_end], [y_start, y_end]

In [None]:
xx, yy = decision_boundary(perceptron.coef_[0], perceptron.intercept_[0])
fig, ax = plt.subplots()

sumo_vs_mlb[is_sumo].plot(kind='scatter', x='weight_kg', y='height_cm',
                         color='blue', label='Sumo', ax=ax)
sumo_vs_mlb[~is_sumo].plot(kind='scatter', x='weight_kg', y='height_cm',
                         color='red', label='MLB', ax=ax)
ax.set_xlim(60, 300)
ax.set_ylim(160, 210)
ax.plot(xx, yy);

### Optional exercise: Fit the data using a different classifier

Import the `LogisticRegression` classifier, using it to fit a new model. Get performance measures and plot the new decision boundary.