In [1]:
import pandas as pd
import numpy as np
from io import StringIO
import numpy.linalg as la

# Machine Learning

For this activity, you will explore the basics of machine learning. Machine learning describes a class of methods for automatically building mathematical models based on training data. The dataset that we will work with will be a dataset of Pokemon.

In this activity, you will:
* explore the data using visualization tools   
* split your data into training and test sets  
* create a model to predict whether a Pokemon is legendary or not based on the Pokemon properties.

Load the dataset

In [2]:
# data source: https://www.kaggle.com/abcsds/pokemon/downloads/pokemon.zip/2
df = pd.read_csv("Pokemon.csv")

In the dataset, each row represents a Pokemon. How many Pokemon are in our dataset? How many features are in this dataset? Take a look at the shape:

In [3]:
print(df.shape)

(800, 13)


You can inspect the first few lines of your data using <mark>df.head( )</mark>   

In [4]:
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


Define the 1d numpy array `y`, such that it contains whether a given Pokemon is legendary or not. The $i$th entry of y denotes whether the $i$th Pokemon is legendary (`True`) or not (`False`). We will later use a classification algorithm to help predict if a Pokemon is legendary.

In [5]:
#grade_clear
y = df['Legendary'].values
print(y.shape)
print(y.dtype)
print(type(y))

(800,)
bool
<class 'numpy.ndarray'>


Not every classifier can work with string or boolean types. Instead of having the array `y` as booleans, we can replace `True` with 1 and `False` with 0. Store this in the 1d numpy array `yb` with type `int64`

In [6]:
#grade_clear
yb = y.astype('int64')

What are the features in our data that can be used to determine the legendary status of a Pokemon?

Save these features as a 1d numpy array of strings named `labels`. Hint: there are 7 features.

In [16]:
#grade_clear
labels = np.array(['Total','HP', 'Attack', 'Defense','Sp. Atk', 'Sp. Def', 'Speed'])

Create a new dataframe, copying from df, including only the features described in `labels`.  Name it `x`.

In [17]:
#grade_clear
x = df[labels].copy()

# Splitting the dataset

To assess the model’s performance later, we divide the dataset into two parts: a training set and a test set. The first is used to train the system, while the second is used to evaluate the learned or trained model.  
    
We are going to use <mark>sklearn.model_selection.train_test_split</mark> to split the dataset

In [9]:
from sklearn.model_selection import train_test_split

A common splitting choice is to take 2/3 of your original data set as the training set, while the 1/3 that remains will compose the test set. You should select this proportion by assigning the variable <mark>s</mark> and setting the argument <mark>test_sizes = s</mark> in <mark>sklearn.model_selection.train_test_split</mark>.

In [10]:
s = 0.33

We will fix the seed for the random number generator, in order to get reproducible results

In [11]:
seed = 41

Split the arrays `x` and `yb` into training data (X_train,Y_train) and test data (X_test,Y_test) using  [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

```
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=s, random_state=seed)
```


In [27]:
#grade_clear
X_train, X_test, Y_train, Y_test = train_test_split(x, yb, test_size=s, random_state=seed)

### Logistic regression

Now that we have a dataset to train our model and a dataset to validate our model, we need to construct a model.

To introduce this, we will begin by using a logistic regression model. This is used for classification tasks where data points can only be a member of one class. The model can be solved either using a modified version of least squares or newton's method.

In [28]:
from sklearn.linear_model import LogisticRegression

Using the [LogisticRegression function](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), make an instance of the model. We will use the default parameters for now, and the `LBFGS` method as a solver:

In [29]:
model = LogisticRegression(solver="lbfgs")

Using this instance of the model, let's use the training data to train the model.
Use `model.fit(X_train, Y_train)` to train the model.

In [30]:
model.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

## Model Prediction

We now have a trained model and we can begin using it to make predictions. Recall that we want to use our model to predict whether a Pokemon is legendary or not.

Use the model to predict whether the Pokemon in the test dataset `X_test` are legendary (set not used during the training). You can use the model to make predictions using the predict function
```
model.predict(X_test)
```

Store the prediction as `Ypredict`.

In [25]:
#clear
Ypredict = model.predict(X_test)

How many Pokemons are predicted as legendary? Store this as `n_predict_legendary`. 

How many legendary Pokemons actually exist in the dataset? Store this as `n_actual_legendary`. 

In [26]:
#clear
n_predict_legendary = Ypredict.sum()
n_predict_legendary

14

In [24]:
#clear
n_actual_legendary = Y_test.sum()
n_actual_legendary

19

Unfortunately this information does not tell you how good your model is.

#### How do we know good our prediction is? How can we measure how good is the model?

One way of determining the performance of our model is using a confusion matrix. A confusion matrix describes the performance of the classification model on a set of test data for which the true values are known. A confusion matrix stores the true positives, false positives, false negatives, and true negatives for our test data.

In [None]:
from sklearn.metrics import confusion_matrix

Let's use the `confusion_matrix` function in sklearn to construct a confusion matrix for our dataset.

In [None]:
cmat = confusion_matrix(Y_test,Ypredict)

print("confusion matrix:\n",cmat)

TN, FP, FN, TP = cmat.ravel()

 $$  \text{Confusion matrix} = \left[ \begin{array} {cccc} TN & FP\\ FN&TP \end{array} \right] $$
 
TN: Predicted no (not engendary), and the pokemon is not legendary. (How many non-legendary pokemons are correctly identified?)

FP: Predicted yes (legendary), but the pokemon is not legendary. (How many non-legendary pokemon are identified as legendary? )


FN: Predicted no (not lengendary), but the pokemon is actually legendary. (How many legendary pokemon are missed?)

TP: Predicted yes (legendary), and the pokemon is legendary. (How many legendary pokemons are correctly identified? )  




#### There are different "scores" to quantify how good the model is. Here are some of them:

1) Accuracy: fraction of correct classification https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(Y_test, Ypredict)

which can be also computed as:

In [None]:
(Y_test==Ypredict).sum()/len(Y_test)

2) Precision: when it predicts yes (legendary), how often is the prediction correct?
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html

In [None]:
from sklearn.metrics import precision_score

precision_score(Y_test, Ypredict)

which can be also computed as:

In [None]:
TP/(TP+FP)

3) Recall: when actually yes (legendary), how often is the prediction correct? https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html

In [None]:
from sklearn.metrics import recall_score

recall_score(Y_test, Ypredict)

which can be also computed as:

In [None]:
TP/(TP+FN)

## Different Models:

Starting with an initial dataset, we learned how to prepare the data, split the data, construct a model, and then use the model using sklearn.

Let's try and repeat this experiment now but with a different model. Below are 5 different classifiers (models) found in sklearn. Compare your results for each of the classifiers. Which works best for the task of determining legendary status of a Pokemon?

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [None]:
dict_classifiers = {
    "Nearest Neighbors": KNeighborsClassifier(),
    "Linear SVM": SVC(),
    "Gradient Boosting Classifier": GradientBoostingClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Naive Bayes": GaussianNB(),
    "Linear Regression": LogisticRegression(solver="lbfgs")
}

In [None]:
print('%30s %16s %16s %16s' % ("Classifier","recall", "precision", "accuracy") )
for name, clf in list(dict_classifiers.items()):
    
    clf.fit(X_train, Y_train)
    y_result = clf.predict(X_test)
    recall = recall_score(Y_test, y_result)
    precision = precision_score(Y_test, y_result)
    acc = accuracy_score(Y_test, y_result)
    print('%30s %16f %16f %16f' % (name,recall, precision, acc) )