# ML1 : Machine Learning with Pokemon

In this workshop we will go through the ML engineering process with a real dataset. We will be implementing each of the machine learning models that we discussed in ML0 and evaluate their performances. 

## Our Data
We will be looking at a pokemon data meant to resemble a pokedex. It's not update to take into account the most recent pokemon but had data for 800 different ones.

#### Importing Data

## Data Cleaning
First lets look at our data by taking a sample of it.

The first column in not needed so we will get rid of it

This data has already been preprocessed before us, so we don't have much to do here.

## Exploratory Data Analysis

Now that we have clean data lets do some EDA before we get into the ML

#### Importing Seaborn

In [57]:
import seaborn as sns
import matplotlib.pyplot as plt

# Setting style preferences for seabron
sns.set(style = 'darkgrid', color_codes = True)

def setplt(x = 13, y = 9, a = 1, b = 1):
    f, ax = plt.subplots(a,b,figsize = (x,y))
    sns.despine(f, left = True, bottom = True)
    return f, ax

We can make scatterplots comparing all the features

Above we can see that we can see clear distinctions in the rank when related to the other features. We can try to see if these features have any effect on other classes like discipline and sex and see that there is no relation.

### Our Objective

Our goal for this workshop is to come up with the most efficient model to predict the rank e.i. Associate Professor, Assistant Professor, or Professor (tenured). We are given the stats for each professor such as the years they have worked, salary, and the years they have had thier PhD.

<br>

## Preparing our Data

We now have to prepare out features and labels. We will also then have to create our training and testings sets. But first we need to replace some of the features with numerical values since out ML models will not take in strings. For example our disciple feature.

<br>

####  Encoding Labels
ML models can't take in string values so they mush be converted to numerical values<br>
<br>
We needs to map the two types columns together so that similar types across columns won't have different labels

#### Creating our Features

#### Creating our Labels

### Train-Test-Split

In [2]:
# Using the test-train-split function from sklearn.preprocessing


print('Shape of training features : \t' + str(X_train.shape))
print('Shape of training labels : \t' + str(y_train.shape))
print('Shape of testing features : \t' + str(X_test.shape))
print('Shape of testing labels : \t' + str(y_test.shape))

## Finding the Best Model

Now we will go through the ML models we discussed in ML0. We will create a classifier using each of these algorithms and evaluate them to see which one gives us the best performance. <br>
<br>- K-Nearest-Neighbors
<br>- Random Forest
<br>- Support Vector Machine
<br>- Neural Network

### K-Nearest-Neighbors Classification

First we will try the KNN algorithm to create a KNN Classifier
<br><br>
Documentation:
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier.kneighbors

In [3]:
# Importing Classifier from Sci-Kit Learn

# Initializing the KNN Classifier
 
# Training model

# Evaluating Method


##### Tuning Parameters

We want to find the value for n_neighbors to get us the best accuracy

##### Predictions

### Random Forest Classification

Now that we have an accuracy for our KNN model lets try and see if we can get better performance with out RF algorithm since 0.775 is not considered to be very accurate

In [5]:
# Import RF Classifier

# Initialize the Random Forest
        # Using Gini index to measure feature importance 

# Train Model

# Test


##### Tuning Parameters

Here we will have to tune our n_estimators parameter. There are others such as the criterion (above), bootstrapping, max_features, etc... For this our case we can simply use the default values that sklearn RFC gives us, and will mainly worry about the number of trees. Hence, we will tune the parameters similar to the way we did with KNN.

<b>NOTE : </b> Since the creating of the model is randomized and not uniform like KNN, there will be a different testing accuracy every time because different bootstraps are being used to create our random forest every time we train. If you keep running the block above, the accuracy will be constantly changing.


#### Why did the RFC do better?
Our RFC model did pretty well compared to the KNN model. This is because this data can be easily categorized. Notice in the pairplots we made above the data can be seperated visually but there is a lot of overlap between the classes. Because of this KNN might have worse performance towards the edges. Random Forest allows us to make more reasonable distinctions using feature values.

### Support Vector Machine Classification

Now we will try to use the SVC algorithm to create a classification model


In [6]:
# Importing SVC

# Initialize the SVC

# Train Model

# Test Model


##### Tuning Parameters

In this case we want to find the best combination of parameters since we will be chaning more than one paramter. Lets define the ones we want to change. Since we are working with the RBF kernel (we have more than three features so this is optimal) we will have to optimize the C and gamma parameters.

Sci-Kit Learn has a very useful tool called the GridSearch. This makes finding the best paramters more simple but it essentially still uses trial and error.

In [7]:
# Importing GridSearch

# Initialize the GridSearch 

# Try all combinations

# Observe best parameters


<b>NOTE : </b> Grid search can be used to optimize any of the classical machine learning models we have discussed.

### Multi-Layer Perceptron
We will now attempt to use a neural network to create a deep learning model. We will be using the keras library built from tensorflow

#### Design

First we will need to design our MLP given the features we have and what we need from the output.

#### Sizes of vital layers

<b>Input Layer : </b> The number of features we are training on so in this case it will be <code>len(features.columns ) = 5</code>
<br>
<b>Hidden Layer : </b> We will experiment with the hidden layers but for now will just include the 4 nodes in a single hidden layer <br>
<b>Output Layer : </b> The amount of unique labels we will classify data points as <code>len(targets.unique()) = 3</code>
<br><br>
<img src='nn.png' width='550'>

Now that we have the architecture lets implement our network

In [79]:
from tensorflow.keras.models import Sequential                       # Feed-Forward Model
from tensorflow.keras.layers import Dense, Dropout, Activation       # Layers and Activation Functions
from tensorflow.keras.optimizers import SGD                          # Stochastic Gradient Descent

Encoding the Labels

In [8]:







# FOR ONE-HOT ENCODING (not used for now)

# Legendary = [1,0]
# Not-Legendary = [0,1]

# onehotencoder = OneHotEncoder(sparse = False)
# y_train = y_train.reshape(len(y_train),1)
# y_test = y_test.reshape(len(y_test),1)
# y_train = onehotencoder.fit_transform(y_train).astype(int)
# y_test = onehotencoder.fit_transform(y_test).astype(int)
# y_train[:5]


##### Initializing the MLP

##### Performance

Notice how the Neural Network did not perform as well as the other models. This is mainly because MLP's are best when used with data that has a lot of features. For data with a low number of features a simpler method like RF will work better.

## Thank You

Thanks for attending our first run of the ML series. Let us know if you enjoyed it or what you think could be improved, we are always looking information on how to improve our workshops to work better in the future.

## Interested in DSI?

We are currently looking for new members to serve on our Junior Executive Board. If you are interested please come up and talk to us so we can giev you imformation about the application and interview process.