# Classification using scikit-learn

### Imports

Importing pandas for dataframe functionality and sklearn (scikit learn) for classification algorithms

In [None]:
import pandas as pd
import numpy as np
import math
import os
import random
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict, StratifiedKFold, train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

### Loading and cleaning data

Use pandas to read the csv data file into a dataframe and pass the column names as an array

In [None]:
df = pd.read_csv('flag.data', names=['name', 'landmass', 'zone', 'area', 'population', 'language', 'religion', 
                                     'bars', 'stripes', 'colours', 'red', 'green', 'blue', 'gold', 'white', 'black',
                                     'orange', 'mainhue', 'circles', 'crosses', 'saltires', 'quarters', 'sunstars', 
                                     'crescent', 'triangle', 'icon', 'animate', 'text', 'topleft', 'botright'])

In [None]:
df.columns

In [None]:
df

Drop the name column, not useful for what we're doing

In [None]:
df = df.drop('name', axis=1)

The columns 'mainhue', 'topleft', and 'botright' all have a color as a value (which is a string). This doesn't work with these classifiers so we have to map each color to an int value. Since we're predicting 'mainhue' for this example, we can leave that column. For the 'topleft' and 'botright', we can use pandas to do this mapping

In [None]:
# change type from string to category
df.topleft = df.topleft.astype('category')
df.botright = df.botright.astype('category')

# get columns from dataframe that are category type
cat_columns = df.select_dtypes(['category']).columns

# apply category codes to category columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

In [None]:
df

## Personal models (10 fold cross validation)

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html

Let's make some predictions! First up is what we refer to as personal models. For this, we'll use scikit's built in cross_val_predict function. The full list of parameters that can be used can be see by following the documentation linked above. The parameters we use below are the algorithm we are using (RandomForestClassifier()), the data to be fit for the model (the entire dataframe excluding the column we want to predict, everything but 'mainhue'), the target data (the column we are trying to predict, 'mainhue'), and the number of cross validations (set to 10 fold).

In [None]:
predicted = cross_val_predict(RandomForestClassifier(), df.drop('mainhue', axis=1), df['mainhue'], cv=10)

Gives us the accuracy percentage of correctly classified labels.

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

In [None]:
accuracy = accuracy_score(df['mainhue'], predicted)

print(accuracy)

We can view the accuracy of our results in greater detail with a confusion matrix

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

### Side example on confusion matrices for the unfamiliar

In [None]:
y_true = ["cat", "ant", "cat", "cat", "ant", "bird"]
y_pred = ["ant", "ant", "cat", "cat", "ant", "cat"]
test_cm = confusion_matrix(y_true, y_pred, labels=["ant", "bird", "cat"])

print(test_cm)

In [None]:
test_cm_df = pd.DataFrame(test_cm, columns=["ant", "bird", "cat"], index=["ant", "bird", "cat"])
print(test_cm_df)

### Back to our results and confusion matrix

In [None]:
cm = confusion_matrix(df['mainhue'], predicted, labels=df['mainhue'].unique())
labels = df['mainhue'].unique()
cm_df = pd.DataFrame(cm, columns=labels, index=labels)

In [None]:
cm_df

In [None]:
df['mainhue'].value_counts()

In [None]:
df_cm_norm = cm_df / cm_df.sum(axis=1)

In [None]:
def plot_confusion_matrix(df_confusion, title='Confusion matrix', cmap=plt.cm.gray_r):
    plt.matshow(df_confusion, cmap=cmap) # imshow
    plt.colorbar()
    tick_marks = np.arange(len(df_confusion.columns))
    plt.xticks(tick_marks, df_confusion.columns, rotation=45)
    plt.yticks(tick_marks, df_confusion.index)
    plt.ylabel(df_confusion.index.name)
    plt.xlabel(df_confusion.columns.name)

In [None]:
plot_confusion_matrix(df_cm_norm)
plt.show()

### Let's try to use what we learned to predict the dominant religion in a country using the flag data

Don't forget that we must map 'mainhue' to ints because we are now using it to build the model.

## Impersonal models

Impersonal models split the data up into training and testing sets to train and evaluate the model.

Our target variable will once again be 'mainhue', we'll call this y and X will be the rest of the dataframe.

In [None]:
y = df.mainhue

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('mainhue', axis=1), y, test_size=0.2)

Next we fit the model on the training date. Once again we'll use RandomForest

In [None]:
random_forest = RandomForestClassifier()

In [None]:
random_forest.fit(X_train, y_train)

In [None]:
y_predict = random_forest.predict(X_test)
accuracy_score(y_test, y_predict)

In [None]:
impersonal_cm = confusion_matrix(y_test, y_predict, labels=df['mainhue'].unique())
labels = df['mainhue'].unique()
impersonal_cm_df = pd.DataFrame(impersonal_cm, columns=labels, index=labels)

In [None]:
impersonal_cm_df

In [None]:
df_cm_im_norm = impersonal_cm_df / impersonal_cm_df.sum(axis=1)

In [None]:
plot_confusion_matrix(df_cm_im_norm)
plt.show()