# Sunday challenge

There are numerous parts to this exercise. If the first section, we'll explore logistic regression on two different classification problems. In the second section, we'll play with some other methods and see if we can improve our performance.

In [None]:
# Importing all the libraries we'll use
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

import sklearn
from sklearn import model_selection, datasets, linear_model

In [None]:
%matplotlib inline

## Part 1 - Loading the data

### Loading the data

#### Dataset 1 - Digits

Alex talked about digit classification being one of the 'hello world' problems of machine learning. So, we may as well have a go at it!

There is a digits dataset built in to sklearn, so we can access it easily (sklearn.datasets.load_digits())

In [None]:
digits = datasets.load_digits()
X_digits = digits.data
y_digits = digits.target

X contains sets of features, and y contains the corresponding class. In this case, the features are the pixel values in an 8x8 image of a hand-written number. Let's look at the first digit:

In [None]:
X_digits[0]

In [None]:
y_digits[0]

Apparently, the first digit is a zero. Let's plot it and see what it looks like:

In [None]:
plt.imshow(X_digits[0].reshape(8, 8), cmap='gray')

#### Task 1: plot 3 more images. Do the tags (y values) match what you think the numbers look like?

In [None]:
# Answer task 1 here

We could use the arrays of X and y as they are for model building, but we may as well put them into a pandas dataframe to get in some more practice!

In [None]:
digits_df = pd.DataFrame(X_digits)
digits_df['class'] = y_digits
digits_df.head() # 64 pixel values then the class

As a final step, we'll split the data into test and train sets - the test set will be used to evaluate the model's performance

In [None]:
digits_train, digits_test = model_selection.train_test_split(digits_df)
# If we wanted to keep working with arrays we'd do the following:
#X_digits_train, X_digits_test, y_digits_train, y_digits_test = model_selection.train_test_split(X_digits, y_digits)

#### Dataset 2 - Robot collisions

In [None]:
data = pd.read_csv('collisions.csv')

In [None]:
data.head()

The classses are mapped with the following: classes = {'normal':0, 'collision':1, 'obstruction':2, 'fr_collision':3}

You can read about the dataset from which these values were derived here: https://archive.ics.uci.edu/ml/datasets/Robot+Execution+Failures. 

In [None]:
# Looking at how the classes differ. 
data.hist(by='class name', column='t3') # Try f1, f2, f3, t1, t2, t3
# Add sharex=True to the function above - does this make the differences clearer?

Again, the final step in getting this data ready is to split it into training data and test data.

In [None]:
robot_train, robot_test = model_selection.train_test_split(data)

In [None]:
robot_train.head()  # 80% of the data is in this df

In [None]:
robot_test.head() # This will be used for model scoring

## Part 2 - Regression

Now that the data is all ready, we can try out our classification skills!

In [None]:
model = linear_model.LogisticRegression()

In [None]:
X = robot_train[['f1', 'f2', 'f3', 't1', 't2', 't3']] # The input columns
y = robot_train['class'] # Must be an integer
model.fit(X, y)

In [None]:
model.score(X, y) # Score on the TRAINING data

In [None]:
# And to score on the test data
XT = robot_test[['f1', 'f2', 'f3', 't1', 't2', 't3']]
yT = robot_test['class']
model.score(XT, yT)

In [None]:
# Again, but tighter code - you can copy and paste this for quick tests
model = linear_model.LogisticRegression()
model.fit(robot_train[robot_train.columns[:-2]], robot_train['class'])
print("Training score: "+str(model.score(robot_train[robot_train.columns[:-2]], robot_train['class'])))
print("Test score:     "+str(model.score(robot_test[robot_test.columns[:-2]], robot_test['class'])))

In [None]:
# And for digits:
model = linear_model.LogisticRegression()
model.fit(digits_train[digits_train.columns[:-1]], digits_train['class'])
print("Training score: "+str(model.score(digits_train[digits_train.columns[:-1]], digits_train['class'])))
print("Test score:     "+str(model.score(digits_test[digits_test.columns[:-1]], digits_test['class'])))

# Part 3 - Trying different models

We've got data, and we've seen how to fit a model to that data and score it on a test dataset. Now for the fun part, and the main section of this exercise - trying out different models!

Check out how many different supervised learning models are available in scikit-learn: http://scikit-learn.org/stable/supervised_learning.html

We're going to use our datasets that we've prepared to try out different models. I'll list some common classifiers for you to try, and hint at what parameters you could change.

In [1]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
#..... Look at the docs and see if you can find one extra one!

## Task: Try out at least ten different models, and pick your best!