In [1]:
"""The digits recognition dataset
Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type Bunch, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images' key in addition to the 'data' and 'target' keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this 'images' key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see Chapter 2 of DataCamp's course on Data Visualization with Python). On the other hand, the 'data' key contains the feature array - that is, the images as a flattened array of 64 pixels.

Notice that you can access the keys of these Bunch objects in two different ways: By using the . notation, as in digits.images, or the [] notation, as in digits['images'].

For more on the MNIST data, check out this exercise in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model."""

"The digits recognition dataset\nUp until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.\n\nEach sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type Bunch, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images' key in addition to the 'data' and 'target' keys tha

In [None]:
#Data Preprocessing
#!/usr/bin/env python
import numpy as np
from scipy.stats import nanmean

def fill_missing_values(X):
    """ imputing missing values before building a learner """
    mean=nanmean(X,axis=0)
    for rows in xrange(len(X)):
        for cols in xrange(len(X[rows])):
            if np.isnan(X[rows][cols]):
                X[rows][cols]=mean[cols]
    return X


In [1]:
#Functions
#!/usr/bin/env python
import json
import numpy as np

subjects={"English":0,"Physics":1,"Chemistry":2, "ComputerScience":3,"Biology":4,\
        "PhysicalEducation":5, "Economics":6,"Accountancy":7,"BusinessStudies":8,\
        "Mathematics":9,"serial":10}

def get_x(data):
    x=[np.nan]*9;
    for key in data.keys():
        if subjects[key]<=8:
            x[subjects[key]]=data[key]
    return x

def get_y(data):
    for key in data.keys():
        if subjects[key]==9:    #Mathematics
            y=data[key]
    return y

def load_training_data(filename):
    X=[];Y=[]
    f=open(filename,"r")
    nline=int(f.readline())
    for i in xrange(nline):
        data=json.loads(f.readline())
        X.append(get_x(data))
        Y.append(get_y(data))
    return (X,Y)

def load_test_data(xfilename,yfilename):
    X=[];Y=[]
    fx=open(xfilename,"r")
    fy=open(yfilename,"r")
    nline=int(fx.readline())
    for i in xrange(nline):
        data=json.loads(fx.readline())
        X.append(get_x(data))
        Y.append(int(fy.readline()))
    return (X,Y)

def load_input():
    X=[]
    n=int(raw_input())
    for i in xrange(n):
        data=json.loads(raw_input())
        X.append(get_x(data))
    return X

In [None]:
#Decision tree
#!/usr/bin/env python
from datasets import load_training_data,load_test_data
from preprocessing import fill_missing_values

""" get training data """
Xtr,Ytr=load_training_data("data/training.json")
Xtr=fill_missing_values(Xtr)

""" get test data """
Xte,Yte=load_test_data("data/sample-test.in.json","data/sample-test.out.json")
Xte=fill_missing_values(Xte)

"""training"""
from sklearn.tree import DecisionTreeClassifier
learner = DecisionTreeClassifier(max_depth=7,random_state=0).fit(Xtr, Ytr)

"""predicting"""
print learner.score(Xtr,Ytr)
print learner.score(Xte,Yte)