# Lecture 10: Classification Part 1

### Classification using scikit-learn (with pandas)

Classification Algorithms covered:
1. k-Nearest Neighbors
2. Decision Trees / Random Forest
3. Logistic Regression

Notebook created by Jennifer Widom, modified by Lisa Wang.

In [70]:
import csv
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [71]:
# Read Cities.csv into dataframe, add column for temperature category
# Note: For a dataframe D and integer i, D.ix[i] is the i-th row of D
f = open('Cities.csv','rU')
cities = pd.read_csv(f)
cats = []
for i in range(len(cities)):
    if cities.ix[i]['temperature'] < 5:
        cats.append('cold')
    elif cities.ix[i]['temperature'] < 9:
        cats.append('cool')
    elif cities.ix[i]['temperature'] < 15:
        cats.append('warm')
    else: cats.append('hot')
cities['category'] = cats
print "cold:", len(cities[(cities.category == 'cold')])
print "cool:", len(cities[(cities.category == 'cool')])
print "warm:", len(cities[(cities.category == 'warm')])
print "hot:", len(cities[(cities.category == 'hot')])

cold: 17
cool: 92
warm: 79
hot: 25


In [72]:
# Create training and test sets for cities data
num_items = len(cities)
percent_train = 0.85
num_train = int(numitems*percent_train)
num_test = num_items - num_train
print 'Training set', num_train, 'items'
print'Test set', num_test, 'items'
citiesTrain = cities[0:num_train]
citiesTest = cities[num_train:]

Training set 181 items
Test set 32 items


In [73]:
print citiesTrain[:10]

        city         country  latitude  longitude  temperature category
0    Aalborg         Denmark     57.03       9.92         7.52     cool
1   Aberdeen  United Kingdom     57.17      -2.08         8.10     cool
2     Abisko          Sweden     63.35      18.83         0.20     cold
3      Adana          Turkey     36.99      35.32        18.67      hot
4   Albacete           Spain     39.00      -1.87        12.62     warm
5  Algeciras           Spain     36.13      -5.47        17.38      hot
6     Amiens          France     49.90       2.30        10.17     warm
7  Amsterdam     Netherlands     52.35       4.92         8.93     cool
8     Ancona           Italy     43.60      13.50        13.52     warm
9    Andorra         Andorra     42.50       1.52         9.60     warm


Pandas: Note that you can access individual rows by their row index. E.g.

In [74]:
citiesTrain.ix[0]

city           Aalborg
country        Denmark
latitude         57.03
longitude         9.92
temperature       7.52
category          cool
Name: 0, dtype: object

In [75]:
print citiesTest[:10]

             city         country  latitude  longitude  temperature category
181         Sivas          Turkey     39.75      37.03         8.05     cool
182        Skopje       Macedonia     42.00      21.43         9.36     warm
183         Split         Croatia     43.52      16.47        12.46     warm
184  Stara Zagora        Bulgaria     42.42      25.62        10.90     warm
185     Stavanger          Norway     58.97       5.68         5.53     cool
186     Stockholm          Sweden     59.35      18.10         6.26     cool
187          Sumy         Ukraine     50.92      34.78         6.28     cool
188       Swansea  United Kingdom     51.63      -3.95         9.73     warm
189        Szeged         Hungary     46.25      20.15        10.34     warm
190       Tallinn         Estonia     59.43      24.73         4.82     cold


### K-nearest-neighbors classification

In [76]:
# Predict temperature category from other features
features = ['longitude', 'latitude']

# Create classfier
neighbors = 7 # Number of neighbors to consider for k nearest neighbor classification
classifier = KNeighborsClassifier(n_neighbors=neighbors)

# Train the classifier on training data
classifier.fit(citiesTrain[features], citiesTrain['category'])

# Make predictions on training data
train_predictions = classifier.predict(citiesTrain[features])

# Make predictions on test data
test_predictions = classifier.predict(citiesTest[features])

num_train = len(citiesTrain)
num_test = len(citiesTest)
# Calculate training accuracy
train_correct = 0
for i in range(num_train):
    print 'Predicted:', train_predictions[i], ' Actual:', citiesTrain.ix[i]['category']
    if train_predictions[i] == citiesTrain.ix[i]['category']: train_correct +=1
print 'Training Accuracy:', float(train_correct)/float(num_train)
print ""

# Calculate test accuracy
test_correct = 0
for i in range(num_test):
    print 'Predicted:', test_predictions[i], ' Actual:', citiesTest.ix[num_train + i]['category']
    if test_predictions[i] == citiesTest.ix[num_train + i]['category']: test_correct +=1
print 'Test Accuracy:', float(test_correct)/float(num_test)
# Comment out print, try other values for neighbors, other features

Predicted: cool  Actual: cool
Predicted: cool  Actual: cool
Predicted: cold  Actual: cold
Predicted: warm  Actual: hot
Predicted: hot  Actual: warm
Predicted: hot  Actual: hot
Predicted: warm  Actual: warm
Predicted: cool  Actual: cool
Predicted: warm  Actual: warm
Predicted: warm  Actual: warm
Predicted: warm  Actual: warm
Predicted: warm  Actual: warm
Predicted: warm  Actual: warm
Predicted: warm  Actual: warm
Predicted: hot  Actual: hot
Predicted: cool  Actual: cold
Predicted: cool  Actual: cool
Predicted: hot  Actual: hot
Predicted: cool  Actual: cool
Predicted: cool  Actual: cool
Predicted: warm  Actual: hot
Predicted: hot  Actual: hot
Predicted: cool  Actual: cool
Predicted: warm  Actual: warm
Predicted: cool  Actual: cool
Predicted: warm  Actual: warm
Predicted: cool  Actual: warm
Predicted: cool  Actual: cold
Predicted: cool  Actual: cool
Predicted: cool  Actual: cool
Predicted: cool  Actual: cool
Predicted: cool  Actual: cool
Predicted: warm  Actual: warm
Predicted: warm  Actu

### <font color="green">Your Turn: K-nearest-neighbors on World Cup Data</font>

In [77]:
# Predict position from one or more of minutes, shots, passes, tackles, saves.
# This cell does all the set-up, including reordering the data to avoid team bias.
f = open('Players.csv','rU')
players = pd.read_csv(f)
players = players.sort_values(by='surname')
players = players.reset_index(drop=True)
num_items = len(players)
percent_train = 0.95
num_train = int(num_items*percent_train)
num_test = num_items - num_train
print 'Training set', num_train, 'items'
print'Test set', num_test, 'items'
playersTrain = players[0:num_train]
playersTest = players[num_train:]

Training set 565 items
Test set 30 items


In [78]:
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']
# Predict a player's position ( playersTrain['position'] ) 

# Create classfier
neighbors = 11
classifier = KNeighborsClassifier(n_neighbors=neighbors)

# Train the classifier on training data
classifier.fit(playersTrain[features], playersTrain['position'])

# Make predictions on training data
train_predictions = classifier.predict(playersTrain[features])

# Make predictions on test data
test_predictions = classifier.predict(playersTest[features])

num_train = len(playersTrain)
num_test = len(playersTest)
# Calculate training accuracy
train_correct = 0
for i in range(num_train):
#     print 'Predicted:', train_predictions[i], ' Actual:', playersTrain.ix[i]['position']
    if train_predictions[i] == playersTrain.ix[i]['position']: train_correct +=1
print 'Training Accuracy:', float(train_correct)/float(num_train)
print ""

# Calculate test accuracy
test_correct = 0
for i in range(num_test):
    print 'Predicted:', test_predictions[i], ' Actual:', playersTest.ix[num_train + i]['position']
    if test_predictions[i] == playersTest.ix[num_train + i]['position']: test_correct +=1
print 'Test Accuracy:', float(test_correct)/float(num_test)


Training Accuracy: 0.623008849558

Predicted: forward  Actual: defender
Predicted: midfielder  Actual: forward
Predicted: forward  Actual: forward
Predicted: defender  Actual: midfielder
Predicted: forward  Actual: forward
Predicted: defender  Actual: midfielder
Predicted: defender  Actual: midfielder
Predicted: forward  Actual: forward
Predicted: defender  Actual: midfielder
Predicted: midfielder  Actual: midfielder
Predicted: defender  Actual: defender
Predicted: defender  Actual: midfielder
Predicted: forward  Actual: forward
Predicted: forward  Actual: forward
Predicted: midfielder  Actual: midfielder
Predicted: midfielder  Actual: midfielder
Predicted: midfielder  Actual: midfielder
Predicted: forward  Actual: defender
Predicted: defender  Actual: defender
Predicted: midfielder  Actual: defender
Predicted: midfielder  Actual: midfielder
Predicted: defender  Actual: defender
Predicted: forward  Actual: forward
Predicted: defender  Actual: midfielder
Predicted: defender  Actual: mid

## Decision tree classification

In [79]:
# Predict temperature category from other features
features = ['longitude','latitude']

# Create classifier
split = 10
dt = DecisionTreeClassifier(min_samples_split=split) # parameter is optional

# Train the classifier on training data
dt.fit(citiesTrain[features], citiesTrain['category'])

# Make predictions on training data
train_predictions = dt.predict(citiesTrain[features])

# Make predictions on test data
test_predictions = dt.predict(citiesTest[features])

num_train = len(citiesTrain)
num_test = len(citiesTest)
# Calculate training accuracy
train_correct = 0
for i in range(num_train):
#     print 'Predicted:', train_predictions[i], ' Actual:', citiesTrain.ix[i]['category']
    if train_predictions[i] == citiesTrain.ix[i]['category']: train_correct +=1
print 'Training Accuracy:', float(train_correct)/float(num_train)
print ""

# Calculate test accuracy
test_correct = 0
for i in range(num_test):
#     print 'Predicted:', test_predictions[i], ' Actual:', citiesTest.ix[num_train + i]['category']
    if test_predictions[i] == citiesTest.ix[num_train + i]['category']: test_correct +=1
print 'Test Accuracy:', float(test_correct)/float(num_test)

Training Accuracy: 0.933701657459

Test Accuracy: 0.75


### "Forest" of decision trees

In [80]:
# Predict temperature category from other features
features = ['longitude', 'latitude']

# Create classifier
trees = 10 # Try other values for trees
rf = RandomForestClassifier(n_estimators=trees)

# Train the classifier on training data
rf.fit(citiesTrain[features], citiesTrain['category'])

# Make predictions on training data
train_predictions = rf.predict(citiesTrain[features])

# Make predictions on test data
test_predictions = rf.predict(citiesTest[features])

num_train = len(citiesTrain)
num_test = len(citiesTest)
# Calculate training accuracy
train_correct = 0
for i in range(num_train):
#     print 'Predicted:', train_predictions[i], ' Actual:', citiesTrain.ix[i]['category']
    if train_predictions[i] == citiesTrain.ix[i]['category']: train_correct +=1
print 'Training Accuracy:', float(train_correct)/float(num_train)
print ""

# Calculate test accuracy
test_correct = 0
for i in range(num_test):
#     print 'Predicted:', test_predictions[i], ' Actual:', citiesTest.ix[num_train + i]['category']
    if test_predictions[i] == citiesTest.ix[num_train + i]['category']: test_correct +=1
print 'Test Accuracy:', float(test_correct)/float(num_test)

Training Accuracy: 0.983425414365

Test Accuracy: 0.75


### <font color="green">Your Turn: Decision tree and forest of trees on World Cup Data</font>

In [81]:
# SINGLE TREE
# Predict position from one or more of minutes, shots, passes, tackles, saves.
# Try different features and different values for min_samples_split.
# What's the highest accuracy you can get?
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']

# Create classifier
split = 10
dt = DecisionTreeClassifier(min_samples_split=split) # parameter is optional

# Train the classifier on training data
dt.fit(playersTrain[features], playersTrain['position'])

# Make predictions on training data
train_predictions = dt.predict(playersTrain[features])

# Make predictions on test data
test_predictions = dt.predict(playersTest[features])

num_train = len(playersTrain)
num_test = len(playersTest)
# Calculate training accuracy
train_correct = 0
for i in range(num_train):
#     print 'Predicted:', train_predictions[i], ' Actual:', playersTrain.ix[i]['position']
    if train_predictions[i] == playersTrain.ix[i]['position']: train_correct +=1
print 'Training Accuracy:', float(train_correct)/float(num_train)
print ""

# Calculate test accuracy
test_correct = 0
for i in range(num_test):
#     print 'Predicted:', test_predictions[i], ' Actual:', playersTest.ix[num_train + i]['position']
    if test_predictions[i] == playersTest.ix[num_train + i]['position']: test_correct +=1
print 'Test Accuracy:', float(test_correct)/float(num_test)

Training Accuracy: 0.846017699115

Test Accuracy: 0.466666666667


In [82]:
# FOREST OF TREES
# Predict position from one or more of minutes, shots, passes, tackles, saves.
# Try different values for n_estimators.
# What's the highest accuracy you can get?
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']

# Create classifier
trees = 10
rf = RandomForestClassifier(n_estimators=trees)

# Train the classifier on training data
rf.fit(playersTrain[features], playersTrain['position'])

# Make predictions on training data
train_predictions = rf.predict(playersTrain[features])

# Make predictions on test data
test_predictions = rf.predict(playersTest[features])

num_train = len(playersTrain)
num_test = len(playersTest)
# Calculate training accuracy
train_correct = 0
for i in range(num_train):
#     print 'Predicted:', train_predictions[i], ' Actual:', playersTrain.ix[i]['position']
    if train_predictions[i] == playersTrain.ix[i]['position']: train_correct +=1
print 'Training Accuracy:', float(train_correct)/float(num_train)
print ""

# Calculate test accuracy
test_correct = 0
for i in range(num_test):
#     print 'Predicted:', test_predictions[i], ' Actual:', playersTest.ix[num_train + i]['position']
    if test_predictions[i] == playersTest.ix[num_train + i]['position']: test_correct +=1
print 'Test Accuracy:', float(test_correct)/float(num_test)

Training Accuracy: 0.978761061947

Test Accuracy: 0.5


### Logistic Regression Classification

In [83]:
# Predict temperature category from other features
features = ['longitude', 'latitude']

# Create classifier
lg = LogisticRegression()

# Train the classifier on training data
lg.fit(citiesTrain[features], citiesTrain['category'])

# Make predictions on training data
train_predictions = lg.predict(citiesTrain[features])

# Make predictions on test data
test_predictions = lg.predict(citiesTest[features])

num_train = len(citiesTrain)
num_test = len(citiesTest)
# Calculate training accuracy
train_correct = 0
for i in range(num_train):
#     print 'Predicted:', train_predictions[i], ' Actual:', citiesTrain.ix[i]['category']
    if train_predictions[i] == citiesTrain.ix[i]['category']: train_correct +=1
print 'Training Accuracy:', float(train_correct)/float(num_train)
print ""

# Calculate test accuracy
test_correct = 0
for i in range(num_test):
#     print 'Predicted:', test_predictions[i], ' Actual:', citiesTest.ix[num_train + i]['category']
    if test_predictions[i] == citiesTest.ix[num_train + i]['category']: test_correct +=1
print 'Test Accuracy:', float(test_correct)/float(num_test)

Training Accuracy: 0.541436464088

Test Accuracy: 0.4375


### <font color="green">Your Turn: Logistic Regression on World Cup Data</font>

In [84]:
# FOREST OF TREES
# Predict position from one or more of minutes, shots, passes, tackles, saves.
# Try different values for n_estimators.
# What's the highest accuracy you can get?
features = ['minutes', 'shots', 'passes', 'tackles', 'saves']

# Create classifier
lg = LogisticRegression()

# Train the classifier on training data
lg.fit(playersTrain[features], playersTrain['position'])

# Make predictions on training data
train_predictions = lg.predict(playersTrain[features])

# Make predictions on test data
test_predictions = lg.predict(playersTest[features])

num_train = len(playersTrain)
num_test = len(playersTest)
# Calculate training accuracy
train_correct = 0
for i in range(num_train):
#     print 'Predicted:', train_predictions[i], ' Actual:', playersTrain.ix[i]['position']
    if train_predictions[i] == playersTrain.ix[i]['position']: train_correct +=1
print 'Training Accuracy:', float(train_correct)/float(num_train)
print ""

# Calculate test accuracy
test_correct = 0
for i in range(num_test):
#     print 'Predicted:', test_predictions[i], ' Actual:', playersTest.ix[num_train + i]['position']
    if test_predictions[i] == playersTest.ix[num_train + i]['position']: test_correct +=1
print 'Test Accuracy:', float(test_correct)/float(num_test)

Training Accuracy: 0.647787610619

Test Accuracy: 0.733333333333
