# Letter Recognition
In this programming assignment, we will load the letter-recognition.data.csv file, explore the data set, and then do the letter classifications. We will train multiple classifiers and apply the ensemble learning to improve classification results.

#### Data Set Information:
The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15. We train on the first 16000 items and then use the resulting model to predict the letter category for the remaining 4000.

#### Features:
1. letter capital letter (26 values from A to Z)
2. x-box horizontal position of box (integer)
3. y-box vertical position of box (integer)
4. width width of box (integer)
5. height height of box (integer)
6. onpix total # on pixels (integer)
7. x-bar mean x of on pixels in box (integer)
8. y-bar mean y of on pixels in box (integer)
9. x2bar mean x variance (integer)
10. y2bar mean y variance (integer)
11. xybar mean x y correlation (integer)
12. x2ybr mean of x*x*y (integer)
13. xy2br mean of x*y*y (integer)
14. x-ege mean edge count left to right (integer)
15. xegvy correlation of x-ege with y (integer)
16. y-ege mean edge count bottom to top (integer)
17. yegvx correlation of y-ege with x (integer)

### The grading rubric is in the following text and code blocks. Everywhere we need to fill in code, the number points is displayed.

### Note, before we start, we need to upload the data file letter-recognition.data.csv.


In [4]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

## Data preprocessing

In [5]:
import pandas as pd

def load_data(path, name):
    csv_path = os.path.join(path, name)
    return pd.read_csv(csv_path, header=None)

In [6]:
letters = load_data(".", "letter-recognition.data.csv")

In [7]:
letters.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [8]:
letters.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 17 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       20000 non-null  object
 1   1       20000 non-null  int64 
 2   2       20000 non-null  int64 
 3   3       20000 non-null  int64 
 4   4       20000 non-null  int64 
 5   5       20000 non-null  int64 
 6   6       20000 non-null  int64 
 7   7       20000 non-null  int64 
 8   8       20000 non-null  int64 
 9   9       20000 non-null  int64 
 10  10      20000 non-null  int64 
 11  11      20000 non-null  int64 
 12  12      20000 non-null  int64 
 13  13      20000 non-null  int64 
 14  14      20000 non-null  int64 
 15  15      20000 non-null  int64 
 16  16      20000 non-null  int64 
dtypes: int64(16), object(1)
memory usage: 2.6+ MB


In [9]:
col_names = ['letter', 'x-box', 'y-box', 'width', 'height', 'onpix', 'x-bar', 'y-bar',
             'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx']

In [10]:
print(col_names)
print(len(col_names))

['letter', 'x-box', 'y-box', 'width', 'height', 'onpix', 'x-bar', 'y-bar', 'x2bar', 'y2bar', 'xybar', 'x2ybr', 'xy2br', 'x-ege', 'xegvy', 'y-ege', 'yegvx']
17


In [11]:
letters.columns = col_names

In [12]:
letters.head()

Unnamed: 0,letter,x-box,y-box,width,height,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [13]:
letters['letter']

0        T
1        I
2        D
3        N
4        G
        ..
19995    D
19996    C
19997    T
19998    S
19999    A
Name: letter, Length: 20000, dtype: object

Convert letters better A and Z to numbers between 0 to 25.

In [14]:
a = np.zeros(len(letters))
for i in range(len(letters)):
    a[i] = ord(letters['letter'][i]) - ord('A')

In [15]:
a

array([19.,  8.,  3., ..., 19., 18.,  0.])

Add the numbered labels as a column named y.

In [16]:
letters['y'] = a

In [17]:
letters.head()

Unnamed: 0,letter,x-box,y-box,width,height,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx,y
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8,19.0
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10,8.0
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9,3.0
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8,13.0
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10,6.0


Drop column 'letter' since do not need it any more.

In [18]:
letters_new = letters.drop('letter', axis=1)

# Prepare training set (the first 16000 samples) and test set (the last 4000 samples).

In [19]:
letters.head()

Unnamed: 0,letter,x-box,y-box,width,height,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx,y
0,T,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8,19.0
1,I,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10,8.0
2,D,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9,3.0
3,N,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8,13.0
4,G,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10,6.0


In [20]:
letters_new.head()

Unnamed: 0,x-box,y-box,width,height,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx,y
0,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8,19.0
1,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10,8.0
2,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9,3.0
3,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8,13.0
4,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10,6.0


In [21]:
X_train = letters_new.iloc[0:16000,:16]
X_train.head()

Unnamed: 0,x-box,y-box,width,height,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx
0,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [22]:
X_train.shape

(16000, 16)

In [23]:
y_train = letters_new.iloc[0:16000,16]
y_train.head()

0    19.0
1     8.0
2     3.0
3    13.0
4     6.0
Name: y, dtype: float64

In [24]:
y_train.shape

(16000,)

In [25]:
X_test = letters_new.iloc[16000:,:16]
X_test.head()

Unnamed: 0,x-box,y-box,width,height,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx
16000,4,10,6,7,9,9,6,4,3,6,7,7,9,8,5,6
16001,6,9,8,4,3,8,7,3,4,13,5,8,6,8,0,8
16002,6,9,8,8,10,7,7,5,4,7,6,8,7,9,7,10
16003,5,6,6,4,3,7,6,2,7,7,6,9,0,9,4,8
16004,5,9,7,6,4,9,7,3,5,10,4,6,5,8,1,7


In [26]:
X_test.shape

(4000, 16)

In [27]:
y_test = letters_new.iloc[16000:,16]
y_test.head()

16000    20.0
16001    13.0
16002    21.0
16003     8.0
16004    13.0
Name: y, dtype: float64

In [28]:
y_test.shape

(4000,)

# Try three classifiers: (i) random forest, (ii) SVM, and (iii) MLP. Then use an ensemble method through soft-voting.

## Classifiers 1: random forest

### Scale the training data set.

In [57]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))

### Use default parameters in random forest to get a sense of the results.

Use 100 estimators and set random state to 42 to obtain a random forest classifier. Train the random forest classifier. (10 points)

In [59]:
from sklearn.ensemble import RandomForestClassifier

# fill in code

forest_clf = RandomForestClassifier(n_estimators=100, random_state=42)
forest_clf.fit(X=X_train_scaled, y=y_train)

forest_clf

Display the corss validation result. Use 3-fold validation. (10 points)

In [38]:
from sklearn.model_selection import cross_val_score
# fill in code
scores = cross_val_score(estimator=forest_clf, X=X_train_scaled, y=y_train, cv=3, scoring="accuracy", n_jobs=-1)
scores

array([0.95444319, 0.95087193, 0.95349709])

## Classifiers 2: MLP

Obtain a MLPClassifier using random state 42. (10 points)

In [39]:
from sklearn.neural_network import MLPClassifier
# fill in code

mlp_clf = MLPClassifier(random_state=42)
mlp_clf.fit(X_train_scaled, y_train)



Display the cross validation result using 3-fold cross validation. (10 points)

In [41]:
# fill in code
mlp_score = cross_val_score(estimator=mlp_clf, X=X_train_scaled, y=y_train, cv=3, scoring="accuracy", n_jobs=-1)
mlp_score


array([0.94131984, 0.94130883, 0.93905869])

### Classifiers 3: SVM (one vs all)

Obtain a svm classifier with random state 42.

In [42]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier(SVC(gamma="auto", random_state=42, probability=True))

Display cross validation result with 3-fold cross validation. (10 points).

In [43]:
cross_val_score(ovr_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

array([0.9135733 , 0.91449466, 0.91318207])

#Hard-voting

Construct a hard-voting classifer using the above three classifiers. (10 points)

In [60]:
from sklearn.ensemble import VotingClassifier
# fill in code

voting_clf = VotingClassifier(estimators=[('rf', forest_clf), ('mlpc', mlp_clf), ('ovr', ovr_clf)], voting='hard')
voting_clf.fit(X_train_scaled, y_train)





Display the cross validation result of the hard-voting classifier using 3-fold cross validation. (10 points)

In [61]:
# fill in code
cross_val_score(voting_clf, X_train_scaled, y_train, cv=3, scoring="accuracy", n_jobs=-1)

array([0.95256843, 0.95012188, 0.94918432])

# Soft-voting

Construct a soft-voting classifier using the above three classifiers and train the classifier. (10 points)

In [49]:
from sklearn.ensemble import VotingClassifier
# fill in code

soft_voting_clf = VotingClassifier(estimators=[('rf', forest_clf), ('mlpc', mlp_clf), ('ovr', ovr_clf)], voting='soft')
soft_voting_clf.fit(X_train_scaled, y_train)






Display the cross validation result of the soft-voting classifier using 3-fold cross validation. (10 points)

In [50]:
# fill in code
cross_val_score(soft_voting_clf, X_train_scaled, y_train, cv=3, scoring="accuracy", n_jobs=-1)

array([0.95650544, 0.95330958, 0.95012188])

Check the accuracy of the soft-voting classifier. (10 points)

In [52]:
X_test_scaled = scaler.transform(X_test.astype(np.float64))
from sklearn.metrics import accuracy_score

#fill in code
for clf in (forest_clf, mlp_clf, ovr_clf, soft_voting_clf):
  clf.fit(X_train_scaled, y_train)
  y_pred = clf.predict(X_test)
  print(clf.__class__.__name__, accuracy_score(y_test, y_pred))





RandomForestClassifier 0.03675




MLPClassifier 0.06725




OneVsRestClassifier 0.036




VotingClassifier 0.068




# Importance of the features

Print out the importance of each feature in decreasing order. This can be done using the random forest classifier we trained earlier. (10 points)

In [None]:
# fill in code





# Stacking

First, we divide the training data into two parts:
(X_train, y_train) and (X_val, y_val).

In [None]:
X_train = letters_new.iloc[0:12000,:16]
X_train.head()

Unnamed: 0,x-box,y-box,width,height,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx
0,2,8,3,5,1,8,13,0,6,6,10,8,0,8,0,8
1,5,12,3,7,2,10,5,5,4,13,3,9,2,8,4,10
2,4,11,6,8,6,10,6,2,6,10,3,7,3,7,3,9
3,7,11,6,6,3,5,9,4,6,4,4,10,6,10,2,8
4,2,1,3,1,1,8,6,6,6,6,5,9,1,7,5,10


In [None]:
y_train = letters_new.iloc[0:12000,16]
y_train.head()

0    19.0
1     8.0
2     3.0
3    13.0
4     6.0
Name: y, dtype: float64

In [None]:
X_val = letters_new.iloc[12000:16000,:16]
X_val.head()

Unnamed: 0,x-box,y-box,width,height,onpix,x-bar,y-bar,x2bar,y2bar,xybar,x2ybr,xy2br,x-ege,xegvy,y-ege,yegvx
12000,4,7,4,5,2,3,10,3,6,11,12,7,2,11,2,6
12001,5,9,5,7,4,4,8,5,7,11,9,14,2,9,3,7
12002,3,6,4,4,2,10,2,2,3,8,2,8,2,6,2,8
12003,5,8,7,6,6,10,6,3,6,10,4,7,4,7,5,10
12004,3,6,4,4,4,8,5,10,0,6,8,8,6,5,0,8


In [None]:
y_val = letters_new.iloc[12000:16000,16]
y_val.head()

12000    24.0
12001     2.0
12002     0.0
12003     1.0
12004    12.0
Name: y, dtype: float64

Scale the data set X_train.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train.astype(np.float64))

Train the three classifiers using (X_train, y_train).

In [None]:
from sklearn.ensemble import RandomForestClassifier

rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rnd_clf.fit(X_train_scaled, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(rnd_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

array([0.9435 , 0.94425, 0.9375 ])

Apply predictions on X_val using random forest classifier.

In [None]:
X_val_scaled = scaler.transform(X_val.astype(np.float64))
y_val_pred = rnd_clf.predict(X_val_scaled)

In [None]:
from sklearn.neural_network import MLPClassifier
mlp_clf = MLPClassifier(random_state=42)
mlp_clf.fit(X_train_scaled, y_train)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=42, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [None]:
cross_val_score(mlp_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")



array([0.92675, 0.9325 , 0.92075])

Apply predictions on X_val using MLP classifier.

In [None]:
y_val_pred_mlp = mlp_clf.predict(X_val_scaled)

In [None]:
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

ovr_clf = OneVsRestClassifier(SVC(gamma="auto", random_state=42, probability=True))
ovr_clf.fit(X_train_scaled, y_train)

OneVsRestClassifier(estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                                  class_weight=None, coef0=0.0,
                                  decision_function_shape='ovr', degree=3,
                                  gamma='auto', kernel='rbf', max_iter=-1,
                                  probability=True, random_state=42,
                                  shrinking=True, tol=0.001, verbose=False),
                    n_jobs=None)

In [None]:
cross_val_score(ovr_clf, X_train_scaled, y_train, cv=3, scoring="accuracy")

array([0.9015 , 0.916  , 0.89075])

Apply predictions on X_val using SVM classifier.

In [None]:
y_val_pred_ovr = ovr_clf.predict(X_val_scaled)

Combine predictions from 3 predictors.

In [None]:
X_stack_pred_training=np.c_[y_val_pred, y_val_pred_mlp, y_val_pred_ovr]
print(X_stack_pred_training.shape)

(4000, 3)


In [None]:
print(X_stack_pred_training)

[[24. 24. 24.]
 [ 2.  2.  2.]
 [ 0.  0.  0.]
 ...
 [ 6.  6.  6.]
 [ 4.  4. 25.]
 [ 2.  2.  2.]]


Try random forest as blender.

In [None]:
blending_rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
blending_rnd_clf.fit(X_stack_pred_training, y_val)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)

In [None]:
from sklearn.model_selection import cross_val_score
cross_val_score(blending_rnd_clf, X_stack_pred_training, y_val, cv=3, scoring="accuracy")

array([0.95727136, 0.94523631, 0.95048762])

Try MLP as blender.

In [None]:
blending_mlp_clf = MLPClassifier(random_state=42)
blending_mlp_clf.fit(X_stack_pred_training, y_val)
cross_val_score(blending_mlp_clf, X_stack_pred_training, y_val, cv=3, scoring="accuracy")



array([0.66191904, 0.68342086, 0.58814704])

We can see the MLP blender is not performing very well.

The reason is that we are treat predictions from the three predictors as numerical data instead of categorical data.

The above approach incorrect. Instead, use onehot encoder to encode these predictions.

In [None]:
from sklearn.preprocessing import OneHotEncoder

onehot_encoder = OneHotEncoder()
X_stack_pred_training_1hot = onehot_encoder.fit_transform(X_stack_pred_training)

In [None]:
X_stack_pred_training_1hot.shape

(4000, 78)

In [None]:
blending_rnd_clf_1hot = RandomForestClassifier(n_estimators=100, random_state=42)
blending_rnd_clf_1hot.fit(X_stack_pred_training_1hot, y_val)
cross_val_score(blending_rnd_clf_1hot, X_stack_pred_training_1hot, y_val, cv=3, scoring="accuracy")

array([0.96401799, 0.94823706, 0.96249062])

We see improvement for random forest blender.

In [None]:
blending_mlp_clf_1hot = MLPClassifier(random_state=42)
# fit the MLP blender
blending_mlp_clf_1hot.fit(X_stack_pred_training_1hot, y_val)
# display cross validtion result
cross_val_score(blending_mlp_clf_1hot, X_stack_pred_training_1hot, y_val, cv=3, scoring="accuracy")



array([0.96101949, 0.94748687, 0.96174044])

We can see the MLP blender is performing much better now. Both MLP blender and Random Forest blender are better than the hard voting.

However, it seems Random Forest blender still is a little better. We use random forest as the blender for the test data.

In [None]:
X_test_scaled = scaler.transform(X_test.astype(np.float64))
y_test_pred_rnd = rnd_clf.predict(X_test_scaled)
y_test_pred_mlp = mlp_clf.predict(X_test_scaled)
y_test_pred_ovr = ovr_clf.predict(X_test_scaled)
X_stack_pred_test = np.c_[y_test_pred_rnd,y_test_pred_mlp,y_test_pred_ovr]
X_stack_pred_test_1hot = onehot_encoder.transform(X_stack_pred_test)
y_test_pred_blending = blending_rnd_clf_1hot.predict(X_stack_pred_test_1hot)
print(blending_rnd_clf_1hot.__class__.__name__, accuracy_score(y_test, y_test_pred_blending))

RandomForestClassifier 0.95525


The stacking classifer is better than each of the individual classifiers. Even though it is not as good as the soft voting classifer, it is better than the hard voting classifier.