This was an assignment on my data science course at the university.

The task was to recognise numbers drawn with a digital pen. The digital drawing of the numbers has been simplified to 8 two-dimensional points.
(Connecting these, the numbers can be recognised by eye quite well.)

In digits.csv, the first 7494 lines are the teaching set, the rest are the test set.

The coordinates of the 8 points are in the first 16 columns, {(𝑥1,𝑦1),...,(𝑥8,𝑦8)} in order. The 17th column is the number that is supposed to be shown in the picture.

In [29]:
# Used libraries

from google.colab import files
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score

In [30]:
# upload the csv file to colab

uploaded = files.upload()

In [31]:
df = pd.read_csv('digits.csv', header=None)
df = df.rename(columns = {0:'x1', 1:'y1', 2:'x2', 3:'y2', 4:'x3', 5:'y3', 6:'x4', 7:'y4', 8:'x5', 9:'y5',
                          10: 'x6', 11:'y6', 12:'x7', 13:'y7', 14:'x8', 15:'y8', 16:'number'})

In [32]:
df

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6,x7,y7,x8,y8,number
0,88,92,2,99,16,66,94,37,70,0,0,24,42,65,100,100,8
1,80,100,18,98,60,66,100,29,42,0,0,23,42,61,56,98,8
2,0,94,9,57,20,19,7,0,20,36,70,68,100,100,18,92,8
3,95,82,71,100,27,77,77,73,100,80,93,42,56,13,0,0,9
4,68,100,6,88,47,75,87,82,85,56,100,29,75,6,0,0,9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10987,0,82,9,59,56,34,41,0,10,30,3,67,42,96,100,100,5
10988,49,100,0,70,24,56,100,65,86,85,44,77,21,38,6,0,4
10989,100,98,60,100,24,87,3,58,35,51,58,26,36,0,0,5,5
10990,59,65,91,100,84,96,72,50,51,8,0,0,45,1,100,0,1


In [33]:
df_train = df[:7494]
df_test = df[7494:]

In [34]:
df_test

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6,x7,y7,x8,y8,number
7494,22,100,69,97,73,61,56,25,32,0,0,30,46,39,100,40,7
7495,20,71,6,34,44,0,95,19,100,62,69,100,15,92,0,51,0
7496,48,89,8,67,48,33,43,0,0,12,6,66,41,100,100,88,5
7497,30,92,57,71,81,97,0,100,25,77,100,69,77,26,31,0,9
7498,0,91,35,100,67,72,61,35,45,0,21,13,40,27,100,29,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10987,0,82,9,59,56,34,41,0,10,30,3,67,42,96,100,100,5
10988,49,100,0,70,24,56,100,65,86,85,44,77,21,38,6,0,4
10989,100,98,60,100,24,87,3,58,35,51,58,26,36,0,0,5,5
10990,59,65,91,100,84,96,72,50,51,8,0,0,45,1,100,0,1


In [None]:
# checking whether really only values from 0-9 are in the outcome variable

df['number'].unique()

array([8, 9, 1, 4, 7, 0, 2, 5, 3, 6])

In [None]:
# The outcome variable of the train set has roughly the same occurence of every possible values

df_train['number'].value_counts()

1    782
2    780
4    779
0    778
7    763
8    729
5    728
3    728
9    715
6    712
Name: number, dtype: int64

In [None]:
# The outcome variable of the test set also has roughly the same occurence of every possible values
# That means, if any of the algorithms tend to make more mistakes with one of the numbers, it is not because of inequality in the occurence

df_test['number'].value_counts()

7    379
0    365
4    365
2    364
1    361
6    344
9    340
5    327
3    327
8    326
Name: number, dtype: int64

In [None]:
# defining predicting variables and outcome variable on train and test set

train_coordinates = df_train.iloc[:, :16].values
train_numbers = df_train.iloc[:, 16].values
test_coordinates = df_test.iloc[:, :16].values
test_numbers = df_test.iloc[:, 16].values

In [None]:
# scaling

scaler = MinMaxScaler()
train_coordinates = scaler.fit_transform(train_coordinates)
test_coordinates = scaler.transform(test_coordinates)

In [None]:
# Building a Support Vector Classifier


param_grid = {
    'C': [0.1, 1.0, 10.0],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto'],
}

svm = SVC(random_state=2023)

grid_search_svm = GridSearchCV(svm, param_grid, cv=5)
grid_search_svm.fit(train_coordinates, train_numbers)

# Print the best parameters and its corresponding accuracy
print("Best Parameters:", grid_search_svm.best_params_)

# Evaluate the best model on the test set
best_model_svm = grid_search_svm.best_estimator_
predictions_svm = best_model_svm.predict(test_coordinates)
accuracy_svm = accuracy_score(test_numbers, predictions_svm)
print("Accuracy on Test Set:", accuracy_svm)

Best Parameters: {'C': 10.0, 'gamma': 'scale', 'kernel': 'rbf'}
Accuracy on Test Set: 0.9942824471126358


In [None]:
# Building a Random Forest

rf = RandomForestClassifier(random_state=2023)
param_grid = {
    'n_estimators': [200, 300, 400, 500],
    'max_depth': [None, 5, 10],
    'max_features': ['sqrt', 'log2']
}

# Perform grid search for hyperparameter optimization
grid_search_rf = GridSearchCV(rf, param_grid, cv=5)
grid_search_rf.fit(train_coordinates, train_numbers)

# Print the best parameters and its corresponding accuracy
print("Best Parameters:", grid_search_rf.best_params_)

# Evaluate the best model on the test set
best_model_rf = grid_search_rf.best_estimator_
predictions_rf = best_model_rf.predict(test_coordinates)
accuracy_rf = accuracy_score(test_numbers, predictions_rf)
print("Accuracy on Test Set:", accuracy_rf)

Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'n_estimators': 400}
Accuracy on Test Set: 0.9908519153802172


In [None]:
# Building a Multilayer Perceptron


param_grid = {
    'alpha': [0.00001, 0.0001, 0.001, 0.01],
    'learning_rate': ['constant', 'adaptive']
}

mlp = MLPClassifier(random_state=2023, max_iter=5000)

# Perform grid search for hyperparameter optimization
grid_search_mlp = GridSearchCV(mlp, param_grid, cv=5)
grid_search_mlp.fit(train_coordinates, train_numbers)

# Print the best parameters and its corresponding accuracy
print("Best Parameters:", grid_search_mlp.best_params_)

# Evaluate the best model on the test set
best_model_mlp = grid_search_mlp.best_estimator_
predictions_mlp = best_model_mlp.predict(test_coordinates)
accuracy_mlp = accuracy_score(test_numbers, predictions_mlp)
print("Accuracy on Test Set:", accuracy_mlp)

Best Parameters: {'alpha': 1e-05, 'learning_rate': 'constant'}
Accuracy on Test Set: 0.9905660377358491


In [None]:
y_pred_svm = grid_search_svm.predict(test_coordinates)

In [None]:
y_pred_rf = grid_search_rf.predict(test_coordinates)

In [None]:
y_pred_mlp = grid_search_mlp.predict(test_coordinates)

In [None]:
df_test

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6,x7,y7,x8,y8,number
7494,22,100,69,97,73,61,56,25,32,0,0,30,46,39,100,40,7
7495,20,71,6,34,44,0,95,19,100,62,69,100,15,92,0,51,0
7496,48,89,8,67,48,33,43,0,0,12,6,66,41,100,100,88,5
7497,30,92,57,71,81,97,0,100,25,77,100,69,77,26,31,0,9
7498,0,91,35,100,67,72,61,35,45,0,21,13,40,27,100,29,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10987,0,82,9,59,56,34,41,0,10,30,3,67,42,96,100,100,5
10988,49,100,0,70,24,56,100,65,86,85,44,77,21,38,6,0,4
10989,100,98,60,100,24,87,3,58,35,51,58,26,36,0,0,5,5
10990,59,65,91,100,84,96,72,50,51,8,0,0,45,1,100,0,1


In [None]:
# Getting the points, the true number and the predicted values to one dataframe

pd.options.mode.chained_assignment = None
df_test_2 = df_test
df_test_2['y_pred_svm'] = y_pred_svm
df_test_2['y_pred_rf'] = y_pred_rf
df_test_2['y_pred_mlp'] = y_pred_mlp

df_test_2

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6,x7,y7,x8,y8,number,y_pred_svm,y_pred_rf,y_pred_mlp
7494,22,100,69,97,73,61,56,25,32,0,0,30,46,39,100,40,7,7,7,7
7495,20,71,6,34,44,0,95,19,100,62,69,100,15,92,0,51,0,0,0,0
7496,48,89,8,67,48,33,43,0,0,12,6,66,41,100,100,88,5,5,5,5
7497,30,92,57,71,81,97,0,100,25,77,100,69,77,26,31,0,9,9,9,9
7498,0,91,35,100,67,72,61,35,45,0,21,13,40,27,100,29,7,7,7,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10987,0,82,9,59,56,34,41,0,10,30,3,67,42,96,100,100,5,5,5,5
10988,49,100,0,70,24,56,100,65,86,85,44,77,21,38,6,0,4,4,4,4
10989,100,98,60,100,24,87,3,58,35,51,58,26,36,0,0,5,5,5,5,5
10990,59,65,91,100,84,96,72,50,51,8,0,0,45,1,100,0,1,1,1,1


Let's have a look at which rows were falsely classified by each method, which numbers were falsely predicted, and what were the most frequent wrong predictions. I will sum up the results of the three methods at the end.

In [None]:
# Falsely predicted numbers by the SVC

df_svm_wrong = df_test_2.loc[~(df_test_2['number'] == df_test_2['y_pred_svm'])]
df_svm_wrong

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6,x7,y7,x8,y8,number,y_pred_svm,y_pred_rf,y_pred_mlp
7853,34,91,0,98,42,100,83,94,100,71,86,46,62,23,46,0,7,9,7,9
7893,68,82,100,100,99,91,68,75,95,48,80,21,41,7,0,0,5,3,3,3
7944,95,31,44,2,0,26,11,69,60,100,100,73,84,30,40,0,0,8,9,0
7985,0,87,37,96,98,100,100,87,80,65,63,44,52,22,62,0,7,1,7,7
8014,39,100,16,74,40,57,14,80,32,35,0,7,18,0,100,8,1,2,1,2
8580,57,100,0,83,48,72,100,93,89,74,67,47,48,21,56,0,9,4,9,9
8639,55,100,16,79,0,39,5,0,61,1,100,31,67,61,5,67,0,6,0,6
8642,60,91,0,85,19,37,82,27,31,0,3,30,40,74,100,100,5,8,8,8
8705,16,100,0,57,55,55,44,0,12,17,6,77,45,98,100,100,6,5,5,5
8949,66,98,55,70,98,68,100,100,57,92,33,62,16,31,0,0,0,4,1,1


In [None]:
# Falsely predicted numbers by SVM

# it seems that 0-s and 1-s were harder to predict for SVM

df_svm_wrong['number'].value_counts()

1    5
0    4
7    3
5    3
9    1
6    1
8    1
3    1
2    1
Name: number, dtype: int64

In [None]:
df_svm_wrong['y_pred_svm'].value_counts()

9    3
3    3
1    3
4    3
8    2
2    2
7    2
6    1
5    1
Name: y_pred_svm, dtype: int64

In [None]:
df_rf_wrong = df_test_2.loc[~(df_test_2['number'] == df_test_2['y_pred_rf'])]
df_rf_wrong

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6,x7,y7,x8,y8,number,y_pred_svm,y_pred_rf,y_pred_mlp
7576,52,100,18,76,0,46,9,17,61,19,100,38,54,22,18,0,6,6,4,6
7810,21,52,0,48,39,50,100,51,68,0,26,29,8,83,85,100,5,5,8,9
7893,68,82,100,100,99,91,68,75,95,48,80,21,41,7,0,0,5,3,3,3
7944,95,31,44,2,0,26,11,69,60,100,100,73,84,30,40,0,0,8,9,0
8000,61,100,100,84,84,62,37,53,71,39,87,15,47,0,0,4,3,3,9,3
8008,95,100,16,90,0,62,84,68,100,90,80,59,56,29,58,0,9,9,4,9
8250,0,86,49,96,65,100,53,56,53,11,100,2,53,0,8,9,1,1,3,1
8445,100,100,0,87,0,73,17,58,33,44,33,29,50,15,67,0,1,1,4,1
8642,60,91,0,85,19,37,82,27,31,0,3,30,40,74,100,100,5,8,8,8
8705,16,100,0,57,55,55,44,0,12,17,6,77,45,98,100,100,6,5,5,5


In [None]:
df_rf_wrong['number'].value_counts()

1    8
5    6
0    6
6    2
3    2
9    2
8    2
2    2
7    2
Name: number, dtype: int64

In [None]:
df_rf_wrong['y_pred_rf'].value_counts()

4    7
8    6
9    5
3    4
2    4
1    3
7    2
5    1
Name: y_pred_rf, dtype: int64

In [None]:
df_mlp_wrong = df_test_2.loc[~(df_test_2['number'] == df_test_2['y_pred_mlp'])]
df_mlp_wrong

Unnamed: 0,x1,y1,x2,y2,x3,y3,x4,y4,x5,y5,x6,y6,x7,y7,x8,y8,number,y_pred_svm,y_pred_rf,y_pred_mlp
7651,59,78,60,100,7,94,0,65,53,57,100,42,80,14,32,0,9,9,9,5
7652,12,77,36,90,44,100,37,63,46,27,100,21,47,8,0,0,1,1,1,5
7810,21,52,0,48,39,50,100,51,68,0,26,29,8,83,85,100,5,5,8,9
7853,34,91,0,98,42,100,83,94,100,71,86,46,62,23,46,0,7,9,7,9
7893,68,82,100,100,99,91,68,75,95,48,80,21,41,7,0,0,5,3,3,3
7948,1,55,7,100,30,92,20,45,0,4,23,0,61,10,100,12,2,2,2,1
8014,39,100,16,74,40,57,14,80,32,35,0,7,18,0,100,8,1,2,1,2
8066,100,100,0,86,0,72,0,57,0,43,0,28,50,14,50,0,1,1,1,6
8229,0,0,35,8,69,31,93,62,100,100,69,100,57,64,64,26,9,9,9,1
8353,67,100,67,85,100,71,67,57,33,42,0,28,0,14,0,0,1,1,1,2


In [None]:
df_mlp_wrong['number'].value_counts()

1    10
5     6
7     4
0     4
9     3
2     3
6     2
8     1
Name: number, dtype: int64

In [None]:
df_mlp_wrong['y_pred_mlp'].value_counts()

2    7
1    5
8    5
9    4
5    3
6    3
4    2
7    2
3    1
0    1
Name: y_pred_mlp, dtype: int64

All three models misclassify most frequently 1s, and in addition 0s, 7s and 5s are typically misclassified - for all three models this is in the top four, although for the random forest 7 is only a tie for fourth, but 0, 1, 5 stand out.

The 5s that were misclassified by the Support Vector Machine were also misclassified by the Random Forest and MLP. Of the three quintiles, one was rated 3, one 8 and the third 9 by all three models. The other 5s that Random Forest and MLP got wrong were also qualified as 8s or 9s.
The numbers 1, 2 and 7 are typically mixed, i.e. they are interchanged between the models. This is not surprising, because a seven can be a one when turned around, and vice versa, and there isn't also a huge difference between a 2 and a 1 or a 2 and a 7.


No model ever misclassified a 4, and there were few errors for 3s (1 for SVM, 2 for Random Forest, 0 for MLP).There is a preference for classifying numbers that are not 4s as 4s, with the Random Forest classifying numbers that are not actually 4s as 4s in the (relative) majority of the errors (7 times), and the SVM classifying numbers that are not actually 4s as 4s in the majority of the cases (7 times). It varies which numbers are mistakenly "believed" to be 4s, most often 9s (there is indeed some kind of similarity in the form of the two numbers), but also 6s, 0s and 1s were mistakenly classified as 4s.