In [1]:
import pandas as pd
df = pd.read_csv('./train.csv')
print("\n\nMETADATA")
print("--------------------")

print(f"\n{len(df.index)} samples")
print(f"{len(df.columns)} columns")

print(f"\nnull count: \n{df.isnull().sum()}")

class_values = df.label.unique()
class_values.sort()
print(f"\nclasses:\n{class_values}")
print(df['label'].value_counts())


sum = 0
avg = 0
for i in class_values:
    sum+=len(df[df['label'] == i].index)
avg = sum/len(class_values)
print(f"\nAverage class distribution: {avg}")

class_distribution = df.groupby("label").size()
print(f"\nclass distribution:\n{class_distribution}")





METADATA
--------------------

50000 samples
2 columns

null count: 
im_name    0
label      0
dtype: int64

classes:
[0 1 2 3 4 5 6 7 8 9]
label
2    5038
8    5020
1    5012
0    5010
3    5007
7    5000
4    4995
5    4993
9    4970
6    4955
Name: count, dtype: int64

Average class distribution: 5000.0

class distribution:
label
0    5010
1    5012
2    5038
3    5007
4    4995
5    4993
6    4955
7    5000
8    5020
9    4970
dtype: int64


Comments on dataframe:

- The dataset is considered a large dataset. 50,000 samples is a lot.
- the dataset is clean. No need to interpolate missing values or remove columns/rows.
- the class distribution is very even. 
- there are 10 labels. Multiclass.


In [31]:
import matplotlib.pyplot as plt
import matplotlib as mpl
from PIL import Image
import numpy as np
import math 

mpl.rcParams["figure.dpi"] = 150

first_sample = df.iloc[[0]]
print(f"FIRST SAMPLE:\n{first_sample}")
sample_image_name = first_sample.im_name.values[0]
image_path = f"./train_ims/{sample_image_name}"
im = Image.open(image_path)
pixels = np.array(list(im.getdata()))
im.close()
height = int(math.pow(len(pixels), 1/2))
width = height
print(f"\n\ndata dimension:\nheight: {int(height)}\nwidth: {int(width)}\nfeatures: {len(pixels)}\n\n")
print(pixels)

FIRST SAMPLE:
       im_name  label
0  00016cd.jpg      6


data dimension:
height: 32
width: 32
features: 1024


[[237 242 246]
 [238 246 249]
 [228 239 241]
 ...
 [112 150 153]
 [104 144 135]
 [ 96 139 122]]


Comments on data:
- number of features is very large.
- each feature has a RGB value. How do we convert RGB into a single value?
- can't split image into 3 colour groups, R, G, B. The relationships between different colours would be lost.
- images of the same class vary a lot. High variance within image classes.
- for instance, different angles, different colours, different distances from camera.



Initial Ideas:

Large dataset.
-> use smaller K in K-fold validation to improve training performance. 
-> Perhaps k = 5

Even class distribution:
-> no need to apply any class balancing techniques, such as penalty weighting.

Multiclass:
-> will need an ensemble of SVMs. A SVM for each class in a ensemble.

Features:
-> should NOT use feature selection. Features are heavily related to one another and the variance in images is too great.
-> feature reduction may be an idea, but the images are already so low resolution. Further compressing them may cause it to lose detail

