<a href="https://www.kaggle.com/code/fabinahian/eda-of-fashion-mnist-using-cnn?scriptVersionId=155537549" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# ⚡️ OBJECTIVES 

- EDA of Fashion MNIST
- Identification & Classification of Sustainable Apparel Products

# ⚡️ DATASET: [Fashion MNIST](https://www.kaggle.com/datasets/zalando-research/fashionmnist/data) (👕👖🧥👗🥼👡👔👟👜👢) 

**Content**

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image.

- Each row is a separate image
- Column 1 is the class label.
- Remaining columns are pixel numbers (784 total).
- Each value is the darkness of the pixel (1 to 255)

**Data Dictionary**

Each training and test example is assigned to one of the following labels:

- 0 : T-shirt/top
- 1 : Trouser
- 2 : Pullover
- 3 : Dress
- 4 : Coat
- 5 : Sandal
- 6 : Shirt
- 7 : Sneaker
- 8 : Bag
- 9 : Ankle boot


# ⚡️ IMPORTS

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import tensorflow as tf
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.python import keras
from tensorflow.python.keras.models import Sequential
from keras.layers import Dense, Conv2D, Activation, MaxPool2D, Flatten, Dropout, BatchNormalization
from keras.optimizers import RMSprop,Adam
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from keras.utils import plot_model
import math
from keras.optimizers import RMSprop
import pickle

# ⚡️ VARIABLES

In [None]:
test_size = 0.2 
random_state = 23
epochs = 100
batch_size = 32

# ⚡️ DATA ANALYSIS

**Reading the Data**

In [None]:
#dataframe for the train dataset
df_train = pd.read_csv('/kaggle/input/fashionmnist/fashion-mnist_train.csv')

#dataframe for the test dataset
df_test  = pd.read_csv('/kaggle/input/fashionmnist/fashion-mnist_test.csv')

**Sneak Peek of the Data**

In [None]:
# Observing the data head of the train data

df_train.head()

In [None]:
# Observing the data head of the test data

df_test.head()

**Observation**: In both train data & test data, there are 785 columns, out of which, the 1st column is the label & the rest are the input features.

**Shape of the Data**

In [None]:
print("Train Data\n")
print("Rows: ", df_train.shape[0])
print("Columns: ", df_train.shape[1])
print("\n")

print("Test Data\n")
print("Rows: ", df_test.shape[0])
print("Columns: ", df_test.shape[1])

**Observation**: The number of columns is the same for both train data & test data; however, the number of rows differs. For train data, there are 60000 rows & test data has 10000 rows.

So, there are 60000 training data & 10000 testing data.

**Data Info**

In [None]:
df_train.info()

In [None]:
df_test.info()

**Observation**: Both training samples & testing samples consist of integer values. So, the labels are also numerical values.

**Finding Unique Labels**

In [None]:
#the unique labels that exist in all 60000 training samples

unique_labels_train = df_train.label.unique()
unique_labels_train.sort() #we can sort this because according to data info, the labels are integer values.
print(unique_labels_train)

**Observation**: There is no case of labelling outside of range within the training data. That's a good thing! ✅

In [None]:
#the unique labels that exist in all 10000 testing samples

unique_labels_test = df_test.label.unique()
unique_labels_test.sort() #we can sort this because according to data info, the labels are integer values.
print(unique_labels_test)

**Observation**: There is no case of labelling outside of range within the testing data. That's a good thing! ✅

**NaN Values**

In [None]:
#checking for missing values
df_train.isnull().any().sum()

In [None]:
#checking for missing values
df_test.isnull().any().sum()

**Observation**: There is no missing value or null value in the training samples or the testing samples. That's a good thing! ✅

**Data Count per Label/Category**

In [None]:
#checking how many data we have for each category
df_train['label'].value_counts()

In [None]:
df_train['label'].value_counts().plot(kind = 'bar', figsize = (4,2), color = 'purple')

**Observation**: Each label has 6000 training data. So, there's no imbalance in the training data. That's a good thing! ✅

In [None]:
#checking how many data we have for each category
df_test['label'].value_counts()

In [None]:
df_test['label'].value_counts().plot(kind = 'bar', figsize = (4,2), color = 'purple')

**Observation**: Each label has 1000 testing data. So, there's no imbalance in the testing data. That's a good thing! ✅

# ⚡️ DATA VISUALIZATION

In [None]:
# Mapping classes to their respective labels

apparel_items = {0 : 'T-shirt/top',
            1 : 'Trouser',
            2 : 'Pullover',
            3 : 'Dress',
            4 : 'Coat',
            5 : 'Sandal',
            6 : 'Shirt',
            7 : 'Sneaker',
            8 : 'Bag',
            9 : 'Ankle boot'}

**Visualizing Training Data**

In [None]:
fig, axes = plt.subplots(3, 4, figsize = (5,5))
for row in axes:
    for axe in row:
        index = np.random.randint(60000)
        img = df_train.drop('label', axis=1).values[index].reshape(28,28)
        train_item = df_train['label'][index]
        axe.imshow(img)
        axe.set_title(apparel_items[train_item])
        axe.set_axis_off()

# ⚡️ MODEL DEVELOPMENT

**Data Pre-Processing**

In [None]:
def data_preprocessing(raw):
    label = tf.keras.utils.to_categorical(raw.label, 10)
    num_images = raw.shape[0]
    x_as_array = raw.values[:,1:]
    x_shaped_array = x_as_array.reshape(num_images, 28, 28, 1)
    image = x_shaped_array / 255 #normalization
    return image, label

X, y = data_preprocessing(df_train)
X_test, y_test = data_preprocessing(df_test)

**Data Splitting**

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=test_size, random_state=random_state)

**CNN Model**

In [None]:
model = tf.keras.Sequential()

# First layer, which has a 2D Convolutional layer with kernel size as 3x3 and Max pooling operation 
model.add(Conv2D(32, (3,3), padding='same', input_shape=(28,28, 1)))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))

# Second layer, which has a 2D Convolutional layer with kernel size as 3x3 & ReLU activation and Max pooling operation 
model.add(Conv2D(64, (3,3), padding='same', activation=tf.nn.relu))
model.add(tf.keras.layers.MaxPooling2D(pool_size=(2, 2)))

# Fully connected layer with ReLU activation function 
model.add(Flatten())
model.add(Dense(128, activation=tf.nn.relu))

# Output layer with softmax activation function
model.add(Dense(10, activation=tf.nn.softmax))

**Model Summary**

In [None]:
model.summary()

**Model Plot**

In [None]:
plot_model(model, to_file='model.png')

**Compiling the Model**

In [None]:
# Optimizer specified here is adam, loss is categorical crossentrophy and metric is accuracy
model.compile(optimizer='adam',
              loss=tf.keras.losses.categorical_crossentropy,
              metrics=['accuracy'])

**Fitting the Model**

In [None]:
history = model.fit(X_train, y_train,
                  batch_size=batch_size,
                  epochs=epochs,
                  verbose=1,
                  validation_data=(X_val, y_val))

**Loss & Accuracy**

In [None]:
score = model.evaluate(X_test, y_test, steps=math.ceil(10000/32))
# checking the test loss and test accuracy
print('Test loss:', score[0])
print('Test accuracy:', score[1])

# ⚡️ SAVING THE MODEL

In [None]:
with open('fashion_mnist.pkl','wb') as f:
    pickle.dump(model,f)

# ⚡️ TRAINING & VALIDATION CURVE

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']

loss = history.history['loss']
val_loss = history.history['val_loss']


# PLOTTING THE DATA [ACCURACY]

plt.figure(figsize=(40,5))
plt.subplot(1,2,1)

plt.plot(range(epochs),acc, "--",label='Training Accuracy',linewidth=2.0)
plt.plot(range(epochs),val_acc, "-.",label='Validation Accuracy',linewidth=2.0)

plt.legend(loc='lower right')

plt.title('Training and Validation Accuracy')

plt.savefig('Training and Validation Accuracy.svg',bbox_inches='tight')


# PLOTTING THE DATA [LOSS]

plt.figure(figsize=(40,5))
plt.subplot(1,2,1)

plt.plot(range(epochs),loss,"--",label='Training Loss',linewidth=2.0)
plt.plot(range(epochs),val_loss,"-.",label='Validation Loss',linewidth=2.0)

plt.legend(loc='upper right')

plt.title('Training and Validation Loss')

plt.savefig('Training and Validation Loss.svg',bbox_inches='tight')

# ⚡️ PRECISION, RECALL, F1 SCORE

In [None]:
class_names = ['T-shirt/top',
            'Trouser',
            'Pullover',
            'Dress',
            'Coat',
            'Sandal',
            'Shirt',
            'Sneaker',
            'Bag',
            'Ankle boot']

In [None]:
y_pred = model.predict(X_test)
y_pred_classes = np.argmax(y_pred, axis = 1)
y_true = np.argmax(y_test, axis = 1)

In [None]:
report = classification_report(y_true,y_pred_classes, target_names= class_names)
print(report)

# ⚡️ CONFUSION MATRIX

In [None]:
confusion_mtx = confusion_matrix(y_true, y_pred_classes) 

f,ax = plt.subplots(figsize = (12,12))
sns.heatmap(confusion_mtx, annot=True, linewidths=0.1, cmap = "Greens", linecolor="white", fmt='.0f', ax=ax)
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix")
plt.show()