# Traffic Sign Detection
This notebook implements a CNN model for detecting the traffic signs from the [GTSRB Dataset](https://www.kaggle.com/datasets/meowmeowmeowmeowmeow/gtsrb-german-traffic-sign) which consists of 43 classes and around 50,000 images (train+test).

In [None]:
%matplotlib inline
import os, glob
import tqdm
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2
from PIL import Image
import warnings
warnings.filterwarnings('ignore')
import tensorflow as tf 
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D, Dense, Flatten, Dropout

## Exploratory Data Analysis
We will read three csv files `Meta.csv`, `Train.csv`, `Test.csv` and explore each one of them.

In [None]:
# data_path = 'data/'
# train_path = data_path + 'Train/'
# test_path = data_path + 'Test/'
df_meta = pd.read_csv(r"D:\Akash\Codes\Projects\Traffic_Sign_Detection\Data\Meta.csv")
df_train = pd.read_csv(r"D:\Akash\Codes\Projects\Traffic_Sign_Detection\Data\Train.csv")
df_test = pd.read_csv(r"D:\Akash\Codes\Projects\Traffic_Sign_Detection\Data\Test.csv")

### Exploring Meta Dataframe

In [None]:
df_meta.head()

There are four columns in df_meta. `Path`, `ClassId`, `ShapeId`, `ColorId`, `SignId`.

In [None]:
print("Min. Class Label: {}".format(df_meta.ClassId.min()))
print("Max Class Label: {}".format(df_meta.ClassId.max()))
print("Total Class Labels: {}".format(len(df_meta.ClassId.unique())))

**Let us visualize all 43 class types** using the coloumn `ClassId`

In [None]:
num_classes = len(df_meta.ClassId.unique())
class_dict = {}
class_labels = list(range(num_classes))
# Speed Class 0-9
speed_class = ['Speed Limit ' + item for item in [speed + ' kmph' for speed in ['20', '30', '50', '60', '70', '80']]]\
            + ['End of Speed Limit 80 kmph']
speed_class+= ['Speed Limit ' + item for item in [speed + ' kmph' for speed in ['100', '120']]]
speed_class
# 10, 11 No Passing
no_pass = ['No Passing' + item for item in ['', ' vehicle over 3.5 ton']]
# 12-43
rest = ['Right-of-way at intersection', 'Priority road', 'Yield', 'Stop', 'No vehicles', 'Veh > 3.5 tons prohibited',\
            'No entry', 'General caution', 'Dangerous curve left', 'Dangerous curve right', 'Double curve', 'Bumpy road',
            'Slippery road', 'Road narrows on the right', 'Road work', 'Traffic signals', 'Pedestrians', 'Children crossing',
            'Bicycles crossing', 'Beware of ice/snow','Wild animals crossing', 'End speed + passing limits', 'Turn right ahead',
            'Turn left ahead', 'Ahead only', 'Go straight or right', 'Go straight or left', 'Keep right', 'Keep left',
            'Roundabout mandatory', 'End of no passing', 'End no passing vehicle > 3.5 tons']
class_values = speed_class + no_pass + rest
class_dict = {keys:values for keys,values in zip(class_labels, class_values)}

In [None]:
sortFunction = lambda x: int(os.path.basename(x)[:-4])
plt.figure(figsize = (25, 25))
for i, imagename in enumerate(sorted(glob.glob(data_path + 'Meta/' + '*.*'), key = sortFunction)):
    plt.subplot(7, 7, i + 1)
    plt.grid(False)
    plt.xticks([])
    plt.yticks([])
    plt.xlabel(class_dict[i])
    image = cv2.imread(imagename)
    plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
plt.show()

**Let us visualize the shapes and colors of the sign** using the columns `ShapeId` and `ColorId`

In [None]:
shape_dict = {0: 'Triangle', 1: 'Circle', 2: 'Diamond', 3: 'Hexagon', 4: 'Inverse Triangle'}
df_meta.ShapeId.value_counts()

In [None]:
def visualize_shape(shape = 0):
    """
    Plots random samples of a particular shape from shape_dict
    """
    filenames = df_meta[df_meta.ShapeId==shape].sample(10).Path
    plt.figure(figsize = (25, 25))
    for i, filename in enumerate(data_path + filenames):
        image = cv2.imread(filename)
        plt.subplot(11, 4, i+1)
        plt.grid(False)
        plt.xticks([])
        plt.yticks([])
        plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.show()

In [None]:
# Visualize Triangular signs, ShapeId=0
visualize_shape(0)

In [None]:
color_dict = {0:'Red', 1:'Blue', 2:'Yellow', 3:'White'}
df_meta.ColorId.value_counts()

In [None]:
def visualize_color(color = 0):
    """
    Plots random samples of a particular color from color_dict
    """
    filenames = df_meta[df_meta.ColorId==color].sample(5).Path
    plt.figure(figsize = (20, 20))
    for i, filename in enumerate(data_path + filenames):
        image = cv2.imread(filename)
        plt.subplot(1, 6, i+1)
        plt.grid(False)
        plt.xticks([])
        plt.yticks([])
        plt.imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    plt.show()

In [None]:
# Visualize Blue Colored Traffic Signs, color = 1
visualize_color(color = 1)

### Exploring Train Dataframe
1. Check the shape of the train dataframe
2. Check the description of all features
3. Create a dictionary `train_dict` with labels as keys and value_counts as values
4. Plot the Class Distribution of Training data
5. Check if the folder directory information is consistent and create the same dictionary `train_sample_dict`

In [None]:
df_train.shape

In [None]:
df_train.head()

In [None]:
df_train.describe()

In [None]:
df_train.ClassId.value_counts()

In [None]:
# Create a dictionary which consists of the labels as keys and the number of samples as values
train_dict = {}
train_dict = {keys:values for keys,values in zip(df_train.ClassId.value_counts().index, df_train.ClassId.value_counts().tolist())}

In [None]:
plt.figure(figsize = (21 ,11))
plt.bar(train_dict.keys(), train_dict.values())
plt.title('Class Distribution for Training data')

In [None]:
# Check the dataframe information with folder directory
train_folders = os.listdir(train_path)
# Create a dict with keys as label names and the number of images present inside each label folder as values
sample_dict = {}
for folders in train_folders:
    images = os.listdir(train_path + folders)
    sample_dict[folders] = len(images)
train_sample_dict = {int(k):v for k,v in zip(sample_dict.keys(), sample_dict.values())}
train_dict==train_sample_dict

### Test Dataframe
1. Check the shape of the test dataframe
2. Check the description of all features
3. Create a dictionary `test_dict` with labels as keys and value_counts as values
4. Plot the Class Distribution of Test data

In [None]:
df_test.shape

In [None]:
df_test.head()

In [None]:
df_test.describe()

In [None]:
df_test.ClassId.value_counts()

In [None]:
# Create a test_dict with keys as the labels and values as the value_counts
test_dict = {}
test_dict = {keys:values for keys,values in zip(df_test.ClassId.value_counts().index, df_test.ClassId.value_counts().tolist())}

In [None]:
plt.figure(figsize = (21, 11))
plt.bar(test_dict.keys(), test_dict.values())
plt.title("Class Distribution of Test Data")

### Check Data Balance
When we see the individual training and testing class distributions the dataset might seem imbalanced. So we want to check this for each label, by measuring the label-wise train test ratio. This will be important while training the model because when we split the training data into `train` and `validation` we would like to retain the ratio of both sets for each label

In [None]:
df_balance = pd.DataFrame()
df_balance['labels'] = list(range(43))
df_balance['train'] = train_dict.values()
df_balance['test'] = test_dict.values()
df_balance['total'] = df_balance['train'] + df_balance['test']
df_balance['train_ratio'] = df_balance['train']/df_balance['total']
df_balance['test_ratio'] = df_balance['test']/df_balance['total']

**Check train test ratio for the first 10 classes**

In [None]:
df_balance.head(10)

**Visualize the train test ratio for each label**

In [None]:
df_balance.plot(x = 'labels', y = ['train_ratio', 'test_ratio'], kind = 'bar', figsize = (21, 11), title = "Train Test Ratio for each class")

**We can see that the data is not that imbalanced when we check labelwise distribution, so data balancing is not required**

### Check Duplicate Entries

In [None]:
df_train.Path.duplicated().unique()

In [None]:
df_test.Path.duplicated().unique()

**No duplicate entries found in train and test csv files**

## Data Preprocessing
After an extensive EDA, we will prepare our dataset from the `data/Train` and `data/Test`. The `Train` folder consists of 43 folders from `0` to `42`. Each folder consists of images. So we will prepare our training data by iterating over thse folders.

For the `Test` data, the folder consists of only images and the `ground-truth` is given in the dataframe `df_test`, we need to predict the labels for each of these images

Before training the model, we will split the `Train` dataset into `train` and `val` using 80-20 stratified split to retain the ratio of balance. `train` and `test` datasets are already in an approximate split of 75-25. This will lead to an overall split of:

- `Train` 60%
- `Val` 15%
- `Test` 25%

### Training Data
1. Iterate over all folders to get images and labels
2. Store the data in `train_data` and labels in `train_labels`
3. Check whether the length of both arrays is equal to the information provided in `df_train`
4. Split the training data into 2 sets `train` and `val` for training using `stratified train-test split`

In [None]:
train_data = []
train_labels = []
for folders in tqdm.tqdm(train_folders):
    imagefiles = os.listdir(train_path + folders)
    for imagefile in imagefiles:
        path = os.path.join(train_path, folders, imagefile)
        image = Image.open(path)
        image = image.resize((32, 32))
        image = np.array(image)
        train_data.append(image)
        train_labels.append(int(folders))
    

In [None]:
train_data = np.array(train_data)
train_labels = np.array(train_labels)

In [None]:
# Check the length of both arrays
len(df_train), len(train_data), len(train_labels)

**The number of images in the train folders are equivalent to the samples given in dataframe.**

In [None]:
print("There are {} images in train dataset".format(len(train_data)))
print("Each image has a dimension of : {}".format(train_data[0].shape))

### Train-Val Split
Use the stratified train-test split which retains the class distribution even after splitting in 80-20 ratio.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(train_data, train_labels, test_size = 0.2, stratify = train_labels, random_state = 42)
train_unique, y_train_count = np.unique(y_train, return_counts = True)
val_unique, y_val_count = np.unique(y_val, return_counts = True)
y_train, y_val = to_categorical(y_train, num_classes), to_categorical(y_val, num_classes)
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

In [None]:
plt.figure(figsize = (21, 11))
plt.bar(train_unique, y_train_count)
plt.bar(val_unique, y_val_count)
plt.legend(['Train Split', 'Val Split'], loc = 'upper right')
plt.title("After Splitting into Train and Val")

### Test Data
1. Iterate over the test directory over all images
2. Store the images in `test_data`
3. Ground truth predictions are given in `df_test`

In [None]:
test_data = []
test_ground_truth = df_test.ClassId.tolist()
test_filenames = (data_path + df_test.Path).tolist()
for test_filename in tqdm.tqdm(test_filenames):
    image_filename = Image.open(test_filename)
    image = image_filename.resize((32, 32))
    image = np.array(image)
    test_data.append(image)

In [None]:
X_test = np.array(test_data)
y_test = np.array(test_ground_truth)

In [None]:
test_unique, y_test_count = np.unique(y_test, return_counts = True)

### Train-Val-Test Split
Visualize the distribution of training, validation and testing data after splitting.

In [None]:
df_balance['val'] = y_val_count
df_balance['val_ratio'] = df_balance['val']/df_balance['total']
df_balance['train_ratio'] = df_balance['train_ratio'] - df_balance['val_ratio']
df_balance.head()

In [None]:
df_balance.plot.bar(x = 'labels', y = ['train', 'test', 'val'], figsize = (25, 25), stacked = True, title = "Train Val Test Split")

## Building Model
We will now build a CNN model for training using Keras.

In [None]:
def get_compiled_model():
    model = Sequential()
    model.add(Conv2D(filters=32, kernel_size=(5,5), activation='relu', input_shape=X_train.shape[1:]))
    model.add(Conv2D(filters=32, kernel_size=(5,5), activation='relu'))
    model.add(MaxPool2D(pool_size=(2, 2)))
    model.add(Dropout(rate=0.2))
    model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
    model.add(Conv2D(filters=64, kernel_size=(3, 3), activation='relu'))
    model.add(MaxPool2D(pool_size=(2, 2)))
    model.add(Dropout(rate=0.2))
    model.add(Flatten())
    model.add(Dense(256, activation='relu'))
    model.add(Dropout(rate=0.5))
    model.add(Dense(43, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

### Fit Model
Fit the model using training and validation data

In [None]:
X_train, X_val, X_test = X_train/255., X_val/255., X_test/255.

In [None]:
model = get_compiled_model()

In [None]:
model.summary()

In [None]:
history = model.fit(X_train, y_train, batch_size=32, epochs=20, validation_data=(X_val, y_val), verbose = True)

## Visualize Results

### Accuracy Plot

In [None]:
plt.figure()
plt.plot(history.history['acc'], label = "Training Accuracy")
plt.plot(history.history['val_acc'], label = "Validation Accuracy")
plt.title("Accuracy Plot")
plt.xlabel("Epochs")
plt.ylabel("Accuracy (%)")
plt.legend()
plt.show()

### Loss Plot

In [None]:
plt.figure()
plt.plot(history.history['loss'], label = "Training Loss")
plt.plot(history.history['val_loss'], label = "Validation Loss")
plt.title("Loss Plot")
plt.xlabel("Epochs")
plt.ylabel("Loss (%)")
plt.legend()
plt.show()

## Inference

In [None]:
predictions = model.predict_classes(X_test)

In [None]:
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix
cfm = confusion_matrix(y_test, predictions)

In [None]:
accuracy_score(y_test, predictions)

In [None]:
df_cfm = pd.DataFrame(cfm, index = [i for i in range(num_classes)], columns = [i for i in range(num_classes)])
plt.figure(figsize = (25, 25))
sns.heatmap(df_cfm, annot=True, cmap=sns.cubehelix_palette(as_cmap=True))

### Visualize Results

In [None]:
plt.figure(figsize = (30, 30))
start_index = 36
for i in range(30):
    plt.subplot(10, 3, i + 1)
    plt.grid(False)
    plt.xticks([])
    plt.yticks([])
    prediction = predictions[start_index + i]
    ground_truth = y_test[start_index + i]
    col = 'g'
    if prediction != ground_truth:
        col = 'r'
    plt.xlabel('Actual Class {} , Predicted Class {}'.format(class_dict[ground_truth], class_dict[prediction]), color = col, weight = 'bold')
    plt.imshow(X_test[start_index + i])
plt.show()

In [None]:
# Save the model for further implementation
os.mkdir('models')
model.save('models/traffic_sign_detection_gtsrb.h5')

## Future Work
1. Use this model and custom dataset to run inference on video using an object detection framework
2. Implement OCR capability for non English speaking countries
3. Create an interactive dashboard with labelling
4. Deploy this model on Streamlit 