<h1>Bohemian Alligators</h1><br>
The group formed in 2018, and though their one and only motive is to create a deep learning masterpiece, their name truly sounds like a funky alternative-rock band's name. They are well known for their passion for everything they get into, and their love for the good-old belgian chocolate ice cream.
<br>
The Members Are:<br>
Beáta Csilla Kovács- the one who always stays calm (she would probably be the singer)<br>
Csenge Kilián - the one who has only one first name (she would probably be the bass guitarist)

You can read more on the nature of alligators here:
https://en.wikipedia.org/wiki/Alligator

In [7]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from os import listdir
from os.path import isfile, join
from PIL import Image
from sklearn.preprocessing import StandardScaler
import matplotlib.colors as mcolors

<h1> About The Data </h1><br>
After considering other datasets as well, our choice fell on the dataset collected and annotated by Berkeley university. 
It's huge. So we decided to initially only deal with the first 100 samples of segmentated data, but later on we plan to include more.
You can download the whole dataset on the following link:
http://bdd-data.berkeley.edu/
The licence is included at the end of readme of our github repo.

Some of the reasons we chose to work with <b>UC Berkerley's dataset</b> are:
1. It covers a wide range of driving conditions both regarding daytime and weather
2. There are over 10,000 samples and corresponding <b>pixel-level annotations for both class-level and instance-level segmentation</b> 
3. After a quick registration, they provide an easy way for download the data through google drive

<h1>Reading and Exploring The Dataset </h1><br>
To make cloning the repository and running our code easier we included the first 100 samples in the repository.

In [None]:
#Reading data
print("Reading data...")
data_filenames = []
for root, dirs, files in os.walk('data/raw_images/'):  
    for filename in files:
        data_filenames.append(filename)

data = [np.array(Image.open('data/raw_images/' + filename)) for filename, i in zip(data_filenames , range(100))]
print("Number of raw images: \t", end="")
print(len(data))

print("\nReading annotated images of segmentation...")
annot_filenames = []
for root, dirs, files in os.walk('data/class_color/'):  
    for filename in files:
        annot_filenames.append(filename)
        
annot = [np.array(Image.open('data/class_color/' + filename)) for filename, i in zip(annot_filenames , range(100))]
print("Number of annotated images: \t", end="")
print(len(annot))
if len(data)==len(annot):
    print("\nAll raw images are annotated.\n")

print("An example of raw data and its annotation: ")

image_raw = data[0]
fig, ax = plt.subplots(1,2, figsize=(15,40))
ax[0].imshow(image_raw)
image_ann = annot[0]
ax[1].imshow(image_ann)
plt.show()

Reading data...


<h1>Preparation of Data</h1> <br>
As preparation we split the data into train, validation and test samples, separate them by RGB channels, and standardize it with the help of the StandardScaler.<br>
On a sidenote: Shuhffling the data was also inteded, but we sadly failed. Later we certainly plan to somehow solve this situation, as we find it necessary, because some of the samples that follow eachother were taken from the same video, and therefore look quite alike, both in terms of the environment and the position of the vehicles

The .csv file contains metadata about the annotation, like which categories of objects the segmentation differentiates and what color reprezents each category. 

In [None]:
#Splitting data into train-validation-test parts with ratios 70-20-10
print("Splitting data into training data, validation data, test data")
nb_samples=len(data_filenames)
#Splitting ratios:
valid_split = 0.2
test_split = 0.1
train_split = 0.7
print("The ratios are: ")
print("\t train:\t", train_split )
print("\t validation::\t",valid_split )
print("\t test:\t",test_split)
    
#Splitting
data_train = np.array(data[0:int(nb_samples*(1-valid_split-test_split))])
annot_train = np.array(annot[0:int(nb_samples*(1-valid_split-test_split))])
data_valid = data[int(nb_samples*(1-valid_split-test_split)):int(nb_samples*(1-test_split))]
annot_valid = annot[int(nb_samples*(1-valid_split-test_split)):int(nb_samples*(1-test_split))]
data_test  = data[int(nb_samples*(1-test_split)):]
annot_test  = annot[int(nb_samples*(1-test_split)):]

#Separation of axes
red_train = []
green_train = []
blue_train = []
for img in data_train:
    image = np.array(img.ravel(), dtype='float64')
    red_train.append(image[0::3])
    green_train.append(image[1::3])
    blue_train.append(image[2::3])


#Standardizing
scaler = StandardScaler()
    
print("\nStandardized data:\nRed:")
scaler.fit(np.reshape(red_train, (-1, 1)))
red_std = scaler.transform(red_train)
print(red_std)

print("\nGreen:")
scaler.fit(np.reshape(green_train, (-1,1)))
green_std = scaler.transform(green_train)
print(green_std)
    
print("\nBlue:")
scaler.fit(np.reshape(blue_train, (-1,1)))
blue_std = scaler.transform(blue_train)
print(blue_std)

print("\nNumber of training samples:\t", len(data_train))
print("Number of validation samples:\t", len(data_valid))
print("Number of test samples:\t", len(data_test))

In [None]:
#Reading .csv file containing metadata about the segmentation
print("Reading file containing metadata about the segmentation...")
metadf = pd.read_csv('data/categories.csv', sep=',')
print("Sneak peak:")
metadf.head()

<h1> Further Analyzing The Data </h1><br>
Our aim was to gain some insight into the quality of data, like how many of the pictures do contain vehicles and how many of them do not. We also had a quick look on the whole dataset if it truly has samples from a wide range of driving conditions, and we found that it was perfect for our needs. Thus we decided, data augmentation wasn't necessary. <br>
(Even though we found an article with great tips and tricks on how to do that here: https://medium.freecodecamp.org/image-augmentation-make-it-rain-make-it-snow-how-to-modify-a-photo-with-machine-learning-163c0cb3843f)

In [None]:
#Organizing subcategories into an array, and counting subcategories
subcat = []
no_subcat = 0
for row in metadf.name:
    subcat.append(row)
no_subcat = len(subcat)

#Organizing categories into an array
cat = []
for row in metadf.category:
    cat.append(row)

#Organizing category Ids into an array
catid = []
for row in metadf.catId:
    catid.append(row)
#Counting categories
no_cat = 1
act = catid[0]
categories = [] #array containing categories without duplication
categories.append(cat[0])
for i in range(len(catid)):
    if catid[i]!=act:
        categories.append(cat[i])
        no_cat+=1
        act=catid[i]

#Organizing subcategory RGB colors into an array
col = []
for row in metadf.color:
    c = row.replace(" ", "").split(',')
    rgb = []
    for i in c:
        rgb.append(int(i))
    col.append(rgb)


print('Number of segmentation subcategories:', no_subcat)
print('Number of segmentation categories:', no_cat, "\n")
print("Subcategories and their representational colors [R, G, B]: \n")
for i in range(len(subcat)):
    print("%30s \t" % subcat[i], end ="")
    print(col[i])

colorstodisplay = []    
for c in col:
    colorstodisplay.append([c[0]/255,c[1]/255,c[2]/255])
my_cmap = mcolors.ListedColormap(colorstodisplay)
plt.figure(figsize=(20, 0.5))
plt.title('The Color Map (in order of subcategory)')
plt.pcolormesh(np.arange(my_cmap.N).reshape(1, -1), cmap=my_cmap)
plt.gca().yaxis.set_visible(False)
plt.gca().set_xlim(0, my_cmap.N)
plt.show()
    
print("\nSubcategories by their categories: \n")
act = cat[0]
print(cat[0] + ":")
for i in range(no_subcat):
    if cat[i] != act:
        print("\n" + cat[i] + ":")
        act=cat[i]
    print("\t\t"+subcat[i])

In [None]:
colors = []
for img in annot:
    im = Image.fromarray(img)
    im_rgb = im.convert('RGB')
    colors.append(im_rgb.getcolors()) #.getcolors() returns with a tuple: [number of occurrence, [R,G,B]]

#counting category and subcategory occurrences 
count_categories = [] #category occurrence
catcounter = [] #helper for counting categroy occurrence
count_subcategories = [] #subcategory occurrence
count_subcategories = np.zeros(len(subcat), dtype=int)
count_categories = np.zeros(no_cat, dtype=int)
catcounter = np.zeros(no_cat, dtype=int)
 
subcat_col = [] #concatenating subcategory and color arrays
for i in range(len(subcat)):
    subcat_col.append([subcat[i], col[i]])

for imcol in colors: #iterating over images
    catcounter = np.zeros(no_cat, dtype=int)
    for j in range(len(imcol)): #iterating over colors of one image   
        for i in range(len(subcat_col)): #iterating over subcategory colors
            if(tuple(subcat_col[i][1]) == imcol[j][1]): #if found
                count_subcategories[i] += 1 #increasing subcategory counter by 1
                catcounter[catid[i]] = 1 #helper for counting categories
    for c in range(len(catcounter)): #increasing ech category found in the colors of one image
        count_categories[c] += catcounter[c]
        
print("An example of segmentation")
print("\tWhere (RGB) depicts subcategory: ")
for row in colors[0]:
    print("\n\t\t", row[1], " depicts:  ", end="")
    for i in range(len(subcat_col)): #iterating over subcategory colors
            if(tuple(subcat_col[i][1]) == row[1]): #if found
                print("  ", subcat_col[i][0], end="")
                
image = annot_train[0]
plt.imshow(image)
plt.show()


print("Categories and the number of pictures they occur in (out of ", len(annot) , " samples):")
for i in range(no_cat):
    print(categories[i] + ": ", end='')
    print(count_categories[i])
print("Conclusion: most of the data has some kind of vehicle on them. Which is good, as it is our main region of interest.")
        
print("\nSubcategories and the number of pictures they occur in (out of ", len(annot) , "samples):")
for i in range(no_subcat):
    print(subcat[i] + ": ", end='')
    print(count_subcategories[i])
    
print("hello darkness my old friend")

print("\n\nSome fancy exploding pie charts visualizing some of the earlier gained statistical data:")
labels = 'Pictures with trucks on them', 'No trucks :('
sizes = [count_subcategories[-1],len(annot)-count_subcategories[-1]]
explode = (0.1, 0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

labels = 'Pictures with person object on them', 'Pictures with no person object on them'
index=0
for sc in range(len(subcat)):
    if subcat[sc] == "person":
        index = sc
        break
sizes = [count_subcategories[index],len(annot)-count_subcategories[index]]
explode = (0.1, 0)
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
#...and then we realised these pie charts aren't as informative as we wished they would be, so we stopped making them