# Vehicle silhouettes

## Objective
To classify a given silhouette as one of four types of vehicle, 	using a set of features extracted from the silhouette. The 	vehicle may be viewed from one of many different angles.   

## Description

### The features were extracted from the silhouettes by the HIPS
(Hierarchical Image Processing System) extension BINATTS, which extracts a combination of scale independent features utilising	both classical moments based measures such as scaled variance,	skewness and kurtosis about the major/minor axes and heuristic	measures such as hollows, circularity, rectangularity and	compactness. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400.	This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars.

##Useful links:
https://machinelearningmastery.com/implement-random-forest-scratch-python/

https://towardsdatascience.com/random-forests-and-decision-trees-from-scratch-in-python-3e4fa5ae4249

https://www.analyticsvidhya.com/blog/2018/12/building-a-random-forest-from-scratch-understanding-real-world-data-products-ml-for-programmers-part-3/

# Part 1: Random Forest from scratch

Random forests are an ensemble learning method for classification and regression that operate by constructing multiple decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

## Importing Data

In [2]:
# Load the libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from scipy.spatial.distance import euclidean as euc
# From visualize import generate_moons_df, preprocess, plot_boundaries

# Sklearn processing
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from collections import Counter
np.random.seed(0)

In [3]:
# Load the dataset 
data = pd.read_csv('Data/vehicle.csv')

In [4]:
data

Unnamed: 0,compactness,circularity,distance_circularity,radius_ratio,pr.axis_aspect_ratio,max.length_aspect_ratio,scatter_ratio,elongatedness,pr.axis_rectangularity,max.length_rectangularity,scaled_variance,scaled_variance.1,scaled_radius_of_gyration,scaled_radius_of_gyration.1,skewness_about,skewness_about.1,skewness_about.2,hollows_ratio,class
0,95,48.0,83.0,178.0,72.0,10,162.0,42.0,20.0,159,176.0,379.0,184.0,70.0,6.0,16.0,187.0,197,van
1,91,41.0,84.0,141.0,57.0,9,149.0,45.0,19.0,143,170.0,330.0,158.0,72.0,9.0,14.0,189.0,199,van
2,104,50.0,106.0,209.0,66.0,10,207.0,32.0,23.0,158,223.0,635.0,220.0,73.0,14.0,9.0,188.0,196,car
3,93,41.0,82.0,159.0,63.0,9,144.0,46.0,19.0,143,160.0,309.0,127.0,63.0,6.0,10.0,199.0,207,van
4,85,44.0,70.0,205.0,103.0,52,149.0,45.0,19.0,144,241.0,325.0,188.0,127.0,9.0,11.0,180.0,183,bus
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
841,93,39.0,87.0,183.0,64.0,8,169.0,40.0,20.0,134,200.0,422.0,149.0,72.0,7.0,25.0,188.0,195,car
842,89,46.0,84.0,163.0,66.0,11,159.0,43.0,20.0,159,173.0,368.0,176.0,72.0,1.0,20.0,186.0,197,van
843,106,54.0,101.0,222.0,67.0,12,222.0,30.0,25.0,173,228.0,721.0,200.0,70.0,3.0,4.0,187.0,201,car
844,86,36.0,78.0,146.0,58.0,7,135.0,50.0,18.0,124,155.0,270.0,148.0,66.0,0.0,25.0,190.0,195,car


**Preprocessing**

In [6]:
# Checking the data types of each column
for i in data:
    print(i, ":", type(data[i][0]))

compactness : <class 'numpy.int64'>
circularity : <class 'numpy.float64'>
distance_circularity : <class 'numpy.float64'>
radius_ratio : <class 'numpy.float64'>
pr.axis_aspect_ratio : <class 'numpy.float64'>
max.length_aspect_ratio : <class 'numpy.int64'>
scatter_ratio : <class 'numpy.float64'>
elongatedness : <class 'numpy.float64'>
pr.axis_rectangularity : <class 'numpy.float64'>
max.length_rectangularity : <class 'numpy.int64'>
scaled_variance : <class 'numpy.float64'>
scaled_variance.1 : <class 'numpy.float64'>
scaled_radius_of_gyration : <class 'numpy.float64'>
scaled_radius_of_gyration.1 : <class 'numpy.float64'>
skewness_about : <class 'numpy.float64'>
skewness_about.1 : <class 'numpy.float64'>
skewness_about.2 : <class 'numpy.float64'>
hollows_ratio : <class 'numpy.int64'>
class : <class 'str'>


In [7]:
# Check for null values
data.isnull().sum()

compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64

In [8]:
#Printing the unique values

for i in data.columns:
    print(i, ":", len(data[i].unique()))

compactness : 44
circularity : 28
distance_circularity : 64
radius_ratio : 135
pr.axis_aspect_ratio : 38
max.length_aspect_ratio : 21
scatter_ratio : 132
elongatedness : 36
pr.axis_rectangularity : 14
max.length_rectangularity : 66
scaled_variance : 129
scaled_variance.1 : 423
scaled_radius_of_gyration : 144
scaled_radius_of_gyration.1 : 40
skewness_about : 24
skewness_about.1 : 42
skewness_about.2 : 31
hollows_ratio : 31
class : 3


**Normalizing**

In [None]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)



In [None]:
# Divide the dataset to training and testing set



In [None]:
# Randomly choose the features from training set and build decision tree
# Randomness in the features will help us to achieve different DTrees every time
# You can keep minimum number of random features every time so that trees will have sufficient features
# Note: You can use builtin function for DT training using Sklearn




In [None]:
# Train N number of decision trees using random feature selection strategy
# Number of trees N can be user input




In [None]:
# Apply different voting mechanisms such as 
# max voting/average voting/weighted average voting (using accuracy as weightage)
# Perform the ensembling for the training set.


In [None]:
# Apply invidual trees trained on the testingset
# Note: You should've saved the feature sets used for training invidual trees,
# so that same features can be chosen in testing set

# Get predictions on testing set

In [None]:
# Evaluate the results using accuracy, precision, recall and f-measure



In [None]:
# Compare different voting mechanisms and their accuracies



In [None]:
# Compare the Random forest models with different number of trees N



In [None]:
# Compare different values for minimum number of features needed for individual trees




## Part 2: Random Forest using Sklearn

In [None]:
# Use the preprocessed dataset here



In [None]:
# Train the Random Forest Model using builtin Sklearn Dataset



In [None]:
# Test the model with testing set and print the accuracy, precision, recall and f-measure



In [None]:
# Play with parameters such as
# number of decision trees
# Criterion for splitting
# Max depth
# Minimum samples per split and leaf