# Data Science for the Automotive Industry: First practical session - ML

In this session, we will dive into an example of classic machine learning models, trying to extract the most out of a simple and small dataset using simple but powerful methods.

We will explore a data set with car sales of different brands and models in USA. We will follow the order below:
1. Loading the dataset from google drive
2. Exploratory data analysis
3. Unsupervised learning
4. Supervised learning 

<!-- This data has been downloaded from [kaggle.com](https://www.kaggle.com/) and it can be found using [this link](https://www.kaggle.com/gagandeep16/car-sales) for car_sales,[this link](https://www.kaggle.com/smritisingh1997/car-salescsv) for car_sales_2 and [this link](https://www.kaggle.com/sachinsachin/car-sales?select=Car+Sales.xlsx) for car_sales_3. -->

Developed by Nicolas Gutierrez in December 2021.


## Importing required libraries
It is a good practice loading the required libraries for the code at the start of it. Additionally, doing it this way you can have some hints about what the code below will do, just by checking the types of libraries imported.

In [None]:
### Do not modify this cell, not an exercise

# Files
import glob

# Data loading and manipulation
import pandas as pd

# Numeric operations
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics, svm, tree
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.cluster import KMeans

# Representation
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.gridspec import GridSpec

## Loading the dataset from Google Drive
The best way to work with a dataset from google colab is loading it from the same folder where the notebook is stored using the following cell.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Once google drive is mounted and access is granded, we can use glob library to check the directory.

In [None]:
### Exercise 1: Locate the folder with the datasets in your google drive

# Modify the following line
list_of_files = glob.glob('/content/drive/MyDrive/.../*')
#

print(list_of_files)
## The result of this print should be 3 paths ending in */car_sales_3.csv, */car_sales_2.csv, */car_sales.csv

From the previous list, we can use Pandas and load the csv "Car_sales" into memory.

In [None]:
### Exercise 2: Load the data from the file car_sales.csv using pandas, find the order to load a csv with pandas

# Modify the following line
car_sales = 
#

print(f"The number of rows of the file are: {len(car_sales)}")
## The result should be 157

## Exploratory Data Analysis
In this section, we will check the data to see what features/models/brands ... are included. Additionally we will check the distribution of values and get a feeling of statisical parameters, extremes and relationship of variables. 

In [None]:
### Exercise 3: Find a way of showing the first or last rows of a pandas dataframe

# Insert the order here

#

## The result should be a table in which the first column is Manufacturer

In [None]:
### Exercise 4: Find a method in your data frame car_sales that shows you a summary or a description of the content of the columns of the dataframe, Use the option of the method to describe all the data

# Insert the order here

#

## The result should be a table with Manufacturer in the first column and then a some handy statistical values in the rows.

In [None]:
### Exercise 5: Let's describe a bit more the data we have

# Using your car sales dataframe complete the following lines
number_of_data_rows = 
number_of_data_columns = 
data_columns_as_a_list = 

manufacturers_list = 
models_list = 
vehicle_type_list = 

total_car_sales = 

most_successful_brand = 
percentage_of_sales_of_most_successful_brand = 
#

## 157 rows and 16 columns
print(f"The dataset has {number_of_data_rows} rows and {number_of_data_columns} columns\n")
## ['Manufacturer', 'Model', 'Sales_in_thousands', '__year_resale_value', ...
print(f"Column names are {data_columns_as_a_list}\n")
## 30 and 156
print(f"{len(manufacturers_list)} different brands and {len(models_list)} models are considered\n")
## ['Passenger' 'Car']
print(f"Following vehicle types are included: {vehicle_type_list}\n")
## 8320698.0
print(f"Total amount of sales: {total_car_sales}\n")
## Ford with 24.31% of the sales
print(f"Most succesful brand (by sales): {most_successful_brand}, "
f"with {percentage_of_sales_of_most_successful_brand:.2f}% of the sales")

### What variables do you see as possible "inputs" and "outputs"?

In [None]:
### Exercise 6: Plot the car sales of every manufacturer so we can have a view of the most successful manufacturers

# Don't use matplotlib for this, but directly from pandas, this should be done in one line

#
plt.title("Sales by manufacturer")
plt.ylabel("Sales [1000s of units]")

## The output should be a bar plot ordered from lowest to highest with the manufacturers in the x axis and the sales on the y axis.

In [None]:
### Exercise 7: Plot the car sales, but now based on the models.

# Use the option figsize=(20, 5) to stretch the graph and see it better

#

plt.title("Sales by Model")
plt.ylabel("Sales [1000s of units]")

## The output should be a bar plot ordered from lowest to highest with the manufacturers in the x axis and the sales in the y axis

In [None]:
### Do not modify this cell, not an exercise

# Let's plot histograms of all variables as this will help us understand how the values are distributed
def plot_histograms(vars, xlabels, title=None):
  ncols = len(vars)
  fig, ax = plt.subplots(ncols=ncols)
  plt.ylabel('Frequency [-]')
  if title:
    plt.suptitle(title)
  for i in range(len(vars)):
    average = car_sales[vars[i]].mean()
    median_value = car_sales[vars[i]].median()
    car_sales[vars[i]].plot.hist(ax=ax[i], sharey=True, figsize=(18,5))
    ax[i].set_title("Distribution of \n" + vars[i])
    ax[i].set_xlabel(xlabels[i])
    ax[i].axvline(x=average, color='red', zorder=1)
    ax[i].axvline(x=median_value, color='black', zorder=2)

In [None]:
### Exercise 8: Use the function above to plot the 4 most relevant variables

# Modify the following lines
variables_to_plot = 
xlabel = 
title = 
plot_histograms(variables_to_plot, xlabel, title)
#

### What do you thing are the most relevant variables? 
### What do the red and black lines mean? 
### What can you extract from them being close together or separated.

In [None]:
### Exercise 9: Use plot_histogram function to plot 'Physical characteristics of the cars

# Include your lines here

#

## Feature engineering
New features can be created as a combination of any other if required to enrich the model. For example in this data set we have width and length. Do you believe having the size will make a difference?

In [None]:
### Exercise 10: Feature engineering

# Create a column name called Size as a relevant combination of Width and Length
car_sales['Size'] = 
#

## Correlation matrix
Pearson correlation value is a very powerful indicator of linear relationship 
between two variables. It is very commonly used in Data Science 
as knowing the relationship between different variables is extremely useful 
for creating models. The highest the absolute value, the highest the 
linear relationship. The Pearson coefficient value is bounded between -1 and 1.

In [None]:
### Exercise 10: Pearson correlation matrix

# Calculated the corr_matrix in the following line
corr_matrix = 
#

matfig = plt.figure(figsize=2*np.array([6.4, 4.8]))
plt.matshow(np.abs(corr_matrix), cmap=cm.RdYlGn, fignum=matfig.number)
plt.xticks(np.arange(0, len(corr_matrix.columns)), corr_matrix.columns.to_list(), rotation=90)
plt.yticks(np.arange(0, len(corr_matrix.columns)), corr_matrix.columns.to_list())
plt.colorbar()

## Check the output of this cell, is there anything that calls your attention?

## Unsupervised learning
In this section we will try to check if we can find any hidden relationship in the data by means of clustering. We will use [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and the famous [elbow method](https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/) to detect the optimal number of clusters.

We will try to look for an example like [this](https://datatofish.com/k-means-clustering-python/), where clusters are clearly defined. 

In [None]:
### Do not modify this cell, not an exercise
def plot_couple_of_variables(dataset, x_name, y_name):
  x_variable = dataset[x_name]
  y_variable = dataset[y_name]

  fig = plt.figure()
  gs = GridSpec(4, 4)

  ax_scatter = fig.add_subplot(gs[1:4, 0:3])
  ax_hist_x = fig.add_subplot(gs[0,0:3])
  ax_hist_y = fig.add_subplot(gs[1:4, 3])

  ax_scatter.scatter(x_variable, y_variable)
  ax_scatter.set_xlabel(x_name)
  ax_scatter.set_ylabel(y_name)

  ax_hist_x.hist(x_variable)
  ax_hist_y.hist(y_variable, orientation = 'horizontal')

In [None]:
### Exercise 11: Select two relevant variables from the car_sales dataset

# Modify the following lines
x_name = ""
y_name = ""
#

plot_couple_of_variables(car_sales, x_name, y_name)

In [None]:
### Exercise 12: Prepare the dataset

# First, remove the nan from the data set in the following line
car_sales_nandropped = 
#

X = np.array(list(zip(car_sales_nandropped[x_name], 
                      car_sales_nandropped[y_name])))

In [None]:
### Exercise 13: Fit a KMeans as a unsupervised learning method

# Modify the number of clusters and check the results
number_of_clusters = 

kmeanModel = 
#

In [None]:
### Exercise 14: Predict the data in X using kmeanModel object

# Modify the following line
results = 
#

In [None]:
### Do not modify this cell, not an exercise
print(results)
for i in range(number_of_clusters):
  points_belonging_to_i_cluster = results == i
  plt.scatter(car_sales_nandropped[x_name].to_numpy()[points_belonging_to_i_cluster], 
              car_sales_nandropped[y_name].to_numpy()[points_belonging_to_i_cluster], 
              label=f"Cluster {i}")
plt.legend()
plt.xlabel(x_name)
plt.ylabel(y_name)

As you may have seen already, the number of clusters to use is not very clear and seems a bit subjective. Data Science tries to not rely on opinions and for this case what is normally used is the elbow method. 

In [None]:
### Exercise 15: Create a function to plot a graph for the elbow method

def elbow_method(X):
  distortions = []
  K = range(1, 10)
  
  for k in K:
      # Complete the following lines to define the KMeans method knowing that 
      # in every iteration we want to use a k number of clusters and then fit
      # to X
      kmeanModel = 
      
      #

      ## Distortions are calculated as the distance between every point and the 
      ## cluster centers divided by the number of points in X
      distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                          'euclidean'), axis=1)) / X.shape[0])
    
  plt.plot(K, distortions, 'bx-')
  plt.xlabel('Values of K')
  plt.ylabel('Distortion')
  plt.title('The Elbow Method using Distortion')
  plt.show()

In [None]:
### Exercise 16: Use the previous function to plot a figure with the elbow method

# Fill the following line

#

## Supervised Learning I: Price prediction

In this section we will try to create a model that help us understanding the behaviour of customers based on previous data.

Wouldn't be cool if we could predict how much would someone pay for a car with certain characteristics?

### Features/labels selection
Let's select first which ones will be the inputs (features) and outputs (labels) of our supervised learning problem.

In [None]:
### Exercise 17: Prepare the data to be used in a supervised learning method

# Select the input and output variable names, input is a list, output is a single one
input_variables = ['']
output_variable = ''
#

# Preparation of the data
input_variables.append(output_variable)

X = car_sales[input_variables].dropna(axis=0)

y = X[output_variable]
X.drop([output_variable], axis=1, inplace=True)

# Showing the result to the user
print(f"{len(X.columns)} input variables: {X.columns}")
print(f"Output variable is: {y.name}")

### Normalization/Standarization of variables
Data should be normalized or standarised before getting into any machine learning model. The purpose of this is to avoid biasing the model towards the higher magnitude variables.

In [None]:
### Exercise 18: Normalization

# Find a scaler to normalise the data based on min and max (Have a look at sklearn.preprocessing)
scaler_x = 
scaler_y = 
#

scaler_x.fit(X)
scaler_y.fit(y.to_numpy().reshape(-1,1))

X_norm = scaler_x.transform(X)
y_norm = scaler_y.transform(y.to_numpy().reshape(-1,1))

In [None]:
### Exercise 19: Standarization

# Find a scaler to standarise the data based on average and standard deviation
standard_x = 
standard_y = 
#

standard_x.fit(X)
standard_y.fit(y.to_numpy().reshape(-1,1))

X_standard = standard_x.transform(X)
y_standard = standard_y.transform(y.to_numpy().reshape(-1,1))

### Train/test split
Next step we will do is plitting the data into train and test datasets. This is done to evaluate the models fairly. The models will be trained with the train data and they normally will perform very well on this one. The test dataset is used to evaluate the performance of the model on data it has never seen before or how good it generalises.

In [None]:
### Exercise 20: Find a method to split into train and test

# Modify the method name in the following lines
# Non processed data
X_np_train, X_np_test, y_np_train, y_np_test = \
METHODNAME(X.to_numpy(), y.to_numpy().reshape(-1,1), test_size=0.3, random_state=1)
#

# Modify the method name in the following lines
# Normalization
X_norm_train, X_norm_test, y_norm_train, y_norm_test = \
METHODNAME(X_norm, y_norm, test_size=0.3, random_state=1)
#

# Modify the method name in the following lines
# Standarization
X_standard_train, X_standard_test, y_standard_train, y_standard_test = \
METHODNAME(X_standard, y_standard, test_size=0.3, random_state=1)
#

### Fitting Models

In [None]:
### Exercise 21: Fit models to the data

def fit_models(features_train, labels_train):
  # Look for a linear regressor
  reg = 
  #
  reg.fit(features_train, labels_train)
  
  # Look for Support vector machines, use the 'linear' kernel
  svm_linear = 
  #
  svm_linear.fit(features_train, labels_train.ravel())
  

  # Look for Support vector machines, use the 'poly' kernel
  svm_poly = 
  #
  svm_poly.fit(features_train, labels_train.ravel())
  

  # Look for Support vector machines, use the 'rbf' kernel
  svm_rbf = 
  #
  svm_rbf.fit(features_train, labels_train.ravel())
  
  # Look for regressor based on decision trees
  clf = 
  #
  clf.fit(features_train, labels_train.ravel())
  
  return reg, svm_linear, svm_poly, svm_rbf, clf

In [None]:
### Modify this cell if you want to use features normalised or standarised

features_train = X_np_train
labels_train = y_np_train
features_test = X_np_test
labels_test = y_np_test

reg, svm_linear, svm_poly, svm_rbf, clf = fit_models(features_train, labels_train)

### Evaluation of the models

In [None]:
### Exercise 21: Evaluate the models

def evaluation_of_methods(methods_vector, methods_name, 
                          features_train, features_test, 
                          labels_train, labels_test):
  evaluation = pd.DataFrame(columns=['method', 'r2_train', 'r2_test', 
                                     'MSE_train', 'MSE_test', 
                                     'MAE_train', 'MAE_test'])
  for i in range(len(methods_vector)):
    print(f"method name:{methods_name[i]}")
    results_dict = dict()
    results_dict['method'] = methods_name[i]
    
    y_hat_train = methods_vector[i].predict(features_train)
    y_hat_test = methods_vector[i].predict(features_test)

    # Look for suitable methods to calculate the following metrics
    results_dict['r2_train'] = XXXX(labels_train, y_hat_train)
    results_dict['r2_test'] = XXXX(labels_test, y_hat_test)

    results_dict['MSE_train'] = XXXX(labels_train, y_hat_train)
    results_dict['MSE_test'] = XXXX(labels_test, y_hat_test)

    results_dict['MAE_train'] = XXXX(labels_train, y_hat_train)
    results_dict['MAE_test'] = XXXX(labels_test, y_hat_test)
    #

    method_results = pd.DataFrame.from_dict([results_dict])
    evaluation = evaluation.append(method_results, ignore_index = True)
    
  return evaluation

In [None]:
### Do not modify this cell, not an exercise

methods_vector = [reg, svm_linear, svm_poly, svm_rbf, clf]
methods_name = ['Linear Regressor', 'SVM Linear', 'SVM Poly', 'SVM RBF', 'Decision Tree']
evaluation = evaluation_of_methods(methods_vector, methods_name, 
                                   features_train, features_test,
                                   labels_train, labels_test)

In [None]:
### Do not modify this cell, not an exercise

evaluation

In [None]:
### Do not modify this cell, not an exercise

def plotting_methods(methods_vector, features_test, labels_test):
  nrows = len(methods_vector)
  fig, ax = plt.subplots(ncols=2, 
                        nrows=nrows, 
                        figsize=2.5*np.array([6.4, 4.8]), 
                        sharex='col')
  for i in range(nrows):
    # Prediction
    y_hat = methods_vector[i].predict(features_test).ravel()
    # Prediction vs real
    ax[i, 0].scatter(labels_test, y_hat)
    ax[i, 0].scatter(labels_test, labels_test)
    ax[i, 0].set_ylabel(f'Predictions {methods_name[i]}')
    ax[i, 0].set_xlabel('Real values')
    ax[i, 1].hist(labels_test.ravel()-y_hat)
    ax[i, 1].set_xlabel('Error')

In [None]:
### Do not modify this cell, not an exercise

plotting_methods(methods_vector, features_test, labels_test)

## Have you tried launching the fitting of the models with normalised or standarised features?

## Supervised Learning II: Sales Prediction

The marketing department of a new brand has contacted our group of data science with interest in sales prediction. Based on the dataset we have, we will try to create a model that can predict the number of sales of a model given its characteristics.

### Features/labels preparation
The manufacturer is probably a thing that will be very interesting in this study, so we will need to find a way of including that information into a model. The problem is that the manufacturer is a string. How can we include a string of this type into a data science model?

One of the techniques is one hot encoding and consists of creating as many columns as manufacturers and including with a 1 when the row belongs to that manufacturer and a 0 otherwise.

In [None]:
### Exercise 21: One hot encoding for the manufacturers

# Create the manufacturers one hot encoding using a method from pandas 
manufacturer_one_hot = 
#

## Quick check: The result should be 157
print(np.sum(manufacturer_one_hot.sum(axis=1)))
print(len(manufacturer_one_hot))

In [None]:
### Exercise 22: Clean the data frame by removing some columns
### Remove columns ['Manufacturer', 'Model', '__year_resale_value', 'Vehicle_type', 'Latest_Launch']

# Find a way in pandas to remove columns from a dataframe
car_sales_for_sales = 
#

## Quick check with the remaining columns
print(car_sales_for_sales.columns)

In [None]:
### Exercise 23: Include the manufacturers into the car_sales_for_sales data frame

# Find a way in pandas to include this new columns into the dataframe

#

## Quick check, you should see now columns with the name of the manufacturer and the one hot encoding
car_sales_for_sales.head(3)

In [None]:
### Exercise 24: Remove any entry that may have a not a number

# Find a way in pandas to remove registries with not a number
car_sales_ii_ready = 
#

### Features/labels selection

In [None]:
### Exercise 25: Select the inputs and outputs for this problem

# Include relevant input variables as list (don't forget to include the manufacturer names)
input_variables = []
#

# Include the output of the output variable
output_variable = 
#

In [None]:
### Do not modify this cell, not an exercise

input_variables.append(output_variable)

X_ii = car_sales_ii_ready[input_variables].dropna(axis=0)

y_ii = X_ii[output_variable]
X_ii.drop([output_variable], axis=1, inplace=True)

## Quick check the data
print(f"{len(X_ii.columns)} input variables: {X_ii.columns}")
print(f"Output variable is: {y_ii.name}")

### Standarization of variables

In [None]:
### Exercise 26: Standardise the data

# Instatiate the standariser
standard_x_ii = 
standard_y_ii = 
#

standard_x_ii.fit(X_ii)
standard_y_ii.fit(y_ii.to_numpy().reshape(-1,1))
X_standard_ii = standard_x_ii.transform(X_ii)
y_standard_ii = standard_y_ii.transform(y_ii.to_numpy().reshape(-1,1))

### Train/test split

In [None]:
### Exercise 27: Split the dataset into training/test

# Modify the method name in the following lines
X_standard_train_ii, X_standard_test_ii, y_standard_train_ii, y_standard_test_ii = \
METHOD_NAME(X_standard_ii, y_standard_ii, test_size=0.3, random_state=1)
#

### PCA
It is a quite famous dimensionality reduction technique that allows us reducing the size of the problem (compressing the information of the features) losing information. The more PCA components we take into account the less information we lose. 

PCA is useful when we have problems with many features and not a huge amount of data.

In [None]:
### Exercise 27: Plot the PCA explained variace ratio as a function of the number of variables

var_exp = []
for i in range(1, len(input_variables)):
  # Find the pca method and use i as number of components and fit it to X_standard_train_ii
  pca = 

  #

  # Find the variance ratio and append to the var_exp list the sum.
  var_exp.append()
  #

plt.figure()
plt.plot(np.arange(1,len(input_variables)), var_exp)

In [None]:
### Exercise 28: Select a suitable number of components and fit the PCA

# Based on the previous results select a number of components and fit the pca to X_standard_train_ii
pca = 

#

features_train = pca.transform(X_standard_train_ii)
labels_train = y_standard_train_ii
features_test = pca.transform(X_standard_test_ii)
labels_test = y_standard_test_ii

### Fitting models

In [None]:
### Exercise 29: Fit models to the data

# Check previous section and fit a linear regressor, support vector machines and decission trees to the data

#

### Evaluation of models

In [None]:
### Do not modify this cell, not an exercise
methods_vector = [reg, svm_linear, svm_poly, svm_rbf, clf]
methods_name = ['Linear Regressor', 'SVM Linear', 'SVM Poly', 'SVM RBF', 'Decision Tree']
evaluation = evaluation_of_methods(methods_vector, methods_name, 
                                   features_train, features_test,
                                   labels_train, labels_test)

In [None]:
### Do not modify this cell, not an exercise

evaluation

In [None]:
### Do not modify this cell, not an exercise

plotting_methods(methods_vector, features_test, labels_test)

In [None]:
### Do not modify this cell, not an exercise

real_errors = standard_y_ii.inverse_transform((labels_test.ravel() - svm_poly.predict(features_test).ravel()).reshape(-1,1))
print(f"Real errors {real_errors.ravel()}")
print(f"Real error average  {np.mean(real_errors)}")
print(f"Real error Standar deviation  {np.std(real_errors)}")

## Conclusions and take aways
- Even with a small dataset, many insights can be unveiled.
- The bigger and the richer the data set, the greater the findings.
- Many times, just with a simple exploratory data analysis give us a lot of information.
- With unsupervised learning we can detect hidden patterns in the data.
- With supervised learning we can create models to predict variables for new entries.