# Data Science for the Automotive Industry - 1st and 2nd practical session - ML

In this session, we will dive into an example of classic machine learning models, trying to extract the most out of a simple and small dataset using simple but powerful methods.

We will explore a data set with car sales of different brands and models in USA. We will follow the order below:
1. Import of required libraries
2. Loading the dataset from google drive
3. Exploratory data analysis
4. Unsupervised learning
5. Supervised learning 
6. Conclusions and take aways

<!-- This data has been downloaded from [kaggle.com](https://www.kaggle.com/) and it can be found using [this link](https://www.kaggle.com/gagandeep16/car-sales) for car_sales,[this link](https://www.kaggle.com/smritisingh1997/car-salescsv) for car_sales_2 and [this link](https://www.kaggle.com/sachinsachin/car-sales?select=Car+Sales.xlsx) for car_sales_3. -->

Developed by Nicolas Gutierrez.


## 1 - Importing required libraries
It is a good practice loading the required libraries for the code at the start of it. Additionally, doing it this way you can have some hints about what the code below will do, just by checking the types of libraries imported.

In [None]:
### Do not modify this cell, not an exercise

# Files
import glob
# Data loading and manipulation
import pandas as pd
# Numeric operations
import numpy as np
from scipy.spatial.distance import cdist
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics, svm, tree
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.cluster import KMeans
# Representation
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.gridspec import GridSpec

## 2 - Loading the dataset from Google Drive
The best way to work with a dataset from google colab is loading it from the same folder where the notebook is stored using the following cell.

In [None]:
### Do not modify this cell, not an exercise

from google.colab import drive
drive.mount('/content/drive')

Once google drive is mounted and access is granted, we can use glob library to check the directory.

#### Ex 1: Locate the data

In [None]:
# Modify/Complete the following line
list_of_files = glob.glob('/content/drive/MyDrive/*')
#

print(list_of_files)
## The result of this print should be 3 paths ending in */car_sales_3.xlsx, */car_sales_2.csv, */car_sales.csv

From the previous list, we can use Pandas and load the csv "Car_sales" into memory.

#### Ex 2: Load the data

In [None]:
#### Exercise 2: Load the data from the file car_sales.csv using pandas, find the order to load a csv with pandas

# Modify the following line
car_sales = 
#

print(f"The number of rows of the file are: {len(car_sales)}")
## The result should be 157

## 3 - Exploratory Data Analysis
In this section, we will check the data to see what features/models/brands ... are included. Additionally we will check the distribution of values and get a feeling of statisical parameters, extremes and relationship of variables. 

### DataFrame description
Pandas has several convenient functions for describing the data in the DataFrame in a very simple way.

#### Ex 3: Check first rows

In [None]:
### Exercise 3: Find a way of showing the first or last rows of a pandas dataframe

# Print 5 first values of the pandas dataframe

#

# * Curb_weight is the weight of the vehicle including a full tank of fuel and all standard equipment
## The result should be a table in which the first column is Manufacturer

#### Ex 4: Show a summary

In [None]:
### Exercise 4: Find a method in your data frame car_sales that shows you a summary or a 
###   description of the content of the columns of the dataframe, Use the option of the 
###   method to describe all the data

# Insert the order here

#

## The result should be a table with Manufacturer in the first column and then a some handy statistical values in the rows.

#### Ex 5: Describe the data further

In [None]:
### Exercise 5: Let's describe a bit more the data we have

# Check the size of the data set as how many rows and columns it has
print(f"The dataset has {} rows and {} columns\n")

# Check how many and which ones are the columns of the dataset
print(f"Column names are {}\n")

# How many different brands and models are included in the study
print(f"{len()} different brands and {len()} models are considered\n")

# How many types of car
print(f"Following vehicle types are included: {}\n")

# What is the amount of car sales according to this dataset? Are those values sensible? How would you know?
print(f"Total amount of sales: {}\n")

# What is the most succesful brand of all and what is the market share of this brand?
print(f"Most succesful brand (by sales): {}, "
f"with {}% of the sales")

# What variables do you see as possible "inputs" and "outputs"?

### Bar and distribution plots
Pandas allows plotting directly from the library without going to any specific library. This feature is very convenient when producing simple plots.

#### Ex 6: Bar plot by Manufacturer

In [None]:
### Exercise 6: Plot the car sales of every manufacturer so we can have a view of the most successful manufacturers

# Don't use matplotlib for this, but directly from pandas, this should be done in one line
car_sales.
#

plt.title("Sales by manufacturer")
plt.ylabel("Sales [1000s of units]")

#### Ex 7: Bar plot by Sales

In [None]:
### Exercise 7: Plot the car sales, but now based on the models.

# Use the option figsize=(20, 5) to stretch the graph and see it better
car_sales
#

plt.title("Sales by Model")
plt.ylabel("Sales [1000s of units]")

#### Ex 8: Plot Distribution of relevant variables

In [None]:
### Do not modify this cell, not an exercise

# Let's plot histograms of all variables as this will help us understand how the values are distributed
def plot_histograms(vars, xlabels, title=None):
  ncols = len(vars)
  fig, ax = plt.subplots(ncols=ncols)
  plt.ylabel('Frequency [-]')
  if title:
    plt.suptitle(title)
  for i in range(len(vars)):
    average = car_sales[vars[i]].mean()
    median_value = car_sales[vars[i]].median()
    car_sales[vars[i]].plot.hist(ax=ax[i], sharey=True, figsize=(18,5))
    ax[i].set_title("Distribution of \n" + vars[i])
    ax[i].set_xlabel(xlabels[i])
    ax[i].axvline(x=average, color='red', zorder=1)
    ax[i].axvline(x=median_value, color='black', zorder=2)

In [None]:
### Exercise 8: Use the function above to plot the 4 most relevant variables

# Modify the following lines
variables_to_plot = ['', '', '', '']
xlabel = ['', '', '', '']
title = ""
plot_histograms(variables_to_plot, xlabel, title)
#

### What do you thing are the most relevant variables? 
### What do the red and black lines mean? 
### What can you extract from them being close together or separated.

#### Ex 9: Plot Physical characteristics

In [None]:
### Exercise 9: Use plot_histogram function to plot 'Physical characteristics of the cars

# Include your lines here
variables_to_plot = ['', '', '', '']
xlabel = ["", "", "", ""]
title = ""
plot_histograms(variables_to_plot, xlabel, title)
#

### Feature Engineering
We will see a toy example of feature engineering.

New features can be created as a combination of any other if required to enrich the model. For example in this data set we have width and length. Do you believe having the size will make a difference?

#### Ex 10: Create new features (if needed)

In [None]:
### Exercise 10: Feature engineering

# Create a column name called Size as a relevant combination of Width and Length
car_sales['Size'] = 
#


### Correlation matrix
__Correlation__ is a statistical measure that indicates the extent to which two or more variables fluctuate in relation to each other. Correlation values are contained between -1 and 1. __The higher the absolute value of the correlation the higher the relationship between the variables__. 
Correlation is also the square root of the r2 accuracy coefficient from linear regressions.

The correlation Matrix should always be one of the first steps in a machine learning project as allows checking easily which variables are related and which are not.

#### Ex 11: Correlation matrix

In [None]:
### Exercise 11: Pearson correlation matrix

# Calculated the corr_matrix in the following line
corr_matrix = 
#

matfig = plt.figure(figsize=2*np.array([6.4, 4.8]))
plt.matshow(np.abs(corr_matrix), cmap=cm.RdYlGn, fignum=matfig.number)
plt.xticks(np.arange(0, len(corr_matrix.columns)), corr_matrix.columns.to_list(), rotation=90)
plt.yticks(np.arange(0, len(corr_matrix.columns)), corr_matrix.columns.to_list())
plt.colorbar()

## Check the output of this cell, is there anything that calls your attention?

## 4 - Unsupervised learning
In this section we will try to check if we can find any hidden relationship in the data by means of clustering. We will use [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and the famous [elbow method](https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/) to detect the optimal number of clusters.

We will try to look for an example like [this](https://datatofish.com/k-means-clustering-python/), where clusters are clearly defined. 

### Plot pair of Variables

#### Ex 12: Scatter of pair of variables

In [None]:
### Do not modify this cell, not an exercise

def plot_couple_of_variables(dataframe, x_name, y_name):
  x_variable = dataframe[x_name]
  y_variable = dataframe[y_name]

  fig = plt.figure()
  gs = GridSpec(4, 4)

  ax_scatter = fig.add_subplot(gs[1:4, 0:3])
  ax_hist_x = fig.add_subplot(gs[0,0:3])
  ax_hist_y = fig.add_subplot(gs[1:4, 3])

  ax_scatter.scatter(x_variable, y_variable)
  ax_scatter.set_xlabel(x_name)
  ax_scatter.set_ylabel(y_name)

  ax_hist_x.hist(x_variable)
  ax_hist_y.hist(y_variable, orientation = 'horizontal')

In [None]:
### Exercise 12: Select two relevant variables from the car_sales dataset

# Modify the following lines
x_name = ""
y_name = ""
#

plot_couple_of_variables(car_sales, x_name, y_name)

### KMeans
[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) is a clustering method based on a iterative refinement technique  that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean ([read more](https://en.wikipedia.org/wiki/K-means_clustering#:~:text=k-means%20clustering%20is%20a,a%20prototype%20of%20the%20cluster.)).

#### Ex 13: Remove NaNs

In [None]:
### Exercise 13: Prepare the dataset

# First, remove the nan from the data set in the following line
car_sales_nandropped = 
#

X = np.array(list(zip(car_sales_nandropped[x_name], 
                      car_sales_nandropped[y_name])))

#### Ex 14: Fit KMeans

In [None]:
### Exercise 14: Fit a KMeans as a unsupervised learning method

# Modify the number of clusters and check the results
number_of_clusters = 
kmeanModel_fitted = 
#

#### Ex 15: Use KMeans

In [None]:
### Exercise 15: Predict the data in X using kmeanModel object

# Modify the following line
results = kmeanModel_fitted.
#

In [None]:
### Do not modify this cell, not an exercise

print(results)
for i in range(number_of_clusters):
  points_belonging_to_i_cluster = results == i
  plt.scatter(car_sales_nandropped[x_name].to_numpy()[points_belonging_to_i_cluster], 
              car_sales_nandropped[y_name].to_numpy()[points_belonging_to_i_cluster], 
              label=f"Cluster {i}")
plt.legend()
plt.xlabel(x_name)
plt.ylabel(y_name)

### Elbow Method
The number of clusters selected in the KMeans algorithm it is somehow a bit arbitrary and it might be a bit subjective sometimes. To avoid this issue, the Elbow method is used.

In [None]:
### Do not modify this cell, not an exercise

def elbow_method(X, x_name, y_name, x_variable, y_variable):
  distortions = []
  inertias = []
  mapping1 = {}
  mapping2 = {}
  K = range(1, 10)
  
  for k in K:
      # Building and fitting the model
      kmeanModel = KMeans(n_clusters=k, n_init=10).fit(X)
      kmeanModel.fit(X)
  
      distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                          'euclidean'), axis=1)) / X.shape[0])
      inertias.append(kmeanModel.inertia_)
  
      mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,
                                    'euclidean'), axis=1)) / X.shape[0]
      mapping2[k] = kmeanModel.inertia_
    
  plt.plot(K, distortions, 'bx-')
  plt.xlabel('Values of K')
  plt.ylabel('Distortion')
  plt.title('The Elbow Method using Distortion')
  plt.show()

In [None]:
### Do not modify this cell, not an exercise

elbow_method(X, x_name, y_name, car_sales_nandropped[x_name], car_sales_nandropped[y_name])

### PCA
[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) stands for __Principal Component Analysis__ and it is commonly used as a technique for dimensionality reduction. The PCA checks the principal components of the linear space defined by the data and establishes a transformation of the data to that space. The result is a "compression" of the data with loss of information depending on the number of variables. 

In this example we will use the PCA as a technique of clustering.

#### Ex 16: Preparation for PCA

In [None]:
### Exercise 16: Preparation of the car_sales dataset for PCA

# Drop the columns that contains strings from the car sales data set
car_sales_string_dropped = car_sales.
# Drop the columns that contain nans (Reset the index as well)
car_sales_string_nandropped = car_sales_string_dropped
#

print(f"Original data frame rows: {len(car_sales)}, new dataframe rows: {len(car_sales_string_nandropped)}")

#### Ex 17: Fitting PCA

In [None]:
### Exercise 17: Fitting PCA

# Instantiate a PCA model with 2 components (it is difficult plotting more than 2)
pca_clustering = 
# Fit transform your PCA model into the car_sales_string_nandropped
car_sales_pca = pca_clustering.
#

print(f"The explained variance is {pca_clustering.explained_variance_}")
print(f"The explained variance ratio is {pca_clustering.explained_variance_ratio_}")

In [None]:
### Do not modify this cell, not an exercise

for i in range(car_sales_pca.shape[0]):
  plt.scatter(car_sales_pca[i][0], 
              car_sales_pca[i][1], 
              marker = f"${i}$")
plt.xlabel("PCA component 0")
plt.ylabel("PCA component 1")

What are the models that the PCA considers further from the average ones? Can you spot why?

## 5I - Supervised Learning: Price prediction

In this section we will try to create a model that help us understanding the behaviour of customers based on previous data.

Wouldn't be cool if we could predict how much would someone pay for a car with certain characteristics?

### Features/labels selection
Let's select first which ones will be the inputs (features) and outputs (labels) of our supervised learning problem.

#### Ex 18: Select variables

In [None]:
### Exercise 18: Prepare the data to be used in a supervised learning method

# Select the input and output variable names, input is a list, output is a single one
input_variables = ["",""]
output_variable = ''
#

In [None]:
### Do not modify this cell, not an exercise

input_variables.append(output_variable)

X = car_sales[input_variables].dropna(axis=0)

y = X[output_variable]
X.drop([output_variable], axis=1, inplace=True)

print(f"{len(X.columns)} input variables: {X.columns}")
print(f"Output variable is: {y.name}")

### Normalization/Standarization of variables
Data should be normalized or standarised before getting into any machine learning model. The purpose of this is to avoid biasing the model towards the higher magnitude variables.

#### Ex 19: Normalization

In [None]:
### Exercise 19: Normalization

# Find a scaler to normalise the data based on min and max (Have a look at sklearn.preprocessing)
scaler_x = 
scaler_y = 
#

In [None]:
### Do not modify this cell, not an exercise

scaler_x.fit(X)
scaler_y.fit(y.to_numpy().reshape(-1,1))
X_norm = scaler_x.transform(X)
y_norm = scaler_y.transform(y.to_numpy().reshape(-1,1))

#### Ex 20: Standarization

In [None]:
### Exercise 20: Standarization

# Find a scaler to standarise the data based on average and standard deviation
standard_x = 
standard_y = 
#

In [None]:
### Do not modify this cell, not an exercise

standard_x.fit(X)
standard_y.fit(y.to_numpy().reshape(-1,1))
X_standard = standard_x.transform(X)
y_standard = standard_y.transform(y.to_numpy().reshape(-1,1))

### Train/test split
Next step we will do is plitting the data into train and test datasets. This is done to evaluate the models fairly. The models will be trained with the train data and they normally will perform very well on this one. The test dataset is used to evaluate the performance of the model on data it has never seen before or how good it generalises.

#### Ex 21: Train/Test Split

In [None]:
### Exercise 21: Split into Train/Test datasets

# Loof for a function to split into train and test
# Select Test size 0.3 and random state 1
# (It is the same for every block)
## Non preprocessed
X_np_train, X_np_test, y_np_train, y_np_test = \
(X.to_numpy(), y.to_numpy().reshape(-1,1), ..., ...)

## Normalization
X_norm_train, X_norm_test, y_norm_train, y_norm_test = \
(X_norm, y_norm, ..., ...)

## Standarization
X_standard_train, X_standard_test, y_standard_train, y_standard_test = \
(X_standard, y_standard, ..., ...)

### Fitting Models

#### Ex 22: Fit Models

In [None]:
### Exercise 22: Find the routines to instantiate and fit the models indicated below

def fit_models(features_train, labels_train):
  # Linear Regression
  lin_regress = 
  lin_regress.
  # print(reg.coef_)
  # print(reg.intercept_)

  # Support vector machines
  # Linear (Pay attention to the Linear detail)
  svm_linear = 
  svm_linear.

  # Poly (Pay attention to the Poly detail)
  svm_poly = 
  svm_poly.

  # RBF (Pay attention to the RBF detail)
  svm_rbf = 
  svm_rbf.

  # Decision tree regressor
  clf = tree.DecisionTreeRegressor()
  clf
  return lin_regress, svm_linear, svm_poly, svm_rbf, clf

In [None]:
### Do not modify this cell, not an exercise

features_train = X_np_train
labels_train = y_np_train
features_test = X_np_test
labels_test = y_np_test

In [None]:
### Do not modify this cell, not an exercise

reg, svm_linear, svm_poly, svm_rbf, clf = fit_models(features_train, labels_train)

### Evaluation of the models

In [None]:
### Do not modify this cell, not an exercise

# Function for evaluating the methods
def evaluation_of_methods(methods_vector, methods_name, 
                          features_train, features_test, 
                          labels_train, labels_test):
  evaluation = pd.DataFrame(columns=['method', 'r2_train', 'r2_test', 
                                     'MSE_train', 'MSE_test', 
                                     'MAE_train', 'MAE_test'])
  for i in range(len(methods_vector)):
    print(f"method name:{methods_name[i]}")
    results_dict = dict()
    results_dict['method'] = methods_name[i]
    
    y_hat_train = methods_vector[i].predict(features_train)
    y_hat_test = methods_vector[i].predict(features_test)

    results_dict['r2_train'] = r2_score(labels_train, y_hat_train)
    results_dict['r2_test'] = r2_score(labels_test, y_hat_test)

    results_dict['MSE_train'] = mean_squared_error(labels_train, y_hat_train)
    results_dict['MSE_test'] = mean_squared_error(labels_test, y_hat_test)

    results_dict['MAE_train'] = mean_absolute_error(labels_train, y_hat_train)
    results_dict['MAE_test'] = mean_absolute_error(labels_test, y_hat_test)

    method_results = pd.DataFrame.from_dict([results_dict])
    evaluation = pd.concat([evaluation,method_results], ignore_index = True)
  return evaluation

In [None]:
### Do not modify this cell, not an exercise

# Function for plotting the results
def plotting_methods(methods_vector, features_test, labels_test):
  nrows = len(methods_vector)
  fig, ax = plt.subplots(ncols=2, 
                        nrows=nrows, 
                        figsize=2.5*np.array([6.4, 4.8]), 
                        sharex='col')
  for i in range(nrows):
    # Prediction
    y_hat = methods_vector[i].predict(features_test).ravel()
    # Prediction vs real
    ax[i, 0].scatter(labels_test, y_hat, label="Predictions")
    ax[i, 0].scatter(labels_test, labels_test, label="Real values")
    ax[i, 0].set_ylabel(f'Predictions {methods_name[i]}')
    ax[i, 0].set_xlabel('Real values')
    ax[i, 0].legend()
    ax[i, 1].hist(labels_test.ravel()-y_hat)
    ax[i, 1].set_xlabel('Error')

In [None]:
### Do not modify this cell, not an exercise

methods_vector = [reg, svm_linear, svm_poly, svm_rbf, clf]
methods_name = ['Linear Regressor', 'SVM Linear', 'SVM Poly', 'SVM RBF', 'Decision Tree']
evaluation = evaluation_of_methods(methods_vector, methods_name, 
                                   features_train, features_test,
                                   labels_train, labels_test)

In [None]:
### Do not modify this cell, not an exercise

evaluation

In [None]:
### Do not modify this cell, not an exercise

plotting_methods(methods_vector, features_test, labels_test)

## 5II - Supervised Learning: Sales Prediction

The marketing department of a new brand has contacted our group of data science with interest in sales prediction. Based on the dataset we have, we will try to create a model that can predict the number of sales of a model given its characteristics.

### Features/labels preparation

#### Ex 23: One hot encoding

In [None]:
### Ex 23: One hot encoding of the Manufacturer

# Look for a function in pandas to one-hot encode the manufacturer
# TIP: It has to do with dummies
manufacturer_one_hot = pd.(car_sales['Manufacturer'], dtype=float)
#

# Quick check:
print(np.sum(manufacturer_one_hot.sum(axis=1)))
print(len(manufacturer_one_hot))

In [None]:
### Ex 24: Drop columns

# Drop columns 'Manufacturer', 'Model', '__year_resale_value', 'Vehicle_type', 'Latest_Launch'
car_sales_for_sales = car_sales.([], axis=1)
#

print(car_sales_for_sales.columns)

In [None]:
### Do not modify this cell, not an exercise

# Assignation of the one hot encoding to the dataframe
car_sales_for_sales[manufacturer_one_hot.columns] = manufacturer_one_hot

In [None]:
### Do not modify this cell, not an exercise

# Check the first ten entries of the dataframe
car_sales_for_sales.head(10)

In [None]:
### Do not modify this cell, not an exercise

# Remove the nans from the dataframe
car_sales_ii_ready = car_sales_for_sales.dropna(axis=0)

In [None]:
### Do not modify this cell, not an exercise

# Show the dataframe
car_sales_ii_ready

### Features/labels selection

In [None]:
### Do not modify this cell, not an exercise

# Total amount of columns
print(f"The list of columns is {car_sales_ii_ready.columns}")

#### Ex 24: Features and labels selection

In [None]:
### Ex 24: Features and labels selection

# Select Price, Engine size, horsepower, wheelbase, width, length, curb_weight
# fuel capacity, fuel efficiency, power perf factor, size and all the brand
# names as input variables.
input_variables = []
#

# Select sales as output variable
output_variable = ''
#

In [None]:
### Do not modify this cell, not an exercise

# Preparation of the data
input_variables.append(output_variable)

X_ii = car_sales_ii_ready[input_variables].dropna(axis=0)

y_ii = X_ii[output_variable]
X_ii.drop([output_variable], axis=1, inplace=True)

print(f"{len(X_ii.columns)} input variables: {X_ii.columns}")
print(f"Output variable is: {y_ii.name}")

### Normalization/standarization

#### Ex 25: Standarise the data

In [None]:
### Ex 25: Instantiate standaridisers
# Modify the following line
standard_x_ii = 
standard_y_ii = 
#
standard_x_ii.fit(X_ii)
standard_y_ii.fit(y_ii.to_numpy().reshape(-1,1))
X_standard_ii = standard_x_ii.transform(X_ii)
y_standard_ii = standard_y_ii.transform(y_ii.to_numpy().reshape(-1,1))

### Tran/test split

#### Ex 26: Train/test Split

In [None]:
### Ex 26: Split the dataset into train and test
# Look for the function to split the dataset and use a test size of 0.3 and 
# random state 1.
X_standard_train_ii, X_standard_test_ii, y_standard_train_ii, y_standard_test_ii = \
(X_standard_ii, y_standard_ii, ..., ...)
#

### PCA
So far in this problem, we have many features (inputs) but there are not many examples. PCA is a good method of reduce the dimensionality of a problem.

#### Ex 27: PCA

In [None]:
### Ex 27: Complete the PCA call to do a sweep with the number of components

var_exp = []
for i in range(1, len(input_variables)):
  # Modify the following line
  pca = PCA()
  #
  pca.fit(X_standard_train_ii)
  var_exp.append(np.sum(pca.explained_variance_ratio_))

plt.figure()
plt.plot(np.arange(1, len(input_variables)), var_exp)

#### Ex 28: Select PCA components

In [None]:
### Ex 28: Select a suitable number of components

#
pca = PCA(n_components=...)
#
pca.fit(X_standard_train_ii)

features_train = pca.transform(X_standard_train_ii)
labels_train = y_standard_train_ii
features_test = pca.transform(X_standard_test_ii)
labels_test = y_standard_test_ii

# features_train = X_standard_train_ii
# labels_train = y_standard_train_ii
# features_test = X_standard_test_ii
# labels_test = y_standard_test_ii

### Fitting models

In [None]:
### Do not modify this cell, not an exercise

reg, svm_linear, svm_poly, svm_rbf, clf = fit_models(features_train, labels_train)

### Evaluation of models

In [None]:
### Do not modify this cell, not an exercise

methods_vector = [reg, svm_linear, svm_poly, svm_rbf, clf]
methods_name = ['Linear Regressor', 'SVM Linear', 'SVM Poly', 'SVM RBF', 'Decision Tree']
evaluation = evaluation_of_methods(methods_vector, methods_name, 
                                   features_train, features_test,
                                   labels_train, labels_test)

In [None]:
### Do not modify this cell, not an exercise

evaluation

In [None]:
### Do not modify this cell, not an exercise

plotting_methods(methods_vector, features_test, labels_test)

In [None]:
### Do not modify this cell, not an exercise

real_errors = standard_y_ii.inverse_transform((labels_test.ravel() - svm_poly.predict(features_test).ravel()).reshape(-1,1))
print(f"Real errors {real_errors.ravel()}")
print(f"Average real error {np.mean(real_errors)}")
print(f"Standar deviation real error {np.std(real_errors)}")

## 6 - Conclusions and take aways
1. Every data science project is different, requires different approach and methods to get to the objective successfully.
2. First step should always be a "Exploratory Data Analysis" where the statistical distributions of the data are checked as well as the correlation between the variables.
3. Unsupervised learning (targets/labels are not defined) is used to detect hidden patterns or relationships in the data. Commonly clustering, association and anomaly detection.
4. Normalization/Standarization is most of the times required to bring all the data to the same scale, so models are not influence by the size of the data but for its importance.
5. Supervised learning (features and targets are defined) is used to model some variables based on others.