# Step 1: Gathering Data

We start with importing the required libraries. 

# Answer the questions in the text-field represented with bullet-points. 
# Double click on the text-field to edit it. 


In [None]:
# 
# Import Packages
# These packages are like libraries. They contain many useful functions that
# streamline our codes. 

import numpy as np                #helps to crunch numbers
import matplotlib.pyplot as plt   # helps to plot simple bar graphs
import seaborn as sns             # helps to plot scatter plots
import pandas as pd               # Will be used to create our DataFrame
%matplotlib inline

In [None]:
# First we create an Array of strings. Can you guess what the strings represent?

# These are the names of our columns, or Features. The names have been shortened.
# Length(m), Wheel_Base(m), Doors, Weight(100kg), Seats, Engine_Size(1000cc), Sun_Roof, Price
column_features = ['Len(m)', 'Wb(m)', 'Door', 'W(100kg)', 'Seats','E_S(1000cc)','S_F','Price'] # As per the cars dataset information




In [None]:
# Load the data
# Next we upload the cars data into a DataFrame called 'df'
# A DataFrame is just a table of information. our table will have Features and Class Labels

df = pd.read_csv('cars_dl5.csv', names=column_features)  #Using Pandas (as pd), i will open a data file called cars.csv file 
# i will also name the columns in my data 'column_features'

# Step 2: Preparing Data

In [None]:
df.head(5) #the Head function will showcase the first 5 rows of data

# try changing the number '5' parameter to '10' or even '-10'



* What are the rows that are displayed?

* Is the table correct?

* Is the data the correct data that is imported?:

* Are the names of the columns correctly assigned?:


In [None]:
# Some basic statistical analysis about the data
# Before we start applying Machine Learning, we as data enginners should investigate
# some details about our data. 
# The common investigations we can make about the count and mean of our data.

# use the describe() funtion to get a snapshot of our data's statistical analysis
df.describe()




* What does count refer to?

* How many rows of data does df contain?

* Is there enough data to make a good analysis?

* What is mean? 

* How useful is it to look at the mean values of each column for this dataset?

In [None]:
# The describe-table above does not help us to understand the data much.
# It was useful in showing me the min and max values
# But mean, std were not useful.

# Lets dig deeper!, lets find out how many rows of data exists for the 
# 3 classes of data. Expensive, Moderate, Cheap


# count_of_cars is a variable to hold the number of rows where the class is
#"Expensive", "Moderate", "Cheap"
# Change the word 'Expensive' to the name of the column you want to check.
count_of_cars = len(df[df["Price"]=="Expensive"])

print(count_of_cars)

* How many cars are there in the Expensive class?

* How many cars are there in the Moderate class?`

* How many cares are there in the Cheap class?

* Are the columns equal? Will that effect our results?

In [None]:
# Visualize the whole dataset

sns.pairplot(df, hue='Price')   # Using the Seaborn library, we can plot comparitive charts of our data

# The Seaborn library (as sns) quickly plots each row of data and compares it to each other. 
# How does looking at the comparisons of Sepal length VS Sepal Width help us?
# How does looking at the comparison of Petal Length VS Petal Width help us?

# The comparisons help us to infer if the data has seperations or is too closely linked to each other.

* Look at the chart for Len(m) VS Wb(M). Is there a clear seperation of the 3 classes?:

* Look at Len(m) VS E_S(1000cc). Is there a clear seperations of the 3 classes?:

* Look at Wb(m) VS Wb(m). This chart compares only the Wheel_base(m) of all 3 classes. Are the datapoints dignificantly overlapping with each other?

* Look at E_S(1000cc) VS E_S(1000cc). Are the datapoints showing significant overlapping with each other?

* Which graphs show us a seperation of datapoints of the 3 classes?

* Overall, do you think the data of the 3 classes are seperated enough that our Machine Learning model can make good predictions?

In [None]:
# Seperate features and target  

# Before we bgin Machine Learning, we need to spilt up the table into 2 Arrays
data = df.values    # Take the values from the table and create a new table called 'data'
X = data[:,0:7]     # Take on the first 4 columns (Features) and store it in X
Y = data[:,7]       # Take on the Class columns and store it in Y

In [None]:
# Calculate avarage of each features for all classes

# Using the X and Y arrays created in the previous cell, we are going to plot the averages of each column. 
# In this cell, we use some array calculations to help us find the averages. 
Y_Data = np.array([np.average(X[:, i][Y==j].astype('float32')) for i in range (X.shape[1]) for j in (np.unique(Y))])
Y_Data_reshaped = Y_Data.reshape(7,3)
Y_Data_reshaped = np.swapaxes(Y_Data_reshaped, 0, 1)
X_axis = np.arange(len(column_features)-1)
width = 0.25

In [None]:
# Plot the avarage
# Using the MatPlotLib library (as plt) we can plt the averages of each type of Price of car based on its features. 
plt.bar(X_axis, Y_Data_reshaped[0], width, label = 'Cheap')
plt.bar(X_axis+width, Y_Data_reshaped[1], width, label = 'Moderate')
plt.bar(X_axis+width*2, Y_Data_reshaped[2], width, label = 'Expensive')
plt.xticks(X_axis, column_features[:7])
plt.xlabel("Features")
plt.ylabel("Values")
plt.legend(bbox_to_anchor=(1.3,1))
plt.show()

# Study the table below. in the x-axis, we have the Features, in the Y-axis we have the Value. 
# Each bar represents the average of that Feature for each Price of the car.
# For example, only looking at Len(m), i can see that the average Len(m) of each type of car is
#   between 4-5 meters.



* What does the bar chart above show you?

* Can you determine which feature has about the same averages per Price of car?

* Can you determine which feature has varying averages per Price of car?

Great job!

So far we have loaded the flower datas into DataFrames.
We have plotted them into different graphs to find similarities and differences.
What do you think, are the Features of the flowers seperate enough that
we can use the data in our Machine Learning model?

Or do you think that the data for each flower is too similar to each other, there is no way our Machine Learning model can correctly differentiate between flowers?

# Step 3: Choosing a Model

In [None]:
# Split the data to train and test dataset.
# Before we can apply Machine Learning, we need to split our dataset into Training and Testing data
# Training data is the data used to train our Machine
# Testing data will be later used to validate whether our Machine is well trained or not.

# We will use the sklearn library to split our data into Training and Testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)   #split based on 70:30 ratio. 

#80% data for Training
# 20% data for Testing

In [None]:
# Support vector machine algorithm
from sklearn.svm import SVC   # using the SKLearn library to call the Support Vector Machine algorithms
svn = SVC()
svn.fit(X_train, y_train)

In [None]:
# Predict from the test dataset
predictions = svn.predict(X_test) # Now that our Training is complete, it is time to test our data.

# Using the 30% of data seperated for Testing in cell 10, it is time for us
# to use it to predict how well we can identify our Price of cars given the features.


In [None]:
# Calculate the accuracy
# Think of accuracy as shooting arrows at a target. We dont want the arrows
# to be far from the bullseye!

# So the higher the Accuracy is to 1 (1 means 100% accurate), the better our test went.

from sklearn.metrics import accuracy_score
accuracy_score(y_test, predictions)

Hoorah! Our Machine Learning model has been created!

What was your Accuracy results? 
Where they close to 1 (100%)?
Are you satisfied with your results?

In the real-world, finding an accuracy of more than 0.9 is really really good!

# Step 4: Evaluation

In [None]:
# A detailed classification report

# Time to tabulate the results from our tests and predictions
# We only need to look at Precision and Recall

# What is the Precision and Recall of the 3 flowers?
# Which flower has the lowest Precision and Recall? What does that mean?

from sklearn.metrics import classification_report
print(classification_report(y_test, predictions))

# Step 5 Deployment

If i gave you data on new cars\, can you predict their class (Price)

In [None]:
# Lets create an Array of data of 3 types of Prices/cars. 
# For each dimension of the Array, i only have 7 columns, each col represents
# the features of the car. 

#Run the cell and lets see how well our Machine predicts the Prices of these cars. 

X_new = np.array([[4.331,1.958,2,9,2,4.1,1],[3.617,1.484,4,11.52,4,2.9,0],[3.546,1.675,4,15.79,8,1.6,0]])


#Prediction of the species from the input vector
prediction = svn.predict(X_new)
print("Prediction of Species: {}".format(prediction)) 

Prediction of Species: ['Expensive' 'Moderate' 'Cheap']


SO what does the prediction show? 

Use the ten rows of data given. Input 3 rows at a time int he X_new array.
What are the results of your prediction? are they true? check your answers with your teacher? 


# Ignore this part

In [None]:
# Save the model
import pickle
with open('SVM.pickle', 'wb') as f:
    pickle.dump(svn, f)

In [None]:
# Load the model
with open('SVM.pickle', 'rb') as f:
    model = pickle.load(f)

In [None]:
model.predict(X_new)