# Part IV: Perform Machine Learning
After collecting your data (Part I), feature engineering (Part II), and data visualization + a bit of cleaning (Part III), you're now ready to train models and get this party started.

![MachineLearningProcess.png](attachment:MachineLearningProcess.png)

Generally, the machine learning process has five parts:
1. <strong>Split your data into train and test set</strong>
2. <strong>Model creation</strong>
<br>
Import your models from sklearn and instantiate them (assign model object to a variable)
3. <strong>model fitting</strong>
<br>
Fit your training data into the model and train train train
4. <strong>model prediction</strong>
<br>
Make a set of predictions using your test data, and
5. <strong>Model assessment</strong>
<br>
Compare your predictions with ground truth in test data

Highly recommended readings:
1. [Important] https://scipy-lectures.org/packages/scikit-learn/index.html
2. https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
3. https://scikit-learn.org/stable/tutorial/basic/tutorial.html

### Step 1: Import your libraries
We will be using models from sklearn - a popular machine learning library. However, we won't import everything from sklearn and take just what we need. 

Import the following:
1. pandas
2. matplotlib.pyplot as plt
3. seaborn as sns
4. numpy

In [None]:
# Step 1: Import the needed libraries

### Step 2: Read your cleaned CSV from Part III
We'll be reading the CSV from Part III into a DataFrame. 

To recap, the difference between the CSV from Part II and Part III is that we:
1. removed the anomalous rows, i.e. absolutely 0 available taxis in entire country
2. removed sectors 3, 6, 7, and 9 because these sectors are absolutely empty

We don't need the date to be the index now. 

In [None]:
# Step 2: Read your CSV

### Step 3: Prepare your independent and dependent variables
At this stage, we'll be using machine learning to predict the taxi availability in sectors 1-8.

From the loaded CSV, we'll be using only:
1. day_of_week
2. minute
3. hour

Declare a variable and store the DataFrame containing these columns only.

And we'll be preparing these dependent variables separately:
1. sector_1
2. sector_2
3. sector_4
4. sector_5
5. sector_8

Declare five separate variables, and each variable contains one column's values.

![IndependentDependentVariables.png](attachment:IndependentDependentVariables.png)

The independent variables look deceptively simple, but they contain enough information to make a reliable prediction for the dependent variables, i.e. sector_1 to sector_8.

In [None]:
# Step 3: Prepare the independent and dependent variables. 
# Note: make sure you have 1 variable containing the independent variables, and 5 variables each containing the individual dependent variable

### Step 4: Import machine learning libraries
Time to import other libraries. We hope you've taken a look at the three articles (especially the first one) at the start of this notebook because it'll be useful. 

Import the following libraries and methods:
1. train_test_split - sklearn.model_selection
2. DummyRegressor - sklearn.dummy
3. LinearRegression - sklearn.linear_model
4. DecisionTreeRegressor - sklearn.tree
5. RandomForestRegressor - sklearn.ensemble
6. mean_squared_error - sklearn.metrics

In [None]:
# Step 4: Import the libraries that you need

### Step 5: Split your indepedent and dependent variables into train and test sets
We'll be using a 80/20 split for train and test set respectively, using the train_test_split function.

For ease, try splitting your independent variables and the <strong>sector_1</strong> first and use that for the rest of the training.

In [None]:
# Step 5: Split your data into train and test

### Step 6: Train your machine learning model
Once you've split your data, machine learning begins. 

This is what you'll need to do:
1. Start with a model
2. Declare a variable, and store your model in it (don't forget to use brackets)
3. Fit your training data into the instantiated model
4. Declare a variable that contains predictions from the model you just trained, using the train dataset (X_test)

We recommend starting with Dummyregressor to establish a baseline for your predictions. 

Also, the recommended readings above will be very helpful.

In [None]:
# Step 6a: Declare a variable to store the model

# Step 6b: Fit your train dataset

# Step 6c: Declare a variable and store your predictions that you make with your model using X test data

### Step 7: Repeat Step 6 with other models 
After DummyRegressor, give LinearRegression, DecisionTreeRegressor, and RandomForestRegressor a try.

Don't forget to use different variables - we'll be comparing their performances later.

In [None]:
# Step 7: Repeat Step 8 with other models

### Step 8: Assess your model performance
We'll be using two ways to assess our model.

1. Scatter plot comparing the actual values of the dependent variable and the predictions
2. The root mean-squared-error (RMSE) score

Do this for all of your models - you'll get to compare how well they predict your taxi availability. Ideally, your points should lie on a line where y = x, i.e. your predictions are exactly the same as actual values.

![RandomForestSector1.png](attachment:RandomForestSector1.png)

This is an example of the result of plotting test y data against the predictions made for Sector 1. This is pretty good since most of the points cluster around the y = x line.

<strong>Hint: For RMSE, you'll square root the mean_squared_error results - you'll need something from numpy</strong>

In [None]:
# Step 8a: Print the RMSE between the y test and the prediction

# Step 8b: Plot a scatter plot test dependent variables vs predictions

### Repeat Step 5 - 8 with the other sector data
After you are done with 'sector_1', give other sector data a try. This way, you can see which model does best in predicting the taxi availability for  all sectors.

In [None]:
# Try Step 5-8 with other sector data

How to tell whether RMSE is good? Generally, the lower the better. However, the RMSE should be one order of magnitude lower than the values. For example, if my values are between 200-500, a good RMSE would be around 10% of the 

The visual inspection of predicted value vs actual values using a plot is useful. 


Once you've tried all that, congratulations! You have reached...

# The end
And that's the end! To recap, you've:
1. Independently collected data from an official website (a government one, no less!)
2. Performed data cleaning
3. Engineered new features
4. Trained a machine learning model to predict taxi availability 

Go on, give yourself a pat on the back. We hope this project series has give you more confidence in coding and machine learning. 

You have successfully implemented machine learning in predicting taxi availability in different sectors in Singapore. However, as you have noticed, there is a lot more room to improve for the model. The RMSE is good, but we think that it can be better. 

For example:
1. collecting more data over the months (Feb - Dec)
2. collecting more data over the years (2019 and before)
3. tighten the boxes, i.e. instead of nine sectors we divide the map into even smaller sectors for a more granular analysis

That is the fate of a data scientist, to pursue better models that can help model the world out there.  

Whatever you learn here is but a tip of the iceberg, and launchpad for bigger and better things to come. Come join us in our Telegram community over at https://bit.ly/UpLevelSG and our Facebook page at https://fb.com/UpLevelSG

<strong>P.S. Watch out for DLCs for this notebook. We may release project extensions that utilize new techniques and/or feature engineering to improve your dataset and come up with a more amazing model.</strong>