<a href="https://colab.research.google.com/github/dipesh2108/AI_Notes/blob/main/ML_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook is created exclusively for Short ML training Workshops and Bootcamps by **Head - Training & Content Development**  - <font color='darkgreen'><b>Mr. Rocky Jagtiani</b></font> (<small> https://linkedin.com/in/rocky-jagtiani-3b390649/ </small>)



ML Process flow or ML Pipeline
--
![basic_ML_process_flow](https://drive.google.com/uc?id=1mf5Rpq68x9AzyHMAo6xsQ9fVNDbuI4Q3 'basic_ML_process_flow')

`Recall, this is where we stopped in our first NB` <br>
1. **Data Collection**: Collect the data that the algorithm will learn from.

2. **Data Preparation**: Format and engineer the data into the optimal format, extracting important features and performing dimensionality reduction. (*dimensionality reduction means when their are too many features i.e too many columns in our dataset then we need to choose few from all*)

3. **Training**: Also known as the fitting stage, this is where the Machine Learning algorithm actually learns by showing it the data that has been collected and prepared.

4. **Evaluation**: Test the model to see how well it performs.

5. **Tuning**: Fine tune the model to maximise it’s performance.


<font color='green'> <b>
ML Pipeline is nothing but the steps you follow to clean, pre-process the data, scale or normalise it before training and testing it.
</b></font>

**`Exprienced ML Engineers`** infact make ML pipeline rigth from Preparing data to Deploying the ML model. ( see diagram below )

![Board_ML_Pipeline](https://drive.google.com/uc?id=1zEz0CfrYofJrYqD-MjGlwj6tzgyhxjVm 'Board_ML_Pipeline')

Here we would dive into solving a end-to-end ML problem - **Student Grant Recommendation**;  ofcourse a very simple problem so that you get the feel of **`ML pipeline`.**
<br><br>
**`Please Note`** : Later in the `Machine Learning - Intermidate` <font color='green'><b>course</b></font> you would learn how to **`automate`** the pipelining process by using **sklearn.pipeline.Pipeline class.**

# Objective : Student Grant Recommendation

You have historical student performance data and their grant recommendation outcomes in the form of a comma separated value file named student_records.csv. Each data sample consists of the following attributes.

• Name (the student name) <br>
• OverallGrade (overall grade obtained) <br>
• Obedient (whether they were diligent during their course of stay) <br>
• ResearchScore (marks obtained in their research work) <br>
• ProjectScore (marks obtained in the project) <br>
• Recommend (whether they got the grant recommendation) <br>

Your main objective is to build a predictive model based on this data such that you can predict for any future student whether they will be recommended for the grant based on their performance attributes.

`Note` : This is a <u>toy dataset</u>.

**`Step 1: Data Retrieval`** <br>
Here, we will leverage the pandas framework to retrieve the data from the CSV file. The following snippet shows us how to retrieve the data and view it.

In [None]:
# download the dataset from this link https://drive.google.com/open?id=1viCNZx1e3Egi7zsh72zwGjrA_W8dcpul

# then upload this NB to your Colab by running the below code and selecting the .csv file.

# loading the dataset into this Colab NB
#from google.colab import files
#files.upload()

Saving student_records.csv to student_records.csv


{'student_records.csv': b'Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend\nHenry,A,Y,90,85,Yes\nJohn,C,N,85,51,Yes\nDavid,F,N,10,17,No\nHolmes,B,Y,75,71,No\nMarvin,E,N,20,30,No\nSimon,A,Y,92,79,Yes\nRobert,B,Y,60,59,No\nTrent,C,Y,75,33,No\n'}

In [None]:
#--get data into a DataFrame

import pandas as pd

df = pd.read_csv('student_records.csv')
df

Unnamed: 0,Name,OverallGrade,Obedient,ResearchScore,ProjectScore,Recommend
0,Henry,A,Y,90,85,Yes
1,John,C,N,85,51,Yes
2,David,F,N,10,17,No
3,Holmes,B,Y,75,71,No
4,Marvin,E,N,20,30,No
5,Simon,A,Y,92,79,Yes
6,Robert,B,Y,60,59,No
7,Trent,C,Y,75,33,No


**`Step 2: Data Preparation`**<br>
Based on the dataset (above), we do not have any data errors or missing values, hence we will mainly focus on feature engineering and scaling in this section.

<h3>If you wish see <b>some "un-clean data" examples</b> then watch this Video</h3>

<a href="https://drive.google.com/open?id=1NERqIE0PnmaiMInd8BUjswQNC2T1JZsn" download="Introduction_to_ML">
  <img src="https://drive.google.com/uc?id=14OOsd0HaKoMJjqu5YT5n7-HsvE6UVV7z" alt="SuvenML_Intro_to_ML_video" width="130" height="70">
</a>



**`Step 3 : Feature Extraction and Engineering`** <br>
Let’s start by extracting the existing features from the dataset and the outcomes; in separate variables.

`Note 1` : Features are input variables on which the ML model would be trained. They are always represented as X.

`Note 2` : The only column which is not in the set of features would be the Outcome or label. Outcome or Label when available helps the ML model to map features to outcome, thereby its `Supervised Learning`.

Its always advisable to start learning ML from Supervised ML , as its easier and quick to understand the concepts.



In [None]:
#--Type your code here
#--get features and corresponding outcomes








  OverallGrade Obedient  ResearchScore  ProjectScore
0            A        Y             90            85
1            C        N             85            51
2            F        N             10            17
3            B        Y             75            71
4            E        N             20            30
5            A        Y             92            79
6            B        Y             60            59
7            C        Y             75            33
----------------
  Recommend
0       Yes
1       Yes
2        No
3        No
4        No
5       Yes
6        No
7        No


> <font color ='green'> <b> I am sure you have understood that `Features` are what we want to observe.

> `Labels` are what we want to predict.</b> </font>

<h3>If you wish see <b>What are Features and Labels</b> then watch this Video</h3>

<a href="https://drive.google.com/open?id=1Kfky3-TkJtGvJE77m2uy0VqAaD57NWQO">
  <img src="https://drive.google.com/uc?id=14OOsd0HaKoMJjqu5YT5n7-HsvE6UVV7z" alt="Features and Labels_rec_in_hindhi" width="130" height="70">
</a>

<small> Note : This video is recorded in hindi and kept things very simple. </small>

Now that we have extracted our initial available features from the data and their corresponding outcome labels, let’s separate out our available features based on their type (**`numerical`** and **`categorical`**).

In [None]:
#--list down features based on type
numeric_feature_names = ['ResearchScore', 'ProjectScore']
categoricial_feature_names = ['OverallGrade', 'Obedient']

To know <b>Types of Data : Categorical vs. Numerical</b> - watch this Video

<a href="https://drive.google.com/open?id=1fGUOOYV2ash-WL6-al8FvBf9z7twT7dQ">
  <img src="https://drive.google.com/uc?id=14OOsd0HaKoMJjqu5YT5n7-HsvE6UVV7z" alt="Categorical and Numerical DataTypes" width="110" height="60">
</a>

<small> Credits : This video is recorded by 365 DataScience Team.</small>

We will now use a `standard scalar` from `scikit-learn` to **scale** or **normalize** our two numeric scorebased attributes using the following code.

`Note 1 :` **Feature Scaling** is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing.

*For example , if you have `youtube_Video_counts` dataset. Some videos have very small count say 50 and some very high count say 500000. If the ML model is trained using this data, then it would be baised towards video having view_count say 500000. Thereby when we use this model to predict the count of a video it would mostly predict a high value.*

`Note 2 :` Their are three types of Scaling techniques or Algo's : namely **Standard Scalar**, **Min-Max Scalar** and **Robust Scalar**.

`Note 3 :` All the scaling algo's are defined in **sklearn.preprocessing** library.

`Note 4 :` You would learn about each Scaling technique in a later NB. Have patience.  And Yes : **Scaling** and **Normalization** mean the same thing.

In [None]:
# to suppress any unwanted warnings
#--turn of warning messages
pd.options.mode.chained_assignment = None  # default='warn'
#----------------------------------------------------------------------

#--scale or normalize our two numeric score-based attributes
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()   # we are making the object of StandardScaler

# fit scaler on numeric features
ss.fit(training_features[numeric_feature_names])  # fit method learns the range of the data

# scale numeric features now
training_features[numeric_feature_names] = ss.transform(training_features[numeric_feature_names])
# transform method transforms the data. See the o/p. You would see a much reduced range.

# view updated feature-set
print(training_features)

  OverallGrade Obedient  ResearchScore  ProjectScore
0            A        Y       0.899583      1.376650
1            C        N       0.730648     -0.091777
2            F        N      -1.803390     -1.560203
3            B        Y       0.392776      0.772004
4            E        N      -1.465519     -0.998746
5            A        Y       0.967158      1.117516
6            B        Y      -0.114032      0.253735
7            C        Y       0.392776     -0.869179


<b><font color='red'> Did u notice ? </font></b> <br>
`Before Transformation` the range of ResearchScore was 10 to 92. <br>
`After Transformation` the range of ResearchScore is -1.803390 to 0.967158.  So I am sure you understood that the least value 10 got scaled to -1.803390 and max value 92 got scaled to 0.967158.


Now that we have successfully scaled our numeric features, let’s handle our categorical features and carry out the necessary feature engineering. Here we would convert the **`Categorical Data`** into **`Numeric values`**. <font color='green'> Because the ML model do not understand String data. It only understands Numeric inputs. </font>

There are many ways to do Feature Engineering over the **`Categorical Data`**. Here we would use; one of the most popular technique **One Hot encoding**.  

In [None]:
#--Engineering Categorical Features


# view new engineering features, where the categorical features are coded as binary


# We have converted our categoricial data into numeric.
# or we can say we have done feature engineering over categorical data.

   ResearchScore  ProjectScore  ...  Obedient_N  Obedient_Y
0       0.899583      1.376650  ...           0           1
1       0.730648     -0.091777  ...           1           0
2      -1.803390     -1.560203  ...           1           0
3       0.392776      0.772004  ...           0           1
4      -1.465519     -0.998746  ...           1           0
5       0.967158      1.117516  ...           0           1
6      -0.114032      0.253735  ...           0           1
7       0.392776     -0.869179  ...           0           1

[8 rows x 9 columns]


Are u feeling confused on <b>Why to do One Hot Encoding ?</b> - watch this Video

<a href="https://drive.google.com/open?id=11HGEF45Cz0co6ntKQ4P_N9XS8Um1Jgqc">
  <img src="https://drive.google.com/uc?id=14OOsd0HaKoMJjqu5YT5n7-HsvE6UVV7z" alt="One hot encoding" width="110" height="60">
</a>

<small> Credits : This video is recorded by TechDose Team.</small>

In [None]:
#--get list of new categorical features




['OverallGrade_F', 'OverallGrade_A', 'OverallGrade_B', 'OverallGrade_C', 'OverallGrade_E', 'Obedient_Y', 'Obedient_N']


**`Step 4 : Modeling`**<br>
We will now build a simple classification (supervised) model based on our feature set by using the logistic regression algorithm. The following code depicts how to build the supervised model.


**Wait** before moving ahead , I am assuming that you are very clear with the **first NB** on **1_Supervised & Unsupervised ML**.

In [None]:
from sklearn.linear_model import LogisticRegression  # importing the class.
                                                     # LogisticRegression is best suited for binary classification
import numpy as np
import warnings; warnings.simplefilter('ignore')

#--fit the model
lr = LogisticRegression()  # making object of the LogisticRegression class.

model = lr.fit(training_features, np.array(outcome_labels['Recommend']))
# np.array() converts from dataframe to numeric array
# well here we are giving 2 i/ps : features and Labels to train the ML model on what i/ps produce what o/ps.
# so the model learns the relationship. Hence we say its got trained.
# As we gave i/p features and o/p Labels both; thereby its called Supervised Learning


# model is ready, it can used to predict on some real data.

In [None]:
# ok, now i am giving you some real student data, who want to know whether they would be given Research Grant or not ?

new_data = pd.DataFrame([{'Name': 'Ninad', 'OverallGrade': 'F', 'Obedient': 'N', 'ResearchScore': 10, 'ProjectScore': 20},
                  {'Name': 'Alxis', 'OverallGrade': 'B', 'Obedient': 'Y', 'ResearchScore': 78, 'ProjectScore': 80},
                  {'Name': 'Faiz', 'OverallGrade': 'C', 'Obedient': 'N', 'ResearchScore': 69, 'ProjectScore': 70},
                  {'Name': 'Sejal', 'OverallGrade': 'A', 'Obedient': 'Y', 'ResearchScore': 98, 'ProjectScore': 88},
                  {'Name': 'Vijan', 'OverallGrade': 'E', 'Obedient': 'N', 'ResearchScore': 28, 'ProjectScore': 30}])

print(new_data)

    Name OverallGrade Obedient  ResearchScore  ProjectScore
0  Ninad            F        N             10            20
1  Alxis            B        Y             78            80
2   Faiz            C        N             69            70
3  Sejal            A        Y             98            88
4  Vijan            E        N             28            30


In [None]:
# w.r.t new data
# We will now carry out the tasks relevant to
# data preparation—feature extraction, engineering, and scaling
# in the following code snippet.  Same as what we did over training data.

#--data preparation
prediction_features = new_data[feature_names]

#--scaling by using standardScalar object -> ss
prediction_features[numeric_feature_names] = ss.transform(prediction_features[numeric_feature_names])

#--engineering categorical variables -> using One Hot Encoding
prediction_features = pd.get_dummies(prediction_features, columns=categoricial_feature_names)

#--view feature set
print(prediction_features)
print("-----------------------------")
print(prediction_features.columns)

   ResearchScore  ProjectScore  ...  Obedient_N  Obedient_Y
0      -1.803390     -1.430636  ...           1           0
1       0.494137      1.160705  ...           0           1
2       0.190053      0.728815  ...           1           0
3       1.169881      1.506217  ...           0           1
4      -1.195221     -0.998746  ...           1           0

[5 rows x 9 columns]
-----------------------------
Index(['ResearchScore', 'ProjectScore', 'OverallGrade_A', 'OverallGrade_B',
       'OverallGrade_C', 'OverallGrade_E', 'OverallGrade_F', 'Obedient_N',
       'Obedient_Y'],
      dtype='object')


**Important :** We are safe, as the no. of columns in the training_features and prediction_features <font color='green'><b>are same</b></font>.

**Don't worry**, *we will later come across cases where the test data or real time data on which we wish to do predictions , does not have same no. of features. In such cases we will have to add dummy feature columns. Will talk and code on this some time later.*

In [None]:
# We have our complete feature set ready for all the new students.
# Let’s put our model to the test and get the predictions
# with regard to grant recommendations!

predictions = model.predict(prediction_features)

##--display results
new_data['Recommend'] = predictions
print(new_data)

    Name OverallGrade Obedient  ResearchScore  ProjectScore Recommend
0  Ninad            F        N             10            20        No
1  Alxis            B        Y             78            80        No
2   Faiz            C        N             69            70       Yes
3  Sejal            A        Y             98            88       Yes
4  Vijan            E        N             28            30        No


<font color='green'><b>Wow !!! </b></font>

You have done a great Job in this NB.

Let me summarise for you :
> You understood that applying ML to some data is a <font color='green'> well defined process called ML pipeline </font>.

> **You first load data**, Like we loaded it from a .csv file. Although we could load from other sources like DataBase, web-Scrap a website , from json files or read from PDF or word doc file.

> You then did **required** Pre-processing like Cleaning the data, **Scaling numeric data** and **Feature engineering on Categorical data**. Like we used <font color='green'>  Standard Scaler over Numeric data and One Hot Encoding over Categorical data.</font>

> You then loaded the required class, like we loaded (i.e imported) LogisticRegression class.  <br>
*`from sklearn.linear_model import LogisticRegression`* <br>
Remember we would use **sklearn library** a lot. Its the most used and most popular Library in Machine learning.

> Then I gave some dummy student data. You smartly applied your ML model over it to classify who would get **`Research Grant and who would not ?`**

<h3><font color='green'>Trust me you are going Great !! </font></h3>

<font color='green'><b>Thank you for going through the Notebook. I am sure it was a fruitful learning exprience. </b></font>

Please do share your feedback or inputs with me on Linkedin. ( <small> https://linkedin.com/in/rocky-jagtiani-3b390649/ </small> )
