## Foundations of Machine Learning
### Agenda
* Introduction to Machine Learning
* Supervised vs Unsupervised
* Types of Supervised Learning - Classification, Regression
* Data Ingestion
* Data Wrangling
* Data Preprocessing
* Model Training
* Model Validation
* Deployment

## 1. Introduction to Machine Learning
<hr>
* A branch of AI, where applications can automatically learn from data or experience is known as Machine Learning.
* Traditional system needs explicit programming whereas a machine learning system derives logic from data.
* Machine Learning applications can determine if a mail is spam, predict fraud, personalize entertainment.
<img src="https://miro.medium.com/proxy/1*E80yVRL4jO2s6kttekkGzw.png">
<img src="https://miro.medium.com/max/1400/1*0Qasr3tTggxSav1npwifvA.png">

## 2. Supervised/Unsupervised Machine Learning 
<hr>
* Machine Learning applications can be trained with data-with-labels & also data-without-labels.
* Meaning of data-with-labels : Feature Data (House Details like sqft, num_rooms, num_floors etc.), Target Data or labels (Price)
* When a model learns using data-with-labels, it's supervised learning.
* When model learns patterns from unlabeled data, it's unsupervised learning.
* Examples of unsupervised learning : Grouping similar customer together.


## 3. Types of Supervised Learning - Regression & Classification
<hr>
* Regression - Target data is continues in nature:
  - House Price prediction
  - Rainfall prediction
* Classification - target data is catagorical in nature:
  - Predicting spam mail
  - Predicting credit defaulter  
<img src="https://github.com/awantik/machine-learning-slides/blob/master/ML-Pipeline-business.png?raw=true">

## 4. Data Ingestion
<hr>
* Data gathering is one of the most important part of machine learning application development.
* Without data there cannot be machine learning.
* Data can be **batched** as well as streamed.
* Batched data can be read by spark from csv, json, hive, databases etc.
* Streamed data can be recieved from streaming servers like Kafka.
* **Spark Streaming** can do processing on data received from Kafka.

## 5. Data Wrangling
<hr>
* Transformation of data to a more valuable format suited for downstream applications like Business Intelligence, Analytics etc.
* It involves cleaning of data.
* Removing incorrect data.
* Aggregating information.
* Python provides Pandas for Data Wrangling task. 
* PySpark provides **DataFrames** for data wrangling.

## 6. Data Preprocessing
<hr>
* PySpark provides Machine learning library for classification,regression,clustering etc.
* Before data can be fed to machine learning libraries, it needs to be transformed to a format which is expected by machine leaning libraries.
* Scaling techniques for bringing numerical data to same scale.
* Encoding techniques for converting categorical data to numbers.
* Text vectorization techniques.
* ML algorithms only understands numbers.
* Dimensionality reduction techniques for noise reduction from data & improving computation speed.
* All preprocessor are present in **pyspark.ml.feature** module.
* We have to create an instance of preprocessor & pass data to it.

## 7. Model Training
<hr>
* At this stage, data is ready for machine learning.
* Based on the problem statement we might have to classification algorithm or regression algorithm.
* Learning algorithms are part of **pyspark.ml.classification**, **pyspark.ml.regression** etc.
* For supervised learning, pass feature data & target data.
* For unsupervised learning, pass the data.

## 8. Model Validation & Selection
<hr>
* There are multiple models to be chosen from & multiple ways to configure models.
* Model with perfect balance of bias & variance should be chosen.
* For this we need to cross validate our model.
* This step also includes hyper-parameter tuning.
* Pyspark provides **pyspark.ml.tuning** for cross validation & hyper-parameter tuning.

## 9. Deployment
<hr>
* Once the model is validated & selected, it needs to be deployed.
* Persisting model for future reuse & shipping to customer.
* Persisted model can be re-trained as well.
* RESTful interface of model can be exposed for consumption.