# Machine Learning

extract patterns from data, allowing computers to identify related data, and forecast future outcomes, behaviors, and trends.
* application and pattern recognition
* anomaly detection
* forecasting
* recommendation

1. Collect Data
2. Prepare Data
3. Train Model - feature vectorization, scaling , tuning , performance ,
4. Evalute Model - from Validation Set , test and compare 
5. Deploy 
6. Retrain Model - fresh data 



* Numerical : Int , Float 
* Time-Series 
* Speech 
* Categorical : cardinality
* Text 
* Image


Works with no. or vectors . 
Vectors , array of no. or nested arrays of arrays of numbers

* Continous Numerical features
    * Scale Data 
    

## Scaling Data

Transforming it so that the values fit within some range or scale, such as 0–100 or 0–1. This scaling process will not affect the algorithm output since every value is scaled in the same way. But it can speed up the training process, because now the algorithm only needs to handle numbers less than or equal to 1.


Two common approaches to scaling data include standardization and normalization:
 
* **Standardization** : rescale data to have mean = 0 and variance = 1
 
> $\large \frac {x-\mu}{\sigma}$

* **Normalization** : rescale data into range [0,1]

> $\large \frac {x-x_{min}}{x_{max}-x_{min}}$

In [6]:
import numpy as np
arr = np.array([-5,10,15])

In [8]:
(arr - 7)/10

array([-1.2,  0.3,  0.8])

In [9]:
(arr - arr.min()) / (arr.max() - arr.min())

array([0.  , 0.75, 1.  ])

## Encoding Categorical Data

Categorical data, we need to encode it in some way so that it is represented numerically.
There are two common approaches for encoding categorical data: ordinal encoding and one hot encoding.

|SKU|Make|
|---|---|
|908721|Guess|
|456552|Tillys|
|789921|A&F|
|872266|Guess|



* Ordinal Encoding : Cons -  implicitly assumes an order across the categories
convert the categorical data into integer codes ranging from 0 to (number of categories – 1)

|Maker|Encoding|
|---|---|
|A&F|0|
|Guess|1|
|TILLYS|2|


|SKU|Make|
|---|---|
|908721|1|
|456552|2|
|789921|0|
|872266|1|

* One Hot Encoding : Cons - More Columns 

|SKU|A&F|Guess|TILLYS|
|---|---|---|---|
|908721|0|1|0|
|456552|0|0|1|
|789921|1|0|0|
|872266|1|0|0|




## Image Data

image can be encoded numerically

* In grayscale images, each pixel can be represented by a single number, which typically ranges from 0 to 255

* In colored images, each pixel can be represented by a vector of three numbers (each ranging from 0 to 255) for the three primary color channels: red, green, and blue


RGB Image
* Height = 100 px
* Width = 100px
* Channels = 3

The number of channels required to represent the color is known as the color depth or simply depth.
* RGB image depth = 3
* Grayscale image depth = 1



100 * 100 * 3 dimensional vector 

### Encoding an Image

to encode an image. We need to know the following three things about an image to reproduce it:

* Horizontal position of each pixel
* Vertical position of each pixel
* Color of each pixel

> can fully encode an image numerically by using a vector with three dimensions. The size of the vector required for any given image would be the height * width * depth of that image



In [12]:
15*15

225

---
# Text Data

## Normalization

> Text normalization is the process of transforming a piece of text into a canonical (official) form


the verb to be may show up as is, am, are, and so on. Or a document may contain alternative spellings of a word, such as behavior vs. behaviour. So one step that you will sometimes conduct in processing text is normalization

### Lemmatization

an example of normalization. A lemma is the dictionary form of a word and lemmatization is the process of reducing multiple inflections to that single dictionary form

|Original word | Lemmatized word |
|---|---|
|is|be|
|are|be|
|am|be|

### Stop words 
are high-frequency words that are unnecessary (or unwanted) during the analysis. For example, when you enter a query like <br>
<strong><em>which cookbook has the best pancake recipe</em></strong> <br>
into a search engine, the words which and the are far less relevant than cookbook, pancake, and recipe

tokenized the text (i.e., split each string of text into a list of smaller parts or tokens), removed stop words  and standardized spelling .

|Original text|Normalized text|
|---|---|
|The quick fox|[quick, fox]|
|The lazzy dog|[lazy,dog]|
|The rabid hare|[rabid,hare]|



## Vectorization

encoding it in a numerical form. The goal here is to identify the particular features of the text that will be relevant to us for the particular task we want to perform—and then get those features extracted in a numerical form that is accessible to the machine learning algorithm.

vectorize a word or a sentence:
* Term Frequency-Inverse Document Frequency (TF-IDF) vectorization
* Word embedding, as done with Word2vec or Global Vectors (GloVe)

### TF-IDF

to give less importance to words that contain less information and are common in documents, such as "the" and "this"—and to give higher importance to words that contain relevant information and appear less frequently. Thus TF-IDF assigns weights to words that signify their relevance in the documents.

||quick|fox|lazy|dog|rabid|hare|
|---|---|---|---|---|---|---|
|[quick,fox]|0.32|0.23|0|0|0|0|
|[lazy,dog]|0|0|0.12|0.23|0|0|
|[rabid,hare]|0|0|0|0|0.56|0.12|

### Feature Extraction 

text in the example can be represented by vectors with length 6 since there are 6 words total.

```
[quick,fox] as (0.32,0.23,0,0,0,0)
[lazy,dog] as (0,0,0.12,0.23,0,0)
[rabid,hare] as (0,0,0,0,0.56,0.12)
```

Any vector with the same length can be visualized in the same space. How close one vector is to another can be calculated as vector distance. If two vectors are close to each other, we can say the text represented by the two vectors have a similar meaning or have some connections. 




 typical pipeline for text data begins by pre-processing or normalizing the text. This step typically includes tasks such as breaking the text into sentence and word tokens, standardizing the spelling of words, and removing overly common words (called stop words).
 
 The next step is feature extraction and vectorization, which creates a numeric representation of the documents.
 
 Last, we will feed the vectorized document and labels into a model and start the training.

### Computer Science and Statistical Perspective

* Computer Science Perspective
```
Output Variable = f(InputVariables)
```

* Statistical Perspective
```
Dependent Variable = f(Independent Variables)
```

### ML

* **Libraries**
    * Scikit-Learn
    * Keras
    * Tensorflow
    * PyTorch
    
* **Dev Env**
    * Jupyter Notebooks
    * Azure Notebooks
    * Azure Databricks
    * VS Code
    * VS
    
* **Cloud Services**
    * MS Azure ML


with very large amounts of data, or you need a faster processor, it's a better idea to train and test the model remotely using cloud services such as Microsoft Azure. You can use Azure Data Science Virtual Machine, Azure Databricks, Azure Machine Learning Compute, or SQL server ML services to train and test models and use Azure Kubernetes to deploy models

### Cloud Services for Machine Learning

Core asset management

|Feature	|Description|
|---|---|
|Datasets|	Define, version, and monitor datasets used in machine learning runs.|
|Experiments / Runs|	Organize machine learning workloads and keep track of each task executed through the service.|
|Pipelines	|Structured flows of tasks to model complex machine learning flows.|
|Models	|Model registry with support for versioning and deployment to production.|
|Endpoints|	Expose real-time endpoints for scoring as well as pipelines for advanced automation.|


Resources Management

|Feature|Description|
|---|---|
|Compute|	Manage compute resources used by machine learning tasks.|
|Environments|	Templates for standardized environments used to create compute resources.|
|Datastores	|Data sources connected to the service environment (e.g. blob stores, file shares, Data Lake stores, databases).|

### Models vs Algorithms

> Models are the specific representations learned from data


> Algorithms are the processes of learning


objective of ML : to learn model 
    
    
    
algorithm as a function—we give the algorithm data and it produces a model


> **Model = Algorithm(Data)**

* Algorithm

<center>$y = Wx + b$</center>

* Data

|x|y|
|---|---|
|1|1|
|2|2|
|3|3|

* Model

<center>$y = 1*x + 0$</center>

algorithm as a mathematical tool that can usually be represented by an equation as well as implemented in code


Machine learning models are outputs or specific representations of algorithms that run on data. A model represents what is learned by a machine learning algorithm on the data.





---

# Linear Regression

* Simplifies target function Y to a line
* Core idea to obtain fit line that best fits the data
* Best fit line is the one which total prediction error for all the data points is small as possible
* Error is the distance between the point to the regression line
* $B_0$ and $B_n$ coefficients are learned during fitting training

## Simple Linear Regression

$Y = B_0 + B_1*X$

* $B_0$ is intercept or bias , to determined the line intercepts the y-axis
* $B_1$ is slope, define the slope of the line 


## Multiple Linear Regression

$Y = B_0 + B_1*X_1 + B_2*X_2 + ... + B_n + * X_n$


### 1. Linear Assumption

Assumes that the relationship between your input and output

### 2. Remove Noise
 outliers , data cleaning
 
### 3. Remove Collinearity
linear regression will overfit when have highly correlated input variables.
<br>
consider calculating pairwise correlations 
, remove the most correlated

When two variables are collinear, this means they can be modeled by the same line or are at least highly correlated; in other words, one input variable can be accurately predicted by the other

 Having highly correlated input variables will make the model less consistent, so it's important to perform a correlation check among input variables and remove highly correlated input variables.

### 4. Gaussian Distribution

reliable prediction when input/output have gaussian distribution

Linear regression assumes that the distance between output variables and real data (called residual) is normally distributed. If this is not the case in the raw data, you will need to first transform the data so that the residual has a normal distribution.


### 5. Rescale inputs

rescale input variables using standardization or normalization 






## Linear Regression Algorithm

Model representation : $Y = B_0 + B_1 * X $

Estimate Slope $B_1$ : $B_1 = \frac{\sum_{i=1}^n(x_i - mean(x)) * (y_i -mean(y))}{\sum_{i=1}^n(x_i - mean(x))^2}$


Estimate Intercept $B_0$ : $B_0 = mean(y) - B_1 * mean(x)$


Making predictions 

Estimating Error: Root mean squared error (RMSE): RMSE = $\sqrt\frac{\sum_{i=1}^n(p_i - y_i)^2}{n}$

> The process of finding the best model is essentially a process of finding the coefficients and bias that minimize this error. 

To calculate this error, we use a cost function. There are many cost functions you can choose from to train a model and the resulting error will be different depending one which cost function you choose. The most commonly used cost function for linear regression:**root mean squared error (RMSE)**

slope = coefficient

intercept = bias

rmse = cost function

Machine learning algorithm aim to learn a target function $f$ that describes the mapping between data $X$ and output $Y$


$Y=f(X)$ + $e$

Irreducible error is caused by the data collection process—such as when we don't have enough data or don't have enough data features. In contrast, the model error measures how much the prediction made by the model is different from the true output.

# Parametric vs Non-parametric

machine learning algorithm can be divided into 2 categories :

* Parametric
    * simplify the mapping to a known functional form 
        e.g. linear regression  / multiple linear regression / logistic    regression
    * Simpler / Faster / Less Training
    * highly constrained / limited complexity / poor fit
* Non-parametric
    * not making assumptions regarding the form of the mapping betwee i/p         and o/p
        e.g. k-nearest neighbors 
    * High Flexibility / Power / High Performance
    * more training data / slower / overfitting traing data 


Deep learning category of ML based on neural networks

All DL algorithms are machine learning algorithms.
But not all ML algorithm are DL algorithms

Deep Learning Advantages

1. Learn Arbitrarily complex functions from data 
2. better accuracy
3. better support for big data
4. complex features can be learned 

Disadvantages

1. Difficult to explain trained data
2. Require significant computational power 

Classical ML advantages:

1. More suitable for small data
2. Easier to interpret outcomes
3. Cheaper to perform
4. Can run on low-end machines
5. Does not require large computational power

Classical ML disadvantages:

1. Difficult to learn large datasets
2. Require feature engineering
3. Difficult to learn complex functions


|Supervised Learning|Unsupervised Learning|Reinforcement Learning|
|---|---|---|
|Learns from data that contains both the inputs and expected outpus|Learns from data that contains only inputs and finds hidden structures in data| Learns how an agent should take actions in an environment to maximise a reward function|
| Classification: Outputs are categorical
 * Regression : Outpus are numerical
 * Similarity Learning : Learns from examples using similary function (ranking / recommender systems)
 * Feature Learning : Features are learned using labeled data
 * Anomaly Detection : Learns from labelled data as normal/abnormal data}
|| * Clustering: Assigns entities to clusters or groups
||  * Feature Learning: Features are learned from unlabelled data
||  * Anomaly detection: Learns from unlabeled data assuming most entities are normal
||| * Markov Decision Process: Does not assume knowledge of an exact mathematical model |||

# Bias vs Variance

Bias measures how inaccurate the model prediction is in comparison with the true output. It is due to erroneous assumptions made in the machine learning process to simplify the model and make the target function easier to learn. High model complexity tends to have a low bias.

Variance measures how much the target function will change if different training data is used. Variance can be caused by modeling the random noise in the training data. High model complexity tends to have a high variance.

Parametric and linear algorithms have high bias and low variance 
Non parametric and nonlinear algorithms have low bias and high variance

The prediction error can be viewed as the sum of model error (error coming from the model) and the irreducible error (coming from data collection).


```
prediction error = Bias error + variance + error + irreducible error
```

Low bias : less assumptions about target function

High bias : more assumptions about target function

low variance : changes in training data means small changes of the function estimate 

high variance : changes in training data means large changes of the function estimate 

goal of ML : low bias , low variance


Low bias means fewer assumptions about the target function. Some examples of algorithms with low bias are KNN and decision trees. Having fewer assumptions can help generalize relevant relations between features and target outputs. In contrast, high bias means more assumptions about the target function. Linear regression would be a good example (e.g., it assumes a linear relationship). Having more assumptions can potentially miss important relations between features and outputs and cause underfitting.

Low variance indicates changes in training data would result in similar target functions. For example, linear regression usually has a low variance. High variance indicates changes in training data would result in very different target functions. For example, support vector machines usually have a high variance. High variance suggests that the algorithm learns the random noise instead of the output and causes overfitting.

Generally, increasing model complexity would decrease bias error since the model has more capacity to learn from the training data. But the variance error would increase if the model complexity increases, as the model may begin to learn from noise in the training data.

The goal of training machine learning models is to achieve low bias and low variance. The optimal model complexity is where bias error crosses with variance error.


# Overfitting vs Underfitting

Overfitting refers to the situation in which models fit the training data very well, but fail to generalize to new data.

Underfitting refers to the situation in which models neither fit the training data nor generalize to new data.

* k-fold cross-validation: it split the initial training data into k subsets and train the model k times. In each training, it uses one subset as the testing data and the rest as training data.
* hold back a validation dataset from the initial training data to estimatete how well the model generalizes on new data.
* simplify the model. For example, using fewer layers or less neurons to make the neural network smaller.
* use more data.
* reduce dimensionality in training data such as PCA: it projects training data into a smaller dimension to decrease the model complexity.
* Stop the training early when the performance on the testing dataset has not improved after a number of training iterations.