# Intro to Machine Learning.

- Analytics = AI + ML + DL + tools used for creating value from data.
- AI = Systems and algos that exhibit human-like intelligence.
- ML = Subset of AI, comprises of using statistical algos to extract intelligence from big data.

- ML : It gives computer ability to learn without being explicitly programmed.
- model = simplified representation of reality created to serve some purpose.
- A prediction model is a formula for estimating the unknown value of interest: **the target**.
- In data science, prediction more generally means to estimate an unknown value.
- Indeed, since data mining techniques involves collecting huge amounts of historical data.
- Models are very often are built and tested using events from the past.

<pre>

                               transduction
                        data -----------------> prediction
                        |                           ^    
                        |                           |
            induction   |                           |
                        |                           | deduction
                        |                           |
                        V                           |
                        model ----------------------|

</pre>

## Intro:
- AI is a discipline
- ML is a subfield of AI
- DL is a subfield of ML.

## Classification of ML algorithms:
- Supervised
- Unsupervised.

- if y = x -> Unsupervised learning
- if y = {0, 1} -> Supervised **binary classification**
- if y = {0, 1, ...} -> Supervised **multiclass classification**
- if y = {-inf, inf} -> Supervised **regression**

## Supervised ML algorithms:
- require the knowledge of both outcome varaible (dependent variable) and the features (independent variable).
- usually a **loss function** is required.
- eg: linear regression and logistic regression.
- ps: logistic regression is just another name for classification.

- amount of data points in y is very important

## Unsupervised ML algorithms:
- No knowledge of outcome variable is given to the algorithms.
- Algorithms must find the possible values of the outcome variable.
- Examples: clustering, principal component analysis, etc.

- principal component analysis = It helps to reduce the number of features.

# ML algorithms:
- For supervised learning:
<pre>
input data -> Model < ----------------
                |                     | Model update
                V                     |
            predict output -------> Error (Loss function)
                |             ^
        Compare |             |
                V             |
            Expected output --|
</pre>

- For unsupervised learning:
<pre>

    input data -> model -> generated example
</pre>



## Why ML?
- It helps in understanding the association between key performance indicators (KPIs).
- Identifying the factors that have a significant impact on the KPIs for effective management.

## Steps in ML:
1. Identify the problem or opportunity for value creation.
2. Identify sources of data and create a data lake.
3. Pre-process the data for issues such as missing and incorrect data.
4. Generate derived variables and transform the data if necessary.
5. Divide the datasets into subsets of training and validation datasets.
6. Build ML models to identify the best model(s) using model performance in validation data.
7. Implement Solution/Decision/Develop Product.

- There are two phases, first phase is training and second phase is validation.
- in training, we simply train the model with largest section of data available.
- in validation, we do the same, but with a different section of data and this data is distinct from training data.
- the purpose of validation, is to ensure that training has happened properly.
- Instructor gives the following example:
    - Training is like securing marks in internal exam.
    - Validation is like securing marks in final exam.



- The main goal is to minimize the loss function,
- Instructor draws a parabola, x = loss function, y = complexity of ML.
- This is curve is called "loss function vs complexity of ML model"
- left hand of parabola is **Low variance and high bias**
- right hand of parabola is **low bias and high variance**
- The global minima of the curve is **moderate bias and moderate variance**

## Unsupervised machine learning algorithms:
- Objective is to generate labels.
- How many groups are required to make the data clusters which can be labelled.

### K-means clustering algorithm:

- Using distance measures such as Euclidean distance in clustering
- Learn to build clusters using sklearn library in python.

## Introduction unsupervised learning (ML)
- Training data = X = {x1, x2, ..., xn}, X âŠ‚ R<sup>n</sup>
- Clustering / segmentation: 
    - f : R<sup>d</sup> ---> {C1, ..., Ck} (set of clusters).

## Introduction to clustering
- Clustering is a divide-and-conquer strategy which divides the dataset into homogenous groups which can be further used to prescribe the right strategy for different groups.
- In clustering, **the objective** is to ensure that the **variation within a cluster is minimized while the variation between clusters is maximized**.

## Case study: Do clustering operation on customer data.
- Establish a relationship between age and salary with k-mean clustering.

- Loading data:
```python
    import pandas as pd
    customers_df = pd.read_csv("customers.csv")
    customers_df.head(5)
```

- Consider grouping as per their income:
    - Low income with low age
    - Medium income with medium age
    - High income with high age etc...
- For this problem statement there can be 4 possibilities for (age,income) pairings:
    - LL, LH, HL, HH

- Visualizing the relationship:
```python
    # Visualize them before going for clustering
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sn
    %matplotlib inline

    sn.lmplot(data=customers_df, x='age', y = 'income');
    plt.title("Fig 1: Customer segments based on income and age")
```

## Finding similarities using distance:
- Clustering techniques assume that there are subsets in the data that are similar or homogeneous.
- One approach for measuring similarity is through distance measured using different metrics.
- Few distance measures used in clustering are discussed in the following sections.

## Euclidean distance
- D(X1, X2) = sqrt(  Summation_of( (Xi1 - Xi2)^2 )  )


## Other methods of distance measurement:
- minkowski distance
- jaccard similarity coefficent
- cosine similarity
- gower's similarity coefficent

## Procedurre of k-mean clustering (All of this happens internally):
1. Decide the value of k.
2. Choose K observations from the data that are likely to be in different clusters. choose observations that are farthest.
3. The K observations selected in step 2 are the centroids of those clusters.
4. For remaining observations, find the cluster closest to the centroid. Add the new observation (say observation j) to the cluster with the closest centroid. Adjust the centroid after adding a new observation to the cluster. The closest centroid is chosen based upon an appropriate distance measure.
5. Repeat step 4 until all observations are assigned to a cluster.

## Practice Question for understanding the iterations of K-means clustering: 
- Given 5 points and 2 centroids C1 = (4,2), C2 = (3,5). (See pic for 5 points)
    1. Find the new two Centroid points after first iteration.
        - Solution: 
        - C1 = [(5,3), (3,1)]
        - C2 = [(2,5), (1,5), (4,4)]
        - C1new = ( (5+3)/2, (3+1)/2 ) = (4,2)
        - C2new = ( (2+1+4)/3, (5+5+4)/3 ) = (2.33, 4.66)
        - This is the end of first iteration.
        - Note that these iterations continue on until no change in centroid points are seen.

    2. Find the new two centroid points after second iteration.
        - Solution:
        - C1 = [(5,3), (3,1)]
        - C2 = [(2,5), (1,5), (4,4)]
        - C1new = ( (5+3)/2, (3+1)/2 ) = (4,2)
        - C2new = ( (2+1+4)/3, (5+5+4)/3 ) = (2.33, 4.66)
        - This is the end of first iteration.
        - Note that there was no change in centroid points.
        - This means we will stop the iteration here itself.
        - These are the final coordinates of the two centroids:
        - C1 = (4,2)
        - C2 = (2.33, 4.66)

## Method of finding exact number of clusters:
- Although the number of clusters is often arbitrary.
- But there is a procedure to find teh optimal number.
- Eg: Elbow method and WCSS (Within Cluster Sums of Square).

## Method of finding exact number of clusters...
```python

# Using Elbow method and WCSS finding optimum no. of clusters:
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
wcss = []

for i in range(1,11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(customers_df)
    # kmeans.inertia_ = Sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.
    wcss.append(kmeans.inertia_)
    plt.plot(wcss)
    plt.title('The Elbow method')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
```

## Creating empty cluster: From above test k can be selected as 3, and labels to it
```python

    # figure shows that k=3
    from sklearn.cluster import KMeans
    clusters = KMeans(3)
    clusters.fit(customers_df)

    # Now create a label for the data
    customers_df["clusterid"] = clusters.labels_

    # display the sample data
    customers_df[0:5]

```

## Plotting Customers with their Segments:
```python
    # Plotting the customers with their segments
    sn.lmplot( data=customers_df, x="age", y="income", hue="clusterid" )
    plt.title("Fig 2: Customer Segments Based on Income and Age with clusterid")
```

## Repeating the same with normalization of feature:
- If we observe the two features income and age, we see a large amount of variation between their respective ranges.
- eg: age range is 18 to 70, whereas income range is 1 lakh to 9 lakh.
- The gap in the two ranges are huge, which may effect the model preparation.
- To treat them equally we need to do scaling of features.
- Here we do **normalization of features**.
- By creating a same scale for both features, doing more operations on them becomes more convienent.

## Repeating the same without normailization of feature

```python

    from sklearn.preprocessing import StandardScaler

    scaler = StandardScaler()
    scaled_customers_df = scaler.fit_transform(customers_df[["age", "income"]])
    scaled_customers_df[0:5]

```

## Create Clusters using normalized feature set:
```python

    from sklearn.cluster import KMeans
    clusters_new = KMeans(3, random_state=42)
    clusters_new.fit(scaled_customers_df)
    customers_df["clusterid_new"] = clusters_new.labels_
```

## Scatter plot after normalization:
```python

    # Plotting the customers with their segments:
    sn.lmplot( data=customers_df, x="age", y="income", hue="clusterid_new" )
    plt.title( "Fig 3: Customer Segments Based on Income and Age with clusterid_new" )  

```

- Also see the lab task: "kmeans.ipynb" for all the code and plots.