# Machine Learning

Machine learning is a field in data science which allows developing programs that can `learn` from patterns in data to make decisions. These decisions can be predicting how profitable the next transaction will be, whether the next transaction will be returned or not, etc.

## Supervised Machine Learning
These are machine learning algorithms which make decisions based on relationships identified between the input and the target variable by the data scientist.

It can be broadly be of two types:
1. Regression: predicting a numeric value
2. Classification: predicting a label value

## Unsupervised Machine Learning
These are machine learning algorithms which identify relationships between data points and make decisions based on these extracted relations.

- k-Means Clustering is an unsupervised machine learning algorithm which creates clusters of data points which have similar properties and predicts which cluster a data point would belong to based on these properties.

In [None]:
for x in range(1, 11):
    print(f'2 x {x} = {2 * x}')

In [None]:
def double(x):
    return x * 2

In [None]:
double(100)

Sales = 5400
Price = ?

# Correlation

Correlation is a measure of the influence of one numeric variable on another. It quantifies the impact one measure has on another and also tells us the direction of the impact.

| Value | Inference |
| --- | --- |
| 1 | y increases as x increases |
| 0 | y is not impacted by x |
| -1 | y increases as x decreases |

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
ice_cream_data = pd.read_csv(r"https://raw.githubusercontent.com/puneettrainer/datasets/main/ice_cream.csv")

In [None]:
plt.figure(figsize=(15, 5))
plt.scatter(ice_cream_data['Temperature'], ice_cream_data['Profit'])
plt.title('Profit by Temperature')
plt.xlabel('Temperature')
plt.ylabel('Profit')
plt.show()

In [None]:
ice_cream_data.corr()

In [None]:
def scatter_plot(data, x, y):
    plt.figure(figsize=(15, 5))
    plt.scatter(data[x], data[y])
    plt.title(x + ' by ' + y)
    plt.xlabel(x)
    plt.ylabel(y)
    plt.show()

In [None]:
scatter_plot(ice_cream_data, 'Temperature', 'Temperature')

In [None]:
orders = pd.read_excel(r"E:\data\superstore.xls")

In [None]:
orders.corr()

### `dataframe_obj.corr(numeric_only)`

`corr()` is a a dataframe method which calculates the correlation between all the fields in the dataframe.
- if the dataframe contains non-numeric fields, this will throw an error
- we can exclude non-numeric fields for the `corr()` method by specifying `numeric_only` argument as `True`

In [None]:
orders.corr(numeric_only=True)

In [None]:
orders[['Sales', 'Quantity', 'Discount', 'Profit']].corr()

In [None]:
scatter_plot(orders, 'Sales', 'Profit')

In [None]:
scatter_plot(orders, 'Discount', 'Profit')

In [None]:
scatter_plot(orders, 'Quantity', 'Profit')

In [None]:
scatter_plot(orders.loc[orders['Sales'] <= 5500], 'Sales', 'Profit')

# Linear Regression

Linear Regression is a supervised learning algorithm which is used to predict numeric values. It creates a numeric relation between the input variables and the target variable and tries to find the best mathematical relation (equation) to predict future values.

Since it creates a mathematical relation between the inputs and the target, all inputs in a linear regression algorithm are numeric/continuous.

## Linear Regression from Scratch

In [None]:
plt.figure(figsize=(15, 5))
plt.scatter(orders['Sales'], orders['Profit'])
plt.plot([orders['Sales'].min(), orders['Sales'].max()], [orders['Profit'].min(), orders['Profit'].max()], color='orange')
plt.title('Profit by Sales')
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.show()

# $y = m * x + c$

$y \implies$ target value<br>
$m \implies$ slope of the line, weight by which to multiple $x$<br>
$x \implies$ input variable, value which we know on the basis of which we want to predict<br>
$c \implies$ in case of Linear Regression, $c$ is the bias

[Equation of a Line](https://www.mathsisfun.com/equation_of_line.html)

In [None]:
def prediction(x, weight):
    return x * weight

In [None]:
linreg_data = orders[['Sales', 'Profit']]
linreg_data

In [None]:
plt.figure(figsize=(15, 5))
plt.scatter(linreg_data['Sales'], linreg_data['Profit'])
plt.title('Profit by Sales')
plt.xlabel('Sales')
plt.ylabel('Profit')

for w in np.arange(0.1, 1.3, 0.2):
    plt.plot(linreg_data['Sales'], prediction(linreg_data['Sales'], w), label=f'{w:.2f}')
plt.legend()
plt.show()

In [None]:
linreg_data['Prediction'] = linreg_data['Sales'] * 0.5
linreg_data

## Evaluation Metrics

Evaluation metrics are scoring methods which quantify how well our model is performing.

### Mean Absolute Error

Mean Absolute Error is an evaluation metric which calculates the average absolute difference between the predicted value and the actual value. It only tells how far the model predicts the target from the actual value, but does not tell whether the prediction is an overestimation or underestimation. A good MAE score is close to zero.

#### Mean Absolute Error, MAE = $\frac{\sum | actual - prediction |}{observations}$

| Advantages | Disadvantage |
| --- | --- |
| easy to understand; has the same unit as that of the target value | does not indicate whether the model is overestimating or underestimating |
| gives the same weight to all values, so outliers don't impact the final score | if the data has outliers, this may not give the most accurate score |

In [None]:
def mae(actual, prediction):
    return np.abs(actual - prediction).sum() / len(actual)

In [None]:
mae(linreg_data['Profit'], linreg_data['Prediction'])

### Mean Squared Error

Mean Squared Error is another evaluation metric which scores the performance of a model. It gives more weight to errors providing more accuracy to the final score. Since it is prone to outliers, it can also be misleading (a good model may get a bad MSE score). A good MSE score is close to zero. 

#### Mean Squared Error, MAE = $\frac{\sum ( actual - prediction ) ^ 2}{observations}$

| Advantages | Disadvantage |
| --- | --- |
| gives more weight to bigger errors; sensitive to outliers | has squared units (e.g. if target is $dollar$, then score is $dollar^2$) |

In [None]:
def mse(actual, prediction):
    return ((actual - prediction) ** 2).sum() / len(actual)

In [None]:
mse(linreg_data['Profit'], linreg_data['Prediction'])

### Root Mean Square Error

Root Mean Square Error is an improvisation of the MSE, we simply calculate the square-root of the MSE. It has the same advantages and disadvantages as `MSE`. A good RMSE score is close to zero.

#### Root Mean Absolute Error, RMSE = $\sqrt{\frac{\sum ( actual - prediction ) ^ 2}{observations}}$

In [None]:
def rmse(actual, prediction):
    return mse(actual, prediction) ** 0.5

In [None]:
rmse(linreg_data['Profit'], linreg_data['Prediction'])

### Coefficient of Determination $R^2$

$R^2$ is a metric which tells us on a scale of 0 to 1 how much an input variable impacts the target variable. It does not tell us the direction in which the input influences the target or how good the model is, rather it just explains how strongly the input variable is impacting the target.

#### Coefficient of Determination, $R^2$ = 1 - $\frac{\sum{(actual - prediction)}^2}{\sum{(actual - avg(actual))}^2}$

| $R^2$ | Inference |
| --- | --- |
| close to 1 | model is efficiently using the input variables to predict target |
| close to 0 | model is not efficiently using the input variabes to predict target |
| negative | model is not using the relation between input and target to make prediction |

In [None]:
def r_squared(actual, prediction):
    residual_sum_squares = ((actual - prediction) ** 2).sum()
    total_sum_squares = ((actual - actual.mean()) ** 2).sum()
    return 1 - (residual_sum_squares / total_sum_squares)

In [None]:
r_squared(linreg_data['Profit'], linreg_data['Prediction'])

### Adjusted $R^2$

Adjusted $R^2$ is a metric which tells us on a scale of 0 to 1 how much every input variable impacts the target variable. In case of $R^2$, if we add more input variables to our model, it may not accurately show any difference or tell us whether the model improved of worsened because of the added variable. Adjusted $R^2$ overcomes this limitation.

#### $Adjusted$ $R^2$, $R^2_{adj}$ = 1 - $\frac{(1- R^2 )(observations-1)}{observations - variables - 1}$

In [None]:
def adjusted_r2(actual, prediction, inputs):
    return 1 - ((1 - r_squared(actual, prediction)) * (len(actual) - 1)) / (len(actual) - inputs - 1)

In [None]:
adjusted_r2(linreg_data['Profit'], linreg_data['Prediction'], 1)