In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings

warnings.filterwarnings('ignore')

In [None]:
performance = pd.read_csv(r"https://raw.githubusercontent.com/puneettrainer/datasets/main/student_performance.csv")

### Identifying `input` and `target` field

We start of by selecting the `input` and `target` field(s) of the model.

In [None]:
performance.columns

In [None]:
performance.corr(numeric_only=True)

In [None]:
input_field = 'Previous Scores'
target_field = 'Performance Index'

# Predict `Performance` based on `Previous Score`

### Splitting the dataset

When developing a machine learning model, we split the provided dataset into two datasets:
1. `training_data`: dataset which we use to train the model
2. `test_data`: dataset which we use to validate the trained model

We can use any method to split the dataset (such as `loc`, `iloc`, etc.). Another method provided in the `sci-kit` module is the `train_test_split` function.

The dataset can be split in any ratio, but ideally we select a bigger portion of the data for the `training_data` dataset. Just like how we prepare for tests by going through multiple mock tests, a machine learning model learns best when it has a lot of data to study from.

The `test_data` dataset should have a lot of records too, but it would be smaller than the `training_data` dataset. `test_data` is used for evaluating the model after it has been trained.

Usually a 70:30 ratio is adequate for `training_data` and `test_data`.

#### `train_test_split(dataset, test_size, random_state)`

This function splits the `dataset` into two dataframes; first the size of (1 - `test_size`) * n rows, the second the size of `test_size` * n rows. `random_state` is used to make sure that whenever the data is split, the same records are assigned to both the datasets. This is useful when developing the model, as during development we may run the notebook multiple times.

In [None]:
training_data, test_data = train_test_split(performance[[input_field, target_field]]
                                           ,test_size=0.3
                                           ,random_state=0)

### Model Initialization

After the data is split, we instantiate the linear regression model using the `LinearRegression()` class. This algorithm tries to fit a line over the data points such that each point is as close to the data points as possible.

In [None]:
# instantiating a Linear Regression model
model = LinearRegression()

### Training the model

To train the model, we pass the `input` and the `target` into the `fit()` method of the `LinearRegression()` class. The `input` is the input field and `target` is the target field in the `training_data` dataset.

#### The `fit` method does not accept `Series` input, so we pass a single column dataframe instead.

| Syntax | Data Type |
| --- | --- |
| `data['Field']` | Series |
| `data[['Field']]` | single-column dataframe |

In [None]:
model.fit(training_data[[input_field]], training_data[[target_field]])

### Making predictions

To make predictions from the model, we use the `predict()` method of the `LinearRegression()`. We pass the input field from the `test_data` dataset into this method.

#### The `predict` method does not accept `Series` input, so we pass a single column dataframe instead.

In [None]:
model.predict(test_data[[input_field]])

### Evaluating the model

Now we compare the Profit predicted by our model to the actual Profit recorded in the `train_data` dataset.

So,
- `actual` is the actual value of `Performance Index` in `test_data[target_field]` column
- `prediction` is the output of the `model.predict(test_data[[input_field]])`

For evaluating the model, we can use any evaluation method we see fit for our application.

#### When presence of outliers doesn't matter and we want an easy-to-understand score - MAE

In [None]:
from sklearn.metrics import mean_absolute_error

print(f'Mean Absolute Error: {mean_absolute_error(test_data[target_field], model.predict(test_data[[input_field]]))}')

#### When presence of outliers makes a difference - MSE

In [None]:
from sklearn.metrics import mean_squared_error

print(f'Mean Squared Error: {mean_squared_error(test_data[target_field], model.predict(test_data[[input_field]]))}')

#### When presence of outliers makes a difference and we also want the score to come in the same units as the target field - RMSE

In [None]:
from sklearn.metrics import root_mean_squared_error

print(f'Root Mean Squared Error: {root_mean_squared_error(test_data[target_field], model.predict(test_data[[input_field]]))}')

#### When we want to see what percentage of the predictions made by the model can be explained by the `input_field` - $R^2$

In [None]:
from sklearn.metrics import r2_score

print(f'R-Squared Score: {r2_score(test_data[target_field], model.predict(test_data[[input_field]]))}')

From the $R^2$ score, we can see that only $83\%$ of the predictions of `Performance Index` are explained by `Previous Score`. This means that $83\%$ of the predictions can be explained by the `Previous Score`.

### Optimizing/improving a machine learning algorithm

There are multiple ways of optimizing a machine learning algortihm, depending on the algorithm that we are using. Some common optimization methods are:
- hyperparameter tuning
- adding or removing input fields
- scaling the inputs

#### Hyperparameter Tuning

In machine learning algorithms, `parameters` are the input variables that we provide to the model to learn patterns in the dataset from. `Hyperparameters` are arguments of a model class that allow us to further configure the model when we instantiate it.

For example,<br>
```
model = LinearRegression()
model.fit(training_data[input], training_data[[target]])
model.predict(test_data[input])
```

Here,<br>
`training_data[input]`, `training_data[[target]]`, `test_data[input]` are the `parameters` of the `model`.

In [None]:
model_2 = LinearRegression(positive=True)
model_2.fit(training_data[[input_field]], training_data[[target_field]])

print(f'Mean Absolute Error: {mean_absolute_error(test_data[target_field], model_2.predict(test_data[[input_field]]))}')
print(f'Mean Squared Error: {mean_squared_error(test_data[target_field], model_2.predict(test_data[[input_field]]))}')
print(f'Root Mean Squared Error: {root_mean_squared_error(test_data[target_field], model_2.predict(test_data[[input_field]]))}')
print(f'R-Squared Score: {r2_score(test_data[target_field], model_2.predict(test_data[[input_field]]))}')

In [None]:
adjusted_r2(test_data[target_field], model_2.predict(test_data[[input_field]]), model_2.n_features_in_)

```
model_2 = LinearRegression(positive=True)
```

Here,<br>
`positive` is a hyperparameter of the algorithm. It is used to force the weights of the relationship be positive.

### `LinearRegression()` attributes

Just like any class, `LinearRegression` has some attributes. These are properties generated by the algorithm when the class is trained.

For example, the `pd.DataFrame` class has `attributes` such as `shape`, `columns`, `loc`, etc. Similarly, this class has:
- `coef_`: this is the weight computed for the relation (value of `m` in $m * x + c$)
- `intercept_`: this is the value of `c`
- `n_features_in_`: this is the number of input fields
- `features_names_in_`: this is the name of input fields

In [None]:
model.coef_

In [None]:
model.intercept_

In [None]:
model.n_features_in_

In [None]:
model.feature_names_in_

### Linear Regression using multiple input fields (`Multiple Linear Regression`)

In [None]:
input_fields = ['Hours Studied', 'Previous Scores', 'Sleep Hours', 'Sample Question Papers Practiced']

training_data, test_data = train_test_split(performance[input_fields + [target_field]]
                                           ,test_size = 0.3
                                           ,random_state=0)

model_3 = LinearRegression()
model_3.fit(training_data[input_fields], training_data[target_field])

print(f'Mean Absolute Error: {mean_absolute_error(test_data[target_field], model_3.predict(test_data[input_fields]))}')
print(f'Mean Squared Error: {mean_squared_error(test_data[target_field], model_3.predict(test_data[input_fields]))}')
print(f'Root Mean Squared Error: {root_mean_squared_error(test_data[target_field], model_3.predict(test_data[input_fields]))}')
print(f'R-Squared Score: {r2_score(test_data[target_field], model_3.predict(test_data[input_fields]))}')

In [None]:
def adjusted_r2(actual, prediction, inputs):
    return 1 - ((1 - r2_score(actual, prediction)) * (len(actual) - 1)) / (len(actual) - inputs - 1)   

In [None]:
adjusted_r2(test_data[target_field], model_3.predict(test_data[input_fields]), model_3.n_features_in_)

### Saving model for future use

Once we have developed and tested a model, we can save the model for future use by creating a `joblib` file. This file stores the configuration of the model so that we don't have to train it again.

In [None]:
import joblib as jb

# save model
linear_regression_model = {'inputs':input_fields
                          ,'target':target_field
                          ,'model': model_3}
jb.dump(linear_regression_model, 'linear_regression.joblib')

### Importing saved model for usage

In [None]:
import joblib as jb
import numpy as np
import pandas as pd

saved_model = jb.load('linear_regression.joblib')

# creating an empty dictionary to store input values
input_values = {}

for feature in ('Hours Studied', 'Previous Scores', 'Sleep Hours', 'Sample Question Papers Practiced'):
    # fetching input values from user
    value = input(f'Enter the {feature.lower()}: ')

    # inserting input values into input_values dictionary
    input_values.update({feature:float(value)})

# displaying predicted value
print(f'Predicted performance for provided values is: {saved_model['model'].predict(pd.DataFrame(input_values, index=[0]))}')