<a href="https://colab.research.google.com/github/VishalMoradia/ML-Practice/blob/main/RegressionMetrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()

housing

{'DESCR': '.. _california_housing_dataset:\n\nCalifornia Housing dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n    :Number of Instances: 20640\n\n    :Number of Attributes: 8 numeric, predictive attributes and the target\n\n    :Attribute Information:\n        - MedInc        median income in block group\n        - HouseAge      median house age in block group\n        - AveRooms      average number of rooms per household\n        - AveBedrms     average number of bedrooms per household\n        - Population    block group population\n        - AveOccup      average number of household members\n        - Latitude      block group latitude\n        - Longitude     block group longitude\n\n    :Missing Attribute Values: None\n\nThis dataset was obtained from the StatLib repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe target variable is the median house value for California districts,\nexpressed in hundreds of thousands of dollars

In [2]:
df = pd.DataFrame(housing['data'], columns = housing['feature_names'])
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [3]:
df['target'] = housing['target']
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [4]:
from sklearn.ensemble import RandomForestRegressor

X = df.drop('target', axis = 1)
y = df['target']


X.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [5]:
y.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: target, dtype: float64

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

model = RandomForestRegressor(n_estimators=100)

model.fit(X_train, y_train)



RandomForestRegressor()

In [7]:
model.score(X_test, y_test)

0.8126054397077004

#### Insights of R^2 (R-squared value) - 

1. It is the measure of how good are the independent variables to predict the dependent variable. 
2. Here `target` column is dependent variable and all other columns are independent variables.
3. Perfect R-squared score is 1.0 

In [8]:
from sklearn.metrics import r2_score

# fill an array with y_test mean

y_test_mean = np.full(len(y_test), y_test.mean())

# Line above creates a full numpy array of the length y_test and fill that array with the mean value of 'y_test'

r2_score(y_true = y_test, y_pred = y_test_mean)

0.0

In [9]:
## Using Mean Absolute Error(MAE) metric

# MAE is the average of the absolute difference between predictions and actual values


from sklearn.metrics import mean_absolute_error

y_preds = model.predict(X_test)

mae = mean_absolute_error(y_test, y_preds)
mae


# MAE value here which is 0.315 shows that on average predicted value(y_preds) is +- 0.315 of the true value(y_test)

0.32313081099806223

In [10]:
df2 = pd.DataFrame(data = {'actual value': y_test, 
                           'predicted values' : y_preds})

df2['differences'] = df2['predicted values'] - df2['actual value']

df2.head()

Unnamed: 0,actual value,predicted values,differences
19880,0.555,0.61255,0.05755
17697,1.786,1.71447,-0.07153
2267,0.647,0.68297,0.03597
9927,1.755,1.66468,-0.09032
8024,2.174,2.23309,0.05909


In [11]:
# MAE using formula and differences

np.abs(df2['differences']).mean()

0.32313081099806223

In [12]:
from sklearn.metrics import mean_squared_error

mean_squared_error(y_test,y_preds)

0.24985296815824745

### Which regression metric one should use ?

* R-squared is similar to accuracy. It gives quick indication of how well our model has done. But it does not tell anyuthing about how wrong our model is in terms of how far off each prediction is.

* MAE gives a better indication of how far off each of model's prediction are on average to the actual values.

* MSE sqaures the differences between predicted values and actual values. It hence amplifies larger diffferences. 

- Pay more attention to MAE when being 10,000 off is twice as bad as 5000 off.

- Pay more attention to MSE when being 10,000 off is more than twice as bas as being 5,000

## Machine Learning Model Evaluation

Evaluating the results of a machine learning model is as important as building one.

But just like how different problems have different machine learning models, different machine learning models have different evaluation metrics.

Below are some of the most important evaluation metrics you'll want to look into for classification and regression models.

### Classification Model Evaluation Metrics/Techniques

* `Accuracy` - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.

* `Precision` - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.

* `Recall` - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.

* `F1 score` - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.

* `Confusion matrix` - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).

* `Cross-validation` - Splits your dataset into multiple parts and train and tests your model on each part then evaluates performance as an average.

* `Classification report` - Sklearn has a built-in function called classification_report() which returns some of the main classification metrics such as precision, recall and f1-score.

* `ROC Curve` - Also known as receiver operating characteristic is a plot of true positive rate versus false-positive rate.

* `Area Under Curve (AUC) Score` - The area underneath the ROC curve. A perfect model achieves an AUC score of 1.0.

#### Which classification metric one should use ?

* `Accuracy` is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).

* `Precision` and `recall` become more important when classes are imbalanced.

* If false-positive predictions are worse than false-negatives, aim for higher precision.

* If false-negative predictions are worse than false-positives, aim for higher recall.

* `F1-score` is a combination of precision and recall.

* A `confusion matrix` is always a good way to visualize how a classification model is going.

### Rregression Model Evaluation metrics


* `R^2 (pronounced r-squared)` or the coefficient of determination - Compares your model's predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.

* `Mean absolute error (MAE)` - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.

* `Mean squared error (MSE)` - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).

#### Which regression metric should one use ?

* `R^2` is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generally, the closer your R2 value is to 1.0, the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.

* `MAE` is also referred to as L1 loss. It is less biased for large errors and outliers. 

* `MSE` is referred to as L2 loss. This one is biased for large errors and outliers. `RMSE` is just the square root of `MSE`. 
 





