## What is Scikit-Learn?

Scikit-learn (or sklearn for short) is a free open-source machine learning library for Python. It is designed to cooperate with SciPy and NumPy libraries and simplifies data science techniques in Python with built-in support for popular classification, regression, and clustering machine learning algorithms.

Sklearn serves as a unifying point for many ML tools to work seamlessly together. It also gives data scientists a one-stop-shop toolkit to import, preprocess, plot, and predict data.


### Installation is easy for local python3 Environment:-

pip install -U scikit-learn

## But on Colab It's already Installed


Now we’ll implement the <font color='red'>linear regression</font> machine learning algorithm using the Insurance Dataset . As with all ML algorithms, we’ll start with importing our dataset and then train our algorithm using historical data.

From a mathematical point of view, linear regression is about fitting data to minimize the sum of residuals between each data point and the predicted value. In other words, we are minimizing the discrepancy between the data and the estimation model.

As shown in the figure below, the red line is the model we solved, the blue point is the original data, and the distance between the point and the red line is the residual. Our goal is to minimize <font color='red'>the sum of residuals</font>.

![here](https://miro.medium.com/max/2400/1*A71zTD6_QqUzLhMKj1Rgiw.png)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv("insurance.csv")

In [3]:
data

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [4]:
data.describe() # it provides essential info. about each column in dataset

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


In [5]:
data.head() # top5 rows of dataset

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


## Take a look at categorical columns

In [6]:
print(data.region.unique())
print(data.sex.unique())
print(data.smoker.unique())


['southwest' 'southeast' 'northwest' 'northeast']
['female' 'male']
['yes' 'no']


In [7]:
data.corr() # by default we get pearson correlation which lies bw -1 to 1
# 1 means highly positve relation and -1 means negative relation
# https://www.geeksforgeeks.org/python-pandas-dataframe-corr/ refer this 

Unnamed: 0,age,bmi,children,charges
age,1.0,0.109272,0.042469,0.299008
bmi,0.109272,1.0,0.012759,0.198341
children,0.042469,0.012759,1.0,0.067998
charges,0.299008,0.198341,0.067998,1.0


In [8]:
data.corr()['charges'].sort_values() # taking just charges column and sorting it

children    0.067998
bmi         0.198341
age         0.299008
charges     1.000000
Name: charges, dtype: float64

In [9]:
data.dtypes

age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

In [10]:
# one hot encoding our categorical data
encoded_data = pd.get_dummies(data, columns=["sex","smoker","region"])
encoded_data

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.900,0,16884.92400,1,0,0,1,0,0,0,1
1,18,33.770,1,1725.55230,0,1,1,0,0,0,1,0
2,28,33.000,3,4449.46200,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.880,0,3866.85520,0,1,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,10600.54830,0,1,1,0,0,1,0,0
1334,18,31.920,0,2205.98080,1,0,1,0,1,0,0,0
1335,18,36.850,0,1629.83350,1,0,1,0,0,0,1,0
1336,21,25.800,0,2007.94500,1,0,1,0,0,0,0,1


In [11]:
print(data['sex'].value_counts())
print('\n')
print(data['smoker'].value_counts())
print('\n')
print(data['region'].value_counts())

male      676
female    662
Name: sex, dtype: int64


no     1064
yes     274
Name: smoker, dtype: int64


southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64


# <font color='red'>By Default Sklearn uses Ordinary Least squares for regression </font>

In [13]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 

X = encoded_data[encoded_data.columns[~encoded_data.columns.isin(['charges'])]].to_numpy()
y = encoded_data.charges.to_numpy()

# Splitting the data into training and testing data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 



regr = LinearRegression() 
  
regr.fit(X_train, y_train) 



LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [17]:
pred_y = regr.predict(X_test)
print("The first five prediction {}".format(pred_y[:5]))
print('\n')
print("The real first five labels {}".format(y_test[:5]))
print('\n')
print("R Squared Score :-",regr.score(X_test, y_test)) # R squared score


The first five prediction [ 4159.77935786 12699.04554201  7311.81335195 13030.17184285
  8717.5865048 ]


The real first five labels [ 1263.249  12096.6512  5325.651  12730.9996 10107.2206]


R Squared Score :- 0.7017721494828559


# Thats the end of Today's work

Resources
## 1.[Mathematical Intuition of OLS](https://towardsdatascience.com/understanding-the-ols-method-for-simple-linear-regression-e0a4e8f692cc)

## 2. [R squared](https://www.youtube.com/watch?v=WuuyD3Yr-js)