### In this post, we are going to learn about implementing linear regression on Boston Housing dataset using scikit-learn



## Boston Housing Dataset
   <img src="boston.jpeg" alt="drawing" style="width:490px;"/>



- The Boston Housing Dataset consists of price of houses in various places in Boston
- There are 506 samples and 13 feature variables in this dataset. The __objective__ is to __predict the value of prices of the house__  using the given features.
- __We can also access this data from the scikit-learn library__

__Steps--__
- The data should be partitioned into __training and validation sets__ because we need two sets of data: one to build the model that depicts the relationship between the predictor variables and the predicted variable, and another to validate the model‟s predictive accuracy.
- The training data set is used to build the model. The algorithm „discovers‟ the
model using this data set.
- The validation data is used to 'validate' the model. In this process, the model (built using the training data set) is used to make predictions with thevalidation data - data that were not used to fit the model. In this way we getan unbiased estimate of how well the model performs. We compute measures of 'error', which reflect the prediction accuracy.

![title](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/Untitled-drawing-1-11.png)
<br>
__The dependent variable is MEDV - Median value of owner-occupied homes__ in $1000's 

So let’s get started.

#### First, we will import the required libraries.



In [1]:
import numpy as np
import matplotlib.pyplot as plt 

import pandas as pd  
import seaborn as sns 

In [2]:
from sklearn.datasets import load_boston
boston_dataset = load_boston()

In [3]:
# print(boston_dataset) - U can see the data set
X=boston_dataset.data
Y=boston_dataset.target

In [4]:
print(boston_dataset.keys())
print(X.shape)
print(Y.shape)

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
(506, 13)
(506,)


- data: contains the information for various houses
- target: prices of the house
- feature_names: names of the features
- DESCR: describes the dataset

### The prices of the house indicated by the variable MEDV is our target variable and the remaining are the feature variables based on which we will predict the value of a house

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


In [6]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2)#Its a Random Split of data.
print(X_train.shape)
print(X_test.shape)

(404, 13)
(102, 13)


## Train our Linear Regression Model

In [7]:
# CREATE OBJECT
lr=LinearRegression(normalize=True)
# 2. Training
lr.fit(X_train,Y_train)
#3. Output parameters
print(lr.coef_)
print(lr.intercept_)

[-1.37599200e-01  5.67986712e-02  5.84764486e-03  3.98380452e+00
 -1.33584744e+01  3.94063974e+00 -7.78726190e-03 -1.63280336e+00
  2.94634607e-01 -1.07830014e-02 -7.18869283e-01  1.08037103e-02
 -6.29369660e-01]
30.44412535491947


## Accuracy of Regression:
#### model.score() : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit
> Once you have your model fitted, you can get the results to check whether the model works    satisfactorily and interpret it. You can obtain the __coefficient of determination (𝑅²)__ with .score() called on model<br>
> __The R^2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values__

In [10]:
print("Training Score %.4f"%lr.score(X_train,Y_train))
print("Testing Score %.4f"%lr.score(X_test,Y_test))

Training Score 0.7498
Testing Score 0.6379


### The independent features are called the independent variables, inputs, or predictors.



# Overfitting vs. Underfitting:

![title](https://cdn-images-1.medium.com/max/1000/1*6vPGzBNppqMHllg1o_se8Q.png)

- __Overfitting__: too much reliance on the training data
- __Underfitting__: a failure to learn the relationships in the training data4
- __Overfitting and underfitting__ cause poor generalization on the test set

### The problem of Overfitting vs Underfitting finally appears when we talk about the polynomial degree.
### An underfit model will be less flexible and cannot account for the data
-  A model that is underfit will have __high training and high testing error__  while an overfit model will have __extremely low training error but a high testing error.__
- Both overfitting and underfitting lead to poor predictions on new data sets
- Overfitting is often a result of an __excessively complicated model__, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.
#### Overfitting refers to a model that models the training data too well.
#### Underfitting refers to a model that can neither model the training data nor generalize to new data.