# Regression with Scikit-Learn

The notebook covers the following algorithms:
- Linear Regression
- Robust Regression
- Ridge Regression
- Lasso Regression
- ElasticNet

The dataset being used is [Housing Price in Beijing](https://www.kaggle.com/ruiqurm/lianjia), where in the house prices are continuous in nature, hence, we are going to predict house prices using the listed regression algorithms.

# Inspecting Data

Let us load the data and save it in a DataFrame:

In [None]:
import numpy as np
import pandas as pd
df = pd.read_csv("/kaggle/input/lianjia/new.csv", encoding='iso-8859-1', low_memory = False)
df.head()

`.info()` helps us to see how many values are existing in each column and their data type. Here, we can see many columns have missing values which we need to fill with some sensible values. Handling missing data is important as many machine learning algorithms do not support data with missing values.

In [None]:
df.info()

`.describe()` is used to view some basic statistical details like percentile, mean, std, etc. of a DataFrame.

In [None]:
df.describe()

`.corr()` is used to find the pairwise correlation of all columns in the dataframe.

In [None]:
df.corr()

Now, we select the most significant features and correlated features from the correlation matrix.

In [None]:
df = df[['totalPrice', 'square', 'renovationCondition', 'communityAverage']]
df.head()

# Dealing with Missing Data

Oops! It seems like we're about to perform training on a dataframe consisting of NaN (Not a Number) values. Here are some strategies to handle NaN values:

- **Remove rows with NaN values**: This is generally not preferred as substantial data can be lost, and could also lead to improper distribution of values.
- **Filling NaN values with scalar values**: This is helpful to some extent, but it can also lead to improper distribution of data.
- **Filling NaN values with mean**: Filling with mean helps to maintain the distribution of values in the dataframe and hence is more preferred than other methods.

Let us fill NaN values using mean.

In [None]:
df = df.fillna(df.mean())
df.info()

Let us set `totalPrice` as the target variable.

In [None]:
X = df.drop('totalPrice', axis = 1)
y = df['totalPrice']

# Splitting Data into Training and Test Sets
The data we use is usually split into training data and test data using `train_test_split`. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)

# Metrics for testing model's performance

The MSE, MAE, RMSE, and R-Squared are mainly used metrics to evaluate the prediction error rates and model performance in regression analysis.

- **MAE (Mean absolute error)** represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.
- **MSE (Mean Squared Error)** represents the difference between the original and predicted values extracted by squared the average difference over the data set.
- **RMSE (Root Mean Squared Error)** is the error rate by the square root of MSE.
- **R-squared (Coefficient of determination)** represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.

Here we created helper functions to append the metrics to a DataFrame containing the results.

In [None]:
from sklearn import metrics
from sklearn.model_selection import cross_val_score

results_df = pd.DataFrame()
columns = ["Model", "Cross Val Score", "MAE", "MSE", "RMSE", "R2"]

def evaluate(true, predicted):
    mae = metrics.mean_absolute_error(true, predicted)
    mse = metrics.mean_squared_error(true, predicted)
    rmse = np.sqrt(metrics.mean_squared_error(true, predicted))
    r2_square = metrics.r2_score(true, predicted)
    return mae, mse, rmse, r2_square

def append_results(model_name, model, results_df, y_test, pred):
    results_append_df = pd.DataFrame(data=[[model_name, *evaluate(y_test, pred) , cross_val_score(model, X, y, cv=10).mean()]], columns=columns)
    results_df = results_df.append(results_append_df, ignore_index = True)
    return results_df

# Linear Regression

Linear regression is an algorithm which tries to learn the correlation between a dependent variable and one or more independent features.

![Linear Regression](https://cdn-images-1.medium.com/max/1200/1*LEmBCYAttxS6uI6rEyPLMQ.png)

In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression(normalize=True)
lin_reg.fit(X_train,y_train)

pred = lin_reg.predict(X_test)

results_df = append_results("Linear Regression",  LinearRegression(), results_df, y_test, pred)
results_df

# Robust Regression

Robust regression is a form of regression analysis designed to overcome some limitations of traditional parametric and non-parametric methods. A common situation in which robust estimation is used occurs when the data contain outliers. One of the most popular approaches to outlier detection is **RANSAC or Random Sample Consesus**.

Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates.

In [None]:
from sklearn.linear_model import RANSACRegressor

model = RANSACRegressor()
model.fit(X_train, y_train)

pred = model.predict(X_test)
results_df = append_results("Robust Regression",  RANSACRegressor(), results_df, y_test, pred)
results_df

# Ridge Regression

Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients. The ridge coefficients minimize a penalized residual sum of squares:

$\min_{w} || X w - y||_2^2 + \alpha ||w||_2^2$

The complexity parameter $\alpha \geq 0$ controls the amount of shrinkage: the larger the value of , the greater the amount of shrinkage and thus the coefficients become more robust to collinearity.

In [None]:
from sklearn.linear_model import Ridge

model = Ridge()
model.fit(X_train, y_train)
pred = model.predict(X_test)
results_df = append_results("Ridge Regression",  Ridge(), results_df, y_test, pred)
results_df

# Lasso Regression

Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e. models with fewer parameters). This particular type of regression is well-suited for models showing high levels of muticollinearity.

In [None]:
from sklearn.linear_model import Lasso

model = Lasso()
model.fit(X_train, y_train)
pred = model.predict(X_test)
results_df = append_results("Lasso Regression",  Lasso(), results_df, y_test, pred)
results_df

# ElasticNet

ElasticNet combines L1 norms (LASSO) and L2 norms (ridge regression) into a penalized model for generalized linear regression. This gives it sparsity (L1) and robustness (L2) properties.

In [None]:
from sklearn.linear_model import ElasticNet

model = ElasticNet()
model.fit(X_train, y_train)
pred = model.predict(X_test)
results_df = append_results("ElasticNet Regression",  ElasticNet(), results_df, y_test, pred)
results_df

# Summary:

In this notebook, you got to learn about:

- Inspecting Data
- Dealing with Missing Values
- Splitting Data into Training and Test Sets
- Common Linear Regression Algorithms (Linear, Ridge, Lasso, etc.)
- Comparison of all algorithms

## If you found this notebook useful, show your appreciation with an upvote!