## Multiple Linear Regression

#### What is Linear Regression?
- Linear Regression is a powerful statistical analytical method that allows us to examine the relationship between two or more variables of interest, the relationship between independent and dependent variables.
- Linear Regression examines the effects of one or more independent variables on a dependent variable.
- Linear regression is a supervised learning algorithm.
- Applications of Linear Regression range from predicting health outcomes in medicine, stock prices in finance, power usage in high-performance computing, marketing effectiveness on pricing and promotions and sales of a products.


#### Format of Linear Regression formula

- In Multiple Linear Regression, a Multiple independent variables(x1,x2,x3,...xn) are used to predict the value of a single dependent variable(y)

![Regression Formula](pics/MLRFormula.png)


#### The Goal
The goal of this lab is to predict the sales price for all new houses getting build.

#### About the "HousePrices" dataset
The House Prices dataset contains 100 observations and 5 different attributes (4 independent variables and 1 dependent variable)

#### Independent Variables
    1. House Sqft – square footage of the property (X1)
    2. Taxes - property tax will be calculated on this value (X2)
    3. Bedrooms – number of bedrooms in the property (X3)
    4. Bathrooms – number of bathrooms in the property (X4)

#### Dependent Variable
    5. Last Sold Price - the value the property got sold for(Y)

#### Download and Install Python Libraries

In [None]:
#!pip install pandas
#!pip install numpy
#!pip install scikit-learn
#!pip install scipy
#!pip install seaborn
#!pip install matplotlib

#### Import Python Libraries

In [None]:
# Importing some common libraries that’s needed for all data science related projects
import numpy as np
import pandas as pd
import math
import scipy


# Importing different modules from the sklearn library to build and evaluate the linear regression model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score


# Importing matplotlib and seaborn libraries for data visualisation 
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


# Switching off unnecessary warning messages 
import warnings
warnings.filterwarnings('ignore')





#### Process map
Below illustrates a 14-step process used during this lab.

    1.	Import Data
    2.	Data Quality Checks
    3.	Data Cleansing
    4.	Exploratory Analysis using Aggregations
    5.	Exploratory Analysis using Distributions
    6.	Exploratory Analysis using Correlations
    7.	Visualisations
    8.	Model: Pre-processing
    9.	Model: Train/Test Split
    10.	Model: Build (Train dataset)
    11.	Model: Evaluation (Train dataset)
    12.	Model: Evaluation (Test dataset)
    13.	Model: Predictions
    14.	Model: Save Predictions


#### 1. Import Data

In [None]:
# Reading data from a CSV file and saving that data into a dataframe called "df"

df = pd.read_csv("HousePrices.csv")
df

#### 2. Data Quality Checks

    2.1 Check data
    2.2 Check shape of data
    2.3 Check for duplicates
    2.4 Check for missing values

In [None]:
# 2.1
# Viewing top 5 records
df.head()

# Viewing last 5 records
#df.tail()

# Viewing top 3 records
#df.head(n=3)

# Viewing last 3 records
#df.tail(n=3)

In [None]:
# 2.2
# Looking at the structure of the dataframe

df.shape

In [None]:
# 2.3
# Let’s use duplicated() function to identify how many duplicate records there are in the dataset

df.duplicated().sum()

In [None]:
# 2.4
# This method prints out information about a dataframe including the index, dtype, columns, non-null values and memory usage
# This method is also useful for finding out missing values in a dataset
# if found, we can use interpolation techniques to rectify those missing values

df.info()

#### 3. Data Cleansing

    3.1 Converting data types
    3.2 Remove duplicates
    3.3 Fill missing values
    3.4 Outlier detection and treatment

In [None]:
# 3.1
# Converting data type of a column using astype() method

df["HouseSqft"] = df.HouseSqft.astype("float64")
df["Taxes"] = df.Taxes.astype("float64")
df["Bedrooms"] = df.Bedrooms.astype("category")
df["Bathrooms"] = df.Bathrooms.astype("category")
df["LastSoldPrice"] = df.LastSoldPrice.astype("int64")
df.info()

In [None]:
# 3.2
# This is how you remove all the duplicates from the dataset using drop_duplicates() function

df = df.drop_duplicates()

In [None]:
# 3.3
# Fill missing values (NaN, Null) with median value of a column

In [None]:
# This is how you calculate median for all columns in the dataframe
df.median()

In [None]:
# This is how you calculate median value for a specific column
df.HouseSqft.median()

In [None]:
# This is how you fix missing values for all columns
# df = df.fillna(df.median())

In [None]:
# This is how you fix a missing value for a specific column
df.HouseSqft = df.HouseSqft.fillna(df.HouseSqft.median())
df

In [None]:
# By looking at the info it is clear all the missing values are correctly replaced with median value
df.info()

#### 3.4
#### Outlier Detection and Treatment
- One of the most important step in data cleansing is outlier detection and treatment.
- Outliers are defined as data points that are significantly different from the remaining data. Those are points that lie outside the overall pattern of the distribution. Statistical measures such as mean, variance and correlation are very susceptible to outliers.

#### Outlier Detection
- This can be done through visualising the data (Box and whisker plot)


#### Outlier Treatment
- This can be done by imputing mean/median or random value in place of an outlier

![boxplot](pics/boxplot1.png)

In [None]:
# Outlier detection using boxplot from seaborn library

sns.boxplot(data=df[["HouseSqft","Taxes"]])
plt.show()

In [None]:
# Outlier treatment
# Calculation of Q1, Q3, IQR for "HouseSqft" column:

q1 = np.percentile(df.HouseSqft,[25])[0]
q3 = np.percentile(df.HouseSqft,[75])[0]
iqr = q3-q1

ll = q1 - iqr*1.5 #lower limit
ul = q3 + iqr*1.5 #upper limit

print(ll)
print(ul)

In [None]:
# Option1 - Detecting outliers and imputing with custom values

df.HouseSqft[df.HouseSqft>ul] = ul
df.HouseSqft[df.HouseSqft<ll] = ll
df

In [None]:
# Please note the dataset contains 100 records after imputing outliers
df.shape

In [None]:
# Option2 - Filtering out outliers from the dataset
# Please note the dataset only contains 95 records, 5 records are treated as outliers

# df = df.loc[(df.HouseSqft<=ul) & (df.HouseSqft>=ll) , ["HouseSqft","Taxes", "Bedrooms","Bathrooms","LastSoldPrice"]]

# This code below can also create the same output as above
# By omitting column section, we can display all columns
# df.loc[(df.HouseSqft<=ul) | (df.HouseSqft>=ll)]

#### 4. Exploratory Analysis using Aggregations

In [None]:
# Total number of houses and mean house price
df.agg({"LastSoldPrice": ['count', 'mean']})

#### 5. Exploratory Analysis using Distributions

In [None]:
# Mean house price by bedrooms and bathrooms

df.groupby(by=["Bedrooms", "Bathrooms"]).agg({"LastSoldPrice": ['count','mean']}).dropna()

In [None]:
# Exploring Descriptive statistics include those that summarise the central tendency, 
# dispersion and shape of a dataset’s distribution, excluding NaN(Not a Number) values

df.describe()

#### 6. Exploratory Analysis using Correlations

- One of the valuable aspects of regression, is that it’s able to deal with some amount of correlation among independent variables. However, too much multicollinearity in the data can be a problem
- Multicollinearity arises when two variables that measure the same thing or similar things (e.g., weight and BMI) are both included in a multiple regression model. They will, in effect, cancel each other out and generally destroy your model
- The main goal: choose independent variables that are highly correlated with the dependent variable (they provide information), but that are not highly correlated with other independent variables (the same information is not repeated)

In [None]:
# Creating correlation Matrix

df.corr()

In [None]:
# below code can be used to remove any columns due to multicollinearity
# df = df.drop(["Taxes"], axis=1)
# df.info()

#### 7. Visualisations

In [None]:
# Joinplot from seaborn library can be used to create a scatterplot

sns.jointplot(x=df.HouseSqft, y=df.LastSoldPrice, kind="reg")
plt.xlabel('HouseSqft')
plt.ylabel('LastSoldPrice')
plt.show()

#### 8. Model: Pre-processing

Encoding technique converts categorical data into numerical data

![Encode](pics/Encode.png)

In [None]:
# Converting categorical variables into dummy variables (one-hot encoding)

df = pd.get_dummies(data=df, columns=["Bedrooms", "Bathrooms"], drop_first=True)
df

In [None]:
# Check for column names and other info

df.info()

#### 9. Model: Train/Test Split 

**Step1: Split dataset to X and Y variables**

In [None]:
# Separation of independent variables and dependent variable

x = df.loc[:, df.columns != "LastSoldPrice"]
y = df.loc[:, df.columns == "LastSoldPrice"]

In [None]:
# Exploring all independent variables

x.head()

In [None]:
# Exploring the dependent variable

y.head()

In [None]:
# Exploring the shape of x and y datasets - (no of rows, no of columns)

x.shape, y.shape

**Step2: Performing 70:30 Data split**
- After Separating columns into dependent and independent variables (x, y), you split those into training-set and testing-set (70:30)


![split data](pics/traintestsplitdata1.png)

In [None]:
# Spliting data into train and test datasets --> 70:30 split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

In [None]:
# Exploring the dimensions of train datasets

x_train.shape, y_train.shape

In [None]:
# Exploring the dimensions of test datasets

x_test.shape, y_test.shape

#### 10. Model: Build (Train dataset)

In [None]:
# Using sklearn library to build a Linear Regression Model
# from sklearn.linear_model import LinearRegression --> this code imports the Linear Regression module


# Create a linear regression model using LinearRegression() module
model = LinearRegression()

In [None]:
# fitting the training data (70%) to the linear regression model
# this will generate the intercept and all the coefficients

model.fit(x_train, y_train)

#### 11. Model: Evaluation (Train dataset)

In [None]:
# Exploring the intercept

model.intercept_

In [None]:
# Exploring the coefficients

model.coef_

In [None]:
# As you can see the above coefficients and intercept are very poorly formatted
# the below is much better representation of "intercept" in a dataframe layout

pd.DataFrame(np.array(model.intercept_), index=["Intercept"], columns=["Intercept"])

In [None]:
# As you can see the above coefficients and intercept are very poorly formatted
# the below is much better representation of "coefficients" in a dataframe layout

pd.DataFrame(np.array(model.coef_).T, index=x.columns, columns=["Coefficients"])

#### 12. Model: Evaluation (Test dataset)

In [None]:
# Appling the linear regression model to make prediction on testing dataset(30%)

y_pred = model.predict(x_test)
y_pred

In [None]:
# Evaluating the above predicted results (model performance)

print("Root Mean squared error (RMSE):{}".format(math.sqrt(mean_squared_error(y_test, y_pred))))
print("Coefficient of determination (R^2):{}".format(r2_score(y_test, y_pred)))

#### 13. Model: Predictions

In [None]:
# Predicting on new data

# Reading data from a CSV file and saving that data as a dataframe
dfp = pd.read_csv("HousePricesPredict.csv")

# Viewing records
dfp

In [None]:
# This method prints out information about a dataFrame including the index, dtype, columns, non-null values and memory usage
# This method is also useful for finding out missing values in a dataset
# if found, we can use interpolation techniques to rectify those missing values

dfp.info()

In [None]:
# Converting data type of a column using astype() method

dfp["HouseSqft"] = dfp.HouseSqft.astype("float64")
dfp["Taxes"] = dfp.Taxes.astype("float64")
dfp["Bedrooms"] = dfp.Bedrooms.astype("category")
dfp["Bathrooms"] = dfp.Bathrooms.astype("category")
dfp.info()

In [None]:
# Let’s remove all the duplicates from the dataset

dfp = dfp.drop_duplicates()

In [None]:
# Let’s remove all the null values from the dataset

dfp = dfp.dropna()

In [None]:
# Let’s convert categorical variables into dummy variables (one-hot encoding)

dfp2 = pd.get_dummies(data=dfp,columns=["Bedrooms", "Bathrooms"], drop_first=True)
dfp2

In [None]:
# Looking at the structure of the Dataframe

dfp2.shape

In [None]:
# This method prints out information about a dataFrame including the index, dtype, columns, non-null values and memory usage

dfp2.info()

In [None]:
# Making new predictions using the "model" that was created in the earlier section

newhouseprices = model.predict(dfp2)
newhouseprices

In [None]:
# Converting predicted results into a dataframe ("dfr")

dfr = pd.DataFrame(newhouseprices, columns=["newhouseprices"])
dfr

In [None]:
# Attaching predicted prices to the original dataset, and save as a new dataframe ("newdf")

newdf = pd.DataFrame.join(dfp,dfr)
newdf

#### 14. Model: Save Predictions

In [None]:
# Save the above dataframe ("newdf") as a CSV file

newdf.to_csv("NewHousePricesPredicted.csv")