## Simple Linear Regression

### Agenda
   
    ♦ Problem Description
    ♦ Data Understanding
    ♦ Handling Missing values
    ♦ Data Exploratory Analysis
    ♦ Split the data into Train and Validation sets
    ♦ Build a simple linear regression model
    ♦ Interpret model results


### Problem Description

The given data has various attributes of a car, use linear regression approach to estimate price of the car with a combination of collected features. 


Price - The cost of the car

Country - The country in which the car is up for sale

Reliability - An ordinal metric for understanding the reliability of the car

Mileage - The fuel efficiency of the car

Type - A categorical variable defining the category to which the car belongs.

Weight - The weight of the car

Displacement - Represents the engine displacement of the car

HP - Horsepower of the car, a unit that measures it's power


### Data Reading

In [None]:
getwd()

In [None]:
## Load the data
cars_data  <- read.csv(file = "cars.csv")

### Data Understanding

Check the number of observations and attributes

Classify independent variables and dependent variable

In Linear Regression, the dependent variable is continuous variable.

For Simple Linear Regression we will predict dependent variable with one independent variable

For this example, we will consider the Price as dependent variable and the Displacement of the car as the independent variable.


In [None]:
dim(cars_data)

In [None]:
str(cars_data)

In [None]:
head(cars_data)

In [None]:
tail(cars_data)

### Summary Statistics

In [None]:
summary(cars_data)

### Data Type Conversion
Check if any data type conversions have to be done.


In [None]:
#Convert "Reliability" to factor variable
cars_data[, "Reliability"] <- as.factor(as.character(cars_data[, "Reliability"]))

In [None]:
cars_data[, "Country"] <- as.factor(as.character(cars_data[, "Country"]))
cars_data[, "Type"] <- as.factor(as.character(cars_data[, "Type"]))

In [None]:
str(cars_data)

In [None]:
summary(cars_data)

In [None]:
table(cars_data$Country)

### Handling Missing Values

In [None]:
## Look for Missing Values
sum(is.na(cars_data))

In [None]:
colSums(is.na(cars_data))

In [None]:
which(is.na(cars_data$Reliability))

In [None]:
library(DMwR)

In [None]:
## Imputing missing values
library(DMwR)
cars_data=centralImputation(cars_data)

In [None]:
sum(is.na(cars_data))

In [None]:
options(repr.plot.width = 10, repr.plot.height = 10)
plot(cars_data$Mileage, cars_data$Price, xlab = "Mileage",
     ylab = "Price", main = "Mileage vs Price")

### Data Exploratory Analysis

In [None]:
#Plot the Dependent and  Independent variables
# _*Scatter Plot*_ helps to view the relationship between two continuous variables

options(repr.plot.width = 10, repr.plot.height = 10)
par(mfrow = c(2,2)) # Splits the plotting pane 2*2

plot(cars_data$Weight, cars_data$Price, xlab = "Weight", 
     ylab = "Price", main = "Weight vs Price")

plot(cars_data$Mileage, cars_data$Price, xlab = "Mileage",
     ylab = "Price", main = "Mileage vs Price")

plot(cars_data$Disp., cars_data$Price, xlab = "Displacement",
     ylab = "Price", main = "Displacement vs Price")

plot(cars_data$HP, cars_data$Price, xlab = "Horse Power", 
     ylab = "Price", main = "Horse Power vs Price")


In [None]:
# Correlation can be calculated only for numerical attributes.
cor_data = cor(cars_data[,c("Price","Mileage","Weight","Disp.","HP")])      
# Correlation between independent and dependent variable
cor_data

# The values of correlation coefficient ranges from -1 to 1
## Corrplot
library(corrplot)
par(mfrow = c(1,1))

corrplot(cor_data, method = "number")

### Split the data into Train and Validation sets

In [None]:
1:100

In [None]:
set.seed(1)
sample(1:100,size=10)

In [None]:
cars_data[c(5,50),]

In [None]:
## Split row numbers into 2 sets
set.seed(1)
train_rows = sample(1:nrow(cars_data), size=0.7*nrow(cars_data))
validation_rows = setdiff(1:nrow(cars_data),train_rows)

In [None]:
train_rows

In [None]:
validation_rows

In [None]:
## Subset into Train and Validation sets
train_data <- cars_data[train_rows,]
validation_data <- cars_data[validation_rows,]

In [None]:
## View the dimensions of the data
dim(cars_data)
dim(train_data)
dim(validation_data)

### Build A Simple Linear Regression Model

In [None]:
names(train_data)

In [None]:
# lm function is used to fit linear models
LinReg = lm(Price ~ Disp., data = train_data)

### Interpret model results

In [None]:
## Summary of the linear model
summary(LinReg)

# Summary displays the following: 
# Formula given (Call) - Shows the function call used to compute the regression model.
# Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero.
# Coefficients and the test statistic values. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars.
# Residual Standard Error (RSE)
# Multiple R- Squared (which we generally refer to as R squared or Co-efficient of Determination)
# F statistic - Test for Model

# The statistical hypothesis is as follows :
# Null Hypothesis (H0):        the coefficients (slope) are equal to zero 
                               #(i.e., no relationship between x and y)
# Alternative Hypothesis (H1): the coefficients (slope) are not equal to zero 
                               #(i.e., there is some relationship between x and y)

#### Try answering these questions (Interpreting model output) :

Is the Slope significant?

Is the Model significant?

What is the predictive power of the model (R-squared)?

In our example, both the p-values for the intercept and the predictor variable are highly significant, so we can reject the null hypothesis and accept the alternative hypothesis, which means that there is a significant association between the predictor and the outcome variables.

#### Regression line on top of data

In [None]:
plot(train_data$Disp.,train_data$Price)

abline(LinReg)

#### Regression Equation
PredictedPrice = 5213.12 + 51.71 * Disp.

In [None]:
head(train_data)

In [None]:
5213.12 + 51.71 * 181

In [None]:
## Extracting residuals and fitted values
#  Target 
head(train_data$Price)
# To extract the predictions
head(as.numeric(LinReg$fitted.values))
# To extract the residuals:
head(as.numeric(LinReg$residuals))

In [None]:
14944 - 14571.823458939

#### Now build simple linear regression using the following attributes

1.HP

2.Weight

3.Mileage

Observe the difference in the R-square for each of the model and select the best attribute for predicting the price of car


In [None]:
LinReg_HP = lm(Price ~ HP, data = train_data)
summary(LinReg_HP)

In [None]:
LinReg_Weight = lm(Price ~ Weight, data = train_data)
summary(LinReg_Weight)

In [None]:
LinReg_Mileage = lm(Price ~ Mileage, data = train_data)
summary(LinReg_Mileage)

In [None]:
LinReg_Mileage$fitted.values

In [None]:
LinReg_Mileage$residuals

In [None]:
#Residual analysis

#options(repr.plot.width = 10, repr.plot.height = 10)

par(mfrow=c(2,2))
plot(LinReg_Mileage,col='darkgreen')

In [None]:
#Residual outliers
residuals = LinReg_Mileage$residuals
outliers <- boxplot(residuals)$out

In [None]:
outliers