# Linear Regression model to predict the housing prices.

# Importing Dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_excel(r'C:\Users\visha\Desktop\Incinerator.xlsx')
data.head(5)

In [None]:
data.shape

# New Dataset

In [None]:
age = data['age']
price = data['price']

In [None]:
analysis = {'age': age,'price': price} 
slr = pd.DataFrame(analysis, columns= ['age','price'])
slr.head(5)

In [None]:
slr.shape

In [None]:
slr.describe()

# Simple Linear Regression

In [None]:
'''
The next step is to divide the data into "attributes" and "labels". 
Attributes are the independent variables while labels are dependent variables whose values are to be predicted.
Attribute set will consist of the "Age of house" column, and the label will be the "Selling price of the house" column.
The attributes are stored in the X variable. We specified "-1" as the range for columns 
since we wanted our attribute set to contain all the columns except the last one, which is "Price". 
Similarly the y variable contains the labels. We specified 1 for the label column since the index for "Price" column is 1.
'''

In [None]:
X = slr.iloc[:, :-1].values
Y = slr.iloc[:, 1].values

In [None]:
'''
split this data into training and test sets.
The above script splits 75% of the data to training set while 25% of the data to test set.
'''

In [None]:
from sklearn.model_selection import train_test_split
X_Train, X_Test, Y_Train, Y_Test = train_test_split(X, Y, test_size = 1/3, random_state = 0)

In [None]:
'''
Fitting Simple Linear Regression to the training set
Training the algorithm
linear regression model basically finds the best value for the intercept and slope, 
which results in a line that best fits the data.
'''

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_Train, Y_Train)

#To retrieve the intercept:
print(regressor.intercept_)

#For retrieving the slope:
print(regressor.coef_)

In [None]:
'''
This means that for every one unit of change in age of the house, the change in the price is about -561.83%.
if the age of the house increases, the price of the house can expect to achieve a decrease of 561.83%.
'''

In [None]:
'''
We have trained our algorithm, it's time to make some predictions. 
To do so, we will use our test data and see how accurately our algorithm predicts the selling price of the house.
'''

In [None]:
# Predicting the Test set result 

Y_Pred = regressor.predict(X_Test)


In [None]:
'''
Comparing the actual output values for X_test with the predicted values
'''

In [None]:
df = pd.DataFrame({'Actual': Y_Test.flatten(), 'Predicted': Y_Pred.flatten()})
df

In [None]:
'''
Mean Absolute Error (MAE) is the mean of the absolute value of the errors
Mean Squared Error (MSE) is the mean of the squared errors
Root Mean Squared Error (RMSE) is the square root of the mean of the squared
'''

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_Test, Y_Pred))  
print('Mean Squared Error:', metrics.mean_squared_error(Y_Test, Y_Pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_Test, Y_Pred)))

In [None]:
'''
You can see that the value of root mean squared error is 47648.25, 
which is less than 10% of the mean value of the percentages of all the students i.e. 96100.66. 
This means that our algorithm did a decent job.
'''

In [None]:
# Visualising the Training set results

plt.scatter(X_Train, Y_Train, color = 'red')
plt.plot(X_Train, regressor.predict(X_Train), color = 'blue')
plt.title('Price vs Age  (Training Set)')
plt.xlabel('Age of the House')
plt.ylabel('Selling Price of the House')
plt.show()

# Visualising the Test set results

plt.scatter(X_Test, Y_Test, color = 'red')
plt.plot(X_Train, regressor.predict(X_Train), color = 'blue')
plt.title('Price vs Age  (Test Set)')
plt.xlabel('Age of the House')
plt.ylabel('Selling Price of the House')
plt.show()

# Multiple Linear Regression

In [None]:
'''
Linear regression involving multiple variables is called "multiple linear regression".
The difference lies in the evaluation. 
You can use it to find out which factor has the highest impact on the predicted output 
and how different variables relate to each other.
'''

# New Dataset

In [None]:
price = data['price']
nbh = data['nbh']
land = data['land']
area = data['area']
baths = data['baths']
rooms = data['rooms']
age = data['age']

In [None]:
analysis = {'age': age,'rooms': rooms,'baths': baths,'area': area,'land': land,'nbh': nbh,'price': price} 
mlr = pd.DataFrame(analysis, columns= ['age','rooms','baths','area','land','nbh','price'])
mlr.head(5)

In [None]:
mlr.describe()

# Preparing the Data

In [None]:
'''
Column names for creating an attribute set and label.
'''

In [None]:
X = mlr[['age','rooms','baths','area','land','nbh']]
y = mlr['price']

In [None]:
'''
divide our data into training and test sets
'''

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/3, random_state=0)

In [None]:
'''
Training the Algorithm
'''

In [None]:
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
'''
In case of multivariable linear regression, 
the regression model has to find the most optimal coefficients for all the attributes.
'''

In [None]:
coeff_mlr = pd.DataFrame(regressor.coef_, X.columns, columns=['Coefficient'])
coeff_mlr

In [None]:
'''
This means that for a increase in "age", 
there is a decrease of 260.82% in selling price of a house.
Similarly, a unit increase in proportion of "rooms", "bathrooms", "area" results 
in an increase of selling price of a house. 
We can see that "land" have a very little effect on the selling price of the house.
We can see that "neighborhood" have a very little effect on the selling price of the house.
'''

In [None]:
'''
To make pre-dictions on the test data
'''

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
'''
To compare the actual output values for X_test with the predicted values
'''

In [None]:
df1 = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df1

In [None]:
'''
Evaluate the performance of algorithm
'''

In [None]:
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

In [None]:
'''
You can see that the value of root mean squared error is 60.07, which is slightly greater than 10% of the mean value of the gas consumption in all states. This means that our algorithm was not very accurate but can still make reasonably good predictions.
'''

In [None]:
'''
There are many factors that may have contributed to this inaccuracy, a few of which are listed here:

Need more data: Only one year worth of data isn't that much, whereas having multiple years worth could have helped us improve the accuracy quite a bit.
Bad assumptions: We made the assumption that this data has a linear relationship, but that might not be the case. Visualizing the data may help you determine that.
Poor features: The features we used may not have had a high enough correlation to the values we were trying to predict.
'''