#### Ridge Lasso Regression-:

## The objective of this project is to apply regression techniques to identify which features affect home prices the most in the Melbourne .

Suburb: Suburb

Address: Address

Rooms: Number of rooms

Price: Price in Australian dollars

Method:
S - property sold;
SP - property sold prior;
PI - property passed in;
PN - sold prior not disclosed;
SN - sold not disclosed;
NB - no bid;
VB - vendor bid;
W - withdrawn prior to auction;
SA - sold after auction;
SS - sold after auction price not disclosed.
N/A - price or highest bid not available.

Type:
br - bedroom(s);
h - house,cottage,villa, semi,terrace;
u - unit, duplex;
t - townhouse;
dev site - development site;
o res - other residential.

SellerG: Real Estate Agent

Date: Date sold

Distance: Distance from CBD in Kilometres

Regionname: General Region (West, North West, North, North east …etc)

Propertycount: Number of properties that exist in the suburb.

Bedroom2 : Scraped # of Bedrooms (from different source)

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size in Metres

BuildingArea: Building Size in Metres

YearBuilt: Year the house was built

CouncilArea: Governing council for the area

Lattitude: Self explanatory

Longtitude: Self explanatory

In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [5]:
####loading the data:
dataset=pd.read_csv('Melbourne_housing_data.csv')
dataset.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,68 Studley St,2,h,,SS,Jellis,3/09/2016,2.5,3067.0,...,1.0,1.0,126.0,,,Yarra City Council,-37.8014,144.9958,Northern Metropolitan,4019.0
1,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra City Council,-37.7996,144.9984,Northern Metropolitan,4019.0
2,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra City Council,-37.8079,144.9934,Northern Metropolitan,4019.0
3,Abbotsford,18/659 Victoria St,3,u,,VB,Rounds,4/02/2016,2.5,3067.0,...,2.0,1.0,0.0,,,Yarra City Council,-37.8114,145.0116,Northern Metropolitan,4019.0
4,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra City Council,-37.8093,144.9944,Northern Metropolitan,4019.0


In [6]:
dataset.shape

(34857, 21)

In [7]:
dataset.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [8]:
variables_to_use=['Suburb','Rooms','Type','Price','Method','SellerG','Distance','Bedroom2','Bathroom','Car',
                  'Landsize','BuildingArea','YearBuilt','CouncilArea','Regionname','Propertycount']
dataset=dataset[variables_to_use]
dataset.shape

(34857, 16)

In [9]:
dataset.isna().sum()

Suburb               0
Rooms                0
Type                 0
Price             7610
Method               0
SellerG              0
Distance             1
Bedroom2          8217
Bathroom          8226
Car               8728
Landsize         11810
BuildingArea     21115
YearBuilt        19306
CouncilArea          3
Regionname           3
Propertycount        3
dtype: int64

In [10]:
dataset['Bedroom2']=dataset['Bedroom2'].fillna(dataset.Bedroom2.mean())
dataset['Bathroom']=dataset['Bathroom'].fillna(dataset.Bathroom.mean())
dataset['Car']=dataset['Car'].fillna(dataset.Car.mean())
dataset['Landsize']=dataset['Landsize'].fillna(dataset.Landsize.mean())
dataset['BuildingArea']=dataset['BuildingArea'].fillna(dataset.BuildingArea.mean())
dataset['YearBuilt']=dataset['YearBuilt'].fillna(dataset.YearBuilt.mean())


In [11]:
dataset.dropna(inplace=True)

In [12]:
dataset.shape

(27244, 16)

In [13]:
dataset.isnull().sum()

Suburb           0
Rooms            0
Type             0
Price            0
Method           0
SellerG          0
Distance         0
Bedroom2         0
Bathroom         0
Car              0
Landsize         0
BuildingArea     0
YearBuilt        0
CouncilArea      0
Regionname       0
Propertycount    0
dtype: int64

In [14]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27244 entries, 1 to 34856
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         27244 non-null  object 
 1   Rooms          27244 non-null  int64  
 2   Type           27244 non-null  object 
 3   Price          27244 non-null  float64
 4   Method         27244 non-null  object 
 5   SellerG        27244 non-null  object 
 6   Distance       27244 non-null  float64
 7   Bedroom2       27244 non-null  float64
 8   Bathroom       27244 non-null  float64
 9   Car            27244 non-null  float64
 10  Landsize       27244 non-null  float64
 11  BuildingArea   27244 non-null  float64
 12  YearBuilt      27244 non-null  float64
 13  CouncilArea    27244 non-null  object 
 14  Regionname     27244 non-null  object 
 15  Propertycount  27244 non-null  float64
dtypes: float64(9), int64(1), object(6)
memory usage: 3.5+ MB


In [15]:
#### encode the categorical variables:
dataset=pd.get_dummies(dataset,drop_first=True,dtype=int)

In [16]:
dataset.head(2)

Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Propertycount,...,CouncilArea_Wyndham City Council,CouncilArea_Yarra City Council,CouncilArea_Yarra Ranges Shire Council,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria
1,2,1480000.0,2.5,2.0,1.0,1.0,202.0,160.2564,1965.289885,4019.0,...,0,1,0,0,1,0,0,0,0,0
2,2,1035000.0,2.5,2.0,1.0,0.0,156.0,79.0,1900.0,4019.0,...,0,1,0,0,1,0,0,0,0,0


In [17]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 27244 entries, 1 to 34856
Columns: 746 entries, Rooms to Regionname_Western Victoria
dtypes: float64(9), int32(736), int64(1)
memory usage: 78.8 MB


In [18]:
X=dataset.drop('Price',axis=1)
y=dataset['Price']

In [19]:
####creating feature set and dependent variable set:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=2)

In [20]:
from sklearn.linear_model import LinearRegression
reg=LinearRegression()

In [21]:
#### building the model:
reg.fit(X_train,y_train)

In [22]:
#score method
reg.score(X_train,y_train)

0.6873202423834317

In [23]:
reg.score(X_test,y_test)

0.2224442825147307

In [24]:
#As the traing set accuracy is 69%,it is very high compared to test data accuracy

In [25]:
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=60) # alpha from 0 to infinity
ridge_reg.fit(X_train,y_train)

In [26]:
ridge_reg.score(X_train,y_train)

0.6656227459635936

In [27]:
ridge_reg.score(X_test,y_test)

0.670179274229463

### we have too many variables ,there can be overfitting as well as multicollinearity issue which leads to drop in
### test set accuracy .So we use the regulated Ridge model to control overfitting and multicollinearity
### leads to an increased over all test score

In [29]:
from sklearn.linear_model import Lasso

In [30]:
lasso_reg = Lasso(alpha=50)

In [31]:
lasso_reg.fit(X_train,y_train)

In [32]:
lasso_reg.score(X_train,y_train)

0.681318942709281

In [33]:
lasso_reg.score(X_test,y_test)

0.6766485030258043