# Information about the dataset

Rooms: Number of rooms

Price: Price in dollars

Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

SellerG: Real Estate Agent

Date: Date sold

Distance: Distance from CBD

Regionname: General Region (West, North West, North, North east …etc)

Propertycount: Number of properties that exist in the suburb.

Bedroom2 : Scraped # of Bedrooms (from different source)

Bathroom: Number of Bathrooms

Car: Number of carspots

Landsize: Land Size

BuildingArea: Building Size

CouncilArea: Governing council for the area

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

1. Importing the necessary libraries**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Data Visualisation

In [None]:
df=pd.read_csv('/kaggle/input/melbourne-housing-snapshot/melb_data.csv')
df.head()

# Univariate distribution analysis of the saleprice

Since the main target is the price,checking the distribution of the price data

In [None]:
plt.title('Distribution of Price Data')
sns.distplot(df['Price'],kde=True)


Checking the skewness of the distribution,

In [None]:
print("Skewness: %f" % df['Price'].skew())

Price has some positive skewness, thus removing the skewness using log function

In [None]:
PLog = np.log(df['Price'])
PLog.skew()

Plotting

In [None]:
target=PLog
plt.title('Distribution of the skewed distribution')
sns.distplot(target,kde=True)

Correlation analysis of the data

In [None]:
plt.figure(figsize=(20,10))
sns.heatmap(df.corr(),annot=True)

Finding the columns which are more correlated to our target, i.e Price

In [None]:
p_corr=df.corr()
p_corr['Price'].sort_values(ascending=False)

# Treatment of the missing variables

First let us check the number of missing variables in the data

In [None]:
miss=df.isnull().sum()
miss.sort_values(ascending=False)

Visualising the missing values through bar plots

In [None]:
sns.set(font_scale=1)
plt.figure(figsize=(10,10))
miss.plot.barh(title='Missing Values')


Calculating the percentage of missing data

In [None]:
percent_missing=df.isnull().mean()*100
percent_missing.sort_values(ascending=False)

Thus the variables Car, Council, YearBuilt and Building Area have some missing values. Eventhough Building Area and YearBuilt are less correlated with the price the percentage of missing data is larger in BuildingArea and YearBuilt, they are not dropped due to their importance(Customers need to know about the size of the building and how old the building is respectively).

1. Dealing with Year Built data and checking for outliers

Drawing a kdeplot and boxplot

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(2,2,1)
plt.title('Distribution of Year Built data')
sns.kdeplot(df['YearBuilt'],shade=True,color='red')
plt.subplot(2,2,2)
sns.boxplot('YearBuilt',data=df,color='green')

In [None]:
df['YearBuilt'].describe()

Since there is an outlier around 1200 and there are more values in the middle of the kdeplot, hence replacing the missing values with median

In [None]:
df['YearBuilt'].replace({np.nan:df['YearBuilt'].median()},inplace=True)

2.Dealing with Building Area data

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(2,2,1)
plt.title('Distribution of Building Area data')
sns.kdeplot(df['BuildingArea'],shade=True,color='red')
plt.subplot(2,2,2)
sns.boxplot('BuildingArea',data=df,color='blue')

In [None]:
df['BuildingArea'].describe()

The plot shows that the data has lot of outliers, hence replacing the values with mode.

In [None]:
df['BuildingArea'].replace({np.nan:df['BuildingArea'].mode()},inplace=True)

3. Dealing with Car data

In [None]:
plt.figure(figsize=(20,10))
plt.subplot(2,2,1)
plt.title('Distribution of Car data')
sns.distplot(a=df['Car'],kde=False)
plt.subplot(2,2,2)
sns.boxplot('Car',data=df,color='blue')

In [None]:
df['Car'].describe()

From the distribution, we can see that the car data is not continuous. And for missing values, assuming that there are no car spots and hence filling the values with zero

In [None]:
df['Car'].fillna(0,inplace=True)

4. Dealing with Council Area

In [None]:
plt.figure(figsize=(10,10))
ca_count=sns.countplot(df['CouncilArea'])
ca_count.set_xticklabels(ca_count.get_xticklabels(),rotation=90);

* Here replacing the missing values of the council Area with Unavailable

In [None]:
df['CouncilArea'].replace({np.nan:'Unavailable'},inplace=True)

# Analysis of the data

Distribution of the data

In [None]:
df.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);


From the distribution we can see that, Car, Bathroom, Bedroom 2, Rooms are not continous but actually discrete and hence converting them into categoric variable

In [None]:
features_to_convert=['Car','Bedroom2','Bathroom','Rooms']
for i in features_to_convert:
    df[i]=df[i].astype(object)

Separating the numeric variables and categoric variables

In [None]:
categoric=df.select_dtypes(include='object')
numeric=df.select_dtypes(exclude=['object'])


In [None]:
numeric.info()

Relationship between distance from CBD and target

In [None]:
sns.scatterplot(df['Distance'],target)

It can be seen that the house price is less when the distance from the CBD is more

Analysis of numerical variables with the target variable which has positive correlation with the sale price, i.e 'Longtitude','Postcode','BuildingArea','Landsize'

In [None]:
sns.scatterplot(df['Postcode'],target)

The price is higher for the houses in the postcode area 3000-3200

In [None]:
sns.scatterplot(df['Landsize'],target)
plt.xlim(-1,100)

The plot shows that there is no trend of price with respect to the landsize, hence we have to use other categories, such as the type of the house, to study its relationship with the price of the plot

Predicting the relationship of the price with landsize with the help of type of the house

In [None]:
sns.scatterplot(x=df['Landsize'],y=target,hue=df['Type'])
plt.xlim(-1,100000)

We can see that the price is more for the type h houses with landarea, i.e villa, cottage and semi-terrace type houses.

In [None]:
sns.scatterplot(df['BuildingArea'],target,data=df)
plt.xlim(0,1000)

The price varies linearly with respect to the Building Area

In [None]:
sns.scatterplot(df['Longtitude'],target)


The price is higher for the houses at the longtitude 145-145.2

Analysis of the categoric variables with the target variable


In [None]:
categoric.info()

In [None]:
house_features=df[['Rooms','Bedroom2','Bathroom','Car']]
plt.figure(figsize=(20,10))
n=1
for i in house_features:
    plt.subplot(2,2,n)
    x=df.groupby([house_features[i]])['Price'].median().sort_values()
    ax=sns.boxplot(x=house_features[i],y='Price',data=df,order=list(x.index),palette='Blues')
    ax=sns.stripplot(x=house_features[i],y='Price',data=df,color='red',size=1.5)
    plt.xlabel(i)
    n+=1
        


    

Based on the plot with the features,
1. There is an increase of price with the increase in the number of the rooms
2. Similarly for the bedroom and the bathrooms, there is an increase in price for the increase in the number of the rooms
3. There is very little change in the price with respect to the number of car spots

There are too many dates to analyse, hence converting the dates into datetime object

In [None]:
plt.figure(figsize=(10,10))
df['Date']=pd.to_datetime(df['Date'])
df['year'] = pd.DatetimeIndex(df['Date']).year
sns.boxenplot(x='year',y=target,data=df)

The price does not show a lot of difference in terms of what year the house has been sold

Relationship between Type, Method, Regionname with price

In [None]:
sns.violinplot(x='Type',y=target,data=df)


In [None]:
r_plot=sns.boxplot(x='Regionname',y=target,data=df)
r_plot.set_xticklabels(r_plot.get_xticklabels(),rotation=90);

In [None]:
sns.boxplot(x='Method',y=target,data=df)

* The price is more for the h type,i. e - house,cottage,villa, followed by type t and type u
* The houses in the southern Metropolitan region has higher price compaired to other regions, while the houses in Western victoria region has comparatively lower price.
* The price is less for the properties which are sold prior(SP).

Analysis of Council Area

In [None]:
plt.figure(figsize=(10,10))
sns.boxplot(x=target,y='CouncilArea',data=df)

From the plot we can see that the houses coming under the governing council, Boroondara and Bayside are quite expensive compared to others and the houses coming under Wyndham are the cheapest.

Analysis of the variables with lot of categorical data points

There are several values for the categorical values especially suburb, sellerG and council area. Since it is difficult to plot all of them, hence plotting the values which are more expensive and least expensive

Analysis of suburbs with the costliest price and cheapest price

In [None]:
sns.set()
plt.figure(figsize=(20,10))
plt.subplot(2,2,1)
plt.title('Costliest suburbs')
df.groupby(["Suburb"])['Price'].median().sort_values(ascending=False)[:10].plot.bar()
plt.subplot(2,2,2)
plt.title('Cheapest suburbs')
df.groupby(["Suburb"])['Price'].median().sort_values(ascending=True)[:10].plot.bar()


Kooyong is the suburb with the most expensive houses, while Bacchus Marsh is the suburb with the cheapest houses

Analysis of SellerG with the costliest price and cheapest price

In [None]:
sns.set()
plt.figure(figsize=(20,10))
plt.subplot(2,2,1)
plt.title('Expensive sellers')
df.groupby(["SellerG"])['Price'].median().sort_values(ascending=False)[:10].plot.bar()
plt.subplot(2,2,2)
plt.title('Cheaper sellers')
df.groupby(["SellerG"])['Price'].median().sort_values(ascending=True)[:10].plot.bar()


Weast is the most expensive seller while hockingstuart and Advantage are the cheaper sellers

# Conclusion

* The price of the house is inversely related to the distance from the CBD.
* The price is higher for the houses in the postcode area 3000-3200.
* The price is more for the type h houses with landarea, i.e villa, cottage and semi-terrace type houses.
* The price of the house is more when the building area is more (linear-relationship).
* The price increases with the increase in number of rooms, bedrooms and bathrooms.
* The price is more for the h type,i. e - house,cottage,villa, followed by type t and type u
* The houses in the southern Metropolitan region has higher price compaired to other regions, while the houses in Western victoria region has comparatively lower price.
* The price is less for the properties which are sold prior(SP).
* Kooyong and Bacchus Marsh are the suburbs with the most expensive houses and the cheapest houses respectively.
* Weast is the most expensive seller while hockingstuart and Advantage are the cheaper sellers
















