**The dataset used in this project is fairly clean already. Except for duplicates, missing values and outliers, there are no structural errors in features. As part of data preprocessing, the features will be given their proper names and proper data types. Similarly, for data cleaning, duplicates will be removed and outliers will be handled.**

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.preprocessing import OneHotEncoder, QuantileTransformer
from sklearn.model_selection import train_test_split

import tensorflow as tf

In [None]:
# setting some styles and options
sns.set_style("whitegrid")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [None]:
#  Importing the dataset
df = pd.read_csv("/kaggle/input/sydney-house-prices/SydneyHousePrices.csv")

In [None]:
df.head()

In [None]:
# dropping Id column as it isn't needed anymore
df = df.drop('Id', axis='columns')

In [None]:
# Kepping target dependant feature 'sellPrice' to the right most column in the dataframe
target_feature = df.pop('sellPrice')
df.insert(len(df.columns), 'sellPrice', target_feature)


df.head()

In [None]:
df.shape

In [None]:
df.info()

# Handling Duplicates

In [None]:
# checking for duplicates
df[df.duplicated()].shape

<h3>The dataset has several dublicates!</h3>

In [None]:
# remove duplicates and reset index
df = df.drop_duplicates().reset_index(drop=True)

In [None]:
# checking again for duplicates
df[df.duplicated()].shape

In [None]:
# total entries in dataset after removing duplicates
df.shape

# Re-assigning names and dtypes

In [None]:
list(df.columns)

In [None]:
# Giving features their proper names
df = df.rename(columns={'bed': 'bedrooms', 'bath': 'bathrooms', 'car': 'carParkingSpace', 'propType': 'propertyType'})

In [None]:
list(df.columns)

In [None]:
# converting date feature to datetime64
df['Date'] = pd.to_datetime(df['Date'])

# converting cateogrical feature to category
df['suburb'] = df['suburb'].astype('category')
df['propertyType'] = df['propertyType'].astype('category')
df['postalCode'] = df['postalCode'].astype('category')

In [None]:
df.info()

# Handling missing values

In [None]:
import missingno as msno
msno.matrix(df)

In [None]:
(df.isnull().sum() / df.shape[0]) * 100 # converting missing values into percentage

In [None]:
# Missing values in bed feature
df[pd.isna(df['bedrooms'])].shape

In [None]:
# Missing values in car feature
df[pd.isna(df['carParkingSpace'])].shape

* **'carParkingSpace' contains almost 9.1% missing values or 18,151 missing values.**

* **While, feature 'bedrooms' contains 154 missing values**

## Fixing 'bed' feature null values

In [None]:
plt.figure(figsize=(13, 4))
sns.boxplot(data=df, x='bedrooms')
plt.show()

**Since, 'bed' feature is skewed and has extreme values (outliers), it's better to use median to replace the null values.**

In [None]:
# Replacing null values of 'car' feature with its median
bed_median = df['bedrooms'].median()
df['bedrooms'].fillna(bed_median, inplace=True)

In [None]:
# checking for null values in 'car' feature
df['bedrooms'].isnull().sum() / df['bedrooms'].shape[0]

**No null values!**

## Fixing 'car' feature null values

In [None]:
plt.figure(figsize=(13, 4))
sns.boxplot(data=df, x='carParkingSpace')
plt.show()

**Since, 'car' is also skewed, it's better to use median to replace the null values.**

In [None]:
# Replacing null values of 'car' feature with its median
bed_median = df['carParkingSpace'].median()
df['carParkingSpace'].fillna(bed_median, inplace=True)

In [None]:
# checking for null values in 'car' feature
df['carParkingSpace'].isnull().sum() / df['carParkingSpace'].shape[0]

**No null values!**

## Checking data after handling missing values

In [None]:
# Displaying missing values
df.isnull().sum()

**No missing values!**

In [None]:
# Describing numeric features
df.describe()

In [None]:
# Describing non-numeric features
df.describe(datetime_is_numeric=True, exclude=['int', 'float'])

<p style='color: green'>
    <b>The dataset contains record of 199231 properties sold from 1 December, 2000 to 6 July, 2019.</b>
</p>

# Feature Engineering

**'Date' feature is in YYYY-MM-DD format. For analysis purpose, 'Date' feature will be transformed into 3 new features: years, months, and day.**

In [None]:
import calendar
month_names = [calendar.month_name[i] for i in range(1, 13)]

# extracting year, month, and day as separate columns
year = df['Date'].dt.year
month = df['Date'].dt.month.apply(lambda x: month_names[x-1])
day = df['Date'].dt.day

# Inserting new year, month, and day features
df.insert(1, 'dayOfMonth', day)
df.insert(1, 'monthOfYear', month)
df.insert(1, 'yearSold', year)

In [None]:
# Assigning category data type to the month feature
df['monthOfYear'] = df['monthOfYear'].astype('category')

In [None]:
df = df.drop('Date', axis=1) #Removing Date feature

df.head()

# Handling outliers

**Handling outlers is very important as presence of outliers can affect the analysis, correlation, and modeling results in a significant way. Outliers are extreme values that are much higher or lower than the other values in the dataset.**

In [None]:
def plotter(plot_name, dataframe, feature, figsize_width, figsize_height):
    '''
    This function can diplay different plots based on the parameters given.
    '''
    plt.figure(figsize=(figsize_width, figsize_height))
    if plot_name == 'boxplot':
        sns.boxplot(data=dataframe, x=feature)
    elif plot_name == 'kdeplot':
        sns.kdeplot(dataframe[feature], bw_adjust=0.2)
    plt.show()

def remove_outliers(dataframe, feature, conditional=">", value=0, testing=False):
    '''
    If testing=False, removes any outlers below or above "conditional" certain given 'value' and prints report of how many outliers removed
    If testing=True, prints report of how many outliers detected, and displays a box plot which depicts how data will be ifdetected outliers are removed.
    '''
    if conditional==">":
        # Filtering out the outliers
        outliers = dataframe.loc[dataframe[feature]>value]
    elif conditional=="<":
        # Filtering out the outliers
        outliers = dataframe.loc[dataframe[feature]<value]
    # Removing the obvious outliers
    outliers_index = outliers.index
    if testing==False:
        dataframe.drop(outliers_index, inplace=True)
        print(f"{len(outliers_index)} outliers removed.")
    elif testing==True:
        print(f"{len(outliers_index)} outliers detected. If those outliers are removed, the distribution would be as given below:")
        new_df = dataframe.copy()
        new_df.drop(outliers_index, inplace=True)
        plotter('boxplot', new_df, feature, 13, 4)
        
def remove_given_indexes(dataframe, indexes=[]):
    # Remove the entries corresponding to the indices in the list
    dataframe.drop(index=indexes, inplace=True)
    print(f"Entries {str(indexes)} removed.")
    
def print_outliers(dataframe, feature, conditional=">", value=0):
    if conditional==">":
        # Filtering out the outliers
        outliers = dataframe.loc[dataframe[feature]>value]
    elif conditional=="<":
        # Filtering out the outliers
        outliers = dataframe.loc[dataframe[feature]<value]
    return outliers

In [None]:
def outliers_checker_based_on_quantiles(feature, dataframe):
    '''
    Checks if any outliers are present above 90th percentile and 10th percentile
    '''
    # Using 10th and 90th quartile
    tenth_quantile = dataframe[feature].quantile(0.10)
    ninetieth_quantile = dataframe[feature].quantile(0.90)
    
    lower_outliers = dataframe.loc[df[feature]<tenth_quantile]
    upper_outliers = dataframe.loc[df[feature]>ninetieth_quantile]
    
    
    # Using Inter Quartile Range
    Q1 = dataframe[feature].quantile(0.25)
    Q3 = dataframe[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_limit = Q1 - 1.5*IQR
    upper_limit = Q3 + 1.5*IQR

    Iqr_lower_outliers = dataframe.loc[dataframe[feature] < lower_limit]
    Iqr_upper_outliers = dataframe.loc[dataframe[feature] > upper_limit]
    
    print(f"The feature '{feature}' has {lower_outliers.shape[0]} outliers below 10th quartile and {upper_outliers.shape[0]} outliers above 90th quartile.")
    print(f"Similarly, according to IQR limits, the feature '{feature}' has {Iqr_lower_outliers.shape[0]} outliers below lower limit and {Iqr_upper_outliers.shape[0]} outliers above upper limit.")

## Handling outliers in 'sellPrice'

In [None]:
# Plotting boxplot before making any changes
plotter('boxplot', df, 'sellPrice', 13, 3)

**The way extreme outlers are probably entry error. So, it should be removed.**

In [None]:
remove_outliers(df, 'sellPrice', '>', 0.35e9, testing = False)

plotter('boxplot', df, 'sellPrice', 13, 3)

In [None]:
# Checking other remaining outliers to see if there are any noticable abnormalities
# to note their indexes and remove them
remove_outliers(df, 'sellPrice', '>', 0.35e8, testing = True)

print_outliers(df, 'sellPrice', '>', 0.35e8)

<b>Looking at the features of index entry number '103736', the price seems abnormal. It should be removed.</b>

In [None]:
remove_given_indexes(df, [103736])

In [None]:
outliers_checker_based_on_quantiles('sellPrice', df)

## Handling outliers in 'yearSold'

In [None]:
def numeric_histplot_and_boxplot_plotter(feature, fig_width, fig_height):
    if df[feature].dtype == 'int64' or df[feature].dtype == 'float64':
        fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(fig_width, fig_height))

        sns.scatterplot(data=df, x=feature, y='sellPrice', ax=ax1).set_title('With respect to Selling Prices')

        sns.boxplot(data=df, x=feature, ax=ax2).set_title('With respect to counts')
        
        plt.tight_layout()
        plt.show()
        
    else:
        print('Provide a numeric feature')

In [None]:
df['yearSold'].value_counts()

**Years 2000 to 2003 have very few number of entries. It is better to remove them.**

In [None]:
df = df.loc[(df['yearSold'] > 2003)]

df['yearSold'].value_counts()

In [None]:
outliers_checker_based_on_quantiles('yearSold', df)

In [None]:
numeric_histplot_and_boxplot_plotter('yearSold', 13, 8)

**We can clearly see in above 'With respect to selling price' figure, that there still are some potential outliers.**

In [None]:
# filtering out the outliers

outliers_yearSold = df.loc[(df['yearSold'].isin([2009, 2013, 2014, 2017, 2018])) & (df['sellPrice']>0.5e8)]
outliers_yearSold

**These all seem way overpriced. It is better to remove them.**

In [None]:
# removing entries with given index
df = df.drop(outliers_yearSold.index)

In [None]:
numeric_histplot_and_boxplot_plotter('yearSold', 13, 8)

## Handling outliers in 'dayOfMonth'

In [None]:
df['dayOfMonth'].value_counts()

<b>'dayOfMonth' feature looks pretty good as it is</b>

In [None]:
numeric_histplot_and_boxplot_plotter('dayOfMonth', 13, 8)

## Handling outliers in 'bedrooms'

In [None]:
bedroom_dict = dict(df['bedrooms'].value_counts())
bedroom_dict

**Some entries with certain number of bedrooms have very less entriers. It is better to remove them.**

In [None]:
to_remove = []
for key, val in bedroom_dict.items():
    if val < 30: # kepping the threshold at 30. So, bedromms with count less than 30 will be removed
        to_remove.append(key)

# Removing out the selected entries
df = df.loc[~df['bedrooms'].isin(to_remove)]

df['bedrooms'].value_counts()

In [None]:
numeric_histplot_and_boxplot_plotter('bedrooms', 13, 8)

**There are some obvious outlier which should be removed.**

In [None]:
# filtering out the outliers

outliers_ = df.loc[(df['sellPrice']>4e7)]
outliers_

In [None]:
# removing entries with given index
df = df.drop(outliers_.index)

In [None]:
numeric_histplot_and_boxplot_plotter('bedrooms', 13, 8)

In [None]:
outliers_checker_based_on_quantiles('bedrooms', df)

## Handling outliers in 'bathrooms'

In [None]:
bathrooms_dict = df['bathrooms'].value_counts()
bathrooms_dict

**Some entries with certain number of bathrooms have very less entriers. It is better to remove them.**


In [None]:
to_remove = []
for key, val in bathrooms_dict.items():
    if val < 14: # kepping the threshold at 14. So, bathrooms with count less than 30 will be removed
        to_remove.append(key)

# Removing out the selected entries
df = df.loc[~df['bathrooms'].isin(to_remove)]


df['bathrooms'].value_counts()

In [None]:
numeric_histplot_and_boxplot_plotter('bathrooms', 13, 8)

In [None]:
outliers_checker_based_on_quantiles('bathrooms', df)

## Handling outliers in 'carParkingSpace'

In [None]:
carParkingSpace_dict = df['carParkingSpace'].value_counts()
carParkingSpace_dict

In [None]:
to_remove = []
for key, val in carParkingSpace_dict.items():
    if val < 18: # kepping the threshold at 18. So, carParkingSpace with count less than 18 will be removed
        to_remove.append(key)

# Removing out the selected entries
df = df.loc[~df['carParkingSpace'].isin(to_remove)]


df['carParkingSpace'].value_counts()

In [None]:
numeric_histplot_and_boxplot_plotter('carParkingSpace', 13, 8)

In [None]:
outliers_checker_based_on_quantiles('carParkingSpace', df)

## Handling outliers in 'postalCode'

In [None]:
postalCode_dict = df['postalCode'].value_counts()

to_remove = []
for key, val in postalCode_dict.items():
    if val < 10: # kepping the threshold at 10. So, postalCode with count less than 10 will be removed
        to_remove.append(key)

# Removing out the selected entries
df = df.loc[~df['postalCode'].isin(to_remove)]

## Handling outliers in 'propertyType'

In [None]:
propertyType_dict = df['propertyType'].value_counts()
propertyType_dict

**This feature is alright**

## Handling outliers in 'suburb'

In [None]:
suburb_dict = df['suburb'].value_counts()

to_remove = []
for key, val in suburb_dict.items():
    if val < 10: # kepping the threshold at 10. So, suburb with count less than 10 will be removed
        to_remove.append(key)

# Removing out the selected entries
df = df.loc[~df['suburb'].isin(to_remove)]

## Final look at 'sellPrice'

In [None]:
plotter('boxplot', df, 'sellPrice', 13, 3)

**Removing the rightmost ones**

In [None]:
df = df.loc[df['sellPrice']<3.5e7]

In [None]:
plotter('boxplot', df, 'sellPrice', 13, 3)

# *Exploratory Data Analysis

**The data is used to generate some insight about the housing market of sydney in this section.**

In [None]:
# Generate a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), cmap='coolwarm', annot=True)
plt.show()

**Number of bathrooms positively affects the price of properties the most, followed by the number of bedrooms, and then car parking space.**

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(13, 8))




yearSold_dict = {}
for year in range(2004, 2020):
    inner_df = df.loc[df['yearSold'] == year]
    each_year_prices= inner_df['sellPrice']
    
    yearSold_dict[year] = each_year_prices.median()

# Creating lists for x and y values
x_values = list(yearSold_dict.keys())
y_values = list(yearSold_dict.values())

# Create a line plot with markers
ax1.plot(x_values, y_values, marker='o')
# Set the x and y axis labels
ax1.set_xlabel ('yearSold')
ax1.set_ylabel ('Median Sell Price')




sns.histplot(data=df, x='yearSold', ax=ax2)



plt.tight_layout()
plt.show()

In [None]:
df['yearSold'].value_counts()

**The housing market in Sydney saw gradual increase in the number of transactions from early 2000s to 2017 where it peaked. After 2017, the market saw a sharp decline in number of transactions into 2019. The market was thriving between 2014 and 2018 with number of transactions exceeding 20,000 for each of those years.**

**From early 2000s till 2005, the prices of properties remained fairly stable. The prices of properties increased by more than 50% from 2005 to 2008. The prices saw a very slight decline in the next year (2009) but in 2010, the prices inclined sharply by almost 13% from 2009. From 2010, the prices fluctuated, to reach all time high in 2015. 2016 saw slight decline in prices but again rose in 2017 with prices smiliar to 2015. From 2017 onwards, the the housing market saw a huge decline to 2019.**

**The prices of properties tends to decrease with decrease in number of transactions in a year and vice versa.**

In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(13, 8))




import calendar
month_names = [calendar.month_name[i] for i in range(1, 13)]

monthOfYear_median_price_dict = {}
monthOfYear_value_counts_dict = {}
for month in month_names:
    inner_df = df.loc[df['monthOfYear'] == month]
    each_month_prices= inner_df['sellPrice']
    
    monthOfYear_median_price_dict[month] = each_month_prices.median()
    
    monthOfYear_value_counts_dict[month] = len(each_month_prices)

x_months = list(monthOfYear_median_price_dict.keys())
y_median_values = list(monthOfYear_median_price_dict.values())

ax1.plot(x_months, y_median_values, marker='o')
ax1.set_xlabel ('monthOfYear')
ax1.set_ylabel ('Median Sell Price')





y_values_counts = list(monthOfYear_value_counts_dict.values())
ax2.hist(x_months, bins=12, weights=y_values_counts)
ax2.set_xlabel ('monthOfYear')
ax2.set_ylabel ('Frequency')




plt.tight_layout()
plt.show()

**The 3 busiest months in Sydney for property transactions seems to be November, May, and March with the latter month being the busiest out of all 12 months. On the contrary, January seems to be the month with the least number of properties transactions in Sydney.**

**Furthermore, March, September and November seems to be the months where transactions of most expensive properties occur. On the other hand, January, July and December seems to be the months where transactions of less expensive properties are likely to occur.**

In [None]:
import matplotlib.ticker as ticker

suburbs_name_and_median_val = {}
for key, val in df['suburb'].cat.remove_unused_categories().value_counts().items():
    inner_df = df.loc[df['suburb']==key]
    each_suburb_sellPrice = inner_df['sellPrice']
    
    suburbs_name_and_median_val[key] = each_suburb_sellPrice.median()

# sorting the dict by its value amount in descending order
sorted_suburbs_name_and_median_val = dict(sorted(suburbs_name_and_median_val.items(), key=lambda item: item[1], reverse=True))


top_10_most_expensive_suburbs = dict(list(sorted_suburbs_name_and_median_val.items())[:10])
bottom_10_least_expensive_suburbs = dict(list(sorted_suburbs_name_and_median_val.items())[-10:])



# visualizing data
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(13, 10))

ax1.bar(top_10_most_expensive_suburbs.keys(), top_10_most_expensive_suburbs.values())
ax1.set_xlabel ('Suburbs')
ax1.set_ylabel ('Median Properties Price')
ax1.set_xticks(range(len(top_10_most_expensive_suburbs)))
ax1.set_xticklabels(top_10_most_expensive_suburbs.keys(), rotation=45, ha='right')
ax1.set_title('Top 10 Most Expensive Suburbs in Sydney')
ax1.yaxis.set_major_formatter(ticker.FormatStrFormatter('%.0f'))

ax2.bar(bottom_10_least_expensive_suburbs.keys(), bottom_10_least_expensive_suburbs.values())
ax2.set_xlabel ('Suburbs')
ax2.set_ylabel ('Median Properties Price')
ax2.set_xticks(range(len(bottom_10_least_expensive_suburbs)))
ax2.set_xticklabels(bottom_10_least_expensive_suburbs.keys(), rotation=45, ha='right')
ax2.set_title('Top 10 Least Expensive Suburbs in Sydney')

# plt.tight_layout()
plt.subplots_adjust(hspace=1)
plt.show()

In [None]:
# Describing numeric features
df.describe()

**Most properties sold were priced at 985,000.**

**Most of the properties were sold between 720,000 and 1,475,000 prices.**

**Most of the sold properties had either 3 or 4 bedrooms.**

**Most of the sold properties had either 1 or 2 bathrooms.**

**Most of the sold properties had either 1 or 2 car parking spaces.**

In [None]:
# Describing non-numeric features
df.describe(exclude=['int', 'float'])

**Most popular property type sold was houses.**

**The highest number of property sold was in 'Castle Hill'.**

**The highest number of property sold was in 2155 postal code.**