<a href="https://www.kaggle.com/code/bhumitdevni/eda-of-australia-housing?scriptVersionId=143484869" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input/australian-housing-data-1000-properties-sampled'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import warnings
warnings.filterwarnings("ignore")

## Import Dataset

In [None]:
df = pd.read_csv("/kaggle/input/australian-housing-data-1000-properties-sampled/RealEstateAU_1000_Samples.csv")
df.head()

## Data Cleanig

In [None]:
df.info()

As we can see our Dataframe has 26 columns and 1000 rows 

In [None]:
plt.figure(figsize=(10,10))

sns.heatmap(df.isnull(), yticklabels = False, cbar=False, cmap='viridis')

As we can observe from above figure, there are missing values present in our Dataframe

In [None]:
df.isnull().sum()

Longitude and Latitude are missing all the values followed by open date,building size, land size, preferred size respectively while bedroom count, bathroom count and parking count are missing 33 values each, address and address_1 are missing 12 values each. 

Since, latitude and longitude are missing all the values it's good option to just drop both the columns also we'll drop pen date,building size, land size, preferred size as they have high number of missing data that can lead to incorrect output

In [None]:
df.drop(columns = ['latitude' , 'longitude' ,'building_size' , 'land_size' , 'preferred_size' , 'open_date'] , axis = 1, inplace = True)

After looking at remainig data we can make few observations:</p>
    1.We do not require columns TID,breadcrumb,category_name,location_number, phone, run date so we can drop them to reduce dataset and efficate output</p>
    2.In the price column we can see mixed values that requirs cleaning</p>
    3.I suspect price and location_name have same type of values, which is odd so need to investigate</p>
    4. adress and address_1 represents same data while adress shows whole adress, adress_1 only shows street name and unit number and since we already have city,state and pin columns its best to drop adress column</p>
    5.There are missing data in some of columns so we will drop rows with missing values

In [None]:
df.drop(columns = ['TID' , 'RunDate' , 'phone', 'breadcrumb', 'address', 'category_name' , 'location_number'] , axis = 1, inplace = True)

In [None]:
def df_clean(df, column_name):
    df[column_name] = df[column_name].str.replace(r'^.*?\$', '', regex=True)


df_clean(df, 'price')
df_clean(df, 'location_name')

matching_values = (df['price'] == df['location_name']).sum()
matching_values

All of the data in price and location_name is matching hence it's duplicate so our suspicion was right and we can just drop locatio_name all togather 

In [None]:
df = df.drop('location_name' , axis = 1)

Now, let's remove missing values from data set

In [None]:
df.dropna(subset =['bedroom_count' , 'bathroom_count' , 'parking_count' , 'address_1'] ,inplace = True)

In [None]:
df.isnull().sum()

Now, that our dataset has no missing values start with Feature Engineering

In [None]:
df

while observing df we can see price column still has non consistant values so for better result it's important to clean that

In [None]:
df['price'] = df['price'].str.replace(',', '', regex=True)
df = df[df['price'].str.isnumeric()]
df.reset_index(drop=True, inplace=True)

In [None]:
df['price'] = df['price'].astype(int)

In [None]:
df

In [None]:
df.drop('index' , axis=1,inplace=True)

Location_type is also not providing any usefull information so we'll drop that as well

In [None]:
df.drop('location_type' , axis=1, inplace=True)

In [None]:
df.info()

In [None]:
df.describe()

## Outliers Detection

In [None]:


plot_cols = df.columns[df.columns.isin(['price', 'zip_code', 'bedroom_count', 'bathroom_count', 'parking_count'])]


n_cols = 3
n_rows = -(-len(plot_cols) // n_cols)  # ceil division


fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5), squeeze=False)

for i, col in enumerate(plot_cols):
    row, col_idx = divmod(i, n_cols)
    sns.boxplot(data=df, x=col, ax=axes[row, col_idx], width=0.5)
    axes[row, col_idx].set_title(f'Box plot of {col}')
    axes[row, col_idx].set_xlabel(col)
    axes[row, col_idx].set_ylabel('')  # Remove y-axis label


for j in range(i + 1, n_rows * n_cols):
    row, col_idx = divmod(j, n_cols)
    fig.delaxes(axes[row][col_idx])

plt.tight_layout()
plt.show()


As we can see in the box plot there is outlier present that can screw-up our outcome so let's remove that and check again


Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
outlier_threshold = 1.5 * IQR + Q3


df['price'] = df['price'].apply(lambda x: None if x >= outlier_threshold else x)


df.reset_index(drop=True, inplace=True)


In [None]:
z_scores = zscore(df['price'])
outliers = df[(np.abs(z_scores) > 3)]

df = df[(np.abs(z_scores) <=3)]

In [None]:

plot_cols = df.columns[df.columns.isin(['price', 'zip_code', 'bedroom_count', 'bathroom_count', 'parking_count'])]


n_cols = 3
n_rows = -(-len(plot_cols) // n_cols)  # ceil division


fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 5), squeeze=False)


for i, col in enumerate(plot_cols):
    row, col_idx = divmod(i, n_cols)
    sns.boxplot(data=df, x=col, ax=axes[row, col_idx], width=0.5)
    axes[row, col_idx].set_title(f'Box plot of {col}')
    axes[row, col_idx].set_xlabel(col)
    axes[row, col_idx].set_ylabel('')  # Remove y-axis label


for j in range(i + 1, n_rows * n_cols):
    row, col_idx = divmod(j, n_cols)
    fig.delaxes(axes[row][col_idx])

plt.tight_layout()
plt.show()

Now that there's no outlier present in out data we can start visulization to make observation

## EDA

In [None]:
plt.figure(figsize=(12,5))
ax =sns.countplot(x='property_type' , data= df )
plt.xlabel('Property Type')
plt.xticks(rotation = 90)
plt.ylabel('Count')
plt.title('property market analysis ')

def add_value_labels(ax):
    for p in ax.patches:
        ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom')


add_value_labels(ax)

plt.show()

The property market analysis reveals that "House" listings dominate the market, having the highest number of properties available for sale. Conversely, "Studio," "Villa," and "Other" property types have the fewest listings for sale.

Specifically, "House" listings significantly outnumber others, indicating a substantial presence in the real estate market. On the other hand, "Studio apartments","Block of Units", "Villas," and "Other" property categories have a relatively limited presence in terms of available listings.

Among the property types with higher representation, "Apartment" listings are also notable, with 160 properties available for sale. This is closely followed by "Unit" listings, totaling 159 properties.

In contrast, property types such as "Townhouse," "Duplex/Semi-detached," and  "Acreage" have a relatively smaller number of listings, with 24, 16 and 4 properties available for sale, respectively.

These observations suggest that the market is currently characterized by a high supply of houses, potentially indicating a preference for larger properties. However, the relatively low number of "Studio" and "Block of Units" listings may suggest a shift in buyer preferences towards smaller, more compact housing options, such as studio apartments. This trend could be indicative of changing property preferences and potentially rising property prices in the Australian real estate market.

In [None]:
max_prices = df.groupby('property_type')['price'].transform('max')
min_prices = df.groupby('property_type')['price'].transform('min')

plt.figure(figsize=(10, 6))
bars = plt.bar(df['property_type'], df['price'])
plt.xlabel('Property Type')
plt.xticks(rotation=90)
plt.ylabel('Price')

for bar, max_val, min_val in zip(bars, max_prices, min_prices):
    if bar.get_height() == max_val:
        plt.annotate(f'Max: {max_val}', xy=(bar.get_x() + bar.get_width() / 2, bar.get_height() + 2),
                     xytext=(0, 5), textcoords='offset points', ha='center', fontsize=8, color='blue')
    elif bar.get_height() == min_val:
        plt.annotate(f'Min: {min_val}', xy=(bar.get_x() + bar.get_width() / 2, bar.get_height() - 10),
                     xytext=(0, -15), textcoords='offset points', ha='center', fontsize=8, color='red')

plt.title('Price by Property Type')
plt.show()

In [None]:
plt.figure(figsize=(10, 20))
ax =sns.countplot(y='listing_agency' , data= df )
plt.xlabel('Count')
plt.ylabel('Listing Agengy')
plt.title('property market analysis ')

def add_value_labels(ax):
    for p in ax.patches:
        ax.annotate(f'{int(p.get_width())}', (p.get_width(), p.get_y() + p.get_height() / 2.), ha='left', va='center')


add_value_labels(ax)
plt.tight_layout()
plt.plot()

In [None]:
plt.figure(figsize=(20,10))
ax =sns.countplot(x='city' , data= df)
plt.xlabel('City')
plt.xticks(rotation = 90)
plt.ylabel('count')
plt.title('City vs Count plot')

def add_value_labels(ax):
    for p in ax.patches:
        ax.annotate(f'{p.get_height()}', (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='bottom')

add_value_labels(ax)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(12,5))
ax =sns.countplot(x='zip_code' , data= df)
plt.xlabel('Zip Code')
plt.xticks(rotation = 90)
plt.ylabel('count')
plt.title('Zip Code vs Count')
add_value_labels(ax)
plt.show()

This figure shows distribution of proprty with respect to Zip code, there are 13 zip code provided where all the properties are located but if we refer to figure **"City vs count"** we can see there are clearly more than 13 cities given in our data. while zipcode is uique for each and city and suburbs this raise suspision of data being not consistant 