### 1. Import Required Libraries
We start by importing pandas, which is used for data manipulation and analysis.

In [1]:
import pandas as pd

### 2. Load the Data
Read the CSV file into a pandas DataFrame. We use `low_memory=False` to avoid type warnings.

In [2]:
house_data = pd.read_csv("melb_data.csv", low_memory=False)
print("Step 1: Loaded data")
print("Shape:", house_data.shape)
print(house_data.head(), "\n")

Step 1: Loaded data
Shape: (34857, 22)
         Suburb           Address  Rooms Type Method        SellerG      Date  \
0    Abbotsford     68 Studley St      2    h     SS         Jellis  3/9/2016   
1  Airport West     154 Halsey Rd      3    t     PI         Nelson  3/9/2016   
2   Albert Park    105 Kerferd Rd      2    h      S  hockingstuart  3/9/2016   
3   Albert Park  85 Richardson St      2    h      S        Thomson  3/9/2016   
4    Alphington      30 Austin St      3    h     SN        McGrath  3/9/2016   

   Distance  Postcode  Bedroom  ...  Landsize  BuildingArea  YearBuilt  \
0       2.5    3067.0      2.0  ...     126.0           inf        NaN   
1      13.5    3042.0      3.0  ...     303.0           225     2016.0   
2       3.3    3206.0      2.0  ...     120.0            82     1900.0   
3       3.3    3206.0      2.0  ...     159.0           inf        NaN   
4       6.4    3078.0      3.0  ...     174.0           122     2003.0   

                  CouncilArea

### 3. Check and Remove Duplicate Rows
Find out how many duplicate rows exist and remove them.

In [3]:
num_duplicates = house_data.duplicated().sum()
print(f"Step 2: Number of duplicate rows: {num_duplicates}")
house_data = house_data.drop_duplicates()
print("Step 2: After removing duplicates")
print("Shape:", house_data.shape)
print(house_data.head(), "\n")

Step 2: Number of duplicate rows: 0
Step 2: After removing duplicates
Shape: (34857, 22)
         Suburb           Address  Rooms Type Method        SellerG      Date  \
0    Abbotsford     68 Studley St      2    h     SS         Jellis  3/9/2016   
1  Airport West     154 Halsey Rd      3    t     PI         Nelson  3/9/2016   
2   Albert Park    105 Kerferd Rd      2    h      S  hockingstuart  3/9/2016   
3   Albert Park  85 Richardson St      2    h      S        Thomson  3/9/2016   
4    Alphington      30 Austin St      3    h     SN        McGrath  3/9/2016   

   Distance  Postcode  Bedroom  ...  Landsize  BuildingArea  YearBuilt  \
0       2.5    3067.0      2.0  ...     126.0           inf        NaN   
1      13.5    3042.0      3.0  ...     303.0           225     2016.0   
2       3.3    3206.0      2.0  ...     120.0            82     1900.0   
3       3.3    3206.0      2.0  ...     159.0           inf        NaN   
4       6.4    3078.0      3.0  ...     174.0         

### 4. Check and Remove Negative Prices
If the 'Price' column exists, count and remove rows with negative or zero prices.

In [4]:
if 'Price' in house_data.columns:
    num_negative_prices = (house_data['Price'] <= 0).sum()
    print(f"Step 3: Number of rows with negative prices: {num_negative_prices}")
    house_data = house_data[house_data['Price'] > 0]
    print("Step 3: After removing negative prices")
    print("Shape:", house_data.shape)
    print(house_data.head(), "\n")

Step 3: Number of rows with negative prices: 0
Step 3: After removing negative prices
Shape: (27247, 22)
         Suburb           Address  Rooms Type Method        SellerG      Date  \
1  Airport West     154 Halsey Rd      3    t     PI         Nelson  3/9/2016   
2   Albert Park    105 Kerferd Rd      2    h      S  hockingstuart  3/9/2016   
3   Albert Park  85 Richardson St      2    h      S        Thomson  3/9/2016   
5    Alphington        6 Smith St      4    h      S          Brace  3/9/2016   
6    Alphington   5/6 Yarralea St      3    h      S         Jellis  3/9/2016   

   Distance  Postcode  Bedroom  ...  Landsize  BuildingArea  YearBuilt  \
1      13.5    3042.0      3.0  ...     303.0           225     2016.0   
2       3.3    3206.0      2.0  ...     120.0            82     1900.0   
3       3.3    3206.0      2.0  ...     159.0           inf        NaN   
5       6.4    3078.0      3.0  ...     853.0           263     1930.0   
6       6.4    3078.0      3.0  ...   

### 5. Check and Handle Missing Data
Count missing values in each column and fill them with median (numeric) or mode (categorical).

In [5]:
missing_counts = house_data.isnull().sum()
print("Step 4: Number of missing values in each column before filling:")
print(missing_counts[missing_counts > 0])
for column in house_data.columns:
    if house_data[column].dtype == 'O':  # 'O' for object
        mode_value = house_data[column].mode()[0]
        house_data[column] = house_data[column].fillna(mode_value)
    else:
        median_value = house_data[column].median()
        house_data[column] = house_data[column].fillna(median_value)
print("Step 4: After handling missing data")
print("Shape:", house_data.shape)
print(house_data.head(), "\n")

Step 4: Number of missing values in each column before filling:
Distance             1
Postcode             1
Bedroom           6441
Bathroom          6447
Car               6824
Landsize          9265
BuildingArea     16577
YearBuilt        15163
CouncilArea          3
Latitude          6254
Longtitude        6254
Propertycount        3
dtype: int64
Step 4: After handling missing data
Shape: (27247, 22)
         Suburb           Address  Rooms Type Method        SellerG      Date  \
1  Airport West     154 Halsey Rd      3    t     PI         Nelson  3/9/2016   
2   Albert Park    105 Kerferd Rd      2    h      S  hockingstuart  3/9/2016   
3   Albert Park  85 Richardson St      2    h      S        Thomson  3/9/2016   
5    Alphington        6 Smith St      4    h      S          Brace  3/9/2016   
6    Alphington   5/6 Yarralea St      3    h      S         Jellis  3/9/2016   

   Distance  Postcode  Bedroom  ...  Landsize  BuildingArea  YearBuilt  \
1      13.5    3042.0      3.0 

### 6. Check and Handle Categorical Data
Find categorical columns and encode them using one-hot encoding.

In [6]:
categorical_columns = house_data.select_dtypes(include=['object']).columns
print(f"Step 5: Number of categorical columns: {len(categorical_columns)}")
print("Categorical columns:", list(categorical_columns))
house_data = pd.get_dummies(house_data, columns=categorical_columns, drop_first=True)
print("Step 5: After encoding categorical data")
print("Shape:", house_data.shape)
print(house_data.head(), "\n")

Step 5: Number of categorical columns: 10
Categorical columns: ['Suburb', 'Address', 'Type', 'Method', 'SellerG', 'Date', 'BuildingArea', 'CouncilArea', 'Regionname', 'ParkingArea']
Step 5: After encoding categorical data
Shape: (27247, 28246)
   Rooms  Distance  Postcode  Bedroom  Bathroom  Car  Landsize  YearBuilt  \
1      3      13.5    3042.0      3.0       2.0  1.0     303.0     2016.0   
2      2       3.3    3206.0      2.0       1.0  0.0     120.0     1900.0   
3      2       3.3    3206.0      2.0       1.0  0.0     159.0     1970.0   
5      4       6.4    3078.0      3.0       2.0  4.0     853.0     1930.0   
6      3       6.4    3078.0      3.0       2.0  2.0     208.0     2013.0   

   Latitude  Longtitude  ...  Regionname_Southern Metropolitan  \
1  -37.7180    144.8780  ...                             False   
2  -37.8459    144.9555  ...                              True   
3  -37.8450    144.9538  ...                              True   
5  -37.7707    145.0318  ... 