# Data Analysis

## Part 1: Data Pre-Processing

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

### 1.1 Data Loading
Lets now load in the data from the saved .csv files and store each years data in a seperate dataframe.

In [18]:
# File path to 2021 sales data csv
file_path_2021 = "data/2021_property_sales_data.csv"
# Load the 2021 sales data into its own dataframe
df_2021 = pd.read_csv(file_path_2021)
#df_2021.head()

In [19]:
# File path to 2022 sales data csv
file_path_2022 = "data/2022_property_sales_data.csv"
# Load the 2022 sales data into its own dataframe
df_2022 = pd.read_csv(file_path_2022)
#df_2022.head()

In [20]:
# File path to 2023 sales data csv
file_path_2023 = "data/2023_property_sales_data.csv"
# Load the 2023 sales data into its own dataframe
df_2023 = pd.read_csv(file_path_2023)
#df_2023.head()

In [21]:
# File path to 2024 sales data csv
file_path_2024 = "data/2024_property_sales_data.csv"
# Load the 2024 sales data into its own dataframe
df_2024 = pd.read_csv(file_path_2024)
#df_2024.head()

### 1.2 Data Format
Lets now check if the values are of the same format by using frequency tables.

#### Sale year
Lets convert the string formats dates -> numerical format for the 4 dataFrames.
E.g,. 15 January 2021 -> 2021-01-15

In [41]:
## Helper function to parse date of format "15 January 2021" into 2021-01-15
def convert_date(date_str):
    try:
        datetime.strptime(date_str, "%Y-%m-%d")
        return date_str
    except ValueError:
        pass
    try:
        date_obj = datetime.strptime(date_str, "%d %B %Y")
        return date_obj.strftime("%Y-%m-%d")
    except ValueError:
        return None

Lets clean the date values to be of the format "yyyy-mm-dd"

In [42]:
# Stripping "sold" from the beggining of each entry in the "Sale Date" column.
df_2021["Sale Date"] = df_2021["Sale Date"].str.strip().str.replace("Sold", "",regex=False).str.strip()
df_2022["Sale Date"] = df_2022["Sale Date"].str.strip().str.replace("Sold", "",regex=False).str.strip()
df_2023["Sale Date"] = df_2023["Sale Date"].str.strip().str.replace("Sold", "",regex=False).str.strip()
df_2024["Sale Date"] = df_2024["Sale Date"].str.strip().str.replace("Sold", "",regex=False).str.strip()
print("All values have been stripped of the prefix sold")

All values have been stripped of the prefix sold


Something here...

In [44]:
df_2021["Sale Date"] = df_2021["Sale Date"].apply(convert_date)
df_2022["Sale Date"] = df_2022["Sale Date"].apply(convert_date)
df_2023["Sale Date"] = df_2023["Sale Date"].apply(convert_date)
df_2024["Sale Date"] = df_2024["Sale Date"].apply(convert_date)
print("All values have now been converted to the format yyyy-mm-dd ")

All values have now been converted to the format yyyy-mm-dd 


#### Sale Price
This data is formatted in the following formats:
 - €709,606.00
 - € 553,235

Lets process this data so if the input was "€500,000.00" the output would be "500000.00"

In [48]:
# Helper function to remove currenct sign and format the values as a 
# string with format double rounded to 2 decimal places.

def clean_sale_prices(value):
    # Remove the currency sign
    value = str(value).replace("€", "").replace(",", "").strip()

    try:
        # Try to convert the string to a float and round to 2 decimal places
        rounded = round(float(value), 2)
        # Return a string rounded to 2 decimal places
        return f"{rounded:.2f}" 
    except ValueError:
        return None

Apply the helper function to all entries in the column.

In [49]:
df_2021["Sale Price"] = df_2021["Sale Price"].apply(clean_sale_prices)
df_2022["Sale Price"] = df_2022["Sale Price"].apply(clean_sale_prices)
df_2023["Sale Price"] = df_2023["Sale Price"].apply(clean_sale_prices)
df_2024["Sale Price"] = df_2024["Sale Price"].apply(clean_sale_prices)

Lets double check the output of our new column.

#### Location
All entries are of a valid string format, thus no pre-processing steps need to be applied to these columns.

#### Year Built
This data is currently stored in the following formats:
 - Unknown
 - c1999 (Where there is a prefixed char)
 - 1999c (Where there is a suffix char)

Lets process these entries for each dataframe so each entry is either a valid year or Nan

In [61]:
# Extract the first 4 digits from anywhere in the string, and add back into the column.
df_2021["Year Built"] = df_2021["Year Built"].str.extract(r"(\d{4})")
df_2022["Year Built"] = df_2022["Year Built"].str.extract(r"(\d{4})")
df_2023["Year Built"] = df_2023["Year Built"].str.extract(r"(\d{4})")
df_2024["Year Built"] = df_2024["Year Built"].str.extract(r"(\d{4})")

#### Garden
Entries in the Garden column are of the following formats:
 - Yes
 - No
 - ???
   
Lets perform some processing to map the "???" entries to "Unknown"

In [65]:
# Dictionary to map "???" entry to "Unknown"
map1 = {"???":"Unknown"}

In [71]:
# For each entry in the "Garden" column in the 4 dataframes "???" entries are replaced with "Unknown"
df_2021["Garden"] = df_2021["Garden"].replace(map1)
df_2022["Garden"] = df_2022["Garden"].replace(map1)
df_2023["Garden"] = df_2023["Garden"].replace(map1)
df_2024["Garden"] = df_2024["Garden"].replace(map1)

#### Garage
Entries in the Garage column are of the following formats:
 - Yes
 - No
 - ???
   
Lets perform some processing to map the "???" entries to "Unknown"

In [76]:
# Since the entries are of the same  values as the "Garden" column we can use the same maping dictionary as
# before for the four dataframes.
df_2021["Garage"] = df_2021["Garage"].replace(map1)
df_2022["Garage"] = df_2022["Garage"].replace(map1)
df_2023["Garage"] = df_2023["Garage"].replace(map1)
df_2024["Garage"] = df_2024["Garage"].replace(map1)

#### Type
Entries in the Type column are of the following formats:
 - Detached
 - Bungalow
 - Semi-Detached
 - Duplex
 - End-of-Terrace
 - Terraced
 - Semi-D
 - Det.
   
As you can see "Semi-Detached" and "Semi-D" should be counted together. The Same logic is applied to "Detached" and "Det.". 

Lets now use a mapping dictionary to use a universal entry format for Semi-Detached and Detached houses.

In [87]:
# Define a mapping dictionary to universally format entries for detached and semi-detached houses.
house_type_map = {"Det.":"Detached", "Semi-D":"Semi-Detached"}

In [88]:
# Strip and whitespace from the entries across the 4 dataframes.
df_2021["Type"] = df_2021["Type"].str.strip()
df_2022["Type"] = df_2022["Type"].str.strip()
df_2023["Type"] = df_2023["Type"].str.strip()
df_2024["Type"] = df_2024["Type"].str.strip()

In [90]:
# Apply the mapping dictionary to universally format the incorrect entries.
df_2021["Type"] = df_2021["Type"].replace(house_type_map)
df_2022["Type"] = df_2022["Type"].replace(house_type_map)
df_2023["Type"] = df_2023["Type"].replace(house_type_map)
df_2024["Type"] = df_2024["Type"].replace(house_type_map)

#### Style
Entries in the Type column are of the following formats:
 - 1 Storey
 - 1.5 Storey
 - 2 Storey

These are all valid format entries so no processing steps need to be completed.

#### Bedrooms
All entries in the Bedrooms column are valid integer values.

No data cleaning needs to be done to these columns.

#### Bathrooms
All entries in the Bathrooms column are valid integer values.

No data cleaning needs to be done to these columns.

#### First Time Buyer
Entries in the First Time Buyer columns are of the following formats:
 - Yes
 - No
 - NO
 - YES

Lets now convert all entries to lowercase to create a universal entry format.

In [119]:
# Use the .str.lower() function to convert all characters to lowercase
# Then use the str.strip() function to remove all prefix and suffix whitespace.
df_2021["First Time Buyer"] = df_2021["First Time Buyer"].str.lower().str.strip()
df_2022["First Time Buyer"] = df_2022["First Time Buyer"].str.lower().str.strip()
df_2023["First Time Buyer"] = df_2023["First Time Buyer"].str.lower().str.strip()
df_2024["First Time Buyer"] = df_2024["First Time Buyer"].str.lower().str.strip()