# Practical Exam: House sales

RealAgents is a real estate company that focuses on selling houses.

RealAgents sells a variety of types of house in one metropolitan area.

Some houses sell slowly and sometimes require lowering the price in order to find a buyer.

In order to stay competitive, RealAgents would like to optimize the listing prices of the houses it is trying to sell.

They want to do this by predicting the sale price of a house given its characteristics.

If they can predict the sale price in advance, they can decrease the time to sale.


## Data

The dataset contains records of previous houses sold in the area.

| Column Name | Criteria                                                |
|-------------|---------------------------------------------------------|
| house_id    | Nominal. </br> Unique identifier for houses. </br>Missing values not possible. |
| city        | Nominal. </br>The city in which the house is located. One of 'Silvertown', 'Riverford', 'Teasdale' and 'Poppleton'. </br>Replace missing values with "Unknown". |
| sale_price  | Discrete. </br>The sale price of the house in whole dollars. Values can be any positive number greater than or equal to zero.</br>Remove missing entries. |
| sale_date   | Discrete. </br>The date of the last sale of the house. </br>Replace missing values with 2023-01-01. |
| months_listed  | Continuous. </br>The number of months the house was listed on the market prior to its last sale, rounded to one decimal place. </br>Replace missing values with mean number of months listed, to one decimal place. |
| bedrooms    | Discrete. </br>The number of bedrooms in the house. Any positive values greater than or equal to zero. </br>Replace missing values with the mean number of bedrooms, rounded to the nearest integer. |
| house_type   | Ordinal. </br>One of "Terraced" (two shared walls), "Semi-detached" (one shared wall), or "Detached" (no shared walls). </br>Replace missing values with the most common house type. |
| area      | Continuous. </br>The area of the house in square meters, rounded to one decimal place. </br>Replace missing values with the mean, to one decimal place. |


Review all instructions before starting:

- Use Python to perform each of the tasks.

- Write your solutions in the code cell provided.

- The object you have been asked to create will be graded, not the code.

- Ensure you match any column name or object requirements.

- You must be successful in all tasks to pass this exam.


- The fit of your models will be compared to held back values from the test set provided to you. We will calculate the Root Mean Squared Error of your predictions.

- At least one of your two models must have a Root Mean Squared Error below 30,000 to pass.

Test to pass:

- All required data has been created and has the required columns
- Task 1: Identify and replace missing values.
- Task 2: Identify and replace missing values.
- Task 2: Clean categorical and text data by manipulating strings.
- Task 2: Convert values between data types.
- Task 3: Aggregate numeric, categorical variables and dates by groups.
- Task 4 & 5: Implement standard modeling approaches for supervised learning problems.

# Task 1

The team at RealAgents knows that the city that a property is located in makes a difference to the sale price. 

Unfortuntately they believe that this isn't always recorded in the data. 

Calculate the number of missing values of the `city`. 

 - You should use the data in the file "house_sales.csv". 

 - Your output should be an object `missing_city`, that contains the number of missing values in this column. 

In [7]:
# Use this cell to write your code for Task 1

# Task 1: Count missing city values
import pandas as pd
import numpy as np

# Load the data
house_sales_df = pd.read_csv('house_sales.csv')

# Count missing values including '--' entries
missing_city = house_sales_df['city'].isna().sum() + (house_sales_df['city'] == '--').sum()
print(f"Missing values in city column is: {missing_city}")

Missing values in city column is: 16


# Task 2 

Before you fit any models, you will need to make sure the data is clean. 

The table below shows what the data should look like. 

Create a cleaned version of the dataframe. 

 - You should start with the data in the file "house_sales.csv". 

 - Your output should be a dataframe named `clean_data`. 

 - All column names and values should match the table below.


| Column Name | Criteria                                                |
|-------------|---------------------------------------------------------|
| house_id    | Nominal. </br> Unique identifier for houses. </br>Missing values not possible. |
| city        | Nominal. </br>The city in which the house is located. One of 'Silvertown', 'Riverford', 'Teasdale' and 'Poppleton' </br>Replace missing values with "Unknown". |
| sale_price  | Discrete. </br>The sale price of the house in whole dollars. Values can be any positive number greater than or equal to zero.</br>Remove missing entries. |
| sale_date   | Discrete. </br>The date of the last sale of the house. </br>Replace missing values with 2023-01-01. |
| months_listed  | Continuous. </br>The number of months the house was listed on the market prior to its last sale, rounded to one decimal place. </br>Replace missing values with mean number of months listed, to one decimal place. |
| bedrooms    | Discrete. </br>The number of bedrooms in the house. Any positive values greater than or equal to zero. </br>Replace missing values with the mean number of bedrooms, rounded to the nearest integer. |
| house_type   | Ordinal. </br>One of "Terraced", "Semi-detached", or "Detached". </br>Replace missing values with the most common house type. |
| area      | Continuous. </br>The area of the house in square meters, rounded to one decimal place. </br>Replace missing values with the mean, to one decimal place. |

In [18]:
# Use this cell to write your code for Task 2

# Task 2: Clean the data
def clean_house_sales_data(df):
    """
    Clean house sales data according to specifications
    """
    clean_df = df.copy()
    
    # Clean city: replace '--' with NaN, then fill with 'Unknown'
    clean_df['city'] = clean_df['city'].replace('--', np.nan).fillna('Unknown')
    
    # Remove rows with missing sale_price
    clean_df = clean_df.dropna(subset=['sale_price'])
    
    # Handle sale_date: convert to datetime, fill missing with 2023-01-01
    clean_df['sale_date'] = pd.to_datetime(clean_df['sale_date'], errors='coerce')
    clean_df['sale_date'] = clean_df['sale_date'].fillna(pd.Timestamp('2023-01-01'))
    
    # Clean months_listed: convert to numeric, fill missing with mean rounded to 1 decimal
    clean_df['months_listed'] = pd.to_numeric(clean_df['months_listed'], errors='coerce')
    mean_months = round(clean_df['months_listed'].mean(), 1)
    clean_df['months_listed'] = clean_df['months_listed'].fillna(mean_months)
    
    # Clean bedrooms: convert to numeric, fill missing with mean rounded to nearest integer
    clean_df['bedrooms'] = pd.to_numeric(clean_df['bedrooms'], errors='coerce')
    mean_bedrooms = round(clean_df['bedrooms'].mean())
    clean_df['bedrooms'] = clean_df['bedrooms'].fillna(mean_bedrooms).astype(int)
    
    # Clean house_type: standardize abbreviations, fill missing with most common type
    clean_df['house_type'] = clean_df['house_type'].replace({
        'Det.': 'Detached',
        'Semi': 'Semi-detached'
    })
    most_common_type = clean_df['house_type'].mode()[0]
    clean_df['house_type'] = clean_df['house_type'].fillna(most_common_type)
    
    # Clean area: remove 'sq.m.' suffix, convert to numeric, fill missing with mean
    clean_df['area'] = clean_df['area'].astype(str).str.replace(' sq.m.', '', regex=False)
    clean_df['area'] = pd.to_numeric(clean_df['area'], errors='coerce')
    mean_area = round(clean_df['area'].mean(), 1)
    clean_df['area'] = clean_df['area'].fillna(mean_area)
    
    return clean_df

# create cleaned dataframe
clean_data = clean_house_sales_data(house_sales_df)

# print the cleaned data and do some checking to approve the cleaned data
print("Print to check missing values:\n - data to be fed to the model shoul have no missing values!\n")
print(clean_data.isna().sum())
display(clean_data.head())

Print to check missing values:
 - data to be fed to the model shoul have no missing values!

house_id         0
city             0
sale_price       0
sale_date        0
months_listed    0
bedrooms         0
house_type       0
area             0
dtype: int64


Unnamed: 0,house_id,city,sale_price,sale_date,months_listed,bedrooms,house_type,area
0,1217792,Silvertown,55943,2021-09-12,5.4,2,Semi-detached,107.8
1,1900913,Silvertown,384677,2021-01-17,6.3,5,Detached,498.8
2,1174927,Riverford,281707,2021-11-10,6.9,6,Detached,542.5
3,1773666,Silvertown,373251,2020-04-13,6.1,6,Detached,528.4
4,1258487,Silvertown,328885,2020-09-24,8.7,5,Detached,477.1


# Task 3 

The team at RealAgents have told you that they have always believed that the number of bedrooms is the biggest driver of house price. 

Producing a table showing the difference in the average sale price by number of bedrooms along with the variance to investigate this question for the team.

 - You should start with the data in the file 'house_sales.csv'.

 - Your output should be a data frame named `price_by_rooms`. 

 - It should include the three columns `bedrooms`, `avg_price`, `var_price`. 

 - Your answers should be rounded to 1 decimal place.   

In [9]:
# Use this cell to write your code for Task 3
# Task 3: Analyze price by bedrooms
def analyze_price_by_bedrooms(df):
    """
    Calculate average price and variance by number of bedrooms
    """
    analysis_df = df.copy()
    
    # Ensure bedrooms is numeric
    analysis_df['bedrooms'] = pd.to_numeric(analysis_df['bedrooms'], errors='coerce')
    
    # Group by bedrooms and calculate statistics
    price_stats = analysis_df.groupby('bedrooms')['sale_price'].agg([
        ('avg_price', 'mean'),
        ('var_price', 'var')
    ]).reset_index()
    
    # Round to 1 decimal place
    price_stats['avg_price'] = price_stats['avg_price'].round(1)
    price_stats['var_price'] = price_stats['var_price'].round(1)
    
    return price_stats

# Calculate price statistics by bedrooms
price_by_rooms = analyze_price_by_bedrooms(clean_data)

# Task 4

Fit a baseline model to predict the sale price of a house.

 1. Fit your model using the data contained in “train.csv” </br></br>

 2. Use “validation.csv” to predict new values based on your model. You must return a dataframe named `base_result`, that includes `house_id` and `price`. The price column must be your predicted values.

In [None]:
# Use this cell to write your code for Task 4

# Task 4: Baseline model
# Load training and validation data
train_df = pd.read_csv('train.csv')
validation_df = pd.read_csv('validation.csv')

def prepare_baseline_model(train_data, validation_data):
    """
    Simple baseline model using average sale price
    """
    # Calculate average price from training data
    avg_price = train_data['sale_price'].mean()
    
    # Create predictions dataframe
    predictions = pd.DataFrame({
        'house_id': validation_data['house_id'],
        'price': avg_price
    })
    
    return predictions

# Generate baseline predictions
base_result = prepare_baseline_model(train_df, validation_df)

# Task 5

Fit a comparison model to predict the sale price of a house.

 1. Fit your model using the data contained in “train.csv” </br></br>

 2. Use “validation.csv” to predict new values based on your model. You must return a dataframe named `compare_result`, that includes `house_id` and `price`. The price column must be your predicted values.

In [None]:
# Use this cell to write your code for Task 5

# Task 5: Comparison model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

def prepare_comparison_model(train_data, validation_data):
    """
    More sophisticated model using linear regression with feature engineering
    """
    # Define feature columns
    numeric_features = ['bedrooms', 'area', 'months_listed']
    categorical_features = ['city', 'house_type']
    
    # Create preprocessing pipeline
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', 'passthrough', numeric_features),
            ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
        ])
    
    # Create model pipeline
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])
    
    # Prepare training data
    X_train = train_data[numeric_features + categorical_features]
    y_train = train_data['sale_price']
    
    # Fit model
    model.fit(X_train, y_train)
    
    # Prepare validation data
    X_val = validation_data[numeric_features + categorical_features]
    
    # Make predictions
    predictions = model.predict(X_val)
    
    # Create result dataframe
    result_df = pd.DataFrame({
        'house_id': validation_data['house_id'],
        'price': predictions
    })
    
    return result_df

# Generate comparison model predictions
compare_result = prepare_comparison_model(train_df, validation_df)