# Project: Improving a Model's Performance Without Changing the Model

## Objective
In this project, your task is to improve the performance of a pre-implemented model without changing the model itself. You will work on cleaning and pre-processing the dataset, extracting meaningful features, performing hyperparameter tuning, and implementing cross-validation.


## Skills You'll Develop
- Data pre-processing (clean-up, normalization, data categorisation, handling missing values)
- Feature engineering and feature impact analysis
- Hyperparameter tuning
- Cross-validation techniques

## Instructions
Throughout the notebook, you'll find empty sections with instructions on what to do. Follow these instructions to enhance the model's performance step by step.

Before starting, run every cell to see what the starting performance is, and fill it in on the table below and then rerun all cells for every task you complete to track your progress.


### Performance Summary Table
 | Step                                               | Accuracy Score |
 |----------------------------------------------------|--------------------------|
 | Initial Model                                      |                          |
 | After Handling Missing Values                      |                          |
 | After Removing Duplicates                          |                          |
 | After Feature Engineering                          |                          |
 | After Normalization                                |                          |
 | After Hyperparameter Tuning                        |                          |
 | After Hyperparameter Tuning with Cross Validation  |                          |
 | Final Model Evaluation                             |                          |

## Dataset
We are using a property value dataset, which is messy and requires cleaning and preparation before it can be used effectively.

In [None]:
import pandas as pd
df = pd.read_csv('property_data.csv', keep_default_na=False) # Load the dataset
Seed = 42 # Random seed to used to ensure reproducibility
best_params = None # Placeholder for best parameters

print(df.head()) # Display the first 5 rows of the dataset

x = df.drop(columns=['SalePrice'])
y = df['SalePrice']

## Step 1: Analyse and visualise the dataset
Before you can get to cleaning and pre-processing the dataset, you will want to analyse and visualise the dataset to get a better understanding on what needs to be done.

In [2]:
# Analyze the dataset, look at the kind of data within the dataset, what changes might need to be made
# visualize the dataset if needbe to understand the data better

## Step 2: Data Cleaning and Pre-processing
The dataset is rather messy, and alot of the data is in an unoptimal format, this section you will clean up the dataset and perform any preprocessing needed

### Task 2.1: Handle Missing Values
The dataset contains some missing values that need to be addressed. You can either remove rows with missing values or fill them with appropriate values (e.g., mean, median).

At the moment all missing data and NaNs are replaced with zeros but this is far from the best solution. Replace the missign data and NaNs with more meaningful values.


In [3]:
# replace this with a better method
for column in x.columns:
    x[column] = x[column].replace('', 0)

### Task 2.2: Remove Duplicates
Identify and remove any duplicate entries in the dataset.

In [4]:
# YOUR CODE HERE

### Task 2.4: Handle NaNs
AI models can't handle NaNs very well and need to be handled.

At the moment all NaNs in numeric features are replaced with zeros, but this is far from the best solution. Replace the NaNs with more meaningful values.

In [5]:
# Find a better solution to handle NA values
for column in x.columns:
    temp = x[column][x[column] != 'NA'] # Gets all values except 'NA'
    try: # Tries to convert the values to numeric
        temp = pd.to_numeric(temp) # Tries to convert the values to numeric
        x[column] = x[column].replace('NA', 0) # Replaces 'NA' with 0
        x[column] = pd.to_numeric(x[column]) # Converts the column to numeric
    except ValueError:
        pass

### Task 2.4: Handle Categotical Data
 Some features may contain categorical data and need to be converted to a formtat of which the model can handle. 
 
 At the moment they are handled with one-hot encoding, but there are other ways to handle categorical data.


In [6]:
# Some of the code from the previous cell is repeated here which converts any nessessary columns to numeric
# so that the the one-hot encoded used below doesnt categorise numeric data. Its here incase you removed
# changed the code from the previous cell (which is encouraged)
for column in df.columns:
    try:
        df[column] = pd.to_numeric(df[column])
    except ValueError:
        pass


x = pd.get_dummies(x) # replace this with a better solution

### Task 2.5: Normalize the Data
Sometimes data contain large numbers can negatively impact the model and so normalizing or standizing the features as needed as improve performance.

Determine which columns can and should be normalized and then normalize them.

In [7]:
# YOUR CODE HERE

In [None]:
print(x.head()) # Run this to see the first 5 rows of the dataset after any changes made

## Step 3: Feature Engineering
 Not all features contribute to model performance. There are two ways to handle this: Feature Selection or Feature Transformation.

 Note: Whilst you will try both solutions, you will only use one of them in the end so pick whichever one you want to use.

### Task 3.1: Feature Selection
Analyze the dataset and decide which features are most relevant for predicting property values using a feature selection method

In [9]:
# YOUR CODE HERE

### Task 3.2: Feature Tranformation
Instead of selecting the best features and removing the worst, combine similar features together to create new ones

In [10]:
# YOUR CODE HERE

## Step 4: Splitting the Dataset
Splitting the data into training and testing sets to evaluate the model's performance. 

No Changes need to be made here.

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=Seed)

## Step 6: Hyperparameter Tuning
A model has many setting that can be fine tuned to further improve a model and whilst these can be tweaked manually, due to the amount of settings and the respective large number of combinations; sometimes its best to automate the hyperparamter tuning

### Task 6.1: Simple Tuning
Use a method such as Grid search or Randomized Search to find the best hyperparameters. It is common practice to further split the training data (maybe a 80/20 split) into training and validation sets so the hyperparameter tuning is performed on a seperate dataset then what it is trained on to avoid overfitting.

In [12]:
# YOUR CODE HERE
# Name the final parameters 'best_params'

### Task 6.2: Cross-validation Tuning
An alteratnive form of tuning is called cross-validation. Perform K-fold cross validation tuning inplace of the method you attempted in the cell above.

In [13]:
# YOUR CODE HERE
# Name the final parameters 'best_params'

## Model
Below is an implementation of the Decision tree model. Its designed to run on the initial dataset loaded in and any improvements you make so no changes on your end should need to be made on the model.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

if best_params is None: # If best_params is not provided, use the default model
    model = DecisionTreeRegressor(random_state=Seed) 
else:
    model = DecisionTreeRegressor(random_state=Seed, **best_params)

model.fit(X_train, y_train) # train the model

# evaluate the model
predictions = model.predict(X_test)

accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions, average='weighted', zero_division=1)
recall = recall_score(y_test, predictions, average='weighted', zero_division=1)

print(f'Accuracy: {accuracy}')
print(f'Precision: {precision}')
print(f'Recall: {recall}')


Display = True # Set to True to display the final tree, this could take a while to load

if Display:
    plt.figure(figsize=(20,10))
    tree.plot_tree(model, filled=True)
    plt.show()

## Summary
Summarize the steps you took to improve the model's performance and reflect on which changes had the most impact.

YOUR SUMMARY HERE

## Extention Task: Implement a differnt model and compare

Now you've improved the performance of the Decision Tree model, try implementing a different model to see if you can get better results. You can use any model you like, such as Random Forest, SVM, or Gradient Boosting. Make sure to follow the same steps as above to clean and preprocess the data, tune the hyperparameters, and evaluate the model.

In [15]:
# YOUR NEW MODEL HERE

# Feedback
If you have any feedback about this project at all, feel free to tell us using this form: https://forms.gle/oCWaTdUbmwpjgLxi8