# California House Price Prediction Challenge

**Regression challenge**

Your challenge is to develop a machine learning model for predicting house prices in California using features such as the number of rooms and the age of the house.

This is a great opportunity to experiment with and learn about a number of core concepts in machine learning, using [pandas](https://pandas.pydata.org/), [seaborn](https://seaborn.pydata.org/) and [scikit-learn](https://scikit-learn.org/stable/index.html).

This Jupyter notebook will guide you through the various general stages involved in machine learning projects &ndash; including data visualisation, data preprocessing, model selection, model training and model evaluation &ndash; in the context of this challenge, and afterwards, you will be able to submit your test set predictions for evaluation on the [DOXA AI](https://doxaai.com/competition/palmer-penguins) platform.

**Before you get started, make sure to [sign up for an account](https://doxaai.com/sign-up) if you do not already have one and [enrol to take part](https://doxaai.com/competition/california-housing) in the challenge.**

**If you have any questions, feel free to ask them in the [DOXA Community Discord server](https://discord.gg/MUvbQ3UYcf).**


## The machine learning workflow


## Installing and importing useful packages

To get started, we will install and import a few packages we will need.


In [None]:
%pip install numpy pandas matplotlib seaborn scikit-learn
%pip install -U doxa-cli

In [None]:
import os

import pandas as pd
import seaborn as sns


pd.set_option("display.max_colwidth", None)

## Loading the data

The data for this challenge was originally drawn from the [1990 US census](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html).

We can get started by downloading the data if we do not have it already.

- `data/train.csv` is a CSV (comma-separated values) file containing the full training dataset (including both the features and the target `median_house_value` variable)
- `data/test.csv` is a CSV file containing the test set for which your final model will be used to make predictions


In [None]:
# Download the dataset if we do not already have it
if not os.path.exists("data"):
    os.makedirs("data")

    !curl https://raw.githubusercontent.com/DoxaAI/educational-challenges/main/california-housing/data/train.csv --output data/train.csv
    !curl https://raw.githubusercontent.com/DoxaAI/educational-challenges/main/california-housing/data/test.csv --output data/test.csv

# Load the data
train_df = pd.read_csv("data/train.csv")
test_df = pd.read_csv("data/test.csv")

## Exploring the data

Before we dive into training machine learning models, it is incredibly important to first take the time to properly explore the data available and understand its characteristics. As part of this, we will be looking to identify the following: (i) whether the dataset contains any missing, corrupted values or otherwise invalid values; (ii) whether there may be any large outliers or anomalies in the dataset (e.g. typos); (iii) how the different feature variables are distributed; and (iv) what relationships there are between the different data variables.

This will guide us as to what we should do next and may even give us clues as to which data preprocessing techniques and model types we may wish to use. After all, if the quality of a dataset is low, you may have to spend some time (or a lot of time 😅) [cleaning](https://en.wikipedia.org/wiki/Data_cleansing) it before it can be used.

In this notebook, we will use some simple statistical methods and common visualisation techniques to explore the data we have.


### The training dataset

Let's get started by taking a look at the training dataset, which contains the following data variables:

- `median_income`: the median income in block group in thousands of dollars
- `house_age`: the median house age in block group
- `mean_rooms`: the mean number of rooms per household
- `mean_bedrooms`: the mean number of bedrooms per household
- `population`: the block group population
- `mean_household_size`: the median household size of the block
- `latitude`: the latitude of the block group
- `longitude`: the longitude of the block group
- `median_house_value`: the median house value in thousands of dollars


In [None]:
# Let's take a look at the first ten rows
train_df.head(10)

So far, so good &ndash; nothing looks particularly outlandish, so we can continue!

As part of our analysis, it can be useful to find out the following things: what columns we have; what their datatypes are; how many entries there are; and whether there are any missing values in our training dataset we might have to handle (e.g. by dropping the rows they are in or imputing the missing values). One way to do this is to use the `info()` method on our training dataframe `train_df`.


In [None]:
train_df.info()

Fortunately, it looks like the training set is complete! From this, we can see that it is formed of **10,320 entries**, where all of our features are numeric.

We can also look at some statistical properties of the numerical data using the `describe()` method on our training dataframe.


In [None]:
train_df.describe()

Looking at the `75%` and `max` rows, we can already see from this table that there are likely a few extreme examples in the dataset we should be aware of, e.g. some houses with over 50 rooms.

Out of curiosity, let's take a look at those instances:


In [None]:
train_df[train_df["mean_rooms"] > 50]

A more visual way to get an intuitive feel for how the data is distributed is to plot a histogram for each variable:


In [None]:
train_df.hist(bins=50, figsize=(10, 10))

Showing the data in this way makes it clear that median house price values have actually been capped at $500,000, which may have consequences depending on the application.

Using [seaborn](https://seaborn.pydata.org/), we can also see this with a violin plot (which is like a fancier box plot!):


In [None]:
sns.violinplot(data=train_df, x="median_house_value")

Using [pairplot()](https://seaborn.pydata.org/generated/seaborn.pairplot.html), we can show the distribution of each numerical variable, as well as the relationships between them, all in one diagram: on the diagonal, we will plot a histogram for each numerical variable as before, but for the off-diagonal graphs, we can generate a scatter plot for each pair of numerical variables.


In [None]:
sns.pairplot(
    train_df,
    vars=[
        "median_income",
        "house_age",
        "mean_rooms",
        "mean_bedrooms",
        "population",
        "mean_household_size",
        "median_house_value",
    ],
    plot_kws={
        "alpha": 0.75
    },  # make the points slightly transparent to be easier to see
)

Seaborn also lets us generate other graphs, such as a scatter plot showing the location of all the houses in the dataset, where each point is coloured according to the median house price in the block group.


In [None]:
sns.scatterplot(
    train_df,
    x="longitude",
    y="latitude",
    hue="median_house_value",
    alpha=0.75,
    palette="viridis",
)

Great! We are starting to build up a picture of what our data looks like and develop an idea of what we might want to do next.

**[EXERCISE]** What other statistics and visualisations might you want to look at for the training set?


## The test dataset

Before we move onto preprocessing the data, take a quick look at the test dataset for which we will be making predictions.


In [None]:
test_df

In [None]:
test_df.info()

In [None]:
test_df.describe()

**[EXERCISE]** What other statistics and visualisations might you want to look at for the test set?


## Preprocessing the data

Now it is your turn &ndash; can you preprocess the data to get it ready for training?

When preprocessing and transforming datasets, we generally seek to do the following things where applicable:

- **Handling missing values**

  Most machine learning models cannot be trained on missing values (often encoded as `NULL` or `NaN`), so we need a strategy to deal with them. The simplest such strategy (which we will do) is to ignore the rows that contain missing values; however, sometimes, you can lose a lot of useful data this way. Another simple strategy is that of **univariate imputation** where you replace the missing values with some descriptive statistic computed from the remaining values in that column (e.g. the mean, median or mode) or even a sensible constant value. Some more advanced ways to impute missing values include [multivariate feature imputation](https://scikit-learn.org/1.5/modules/generated/sklearn.impute.IterativeImputer.html#sklearn.impute.IterativeImputer) and [nearest neighbours imputation](https://scikit-learn.org/1.5/modules/generated/sklearn.impute.KNNImputer.html#sklearn.impute.KNNImputer).

  There is a great guide in the [scikit-learn documentation](https://scikit-learn.org/1.5/modules/impute.html) if you want to learn more!

- **Scaling numerical data**

  Many machine learning algorithms (especially those that use gradient descent in optimisation) do not perform as well and may even struggle to converge on a good solution if the input features all have drastically different scales. Two common ways to deal with this are (i) **normalisation** or **min-max scaling** (where data is shifted and rescaled into the range `[0, 1]` typically) and (ii) **standardisation** where the data is shifted to have a mean of zero and rescaled to have unit variance. Some approaches may be more suitable than others, depending on the models you are using and what you want to achieve, but it can often be a good idea to experiment to see what gives you better results!

- **Encoding categorical data**

  We usually need to encode categorical data before we can train a model on it such as by using a [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), which effectively creates a binary column for each discrete category.

- **Feature engineering**

  Sometimes, it is possible to transform the raw input features you have into a set of features that are more useful in training. This is beyond the scope of this tutorial, but you may wish to read into it!

If you are interested in learning more about data preprocessing, check out the [scikit-learn documentation](https://scikit-learn.org/stable/modules/preprocessing.html) on the subject. Just note that you will want to preprocess the test set in the same way, so if you rescale your training data, you will want to rescale your test data in exactly the same way (with the parameters computed from the _training_ data)!

**Beware of data leakage**: while you must preprocess your training, validation and test data the same way, make sure you do not leak information about the your validation and test data when doing this (e.g. by standardising your data using a mean and standard deviation calculated from all of the data, rather than just the training data), as this may artificially boost the performance of your model.


### Preprocessing the training data

Given that our data is entirely numerical and there are no missing values, for the time being, we will just select the features we want to use and rescale them.


In [None]:
# We will only use a subset of the features to begin with, but feel free to experiment!
FEATURES = [
    # "median_income",
    "house_age",
    "mean_rooms",
    "mean_bedrooms",
    "population",
    "mean_household_size",
    # "latitude",
    # "longitude",
]

X_train = train_df[FEATURES]
y_train = train_df["median_house_value"]

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, StandardScaler


preprocessor = make_column_transformer(
    # Feel free to experiment with different preprocessing techniques here! 👀
    (
        # [EXERCISE] Is this the best scaler to use in this case given what we have seen?
        MinMaxScaler(),
        FEATURES,
    ),
)

## Training and evaluating models

### A little theory

Our objective is to develop a model with the lowest possible **generalisation error** (or **out-of-sample error**), which is a measure of how well our model performs on unseen data. This is important if we ever want to use the models we train for inference in the real world. Of course, we cannot compute this directly &ndash; the data is unseen! &ndash; so we use the **empirical error** from evaluating the model on a test set, which is not used in training at all, as a proxy. This is what we show on the [competition scoreboard](https://doxaai.com/competition/california-housing/scoreboard).

In the ideal scenario, our model will generalise to perform similarly on both the training and test sets; however, this is not always the case. If our model starts to fit to the noise (or residual variation) in our training dataset, rather than the underlying function we are trying to learn (the signal), we end up **overfitting**; and likewise, if the representation of our model is not rich enough to encode that underlying relationship in the data, our model ends up **underfitting**. Both issues cause our model to perform worse when evaluated out-of-sample.

There are a lot of different models available to us (with different suitabilities for encoding different relationships), each with a range of **hyperparameters** we can tune to affect the learning process; however, if we use the performance of our models on the test set as the basis for updating hyperparameters, we stand the risk of leaking information about and overfitting to the test set.

One approach is to further subdivide our training dataset into a training set and a completely separate **validation set**, but training data is precious and in short supply a lot of the time, and we would not want to overfit to that validation set too. An alternative approach here is to perform (**stratified**) **k-fold cross-validation**, where we (randomly) partition the data into `k` different "folds", train `k` models using each fold for validation and the remaining folds for training, and average the results. You can read more about this in the [scikit-learn documentation](https://scikit-learn.org/stable/modules/cross_validation.html).

Putting this all together, we finally just need a strategy for optimising our hyperparameters. One (albeit potentially slow) way is just to perform a **cross-validated grid search** over a grid of hyperparameter values we want to investigate in order to try out a range of different combinations and see what performs well. When we think we have found the model with the best set of hyperparameters, we can retrain it on all of the training data and then evaluate it on the test set! 😎

### Putting everything into practice

We can now try applying what we know by performing a small grid search of our own on a [support vector machine with a linear kernel](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVR.html) (or `LinearSVR`) to find the value of the hyperparameter `C` (which controls the strength of regularisation that is applied) that gives the best performing model!


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVR


parameter_grid = {
    # Try experimenting with other parameters!
    "model__C": [0.5, 1, 2],
}

pipeline = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("model", LinearSVR()),
    ]
)

regressor = GridSearchCV(
    pipeline,
    param_grid=parameter_grid,
    cv=5,  # Use 5 folds
    refit=True,
    scoring="neg_mean_absolute_error",
)
regressor.fit(X_train, y_train)

print("Best parameters:", regressor.best_params_)
print("Best mean absolute error:", -regressor.best_score_)

Neat! We just used [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to perform a cross-validated grid search and automatically retrain the model on all of the training data, which we can now use to make predictions for the test set! 🥳

**[EXERCISE] Since we do all our preprocessing as part of the pipeline we have made, it will be repeated every time a model is trained for a particular fold. Why might this be what we want?**

**[EXERCISE] What other models could you use for this regression task?**

**Hint**: take a look at the [scikit-learn documentation](https://scikit-learn.org/1.5/supervised_learning.html).


## Submitting our predictions to the DOXA AI platform

We are now ready to generate house price predictions for the test set and upload our work to the DOXA AI platform for evaluation!

**Make sure to [enrol to take part](https://doxaai.com/competition/california-housing) in the challenge if you have not already done so.**


In [None]:
predictions = regressor.predict(test_df[FEATURES])

predictions

In [None]:
# Prepare our submission

os.makedirs("submission", exist_ok=True)

with open("submission/doxa.yaml", "w") as f:
    f.write(
        "competition: california-housing\nenvironment: cpu\nlanguage: python\nentrypoint: run.py\n"
    )

with open("submission/run.py", "w") as f:
    contents = "\\n".join([str(prediction) for prediction in predictions])
    f.write(
        f"""import os
with open(os.environ["DOXA_STREAMS"] + "/out", "w") as f:
    f.write("{contents}")"""
    )

Next, we need to make sure we are logged in:


In [None]:
!doxa login

Finally, we can submit the predictions for evaluation:


In [None]:
!doxa upload submission

Wooo! 🥳 You have just submitted your predictions to the platform &ndash; well done! Take a moment to see how well you have done on the [scoreboard](https://doxaai.com/competition/california-housing/scoreboard).


## Possible improvements

Congratulations &ndash; you have made it to the end! We hope you have enjoyed learning about and applying machine learning to this challenge. Hopefully, you can now start experimenting with your own ideas to see how you can improve the performance of your model!

Here are a few questions and ideas to get you started:

1. **Data visualisation**:

- What other visualisations can you make using **pandas**, **matplotlib** and **seaborn**?

2. **Data preprocessing**:

- What other features might you want to use? Is there a way to select them automatically?
- How might you scale the data differently to boost performance?
- Would applying the [PCA algorithm](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) help here?

3. **Model selection**:

- What alternatives are there to running a standard grid search? Take a look at [HalvingGridSearchCV](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingGridSearchCV.html#sklearn.model_selection.HalvingGridSearchCV), [HalvingRandomSearchCV](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.HalvingRandomSearchCV.html#sklearn.model_selection.HalvingRandomSearchCV) and [RandomizedSearchCV](https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.RandomizedSearchCV.html#sklearn.model_selection.RandomizedSearchCV)!
- Have you ever tried using [BayesSearchCV](https://scikit-optimize.github.io/stable/auto_examples/sklearn-gridsearchcv-replacement.html) from the [scikit-optimize](https://scikit-optimize.github.io/stable/) package? It searches the space of hyperparameters using Bayesian optimisation and is a drop-in replacement for [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).
- What other models could you use for this regression task? Take a look at the [scikit-learn documentation](https://scikit-learn.org/1.5/supervised_learning.html)!
- Could you improve performance by using an [ensemble of different models](https://scikit-learn.org/stable/modules/ensemble.html)?

## Closing remarks

We hope that you have found this to be a useful and enjoyable exercise in exploring and gaining exposure to some fascinating ideas and concepts in machine learning. We look forward to seeing what you build! Do continue the conversation on the [DOXA Community Discord server](https://discord.gg/MUvbQ3UYcf). 😎
