#Assignment introduction

**Time to predict some stuff!** <br> For this assignment, we will try to predict house prices based on the features of the house in question. This means we need to do the following:

1. Load in the data (as per usual)
2. Perform some EDA to get a better understanding of the data
3. Clean up the data
4. Perform feature engineering and choose our feature dimensions
5. Create the feature and target matrix (our X's and y's)
6. Create, fit a model and evaluate performance
7. Set up a data prediction pipeline

This assignment was very much made to give you a lot of creative freedom in how you want to approach your data engineering and model creation. Creating good SML models takes a lot of thinking, effort and testing. You can expect to go back and forth in the notebook a lot to change earlier data engineering steps, so keep your assigned variable names consistent!<br><br>

(**Note:** There is no shame in not creating a super well-performing model, as long as you try out the different methods involved in SML. As the picture below illustrates, sometimes it just goes wrong) <br><br>

The order in which you perform the different steps is more or less up to you, as long as you end up with some sort of trained model that is suitable for this type of prediction. <br>
The way this notebook is laid out, is just to give you a general direction guide in terms of the overarching concepts; data filtering, data engineering, model fitting, testing and evaluation as well as setting up a data pipeline. If it's easier, you're welcome to create a separate notebook instead of working in this one, as long as you cover the tasks included.

![](https://aaubs.github.io/ds-master/media/ML_Daddy.png)

This is not an exhaustive list, but for this assignment you **will** need the following:
1. From Scikitlearn:
- Some sort of encoding library of your own choice
- train_test_split (unless you want to do it manually for whichever reason)
- Some sort of data scaler
2. A visualization library:
- I recommend matplotlib.pyplot and seaborn, but you can try to use altair if you want to
3. Pandas (not the bamboo eating kind)
- Because, duh, we need them dataframes!
4. Joblib or Pickle
- To save each component of the entire process

# Importing modules and loading in the data

**Link to the Data:** <br> https://www.kaggle.com/datasets/harishkumardatalab/housing-price-prediction?resource=download

#EDA

**First things first** <br> It's always a good idea to start out with checking for missing values, and dealing with them if there are any. <br> Do we have any missing values that we need to take care of? If so, find a way to handle the missing data (ie removing or replacing/filling them)

**Lets visualize some of the dimensions to get a better idea of our data. <br> Create a plot that shows the distribution of the price dimension. What can we see?**

**Correlations impact our predictions through statistical inference. Thus understanding which correlations we may be dealing with, is a good tool for choosing our dimensions for the feature matrix. Often these can be quite logical with some understanding of what the data represents <br> Check the correlations of the dimensions**

#Data preprocessing

##Data cleaning

**Do we have any outliers in our data that may affect our prediction? If so, remove them if you think they could cause issues**

**Is there any other cleaning that needs to be done? If you believe so, then perform the remaining cleaning and then proceed to the next step**

##Feature engineering

**So far we've gotten an idea of how our data is correlated, and we may already have an idea of which features we wish to use for our feature matrix. We can however do more than simply check correlations between data points. We can check what's known as feature importances. In order to do this we need to engineer the data in a way so that we can feed it to the SML model we choose.**

**BONUS TASK:** <br>
It may be possible to define new features, for example as a combination of two existing ones in order to increase predictability. If you believe you can create new features, this is the time to do it <br>
(Note: This step is not necessarily needed, but more if you're feeling adventurous, or if your model is not performing as well as you had hoped)

**In order to fit a model to our data, we need to encode the categorical values for prediction. <br> Encode the categorical features in the dataframe**

##Creating target and feature + final evaluation of dimensions

**Separate the target feature from the rest**

**Next up:**<br> I want you to evaluate whats known as the feature importances of the data. (Hint: Most SML models has a class attribute for this) <br>
**Reflect:**<br> Do we evaluate feature importances *before* or *after* we do the train_test_split? What are some possible issues/benefits with either approach? <br>
**Task:** Evaluate the feature importances of your data.

**Solution:** I do this *BEFORE* the feature importances, as I wish to increase the models ability to generalize to new data. If we base our selection on feature importances of all the data, it may lead to overly optimistic model evaluation.

**Now that you have selected your target features through whichever method you preferred, you need to remove the excess features that you don't need.** <br>
*Note: If you've already performed the train_test_split, remember to remove them from both the test and training data*

#Fitting, testing and evaluating the model

**Okay! Now we got our feature and target matrix sorted (for now), we finally get to actually create our machine learning model.** <br><br>
**Task:** Select a suitable model for the type of prediction we are trying to make, fit it to your data and evaluate the performance using appropriate metrics. <br><br>
**Reflect**: Is the model performing well? If not, how can we increase the performance?

##**BONUS TASK - Hyperparameter tuning using GridSearchCV:**

One way to increase model performance is to perform what's known as a grid search of the hyperparameters of the model. Basically we try out different combinations of these hyperparameters in order to find the most optimal setup based on some form of scoring metric. Give it a try!  <br><br>
**NOTE:** These can take a while to run, so whilst it is a good approach, it ***can*** also cost a lot of time, and as you may have experienced, Colab tends to time out after a while. A way to think of it is the following;<br><br>

$$
\text {Total models tested} = \text {(Number of variables for parameter 1)} \space \times \text {(Number of variables for parameter 2)} \space \times \text {(Number of variables for parameter 3)} \space \text {...}\space \times \text {(Number of variables for parameter n)}
$$ <br>

**So if you're testing out the following param_grid;**
<br>3 variations of parameter 1,
<br>4 variations of parameter 2,
<br> 2 variations of parameter 3,
<br> 2 variations of parameter 5,
<br>5 variations of parameter 6
<br><br>**You get the following:** <br><br>

$$
\text {Total models tested} = 3 \times 4 \times 2 \times 2 \times 5 = 240
$$ <br>
As you can see it goes up quick, as we are already testing 240 versions of the model with different parameters. Whilst testing upwards of even 100 variations doesn't necessarily take that long (which I tried), you should still be careful to not just increase it to try out all possible combinations there are

## **Save the model components**

**When we have created our model, we want to able to use it outside of our development notebook, for this purpose we need to save each component we used for preprocessing, as well as the model itself**<br>
**Task:** <br>
Save the model components (the prediction model itself, the scaler and the label encoder)

#Creating the data pipeline

**Finally we want to streamline our preprocessing and prediction methods for new data points. To do this, we create what's known as a data pipeline. It's basically a function that performs the *same* preprocessing steps as what we did earlier on a new observation** <br><br>
**Task 1:** <br> Load in your components, and create a data pipeline function that will perform the preprocessing steps you did earlier all in one. Apply it to a new observation. <br><br>
**Task 2:** <br> Load in your model, and create a prediction function that will predict an outcome based on this new observation

#Bonus task: Create an interface to interact with your model

You can either create a simple gradio interface, or alternatively create a streamlit application.