<div class='bar_title'></div>

*Enterprise AI*

# Tutorial 2 - Introduction to Machine Learning

Gunther Gust / Viet Nguyen<br>
Chair of Enterprise AI

Summer Semester 25

<img src="https://github.com/GuntherGust/tds2_data/blob/main/images/d3.png?raw=true" style="width:20%; float:left;" />

In this tutorial, we introduce you to the basics of machine learning through hands-on programming exercises. This tutorial is designed as a practical refresher of the content in the course "Practical Data Science" from the last winter semester. While it may seem challenging for newcomers who have never programmed, you will gradually become comfortable with the concepts through practice. To support your learning, we have also provided free courses on DataCamp that cover both the basics and more advanced topics in Python programming. 

The tutorial includes:
- Data preprocessing: how to clean up and prepare data as input for the machine learning models
- Model development and evaluation: building simple machine learning models and evaluating their performance on a particular task
  

## 1. Data Preprocessing

Before we can build any machine learning model, it is crucial to prepare the data properly. Raw data is often messy, incomplete, or not in the ideal format for modeling. Data preprocessing ensures that the data is clean, well-structured, and suitable for feeding into machine learning algorithms. Skipping or poorly executing this step can lead to inaccurate or misleading models. In this section, we will:

- Load and explore the dataset using the pandas library.
- Separate the dataset into features (input variables) and the target variable (what we want to predict).
- Split the data into training and testing sets for model evaluation.

You can read further on the topic [here](https://www.v7labs.com/blog/data-preprocessing-guide). We will go into the basics in the next cells.

### Importing required libraries

To begin with, we import the `pandas` package, which is a powerful and widely used Python library for data manipulation and analysis.

In [1]:
import pandas as pd

Here, `as pd` is an alias, just a shorthand, so we can refer to the package using `pd` throughout the code instead of typing pandas every time.

In this tutorial, we will be working with the housing dataset, which is hosted on our GitHub repository. To load this dataset into a pandas `DataFrame`, we will use the `pd.read_csv()` function. If the dataset is stored locally, you can provide the relative or absolute path to the csv file. For example, if the file is in the current directory, you can use the following command:
```python
housing_data = pd.read_csv("./path/to/folder/housing.csv")
```
The `./` represents the current directory, and you can replace `"./path/to/folder/housing.csv"` with the actual path where your CSV file is stored.

We will use an online dataset, and the command to load the dataset from GitHub would look like this:

In [2]:
housing_data = pd.read_csv("https://raw.githubusercontent.com/GuntherGust/tds2_data/refs/heads/main/data/housing.csv")

Here, you create a variable `housing_data` that stores the data from reading the csv file `housing.csv`. You can think of this `housing_data` as an object, specifically a `DataFrame`, which is a type of object provided by the pandas library. A `DataFrame` is kind of like an Excel spreadsheet or a table in a database: it has rows (each one representing a house, in this case) and columns (each column represents a feature of the house, like area, number of bedrooms, etc.). Once we have this object (housing_data), we can use many built-in functions that pandas provides. One very commonly used function is `head()`:

In [3]:
# Here is a "comment" of a code cell. It will not be executed, but only to provide additional information about your written code
# Show the first 5 rows
housing_data.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420.0,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960.0,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960.0,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500.0,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420.0,4,1,2,yes,yes,yes,no,yes,2,no,furnished


This function shows the first 5 rows of the data. It helps give us a quick overview of what the dataset looks like — how many columns there are, what kind of values are inside, and whether it loaded correctly. This is very helpful when you're starting to explore a dataset, especially if you've never seen it before. Similarly, you can compute the overall statistics of your dataset using `describe()`:

In [4]:
housing_data.describe()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,parking
count,545.0,504.0,545.0,545.0,545.0,545.0
mean,4766729.0,5209.31746,2.965138,1.286239,1.805505,0.693578
std,1870440.0,2188.828118,0.738064,0.50247,0.867492,0.861586
min,1750000.0,1700.0,1.0,1.0,1.0,0.0
25%,3430000.0,3600.0,2.0,1.0,1.0,0.0
50%,4340000.0,4636.0,3.0,1.0,2.0,0.0
75%,5740000.0,6420.0,3.0,2.0,2.0,1.0
max,13300000.0,16200.0,6.0,4.0,4.0,3.0


## 2. Data Sampling
Now that we've had a first look at our dataset, it's time to prepare it for modeling. Before we build any machine learning model, we need to test how well it performs on new, unseen data. To do this properly, we split our data into two parts:
- `Training set`: the portion of the data we’ll use to train (or "teach") our model.
- `Testing set`: a separate portion we’ll use later to test how well the model works on data it hasn’t seen before.

Think of it like studying for an exam: you learn from a textbook (training set), and later you’re tested with different questions (testing set) to see how well you understood the material. This idea of separating training from testing is called *data sampling*.

### Splitting features and Targets

Let’s now split the dataset into two parts:

- Features (inputs we use to make predictions): all the columns except the house price.
- Target (what we want to predict): in our case, it's the price column — the house price.

We can do this with the following commands:

In [5]:
Y = housing_data["price"] # Extract the price column from the dataFrame
X = housing_data.drop("price",axis=1) # Extract the remaining columns from the dataFrame

Here’s what’s happening:
- `housing_data["price"]` grabs just the "price" column as a Series (a single column).
- `housing_data.drop("price", axis=1)` removes the price column and returns all the other columns in a new `DataFrame`. Here, `axis` indicates which dimension of the data frame we want to drop. Since this is 2-dimensional, `axis=1` means dropping the column "price". You will encounter more complicated examples in the future where dimensions can be larger than 2. 

### Train-Test Split

Now we’ll use a function from the `scikit-learn` library called `train_test_split`. This function takes our features (X) and target (Y) and splits them into four parts:
- `X_train` and `Y_train`: data for training the model.
- `X_test` and `Y_test`: data we save for testing the model later.

We also need to tell the function what percentage of data we want to keep for testing. A common choice is `20%` for testing and `80%` for training. Additionally, we set a `random_state` to make sure we get the same split every time we run the code:

In [6]:
from sklearn.model_selection import train_test_split
random_state = 0
test_size = 0.2
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=test_size, random_state=random_state)

When you run train_test_split, it might randomly divide the data into:
1. Training data: `X_train` (features) and `Y_train` (target):
- `X_train`: Includes columns like Area and Bedrooms for training.
- `Y_train`: Includes the prices corresponding to those entries.
2. Test data: `X_test` (features) and `Y_test` (target):
- `X_test`: A smaller set of features that the model has never seen.
- `Y_test`: The corresponding prices for those test entries.

Broad analogy: To make sure the model is truly learning and not just memorizing the training data (which is known as [overfitting](https://en.wikipedia.org/wiki/Overfitting)), think of it like a student preparing for a test. If the student only practices with the exact same questions they’ll see on the exam, they might memorize the answers instead of actually learning the material. But if the student practices with different questions and then takes an exam with new questions (the test set), they’ll be tested on their ability to apply what they’ve learned, not just recall memorized answers. In the same way, we use the test data to ensure that our model can generalize, which means it can make accurate predictions on new, unseen data, rather than simply memorizing the training data.

In addition, the `random_state` is very important: if someone else runs your code, even on a different machine, they will get the exact same train/test split, assuming the data and environment are the same.

## 3. Data Inspection
Now that we've split our data, it's time to get to know it better. From this point forward, we'll only work with our training dataset (`X_train`). This is important because we want to avoid accidentally peeking at the test data during training, which could bias our model and give us a false sense of its performance. Let's repeat the step of reading the first few rows of the training data:

In [7]:
# Show the first five rows of the X_train dataFrame
X_train.head(n=7)

Unnamed: 0,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
542,3620.0,2,1,1,yes,no,no,no,no,0,no,unfurnished
496,4000.0,2,1,1,yes,no,no,no,no,0,no,unfurnished
484,3040.0,2,1,1,no,no,no,no,no,0,no,unfurnished
507,,2,1,1,yes,no,no,no,no,0,no,unfurnished
252,9860.0,3,1,1,yes,no,no,no,no,0,no,semi-furnished
263,3968.0,3,1,2,no,no,no,no,no,0,no,semi-furnished
240,3840.0,3,1,2,yes,no,no,no,no,1,yes,


The function `head()` has an optional input parameter `n`, which indicates how many rows you want to display. By default, `n=5` if you don't specify it. When you program, a useful function could be `help()`, where you can see a full descriptions of a function, including the input parameters and the expected output.

In [8]:
# Please run this cell
help(X_train.head)

Help on method head in module pandas.core.generic:

head(n: 'int' = 5) -> 'Self' method of pandas.core.frame.DataFrame instance
    Return the first `n` rows.
    
    This function returns the first `n` rows for the object based
    on position. It is useful for quickly testing if your object
    has the right type of data in it.
    
    For negative values of `n`, this function returns all rows except
    the last `|n|` rows, equivalent to ``df[:n]``.
    
    If n is larger than the number of rows, this function returns all rows.
    
    Parameters
    ----------
    n : int, default 5
        Number of rows to select.
    
    Returns
    -------
    same type as caller
        The first `n` rows of the caller object.
    
    See Also
    --------
    DataFrame.tail: Returns the last `n` rows.
    
    Examples
    --------
    >>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion',
    ...                    'monkey', 'parrot', 'shark', 'whale', 'zebra']})
    >

Next, let’s check the shape of our dataset — how many rows and columns it has:

In [9]:
X_shape = X_train.shape
print(X_shape)

(436, 12)


When you run `X_train.shape` by itself in a code cell, Python will immediately display the result, which tells you how many rows and columns your dataset has:
```python
X_train.shape
```
But sometimes, especially when you're writing longer programs, you don’t want Python to just automatically show the result right away. Instead, you might want to save the result somewhere first. In this case, we save it in a variable called `X_shape`. Now the result is stored in this variable, like putting something into a labeled box. If you now want to see what's inside that box, you can use the `print()` function:
```python
# store shape into X_shape 
X_shape = X_train.shape
# display
print(X_shape)
```

It works with almost anything: numbers, text, even entire tables. So by writing `print(X_shape)`, you're asking Python to display the shape of the training dataset on your terms.

Now, we’ll create a summary of the numeric data using the `.describe()` function:

In [10]:
X_train_description = X_train.describe()

This gives us useful stats like the mean, standard deviation, minimum, and maximum for each numerical feature. It’s like getting a quick health check-up for your data. Similarly, you can print out this variable:

In [11]:
X_train_description.describe()

Unnamed: 0,area,bedrooms,bathrooms,stories,parking
count,8.0,8.0,8.0,8.0,8.0
mean,5076.553961,56.838039,55.721056,56.085973,55.198773
std,4906.101354,153.213119,153.65787,153.511732,153.870094
min,412.0,0.743303,0.488635,0.87356,0.0
25%,2076.902651,1.75,1.0,1.0,0.0
50%,4195.0,2.980505,1.139908,1.90711,0.795093
75%,5510.421117,3.75,2.25,2.5,1.5
max,16200.0,436.0,436.0,436.0,436.0


## 4. Handling Missing Values

Real-world data often has gaps or missing values; maybe someone forgot to enter a value, or maybe the data wasn’t available. We can find these missing values using the `isna()` function. This function returns a data frame with boolean values, where `True` indicates that the value is missing and `False` indicates that the value is present. To count the number of missing values in each column, we can use the `sum()` function:

In [12]:
missing_values = X_train.isna().sum()
print(missing_values)

area                24
bedrooms             0
bathrooms            0
stories              0
mainroad             0
guestroom            0
basement             0
hotwaterheating      0
airconditioning      0
parking              0
prefarea             0
furnishingstatus    29
dtype: int64


Here, it shows there are 24 and 29 missing values in the `area` and `furnishingstatus` columns, respectively. Since most machine learning models can’t handle missing values directly, we need to fix this. We have two main options:
1. Drop the rows with missing data (but this may lose important information).
2. Impute the missing values, which means we fill in the blanks with a reasonable guess.

For starting, we will use the [`SimpleImputer`](https://scikit-learn.org/stable/modules/impute.html#impute) tool from `Scikit-learn` to do this. In Python (and especially in libraries like scikit-learn), many tools come in the form of classes. A class is like a blueprint or recipe for creating objects that can perform certain tasks. Here, `SimpleImputer` is one such class. It comes from the `sklearn.impute` module and is used to fill in missing values in your dataset. 

In [13]:
from sklearn.impute import SimpleImputer

Next, we will create our Imputer. To do this, we must hand over a strategy defining how the missing values are imputed. But what should we use to fill in the blank?
### Numerical Data: Using Mean
For numerical columns like area (e.g., 1200, 1500, 1800), we could use the mean. That is, the average of all existing values in that column, which is a good assumption in this case. This is a good strategy when your data is fairly balanced and doesn't have extreme outliers (very large or small numbers that could skew the average).

### Categorical Data: Using the Most Frequent
For categorical columns like `furnishingstatus` (e.g., "furnished", "unfurnished", "semi-furnished"), you can't take an average — they’re not numbers! Instead, we use the most frequent value, the category that appears most often in that column. This is a safe assumption when one category is more common than the others.

We can use different strategies using the parameter `strategy` of the class `SimpleImputer`

In [14]:
numerical_imputer = SimpleImputer(strategy='mean')
categorical_imputer = SimpleImputer(strategy="most_frequent")

You are doing two things:
1. Creating an object (e.g., called `numerical_imputer`) based on the `SimpleImputer` class
2. Telling it how to behave by passing a parameter

Once you have this object, it gives you functions (also called methods) that come from the class. The main ones we use are:
1. `fit()`: this is like the object learning from your data (it figures out the mean or the most common value)
2. `transform()`: this is like the object applying what it learned to fix the data
3. `.fit_transform()`: this combines both steps: learn and apply, all at once

Putting all together: The class gives you a tool. You configure it (with `strategy`), and then you use its functions (e.g., `fit_transform()`) to clean your data:

In [15]:
numeric_imputed_values = pd.DataFrame(numerical_imputer.fit_transform(X_train[["area"]]),index=X_train.index, columns=["area"])
categorical_imputed_values = pd.DataFrame(categorical_imputer.fit_transform(X_train[["furnishingstatus"]]),index=X_train.index, columns=["furnishingstatus"])

After we impute missing values using `fit_transform()`, the result we get is a `NumPy` array (please look at this more in the DataCamp courses). It's just the data without any column names or row labels. But our original data was a pandas DataFrame, which includes both:
- `Columns`: named like `area` or `furnishingstatus`
- `Index`: the row labels (usually numbers like 0, 1, 2, ..., but they can be dates or other labels too)

When we write:
```python
pd.DataFrame(numpy_array, index=X_train.index, columns=["area"])
```

We tell pandas to convert this `NumPy` array back into a `DataFrame`, and use the same row labels (index) as the original `X_train` so everything lines up correctly. This is important because later on, we may want to merge the imputed columns back into the full dataset. If the rows don’t match (e.g., are out of order or mislabeled), we could end up mixing data from different houses, which we definitely don't want.

Finally, we remove the original text columns and add the encoded columns to the dataset:

In [16]:
X_train["area"] = numeric_imputed_values
X_train["furnishingstatus"] = categorical_imputed_values

You can check that there are no missing values left in these two columns:

In [17]:
missing_values_after_imputation = X_train.isna().sum()
print(missing_values_after_imputation)

area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64


## 5. Scaling the Data
Our next preprocessing step is to scale our data. Some machine learning models are sensitive to the scale of numerical data. For example, a model might treat “area in square feet” (which can be in thousands) very differently from “number of bedrooms” (which is a small number), just because of their different scales, not because area is more important.

To fix this, we’ll use something called standardization, which means adjusting the values so they all have a similar scale (mean = 0, standard deviation = 1). Similar to imputation, we have [`StandardScaler`](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling) as the candidate class for standardization purposes:

In [18]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Before scaling, let’s identify the numeric and categorical columns:

In [19]:
# Identify column types
categorical_columns = X_train.select_dtypes(include="object").columns
numerical_columns = X_train.select_dtypes(exclude="object").columns

The `select_dtypes()` method in pandas is used to select columns of a specific data type from a `DataFrame`.
We're using it to separate:
- categorical columns (e.g., "yes", "no", "furnished", "semi-furnished")
- numerical columns (e.g., area in square feet, number of bedrooms)
  
When a column contains text/string values, its data type is usually "object" (because strings are technically Python objects).

In [20]:
print(categorical_columns)
print(numerical_columns)

Index(['mainroad', 'guestroom', 'basement', 'hotwaterheating',
       'airconditioning', 'prefarea', 'furnishingstatus'],
      dtype='object')
Index(['area', 'bedrooms', 'bathrooms', 'stories', 'parking'], dtype='object')


The procedure is similar to imputation: 

In [21]:
scaler = StandardScaler()
# apply the scaler on the numerical columns of the X_train dataFrame and convert the result to a dataFrame
scaled_values = pd.DataFrame(scaler.fit_transform(X_train[numerical_columns]),index=X_train.index,columns=numerical_columns)
# replace the original numerical columns with scaled versions
X_train[numerical_columns] = scaled_values

This makes the numerical features more balanced and helps models perform better.

## 6. Handling Categorical Variables

Most machine learning models can’t understand text labels like "furnished" or "semi-furnished". We need to convert these into numbers — but not just any numbers. We use [`One-Hot Encoding`](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features), which turns each category into a new column. Again, `OneHotEncoder` is the candidate class for this task:

In [22]:
from sklearn.preprocessing import OneHotEncoder

# same procedure as imputation and scaling data
one_hot_encoder = OneHotEncoder(sparse_output=False)
encoded_values =  pd.DataFrame(one_hot_encoder.fit_transform(X_train[categorical_columns]),index=X_train.index, columns=one_hot_encoder.get_feature_names_out())

When we create our encoder, we need to set the parameter `sparse_output` to `False`. This will return a numpy array instead of a sparse matrix. A sparse matrix is a matrix (a grid of numbers) that contains mostly zeros. It saves memory by only storing the non-zero values, rather than storing all the zeroes too. This is useful in cases like `OneHotEncoding`, where we often end up with lots of columns, and many of them are zero. Please take a look at the documentation to understand the sparse matrix format better!

Next, we bring the encoded data back to our data frame. Therefore, we first drop the categorical columns of the data frame using the `drop()` function:

In [23]:
X_train_no_cat =  X_train.drop(categorical_columns,axis=1)
X_train_no_cat = pd.concat([X_train_no_cat, encoded_values],axis=1)

This function allows us to remove specific columns by passing the `categorical_columns` list and setting `axis=1`, which tells pandas to drop columns (not rows). We do not use inplace=True here, although it would modify the original DataFrame directly; it’s generally clearer and safer to avoid this and instead assign the result to a new variable, so we keep control of our data and avoid unexpected changes.

Finally, we concatenate the encoded data back into the `DataFrame` using `pd.concat()`. This function merges the original data (now without the categorical columns) and the encoded values side by side by setting `axis=1`, which combines them column-wise.

## 7. Model Training

We have now preprocessed our data, and we’re ready to build our machine-learning model! In this case, we will use a [Random Forest Regressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). So, what is a random forest? Think of it as a group of decision trees. A decision tree is a model that predicts an outcome based on certain criteria. For example, it might predict the price of a house by asking questions like: “How many bedrooms does the house have?” or “What is the area?” The tree then branches out to a prediction. A Random Forest takes this idea further. It builds many decision trees and combines their predictions to make a final decision. Imagine a forest with many trees: each tree gives an opinion (a prediction), and the forest as a whole makes a better, more accurate decision by considering all of these opinions. This collective approach reduces the risk of overfitting, when a model becomes too focused on the training data and doesn't perform well on new, unseen data. For the sake of time, we won't cover the algorithm here, but only present the intuition. You are welcome to dive deeper into the model at home. You might be interested in the concepts of [supervised learning and unsupervised learning](https://www.ibm.com/think/topics/supervised-vs-unsupervised-learning) here. Intuitively:

- **Supervised Learning**: A type of machine learning where the model is trained on labeled data, learning to predict the output (target) from input features.
- **Unsupervised Learning**: A type of machine learning where the model works with unlabeled data, aiming to find hidden patterns or groupings within the data.

Random Forest is a supervised learning algorithm because it learns from labeled data. Specifically, it’s a regression algorithm in this case, meaning it’s predicting a continuous value (like a price). Each individual decision tree in a random forest is trained on a subset of the data, and then the model makes predictions by averaging the results from all the decision trees. The procedure is as follows:

1. **Training the Model**: The model learns from labeled training data, where it has both the features (e.g., number of bedrooms, area of the house) and the target (the actual house price).
2. **Prediction**: After training, the model is able to make predictions on new data (test data), even though it hasn’t seen that data before. This is what makes the Random Forest a supervised learning method. It uses known labels during training to make accurate predictions.


Now, let's first import the model:

In [24]:
from sklearn.ensemble import RandomForestRegressor

### Creating the Model

We now create our RandomForestRegressor object. By setting `random_state=0`, we make sure that every time someone runs this code, they get the same results, which is important for reproducibility.

In [25]:
model = RandomForestRegressor(random_state=0)

This model is now ready for training. The next step is to “teach” it how to predict. This is done by calling the `fit()` function, where we pass in our training data (`X_train_no_cat` and `Y_train`). `X_train_no_cat `contains the features (like the number of bedrooms, area, etc.), and `Y_train` contains the target variable (the price of the house).

In [26]:
model.fit(X_train_no_cat, Y_train)

### In-Sample Predictions

After training the model, we can use it to make predictions on the training data. When we predict using the training dataset, it’s called in-sample prediction. This is because we are using the same data the model has already seen during training. It helps us assess how well the model has learned. For example, here’s how we make predictions on the training data using the `predict()` function:


In [27]:
in_sample_prediction = model.predict(X_train_no_cat)

The next step is to assess the model’s performance by calculating its error using mean absolute error (MAE) and mean absolute percentage error (MAPE). These metrics help us understand how far off our predictions are from the true values:

#### Mean Absolute Error (MAE)
The Mean Absolute Error (MAE) measures the average magnitude of the errors between the predicted and actual values. It calculates the absolute difference between each predicted value ($\hat{y}_i$) and the true value ($y_i$), then averages these differences across all data points. The formula for MAE is:

$$
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|
$$

Where:
- $n$ is the number of data points,
- $y_i$ is the true value,
- $\hat{y}_i$ is the predicted value.

This metric is useful when you want to know the average prediction error in the same units as the target variable.

#### Mean Absolute Percentage Error (MAPE)
The Mean Absolute Percentage Error (MAPE) is similar to MAE, but instead of using the absolute difference, it uses the relative difference expressed as a percentage. This metric shows how large the error is relative to the actual value. The formula for MAPE is:

$$
MAPE = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100
$$

Where:
- $n$ is the number of data points,
- $y_i$ is the true value,
- $\hat{y}_i$ is the predicted value.

MAPE is helpful for understanding the error in percentage terms, making it easier to compare model performance across datasets with different scales. However, it can be biased if the actual values ($y_i$) are very small, leading to very high error percentages.



Now, let's calculate these metrics. `sklearn` provides these metrics as built-in functions `mean_absolute_error` (MAE) and `mean_absolute_percentage_error` (MAPE):

In [28]:
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error

To calculate the scores, we have to hand over the true values and the predicted values to the functions:

In [29]:
mae = mean_absolute_error(Y_train, in_sample_prediction)
mape = mean_absolute_percentage_error(Y_train, in_sample_prediction)
print(f"Mean Absolute Error: {mae:.2f}\t Mean Absolute Percentage Error: {mape:.2f}")

Mean Absolute Error: 312757.81	 Mean Absolute Percentage Error: 0.07


### Evaluating the Test Data: Out-of-Sample Predictions

Next, we want to test the model's performance on new, unseen data (called out-of-sample data). This is crucial because a model might perform well on the training data (in-sample), but we want to know how it does with data it hasn’t encountered before.

To evaluate the model on the test data (`X_test`), we must preprocess it in the same way as the training data. This includes imputing missing values, scaling features, and encoding categorical variables, but this time using the `transform()` method, not `fit_transform()`. We use `transform()` because we don’t want to fit new scalers or imputers to the test data, which would introduce bias. Fitting the scaler or imputer to the test data would mean that the model has already had access to information from the test set, which could lead to overly optimistic results and an unfair evaluation of the model’s performance. This ensures that the model is evaluated in a more realistic scenario, where it has never seen the test data during training.

In [30]:
numeric_imputed_values =  pd.DataFrame(
    numerical_imputer.transform(X_test[["area"]]),index=X_test.index, columns=["area"]
)
categorical_imputed_values = pd.DataFrame(categorical_imputer.transform(X_test[["furnishingstatus"]]),index=X_test.index,columns=["furnishingstatus"])
X_test["area"] = numeric_imputed_values
X_test["furnishingstatus"] = categorical_imputed_values
# also need to scale the values
X_test[numerical_columns] = scaled_values

We then apply the same transformations to scale the data and encode the categorical features. After processing the test data, we combine it with the original dataset to prepare it for prediction:

In [31]:
encoded_values = pd.DataFrame(one_hot_encoder.transform(X_test[categorical_columns]),index=X_test.index,columns=one_hot_encoder.get_feature_names_out())
X_test_no_cat = X_test.drop(categorical_columns, axis=1)
# Concatenate the X_test dataFrame with the encoded values column-wise
X_test_no_cat = pd.concat([X_test_no_cat,encoded_values],axis=1) 
X_test_no_cat.head()

Unnamed: 0,area,bedrooms,bathrooms,stories,parking,mainroad_no,mainroad_yes,guestroom_no,guestroom_yes,basement_no,basement_yes,hotwaterheating_no,hotwaterheating_yes,airconditioning_no,airconditioning_yes,prefarea_no,prefarea_yes,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
239,,,,,,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
113,,,,,,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0
325,,,,,,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0
66,,,,,,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0
479,,,,,,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0


Now, with the test data ready, we use our trained model to predict house prices on the test set:

In [32]:
predictions = model.predict(X_test_no_cat)
mape_oos = mean_absolute_percentage_error(Y_test, predictions) 
mae_oos = mean_absolute_error(Y_test, predictions) 
print(f"Mean Absolute Error: {mae_oos:.2f}\t Mean Absolute Percentage Error: {mape_oos:.2f}")

Mean Absolute Error: 1134348.56	 Mean Absolute Percentage Error: 0.23


You may notice that the error on the test set is usually higher than the error on the training set. This is normal because the model hasn’t seen the test data before, and it’s a sign that the model is generalizing well. It’s always a good idea to measure a model’s performance on data it hasn’t been trained on—this gives us a realistic estimate of how it will perform in real-world scenarios.


## 8. Conclusion

Through this tutorial, we’ve learned how to preprocess data, train a machine learning model, and evaluate its performance. By working with a Random Forest Regressor, we created a model that learns from patterns in data and makes predictions. We also saw the importance of evaluating the model on both the training data (in-sample) and new, unseen data (out-of-sample) to ensure that the model is reliable. 

By following the steps outlined here, you now have the tools to build, evaluate, and refine machine-learning models. This process of data preprocessing, model training, and evaluation is fundamental in any machine learning project.

Assignment 1 will be released next week, and the concepts you've learned here will be really helpful as you tackle it. In the meantime, keep practicing, and happy coding!