# Basic Supervised Learning model
Learn the supervised learning model basics

In [1]:
# Install the project libraries
!pip install pandas
!pip install -U scikit-learn
!pip install -U matplotlib



## Medellin properties price prediction
Using the Medellin properties dataset 2023, predict the property price.

You can download the dataset from: https://www.kaggle.com/datasets/cesaregr/medelln-properties

In [2]:
# Import python libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

### 1. Frame the problem
The goal es predict the Medellin properties price using the properties features: geolocation, area, # bathrooms, # rooms, etc.

### 2. Selec the Performance Measure
In order to identify how well our model is doing, we can calculate the difference between the real and predicted value. The performance measure to use is Mean Squared Error (MSE). Later we will understand how it work

### 3. Download and read the Data
The data is in the file named 'medellin_properties.csv'

### 4. Take a look at the dataset - Analyze the dataset
Review the dataset, understand what it cotains, check data quality, understand the data limitations, the risks, the non-relevant data, etc.

In [3]:
# Read medellin_properties file
properties_filepath = 'medellin_properties.csv'
medellin_properties_df = pd.read_csv(properties_filepath)
medellin_properties_df

Unnamed: 0,neighbourhood,latitude,longitude,property_type,price,rooms,baths,area,administration_price,age,garages,stratum
0,Suramerica,6.186203,-75.599437,Apartamento,435000000,3.0,2.0,83.0,354400.0,1.0,1.0,4
1,Escobero,6.162800,-75.573519,Apartamento,680000000,3.0,4.0,124.0,480000.0,2.0,2.0,4
2,Castropol,6.216140,-75.566970,Apartamento,900000000,3.0,3.0,111.0,813000.0,1.0,2.0,6
3,Toledo,6.162762,-75.639307,Casa,650000000,3.0,2.0,127.0,0.0,2.0,1.0,4
4,La pilarica,6.247638,-75.565815,Apartamento,320000000,3.0,2.0,72.0,250000.0,,2.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...
9923,Loma de los Bernal,6.213318,-75.607806,Apartamento,685000000,3.0,3.0,94.0,380000.0,1.0,2.0,5
9924,Conjunto residencial san jose,6.157310,-75.578221,Apartamento,582000000,3.0,2.0,88.0,0.0,2.0,1.0,4
9925,Centro,6.244732,-75.560750,Apartamento,320000000,3.0,2.0,59.0,0.0,,,4
9926,La america,6.254452,-75.611331,Apartamento,450000000,4.0,3.0,147.0,0.0,3.0,1.0,3


### 5. Create a Test set
Create a test set to evaluate the model performance. The train dataset and train dataset MUST NOT share relations/transformations. It could generate bias

In [9]:
train_set, test_set = train_test_split(medellin_properties_df, test_size=0.2, random_state=42)
train_set

Unnamed: 0,neighbourhood,latitude,longitude,property_type,price,rooms,baths,area,administration_price,age,garages,stratum
6723,Doce de Octubre,6.299431,-75.584218,Apartamento,130000000,4.0,3.0,72.00,0.0,2.0,,2
2291,Loma de las brujas,6.155710,-75.566498,Apartamento,1550000000,3.0,4.0,181.00,570000.0,1.0,2.0,5
9785,Loma del Indio,6.209064,-75.567698,Apartamento,432000000,3.0,2.0,76.00,358000.0,2.0,1.0,4
4170,Las lomas no.1,6.203481,-75.571659,Apartamento,620000000,4.0,3.0,117.00,480000.0,3.0,1.0,6
5804,Robledo,6.284734,-75.614330,Apartamento,89000000,2.0,1.0,45.95,15000.0,,,1
...,...,...,...,...,...,...,...,...,...,...,...,...
5734,Benedictinos,6.179084,-75.574668,Casa,2350000000,3.0,4.0,330.00,900000.0,2.0,2.0,5
5191,LOMA DEL ESMERALDAL,6.167286,-75.583696,Apartamento,500000000,3.0,2.0,92.00,0.0,,,5
5390,Las Palmas,6.154057,-75.532461,Casa,1890000000,4.0,5.0,220.00,570000.0,,4.0,6
860,Los naranjos,6.187188,-75.659903,Apartamento,800000000,2.0,3.0,120.00,500000.0,1.0,1.0,5


### 6. Prepare Data for Machine Learning Algorithms
Apply the transformations needed to prepare the data to train the model. Remove irrelevant features, handle outliers and anomalies, handle missing values, scale features, etc. In this section we will only remove the irrelevant features, in following session we will perform more transformation to improve the data quality and prepare better the data.

In [10]:
train_set.drop(columns=['neighbourhood', 'property_type'], inplace=True)
test_set.drop(columns=['neighbourhood', 'property_type'], inplace=True)
train_set

Unnamed: 0,latitude,longitude,price,rooms,baths,area,administration_price,age,garages,stratum
6723,6.299431,-75.584218,130000000,4.0,3.0,72.00,0.0,2.0,,2
2291,6.155710,-75.566498,1550000000,3.0,4.0,181.00,570000.0,1.0,2.0,5
9785,6.209064,-75.567698,432000000,3.0,2.0,76.00,358000.0,2.0,1.0,4
4170,6.203481,-75.571659,620000000,4.0,3.0,117.00,480000.0,3.0,1.0,6
5804,6.284734,-75.614330,89000000,2.0,1.0,45.95,15000.0,,,1
...,...,...,...,...,...,...,...,...,...,...
5734,6.179084,-75.574668,2350000000,3.0,4.0,330.00,900000.0,2.0,2.0,5
5191,6.167286,-75.583696,500000000,3.0,2.0,92.00,0.0,,,5
5390,6.154057,-75.532461,1890000000,4.0,5.0,220.00,570000.0,,4.0,6
860,6.187188,-75.659903,800000000,2.0,3.0,120.00,500000.0,1.0,1.0,5


### 7. Train the model
Select a model and use the training dataset to train the model. In following sessions we will improve the model (fine-tune). 

The ML model selected is Random Tree algorithm.

In [11]:
# Build X and y datasets. Training and Testing
y_train = train_set['price']
X_train = train_set.drop(columns='price')
y_test = test_set['price']
X_test = test_set.drop(columns='price')

In [12]:
X_train

Unnamed: 0,latitude,longitude,rooms,baths,area,administration_price,age,garages,stratum
6723,6.299431,-75.584218,4.0,3.0,72.00,0.0,2.0,,2
2291,6.155710,-75.566498,3.0,4.0,181.00,570000.0,1.0,2.0,5
9785,6.209064,-75.567698,3.0,2.0,76.00,358000.0,2.0,1.0,4
4170,6.203481,-75.571659,4.0,3.0,117.00,480000.0,3.0,1.0,6
5804,6.284734,-75.614330,2.0,1.0,45.95,15000.0,,,1
...,...,...,...,...,...,...,...,...,...
5734,6.179084,-75.574668,3.0,4.0,330.00,900000.0,2.0,2.0,5
5191,6.167286,-75.583696,3.0,2.0,92.00,0.0,,,5
5390,6.154057,-75.532461,4.0,5.0,220.00,570000.0,,4.0,6
860,6.187188,-75.659903,2.0,3.0,120.00,500000.0,1.0,1.0,5


In [13]:
y_test

2688    890000000
4503    450000000
7135    460000000
7682    530000000
3127    290000000
          ...    
7517    280000000
6426    320000000
810     250000000
3281    260000000
6393    299000000
Name: price, Length: 1986, dtype: int64

##### ML Algorithms
* Linear Regression  
* Logistic Regression  
* Support Vector Machines (SVMs)  
* Decision Trees and Random Forests  
* Neural networks2

To understand random forest:

https://www.youtube.com/watch?v=g9c66TUylZ4

https://www.youtube.com/watch?v=J4Wdy0Wc_xQ&t=91s

In [14]:
# Initialize the RandomForestRegressor
regr = RandomForestRegressor(n_estimators=100, random_state=42, criterion='squared_error')
# Fit the model on the training data
regr.fit(X_train, y_train)

### 8. Evaluate Model performance
Check how well the mode is doing using the performance metric (cost function). Mean Squared Error (MSE) in this case

In [15]:
# Make predictions on the test data
y_pred = regr.predict(X_test)
# Evaluate the model using the root mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f'MSE: {mse:.2f}')

MSE: 862857707347354583040.00
