## Basic Decision Tree Regressor ML Algorithm for Temperature Prediction

In this brief Notebook, we will build a simple Decision Tree Regression Model to predict temperature based on humidity, wind and visibility values. 

To do this, we will import the data in csv format, we will briefly explore it and we train a baseline model and a bit more accurate one. 

by Davide Nardini (https://github.com/dnardini16)

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import r2_score,mean_squared_error

In [2]:
## import data and print first 5 lines
data = pd.read_csv("weather.csv", index_col=[0])
data.head()

Unnamed: 0,Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Temperature (C)
0,0.89,14.1197,251.0,15.8263,9.472222
1,0.86,14.2646,259.0,15.8263,9.355556
2,0.89,3.9284,204.0,14.9569,9.377778
3,0.83,14.1036,269.0,15.8263,8.288889
4,0.83,11.0446,259.0,15.8263,8.755556


In [3]:
## print statistics of data
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Humidity,96453.0,0.734899,0.195473,0.0,0.6,0.78,0.89,1.0
Wind Speed (km/h),96453.0,10.81064,6.913571,0.0,5.8282,9.9659,14.1358,63.8526
Wind Bearing (degrees),96453.0,187.509232,107.383428,0.0,116.0,180.0,290.0,359.0
Visibility (km),96453.0,10.347325,4.192123,0.0,8.3398,10.0464,14.812,16.1
Temperature (C),96453.0,11.932678,9.551546,-21.822222,4.688889,12.0,18.838889,39.905556


In [4]:
## check for null values in data
data.isnull().sum()

Humidity                  0
Wind Speed (km/h)         0
Wind Bearing (degrees)    0
Visibility (km)           0
Temperature (C)           0
dtype: int64

Create train and test X y by splitting all data with 25% ratio

In [5]:
y = data["Temperature (C)"].values
data = data.drop(["Temperature (C)"],axis=1)
X = data.values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=16)

In [6]:
## fit first and baseline model
dt = DecisionTreeRegressor(random_state=0)
dt.fit(X_train,y_train)

In [7]:
y_pred = dt.predict(X_test)

In [8]:
## array of predicted value for each rows of X_test
print(y_pred)

[15.36666667 29.95       18.78888889 ... 13.62222222 16.15
 27.19444444]


In [9]:
## accuracy metric for regression task
print('R-quadro Score:', r2_score(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

R-quadro Score: 0.15158190763222024
Mean Squared Error: 77.31699885888486


R-quadro or r-square is a metric with range 0-1 (sometime can be less than 0). If value is 1, we have the best fit, if is 0, the worst. 0.15 it is not so good, so now we will tune some hyperparameters to increase this result

In [10]:
## add max_depth as hyperparameters doing "tree pruning" in order to increase generalization
dt = DecisionTreeRegressor(random_state=0,max_depth=6)

dt.fit(X_train,y_train)
y_pred = dt.predict(X_test)

print('R-quadro Score:', r2_score(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))

R-quadro Score: 0.5225393634873462
Mean Squared Error: 43.51135816232535
