##Global Average Temperature onestep forecasting Using Random forest ensemble

In [1]:
# Import the required libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error


In this first section, we import the necessary libraries: pandas for data manipulation, RandomForestRegressor from the sklearn library for building the random forest model, and mean_squared_error from sklearn.metrics to evaluate the model.

In [2]:
# Load and preprocess the data
data = pd.read_csv('https://raw.githubusercontent.com/ashiquebiniqbal/Global-Climate-Change-Data-forecsat/main/GlobalLandTemperatures.csv', parse_dates=['dt'])


In this section, we load the data from a remote CSV file using pandas' read_csv method. We use parse_dates to convert the date column 'dt' into a datetime object.

In [3]:
data = data[['dt', 'AverageTemperature']].groupby('dt').mean().resample('MS').mean().reset_index().dropna()
data = data.set_index('dt')


Next, we select only the 'dt' and 'AverageTemperature' columns from the data frame, group by date ('dt'), take the average of temperatures for each month, and then resample the data to monthly frequency. We then set the index of the resulting data frame as the datetime column ('dt').

In [4]:
for i in range(1, 13):
    data[f't_{i}'] = data['AverageTemperature'].shift(i)


In this loop, we create a set of lagged features (t_1, t_2, ..., t_12) for the target variable 'AverageTemperature' using the shift method. We shift the temperature values by 1 to 12 months to create features for each month's average temperature.

In [5]:
data.dropna(inplace=True)


After creating the lagged features, we remove any rows containing null values.

In [6]:
# Split the data into lagged features and temperature values
lags = [col for col in data.columns if col.startswith('t_')]
X = data[lags]
y = data['AverageTemperature']


Here, we split the data frame into two sets: 'lags' containing the lagged features ('t_1' through 't_12'), and 'y' containing the target variable ('AverageTemperature'). We use 'lags' to predict the target variable 'y' in our model.

In [11]:
print(X)

                  t_1        t_2        t_3        t_4        t_5        t_6  \
dt                                                                             
1745-04-01   0.627462  -2.563385  -3.186000  -0.970615   3.977538   8.212923   
1750-01-01   6.661462   0.627462  -2.563385  -3.186000  -0.970615   3.977538   
1750-02-01  -1.912077   6.661462   0.627462  -2.563385  -3.186000  -0.970615   
1750-03-01  -0.215231  -1.912077   6.661462   0.627462  -2.563385  -3.186000   
1750-04-01   3.537692  -0.215231  -1.912077   6.661462   0.627462  -2.563385   
...               ...        ...        ...        ...        ...        ...   
2013-05-01  19.983010  17.339370  14.615210  12.753660  13.571430  16.948740   
2013-06-01  23.405960  19.983010  17.339370  14.615210  12.753660  13.571430   
2013-07-01  24.341760  23.405960  19.983010  17.339370  14.615210  12.753660   
2013-08-01  24.951320  24.341760  23.405960  19.983010  17.339370  14.615210   
2013-09-01  24.770230  24.951320  24.341

Here , we print the data lags which shows 12 coloumns and 3155 rows.


In [7]:
# Split the data into training and test sets
train_size = int(len(data) * 0.8)
X_train, X_test = X.iloc[:train_size, :], X.iloc[train_size:, :]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]


We split the data into training and test sets by setting the first 80% of the data as the training set and the remaining 20% as the test set.

In [8]:
# Train the random forest model
n_estimators = 100
rf = RandomForestRegressor(n_estimators=n_estimators)
rf.fit(X_train, y_train)


We initialize the random forest model and train it using the training data set. We set the number of trees (n_estimators) to 100.

In [9]:
# Make one-step forecasts on the test set
y_pred = rf.predict(X_test)
print(f"One-step forecast: {y_pred[0:45]}")


One-step forecast: [16.5105164  19.65026973 22.3246089  23.74421974 24.286512   24.0420342
 22.37924293 19.75641757 16.31557291 13.1010977  12.11247746 13.56556758
 16.42043947 19.85971782 22.3916157  23.83613515 24.16884924 23.9707265
 22.3259108  19.5676231  16.2321491  12.9883292  12.16955001 13.42120254
 16.4228962  19.55852627 22.29106567 23.64495926 24.20936348 23.88651964
 22.44562959 19.62690727 16.37518595 13.29055805 12.04594289 13.24450777
 16.34475116 19.76048556 22.2750089  23.7055191  24.17084593 24.04949587
 22.26498739 19.54253063 16.36538993]


Using the trained model, we make one-step forecasts on the test set by calling the predict method. We print the first 45 predicted values.

In [10]:
# Evaluate the model using mean squared error
mse = mean_squared_error(y_test, y_pred)
print(f"Mean squared error: {mse}")

Mean squared error: 1.9703710553431306


In this section, we evaluate the performance of our model using mean squared error (MSE). We calculate the MSE by passing the actual target variable values ('y_test') and the predicted values ('y_pred') to the mean_squared_error function from sklearn.metrics. We then print the MSE value to the console using an f-string. The MSE gives us an idea of how well the model is able to predict the target variable. The lower the MSE value, the better the model's performance.