# Classification with Random Forest Classifier  

After cleaning and analysing the dataset we proceed with a classifier, which is carried on with the random forest classifier algorithm

![random-forest.jpg](random-forest.jpg)

## Preparing the data 

We import the relevant libraries from ScikitLearn

In [None]:
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, roc_curve, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

import warnings
warnings.filterwarnings('ignore')

We import the data from `TWER_grouped_class.csv`:

In [None]:
df = pd.read_csv('../data/processed/TWER_grouped_class.csv').drop('Unnamed: 0', axis=1)

In [None]:
df.head()

Before continuing, we need to remove the last day (2013-12-31), on which we will make the prediction.

In [None]:
df = df[df['day'] != 31]

Let's divide `df` into features and target:

In [None]:
X = df.drop('class', axis=1)
y = df['class']  # target is the multi-class label (High, Medium, Low)

### Preprocessing

We now preprocess the data. 
The classifier only really accepts numbers as an input, so string-to-number conversion of categorical data is essential.
Since this process only concerns categorical data,  we need to split numerical and categorical features first. Second, we enconde the categorical data with label encoding. 

Then we prepare for the classification by splitting Train Data and Test Data. We want the test data to be the 20% of the data available (test_size=0.2) and we want to fix a random_state value of 20.

This is basically like a seed and assures the "random behaviour" of the forest, to be always the same if we run the program the program multiple times.

In the fase of tuning repeatability is essential, because it is the only way to reliably tune the parameters of the model. 

We then scale the numerical features, convert them back to dataframe form and finally recombine categorical and numerical data

In [None]:
def preprocess_data(X, y, categorical_features, numerical_features, test_size=0.2, random_state=20):
    # Label encode the target
    le_target = LabelEncoder()
    y = le_target.fit_transform(y)

    # Label encode categorical features
    le_feature = LabelEncoder()
    for feat in categorical_features:
        X[feat] = le_feature.fit_transform(X[feat])

    # Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    # Scale numerical features
    scaler = StandardScaler()
    X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
    X_test[numerical_features] = scaler.transform(X_test[numerical_features])

    return X_train, X_test, y_train, y_test


We then initialise and train the classifier. 

In [None]:
def train_and_evaluate(X_train, X_test, y_train, y_test, n_estimators, max_depth=None):
    # Initialize Random Forest Classifier
    rf_classifier = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=20)
    
    # Train the classifier
    rf_classifier.fit(X_train, y_train)
    
    # Predictions
    y_pred = rf_classifier.predict(X_test)
    
    # Print results
    print(f"Accuracy with n_estimators={n_estimators} and max_depth={max_depth}: {accuracy_score(y_test, y_pred)}")
    print(classification_report(y_test, y_pred))
    
    # Confusion matrix
    conf_matrix = confusion_matrix(y_test, y_pred)
    ax = plt.subplot()
    sns.heatmap(conf_matrix, annot=True, fmt='g', ax=ax)
    ax.xaxis.set_ticklabels(['high', 'low', 'medium'])
    ax.yaxis.set_ticklabels(['high', 'low', 'medium'])
    plt.show()
    
    return conf_matrix

In the developement of our analysis we noticed that it was not obvious whether to use all the data we had available or just a part of that. 

We intend to discuss this matter by examining two data sets: one that basically trains exlusively on tweet counts and the ARPA's weather data and the other on the largest possible dataset, which also has electrical data and other potentially meaningful features such as `population`, `elevation`. Naturally these data appear with their relative time slot and municipality. 

We do the traing for both dataset with respectively a little forest `(n_estimators=4, max_depth=4)` and a large forest `(n_estimators=100, no max_depth)`. 
In order to do that we created a function that will execute, with any given parameter, all the steps of our classification. 

Indeed the function:
1) Splits the data set in training and test data 
2) Trains the Classifier 
3) Commutes the prediction
4) Prints all the meaningful extimators
5) Plots the confusion matrix and the heatmap 

We create the two relevant Dataframes : 

In [None]:
# Dataset 1 (weather data)
X_weather = df.drop('class', axis=1).drop(['curr_cell', 'curr_site', 'population', 'elevation'], axis=1)
y_weather = df['class']

# Categorical and numerical features
categorical_features_weather = ['date', 'municipality.name', 'hour_category', 'month', 'day_of_week']
numerical_features_weather = ['temperature', 'minTemperature', 'maxTemperature', 'precipitation', 'wind_speed', 'wind_dir', 'tweet_count', 'day']

# Preprocess Dataset 1
X_weather_train, X_weather_test, y_weather_train, y_weather_test = preprocess_data(X_weather, y_weather, categorical_features_weather, numerical_features_weather)

In [None]:
# Dataset 2 (weather + other)
X_additional = df.drop('class', axis=1)
y_additional = df['class']

# Categorical and numerical features
categorical_features_additional = ['date', 'municipality.name', 'hour_category', 'month', 'day_of_week']
numerical_features_additional = ['temperature', 'minTemperature', 'maxTemperature', 'precipitation', 'wind_speed', 'wind_dir', 
                                 'curr_cell', 'population', 'elevation', 'curr_site', 'tweet_count', 'day']

# Preprocess Dataset 2
X_additional_train, X_additional_test, y_additional_train, y_additional_test = preprocess_data(X_additional, y_additional, categorical_features_additional, numerical_features_additional)

## RF classification

We herby presents all the different scenarios 

In [None]:
print("Scenario 1: Weather Data with n_estimators=4 and max_depth=4")
conf_matrix1 = train_and_evaluate(X_weather_train, X_weather_test, y_weather_train, y_weather_test, n_estimators=4, max_depth=4)

In [None]:
print("Scenario 2: Weather + Additional Data with n_estimators=4 and max_depth=4")
conf_matrix2 = train_and_evaluate(X_additional_train, 
                                  X_additional_test, 
                                  y_additional_train, 
                                  y_additional_test, 
                                  n_estimators=4, 
                                  max_depth=4)

In [None]:
print("Scenario 3: Weather Data with n_estimators=100 and no max_depth")
conf_matrix3 = train_and_evaluate(X_weather_train, 
                                  X_weather_test,
                                  y_weather_train, 
                                  y_weather_test,
                                  n_estimators=100, 
                                  max_depth=None)

In [None]:
print("Scenario 4: Weather + Additional Data with n_estimators=100 and no max_depth")
conf_matrix4 = train_and_evaluate(X_additional_train,
                                  X_additional_test,
                                  y_additional_train,
                                  y_additional_test,
                                  n_estimators=100, 
                                  max_depth=None)

This result is not obvious, the so called additional data are not correlated as well to tweet count as the whether data, the risk of redundancy or even insignificance of the dataset is high. Indeed from a first trial it seemed to us the results went in this direction. The risk of overfitting for a larger dataset is high if there are redundancies, since the model may overlearn patterns. If the "additional data" introduces noise instead of information, it's better to remove that part of the dataset entirely. 

Nevertheless, we notice a neat improvement by introducing new data and this signals that the additional data is indeed informative. 

Beyond that, what we notice is that little forests perform way worse than larger forests, a more in depth  training and a meaningful number of estimators is crucial to generalisation. 
This does not depend on the dataset but it is very general, the eventual point in setting a low number of estimators and max_depth parameter is to make the training computationally feasible, but for us it's no problem since the whole process is very fast. If we were to analyse the telecommunication data, the correct approach may very well be using low max_depth and low n_estimators. 

As indicated by the the accuracy, the classifier always works fine, even in the little forest case. 
In the last case (larger forest on larger dataset), we see the promising result of just two misclassifications.

We point out that, as emerges form the confusion matrix, the error type is (almost) always the same in the four cases: some municipalities, which are predicted to be in the middle tweet count range, actually end up to be in the high tweet count range.

This makes sense, because, whereas there are not many events outside our dataset which can cause great short term depression in the tweet count, there are lots of social events that may cause an occasional peak in the tweet count. In the dates Dec 13th and 14th 2013, which present the greatest tweet predictive mistake, a big event called **Universiadi** was held in the municipality of Trento, where such peak is registered.

## Prediction

Now that we estalished the validity of our model, we can train it on all data before December 31st, the day of the prediction.

In [None]:
df = pd.read_csv('../data/processed/TWER_grouped_class.csv').drop('Unnamed: 0', axis=1)

In [None]:
df.head()

In [None]:
X = df.drop('class', axis=1)
y = df['class']

In [None]:
categorical_features = ['date', 'municipality.name', 'hour_category', 'month', 'day_of_week']  # Any string or categorical features
numerical_features = ['temperature', 
                      'minTemperature', 
                      'maxTemperature', 
                      'precipitation', 
                      'wind_speed',
                      'wind_dir',
                      'curr_cell',
                      'population',
                      'elevation',
                      'tweet_count',
                      'curr_site',
                      'day']

In [None]:
# Selecting Dec 31st
len31 = len(X[X['date'] == '2013-12-31'])

Now the train-test split is done by selecting only the days before Dec 31st.

In [None]:
# Label encoding for categorical features and target
le = LabelEncoder()
y = le.fit_transform(y)  # Apply LabelEncoder to the entire target variable

# Encode the categorical features for the entire dataset before splitting
for feat in categorical_features:
    X[feat] = le.fit_transform(X[feat])

# Split the dataset after label encoding
X_train, X_test = X[0:len(X)-len31], X[len(X)-len31:]
y_train, y_test = y[0:len(X)-len31], y[len(X)-len31:]

# Scale numerical features after splitting
scaler = StandardScaler()
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test[numerical_features] = scaler.transform(X_test[numerical_features])

# Training the model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score

# Initialize Random Forest Classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=20)
rf_classifier.fit(X_train, y_train)

# Predicting and evaluating
y_pred = rf_classifier.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")
print(classification_report(y_test, y_pred))

# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
ax = plt.subplot()
sns.heatmap(conf_matrix, annot=True, fmt='g', ax=ax)
ax.xaxis.set_ticklabels(['high', 'low', 'medium'])
ax.yaxis.set_ticklabels(['high', 'low', 'medium'])
plt.show()

That's what we wanted.