#Text Classification: Tree-based models

Last time, we utilized a common machine learning algorithm, Logistic Regression, to perform text classification on a COVID-19 related dataset of tweets. Now, the goal is to present another widely used class of models for text classification: tree-based models. We aim to apply these models to the same data from the previous notebook.

Let's quickly import the preprocessed data again, so we can start experimenting with the new models!

In [3]:
import pandas as pd
df_train = pd.read_pickle('df_train.pkl')
df_test = pd.read_pickle('df_test.pkl')

Then, we represent the text as vectors using TF-IDF and encode the labels as numbers.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
vectorizer = TfidfVectorizer()

X_train = vectorizer.fit_transform(df_train['ProcessedTweet'])
y_train = label_encoder.fit_transform(df_train['Sentiment'])
X_test = vectorizer.transform(df_test['ProcessedTweet'])
y_test = label_encoder.transform(df_test['Sentiment'])

# Decision Tree

The first model we encounter is the Decision Tree. What is it? A Decision tree is one of the most interpretable machine learning algorithms. One can visualize a decision tree as a series of conditions (called nodes) that, starting from the data, lead to certain label (leaves). To easily explain the key concept behind the decision tree, consider a scenario in which the task is to predict if a person is an adult or a child and we have the height of that person. We could assert that, if a person's height is over 150 cm, that person could be predicted as an adult, and a child otherwise. In this case, the initial data is the height of the subject, the node (condition) is $\mbox{height}>150 \text{cm}$ and the labels are "Adult" and "Child". These labels also represent the leaves of the tree. During training, the algorithm finds the optimal conditions that best split the data to accurately classify each instance.

Let's apply the Decision Tree algorithm to our data!

In [5]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print("Test Classification Report: \n", classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Test Classification Report: 
               precision    recall  f1-score   support

    Negative       0.64      0.60      0.62      1633
     Neutral       0.48      0.57      0.52       619
    Positive       0.65      0.65      0.65      1546

    accuracy                           0.61      3798
   macro avg       0.59      0.61      0.60      3798
weighted avg       0.62      0.61      0.62      3798



The results are quite disappointing.

**Bonus**: can we improve the performance? This is an optimal scenario to introduce the concept of cross-validation!

Until now, we have applied models with default hyperparameters (i.e., we don't specify anything when calling a model). Actually, we can try to modify these hyperparameters to see if some starting combinations work better than others! For the Decision Tree, we can tune these hyperparameters:

- max_depth: This parameter sets the maximum depth of the tree. Limiting the depth of the tree helps to prevent overfitting. If set to None, nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

- min_samples_split: This parameter specifies the minimum number of samples required to split an internal node. A higher value can prevent the model from learning overly specific patterns in the training data, thus helping to reduce overfitting.

- min_samples_leaf: This parameter sets the minimum number of samples required to be at a leaf node. A higher value can smooth the model, making it more resistant to noisy data.

- max_features: This parameter specifies the maximum number of features to consider when looking for the best split. Limiting the number of features can introduce randomness into the model, helping to prevent overfitting.

We can use the `GridSearchCV` object to define some values for the hyperparameters and check which combinations fit better. It performs an exhaustive search over specified hyperparameter values for a model. It uses cross-validation, which means splitting the original training data into multiple training and validation sets to evaluate the model's performance, to find the best hyperparameter combination before evaluating the performance on the test set.

We can extract the best combinations of hyperparameters with the method `.best_params_` and the best model with the method `.best_estimator_` and then evaluate the model on the test set.

In [6]:
# This cell may take some time to be executed...

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [None, 10, 50],
    'min_samples_split': [2, 10],
    'min_samples_leaf': [1, 3, 7],
    'max_features': [None, 'sqrt']
}

# Crea il modello Decision Tree
dt = DecisionTreeClassifier()

# Esegui la ricerca degli iperparametri
grid_search = GridSearchCV(estimator=dt, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

# Migliori parametri trovati
best_params = grid_search.best_params_
print("Best parameters found: ", best_params)

# Predizione sui dati di test con il miglior modello trovato
best_dt = grid_search.best_estimator_
y_pred = best_dt.predict(X_test)

# Genera il classification report
report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
print(report)

Fitting 3 folds for each of 36 candidates, totalling 108 fits
Best parameters found:  {'max_depth': None, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 2}
              precision    recall  f1-score   support

    Negative       0.63      0.58      0.61      1633
     Neutral       0.49      0.57      0.53       619
    Positive       0.63      0.65      0.64      1546

    accuracy                           0.61      3798
   macro avg       0.58      0.60      0.59      3798
weighted avg       0.61      0.61      0.61      3798



We notice that, in this case, the best combination of hyperparameters among all the specified combinations is the default one. So, this model seems to have already reached its maximum performance, and we should consider other models for this task.


One single tree seems to be inadequate for the problem. Why don't we use multiple trees? This is the underlying idea of three so-called ensemble algorithms: Bagging, Random Forest, Boosting.

# Bagging

Derived from Bootstrap Aggregating, the Bagging algorithm proposes to train multiple decision trees, each over a subset of the data (extracted with replacement). The final classification is then obtained as the majority class proposed by the group. We define a `BaggingClassifier` composed by 50 Decision Trees as an example.

In [7]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

base_classifier = DecisionTreeClassifier()
bagging_classifier = BaggingClassifier(base_classifier, n_estimators=50, random_state=1998)

bagging_classifier.fit(X_train, y_train)

y_pred = bagging_classifier.predict(X_test)

print("Classification Report: \n", classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Classification Report: 
               precision    recall  f1-score   support

    Negative       0.70      0.67      0.69      1633
     Neutral       0.58      0.67      0.62       619
    Positive       0.72      0.70      0.71      1546

    accuracy                           0.69      3798
   macro avg       0.67      0.68      0.67      3798
weighted avg       0.69      0.69      0.69      3798



We notice that the performance is improved: United we stand, divided we fall. :-)

# Random Forest

Random Forest creates a 'forest' of trees by randomly sampling both data points and features. Unlike Bagging, it also randomly selects a subset of features at each split. Each tree is trained on a different subset of the data, and the final prediction is made by majority vote from all the trees. The reason why Random Forest is preferred to Bagging is that it further reduces overfitting by introducing random feature selection at each split, leading to less correlated trees and typically improved model performance. We define a `RandomForestClassifier` composed by only 10 trees as an example.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, accuracy_score

rf_classifier = RandomForestClassifier(n_estimators=50, random_state=1998)

rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_test)

print("Classification Report: \n", classification_report(y_test, y_pred, target_names=label_encoder.classes_))

Classification Report:                precision    recall  f1-score   support

    Negative       0.67      0.69      0.68      1633
     Neutral       0.55      0.50      0.53       619
    Positive       0.67      0.67      0.67      1546

    accuracy                           0.65      3798
   macro avg       0.63      0.62      0.63      3798
weighted avg       0.65      0.65      0.65      3798



In this case, the performance seems worse compared to Bagging.

# Boosting

The Boosting algorithm sequentially trains a number of Decision Trees, where each tree attempts to correct the errors made by the previous ones. In the end, all the Decision Trees contribute to the final prediction, but the most recent ones typically have more influence. We define a `BoostingClassifier` composed by 50 trees as an example.

In [8]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

gradient_boosting_classifier = GradientBoostingClassifier(n_estimators=50, learning_rate=0.1, random_state=1998)

gradient_boosting_classifier.fit(X_train, y_train)

y_pred = gradient_boosting_classifier.predict(X_test)

print("Classification Report: \n", classification_report(y_test, y_pred, target_names=label_encoder.classes_))


Classification Report: 
               precision    recall  f1-score   support

    Negative       0.76      0.53      0.62      1633
     Neutral       0.62      0.22      0.32       619
    Positive       0.53      0.84      0.65      1546

    accuracy                           0.60      3798
   macro avg       0.64      0.53      0.53      3798
weighted avg       0.64      0.60      0.59      3798



In this case, the performance seems worse compared to Bagging.

# Conclusion

We have explored tree-based models as an additional resource beyond logistic regression for text classification. Starting with the Decision Tree model, we observed its simplicity and interpretability but also its limitations. To address these, we examined Bagging, which improves performance by aggregating multiple decision trees trained on different subsets of the data. Next, we looked into Random Forest, an enhancement of Bagging that further reduces overfitting by introducing random feature selection at each split, resulting in less correlated trees and improved model performance. Finally, we have explored boosting, which sequentially trains decision trees, each one focusing on correcting the errors of its predecessor. This method shows promise in achieving high predictive accuracy by iteratively refining the model. In future sessions, we will explore more advanced models such as embeddings and neural networks, which promise even greater capabilities for text classification tasks. These advanced techniques will help us further push the boundaries of what we can achieve with "classical" machine learning in text analytics.