# UCLAIS Tutorial Series Challenge 3

We are proud to present you with our third challenge of the 2022-23 UCLAIS tutorial series: Sentiment Analysis on the Climate Change problem. You will be introduced to another super exciting domain in Machine Learning, which is Natural Language Processing 🙀.

This Jupyter notebook will guide you through the various general stages involved in end-to-end NLP projects, including data visualisation, data preprocessing, model selection, model training, and model evaluation. Finally, you will get the chance to submit your results to [DOXA](https://doxaai.com/).

If you do not already have a DOXA account, please [sign up](https://doxaai.com/sign-up) first before proceeding.


## Background & Motivation




**Background**: 

You might have heard about [people who deny climate change.](https://en.wikipedia.org/wiki/Climate_change_denial) How many skeptics are there? Why do they believe so? Let's look at 12000 tweets and analyse people's beliefs on climate change.

**Objective**:  

Create a model that classifies tweets according to belief in the existence of global warming or climate change. 

**Dataset**:

The labels are "1" if the tweet suggests global warming is occurring, "-1" if the tweet suggests global warming is not occurring, and "0" if the tweet is ambiguous or unrelated to global warming.  

The dataset is aggregated from the links stated below. The data obtained from these links is processed such that we are dealing with an almost balanced classification problem, and to remove any non-ascii character (just to have higher quality data). 
- https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset
- https://data.world/xprizeai-env/sentiment-of-climate-change/

## Installing and Importing Useful Packages

To get started, we will install a number of common machine learning packages.

In [None]:
%pip install numpy pandas matplotlib seaborn scikit-learn doxa-cli gdown yellowbrick

In [None]:
# Import relevant libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import relevant sklearn classes/functions related to data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.multiclass import OneVsRestClassifier

# For visualising data
from yellowbrick.text import FreqDistVisualizer

# For displaying plots on Jupyter Notebook
%matplotlib inline

## Data Loading
The first step is to gather the data that we will be using. The data can be downloaded directly via [Google Drive](https://drive.google.com/drive/folders/1xct1L1Cyg1JjGQNDT5fXasEdHsb7sl6I) or just by simply running the cell below. 

In [None]:
# Let's download the dataset if we don't already have it!
if not os.path.exists("data"):
    os.makedirs("data", exist_ok=True)

    !curl https://raw.githubusercontent.com/UCLAIS/doxa-challenges/main/Challenge-3/data/train.csv --output data/train.csv
    !curl https://raw.githubusercontent.com/UCLAIS/doxa-challenges/main/Challenge-3/data/test.csv --output data/test.csv

In [None]:
# Import the training dataset
data_original = pd.read_csv("./data/train.csv")

# We then make a copy of the dataset that we can manipulate
# and process while leaving the original intact
data = data_original.copy()

## Data Understanding 
Before we start to train our Machine Learning model, it is important to have a look at and understand the dataset that we will be using. This will provide some insight into which models, model hyperparameters, and loss functions are suitable for the problem we are dealing with. 

In [None]:
# Let's have a look at the shape of our training and testing set
print(f"Shape of the dataset: {data.shape}")

In [None]:
# Let's view the first 15 sample of the dataset
data.head(15)

Alright, from the simple analysis we've done above, we can see that our dataset consists of 15,000 samples (or rather tweets) and our job is to predict the sentiment of these tweets as either -1, 0, or 1.

Nice! Now let's try to see whether we are dealing with a balanced classification problem or not.

In [None]:
data["Sentiment"].value_counts()

There are 4000 data points that correspond to a 'Sentiment' of -1, and 5500 data points that belong to each of 'Sentiment' 0 and 1. The dataset that we have seems a bit imbalanced, but the good thing is that we are not dealing with a heavily imbalanced dataset. This means we can get the ball rolling while not thinking too much about having an imbalanced dataset!

## Data Preprocessing

Now, we get to one of the unique aspects of dealing with a Natural Language Processing (NLP) problem. As you might know (or might not know), computers can only understand numbers, but when it comes to language, we are dealing with text. A lot of text. This type of data is not really useful for the computer. Thus, it is essential for us to transform the text into something that our machines can understand.

And as you might have learned during our tutorial session, we can vectorise our text. So let's vectorise it! We will use the vectors in data visualisation and model training.

Before that, let's split our dataset into both training and validation set.

In [None]:
# Splitting the data into input features and target features (labels)
data_input = data["Tweet"]
data_label = data["Sentiment"]

In [None]:
# Splitting the data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(data_input, data_label, test_size=0.2)

To vectorise our text, we will be using the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) implementation from Scikit-learn. Text preprocessing, tokenization, and stopword filtering are all included in the CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors.

Under the hood, the CountVectorizer implementation can be thought of as a transformation to a **bag of words** representation, The brief algorithm is stated below:

- Assign a fixed integer ID to each word occurring in any document of the training set (for instance by building a dictionary of integer indicies to corresponding words).
- For each document #i, count the number of occurrences of each word w and store it in X[i, j] as the value of feature #j, where j is the index of word w in the dictionary.
- The bag of words representation implies that n_features is the number of distinct words in the corpus: this number is typically larger than 100,000.

In [None]:
# Initializing vectorization of climate posts
vectorizer = CountVectorizer()
X_train_vectorised = vectorizer.fit_transform(X_train)
X_valid_vectorised = vectorizer.transform(X_valid)
X_train_vectorised

In [None]:
X_train_vectorised.shape

Our dataset now has been vectorised, and we can see that it contains approximately 28,000 features which correspond to the frequency of a particular word occuring in a text (tweet). 

This is a lot of features considering that they come from tweets, where the length of the text is usually not that large. Do we expect this? 

Indeed, bear in mind the dataset we are using is coming from twitter, which hosts its own language 🙃. All the different slang, lingo, unscrambled words, and mixed up words. Twitter has all of them.

Next, let's see what the most common words in our dataset are

In [None]:
# Distribution of most common words
features = vectorizer.get_feature_names_out()
visualizer = FreqDistVisualizer(features=features, orient='v')
visualizer.fit(X_train_vectorised)
visualizer.show()

As expected, because we are dealing with climate change specific tweets, the words that come up the most include: climate, change, global, warming, etc...

## Model Training

For this section, We'll experiment with various models such as a [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), a [Bagging Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html) using the OvR strategy, and a [Gradient Boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html) Classifier using the OvR strategy. Feel free to try other models too!

You're already familiar with Random Forests and Gradient Boosting, as we've covered it in [week 5](https://github.com/UCLAIS/ml-tutorials-season-3/tree/main/week-5).

As you could have guessed from its name, [One-vs-the-rest (OvR) Classifiers](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html) output a binary prediction. It predicts if the input belongs to a particular class or not. Rather than trying to learn all the classes, it focuses on one specific class, which might give it an advantage in some cases. Given that each classifier fits its own class, OvR requires multiple classifiers for a multilabel classification problem.

In [None]:
# Fit a Random Forest model
rfClassifier = RandomForestClassifier()
rfClassifier.fit(X_train_vectorised, y_train)

In [None]:
# Fit a bagging OvR classifier
bagClassifier = OneVsRestClassifier(BaggingClassifier())
bagClassifier.fit(X_train_vectorised, y_train)

In [None]:
# Fit a Gradient Boosting OvR classifier
boostClassifier = OneVsRestClassifier(GradientBoostingClassifier())
boostClassifier.fit(X_train_vectorised, y_train)

## Model Evaluation
Now that we have trained our machine learning models, we can test them on our validation set.

Since we are dealing with a balanced dataset, we will be evaluating using the simple accuracy metric, which is the percentage of correct predictions out of all predictions.

In [None]:
# Use the .predict() method to predict output values for our test set
rf_predicted = rfClassifier.predict(X_valid_vectorised)
bag_predicted = bagClassifier.predict(X_valid_vectorised)
boost_predicted = boostClassifier.predict(X_valid_vectorised)

In [None]:
# We will be using the accuracy_score() implementation from scikit-learn
rf_accuracy = accuracy_score(rf_predicted, y_valid)
bag_accuracy = accuracy_score(bag_predicted, y_valid)
boost_accuracy = accuracy_score(boost_predicted, y_valid)

print("Accuracy (Random Forest): ", rf_accuracy)
print("Accuracy (Bagging Classifier with OvR strategy): ", bag_accuracy)
print("Accuracy (Boost Classifier with OvR strategy): ", boost_accuracy)

It seems that our Bagging classifier and Gradient Boosting classifier perform worse than our Random Forest classifier for this data. Does this have something to do with the OvR implementation? 

As practise, try to implement the Bagging Classifier and Gradient Boosting Classifier without the OvR strategy.

## Preparing our DOXA Submission

Once we are confident with the performance of our model, we can start deploying it on the real test dataset for submission to DOXA! 

In [None]:
# First, let's import our test dataset and save it in a variable called data_test
data_test = pd.read_csv("./data/test.csv")          # Change the path accordingly 

Then, we must preprocess the dataset before feeding it into the trained model. Remember that there's only one preprocessing step we've done to our training data, which was to use the CountVectorizer() implementation from Scikit-learn

In [None]:
# Vectorise the test set
X_test_vectorised = vectorizer.transform(X_valid)


Because our Random Forest model did the best, let's use it for our inference

In [None]:
# Inference on testing set
predictions = rfClassifier.predict(X_test_vectorised)


Let's check the size of our predictions and verify that it matches the size of the test set

In [None]:
len(predictions)

In [None]:
os.makedirs("submission", exist_ok=True)

with open("submission/y.txt", "w") as f:
    f.writelines([f"{prediction}\n" for prediction in predictions])

with open("submission/doxa.yaml", "w") as f:
    f.write("competition: uclais-3\nenvironment: cpu\nlanguage: python\nentrypoint: run.py")

with open("submission/run.py", "w") as f:
    f.write("with open('y.txt', 'r') as f: print(f.read().strip())")


## Submitting to DOXA

Before you can submit to DOXA, you must first ensure that you are enrolled for the challenge on the DOXA website. Visit [the challenge page](https://doxaai.com/competition/uclais-3) and click "Enrol" in the top-right corner.

You can then log in using the DOXA CLI by running the following command:

In [None]:
!doxa login

You can then submit your results to DOXA by running the following command:

In [None]:
!doxa upload submission

Yay! You have (probably) just uploaded your first submission to DOXA! Take a moment to see where you are on the [scoreboard](https://doxaai.com/competition/uclais-3)!

## Possible Improvements

**1.Data Preprocessing**
- If you look more closely at the data (tweets) we have, most of them contain the '@' sign followed by a Twitter username. Let's think for a moment, do we really need this information? Or in a much subtler way, does this information provide any value to our model?
- Instead of using the CountVectorizer, why not try using other vectorizer implementations that Scikit-learn provides, such as [Tfidf Vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). To be concise, TF-IDF is better than the Count Vectorizer because it not only focuses on the frequency of words present in the corpus, but it also includes the importance of the words. 
- The labels we are using can be categorized as an ordinal encoding (-1, 0, 1). Why not try using [one-hot encoding](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) instead!


**2.Deep learning model**
- In this notebook, we've only used basic machine learning models that do not rely on neural networks. As you've learned RNNs and LSTMs in [Week 7](https://github.com/UCLAIS/ml-tutorials-season-3/blob/main/week-7/AI%20tutorial%207.pdf), you can try using [Keras's LSTM layers](https://keras.io/api/layers/recurrent_layers/lstm/) to build a powerful deep learning model that might outperform sklearn models. 
- However, be careful that your model does not overfit because 12000 data points isn't that many.

**3.Ensemble Model**  
- You can also try an ensemble of different models that can generalise better than a single model.

**4. Data Augmentation**  
- Our dataset consists of 15,000 tweets for you to play with. Is this enough to generate a model that can understand language? 
- Given a limited volume of data to play with, you can consider augmenting the dataset. Think of adding or removing a random word, or even making a new dataset yourself.