# UCLAIS Tutorial Series Challenge 3

We are proud to present you with our third challenge of the 2022-23 UCLAIS tutorial series: Sentiment Analysis on the Climate Change problem. You will be introduced to another super exciting domain in Machine Learning, which is Natural Language Processing 🙀.

This Jupyter notebook will guide you through the various general stages involved in end-to-end NLP projects, including data visualisation, data preprocessing, model selection, model training, and model evaluation. Finally, you will get the chance to submit your results to [DOXA](https://doxaai.com/).

If you do not already have a DOXA account, please [sign up](https://doxaai.com/sign-up) first before proceeding.


## Background & Motivation




**Background**: 

You might have heard about [people who deny climate change.](https://en.wikipedia.org/wiki/Climate_change_denial) How many skeptics are there? Why do they believe so? Let's look at 12000 tweets and analyse people's beliefs on climate change.

**Objective**:  

Create a model that classifies tweets according to belief in the existence of global warming or climate change. 

**Dataset**:

The labels are "1" if the tweet suggests global warming is occurring, "-1" if the tweet suggests global warming is not occurring, and "0" if the tweet is ambiguous or unrelated to global warming.  

The dataset is aggregated from the links stated below. The data obtained from these links is processed such that we are dealing with an almost balanced classification problem, and to remove any non-ascii character (just to have higher quality data). 
- https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset
- https://data.world/xprizeai-env/sentiment-of-climate-change/

## Installing and Importing Useful Packages

To get started, we will install a number of common machine learning packages.

In [None]:
%pip install numpy pandas matplotlib seaborn scikit-learn doxa-cli gdown yellowbrick

In [None]:
# Import relevant libraries
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Import relevant sklearn classes/functions related to data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    BaggingClassifier,
)
from sklearn.multiclass import OneVsRestClassifier

# For visualising data
from yellowbrick.text import FreqDistVisualizer

# For displaying plots on Jupyter Notebook
%matplotlib inline

## Data Loading
The first step is to gather the data that we will be using. The data can be downloaded directly via [Google Drive](https://drive.google.com/drive/folders/1xct1L1Cyg1JjGQNDT5fXasEdHsb7sl6I) or just by simply running the cell below. 

In [None]:
# Let's download the dataset if we don't already have it!
if not os.path.exists("data"):
    os.makedirs("data", exist_ok=True)

    !curl https://raw.githubusercontent.com/UCLAIS/doxa-challenges/main/Challenge-3/data/train.csv --output data/train.csv
    !curl https://raw.githubusercontent.com/UCLAIS/doxa-challenges/main/Challenge-3/data/test.csv --output data/test.csv

In [None]:
# Import the training dataset
data_original = pd.read_csv("./data/train.csv")

# We then make a copy of the dataset that we can manipulate
# and process while leaving the original intact
data = data_original.copy()

## Data Understanding 
Before we start to train our Machine Learning model, it is important to have a look at and understand the dataset that we will be using. This will provide some insight into which models, model hyperparameters, and loss functions are suitable for the problem we are dealing with. 

In [None]:
# Let's have a look at the shape of our training and testing set

In [None]:
# Let's view the first 15 sample of the dataset

In [None]:
# Try to see whether we are dealing with an imbalanced or a balanced dataset

## Data Preprocessing

Now, we get to one of the unique aspects of dealing with a Natural Language Processing (NLP) problem. As you might know (or might not know), computers can only understand numbers, but when it comes to language, we are dealing with text. A lot of text. This type of data is not really useful for the computer. Thus, it is essential for us to transform the text into something that our machines can understand.

And as you might have learned during our tutorial session, we can vectorise our text. So let's vectorise it! We will use the vectors in data visualisation and model training.

In [None]:
# Splitting the data into input features and target features (labels)

In [None]:
# Splitting the data into training and validation sets

In [None]:
# Initializing vectorization of climate posts using CountVectorizer() or TfifdVectorizer() implementation from Scikit-learn

In [None]:
# Check the distribution of most common words

## Model Training

In [1]:
# Feel free to try any model here ranging from classical Machine Learning (Gradient Boosting, KNN) to neural network (RNN, Transformer)

## Model Evaluation
Now that we have trained our machine learning models, we can test them on our validation set we have created earlier!

In [None]:
# Use the .predict() method to predict output values for our test set (if you're using Scikit-learn implementation)

In [None]:
# Check the performance of your model

## Preparing our DOXA Submission

Once we are confident with the performance of our model, we can start deploying it on the real test dataset for submission to DOXA! 

In [None]:
# First, let's import our test dataset and save it in a variable called data_test
data_test = pd.read_csv("./data/test.csv")  # Change the path accordingly

In [None]:
# Perform data preprocessing method you have done in your training set

In [None]:
# Inference on the testing set

In [None]:
# Check the length of your prediction, make sure it is the same as the length of the test set

In [None]:
os.makedirs("submission", exist_ok=True)

with open("submission/y.txt", "w") as f:
    f.writelines([f"{prediction}\n" for prediction in predictions])

with open("submission/doxa.yaml", "w") as f:
    f.write(
        "competition: uclais-3\nenvironment: cpu\nlanguage: python\nentrypoint: run.py"
    )

with open("submission/run.py", "w") as f:
    f.write("with open('y.txt', 'r') as f: print(f.read().strip())")

## Submitting to DOXA

Before you can submit to DOXA, you must first ensure that you are enrolled for the challenge on the DOXA website. Visit [the challenge page](https://doxaai.com/competition/uclais-3) and click "Enrol" in the top-right corner.

You can then log in using the DOXA CLI by running the following command:

In [None]:
!doxa login

You can then submit your results to DOXA by running the following command:

In [None]:
!doxa upload submission

Yay! You have (probably) just uploaded your first submission to DOXA! Take a moment to see where you are on the [scoreboard](https://doxaai.com/competition/uclais-3)! 🙌 