### Machine Learning Project - **Alice DEVILDER**

# Twitter Climate Change Sentiment Analysis

This project focuses on the sentiment analysis of tweets related to climate change by building and comparing machine learning models to classify whether a person believes in climate change based on novel tweet data. 
The dataset used for this analysis was collected as part of a project funded by *a Canada Foundation for Innovation JELF Grant* awarded to **Chris Bauch** at the *University of Waterloo*.  
It corresponds to the [Twitter Climate Change Sentiment Dataset](https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset) from Kaggle .

### Some context

Climate change remains a topic of significant global concern, prompting ongoing discussions and debates on social media platforms like Twitter. With the proliferation of climate-related content online, distinguishing between different perspectives and attitudes towards climate change is crucial for understanding public sentiment and informing policy decisions.

Twitter serves as a valuable source of real-time data reflecting diverse viewpoints on climate change, ranging from support for climate action to skepticism or denial of its existence. By leveraging machine learning techniques, this project aims to analyze and classify tweet sentiments to discern whether individuals express belief or disbelief in climate change.

In this dynamic digital landscape, the ability to accurately classify tweet sentiments offers valuable insights into public perceptions and attitudes towards climate change. By building and comparing machine learning models, this project seeks to enhance our understanding of the nuanced discourse surrounding climate change on social media platforms like Twitter.

### About Dataset

The dataset aggregates tweets pertaining to climate change collected between *April 27, 2015*, and *February 21, 2018*. A total of **43,943 tweets** were annotated for sentiment analysis. Each tweet is independently labelled by three reviewers, and only tweets where all three reviewers agreed on the sentiment are included in the dataset.

Each tweet is labelled with one of the following classes:

* **2 (News):** The tweet links to factual news about climate change.
* **1 (Pro):** The tweet supports the belief of man-made climate change.
* **0 (Neutral):** The tweet neither supports nor refutes the belief of man-made climate change.
* **-1 (Anti):** The tweet does not believe in man-made climate change.

Moreover, the dataset includes three columns:

**sentiment:** The sentiment label of the tweet.  
**message:** The text content of the tweet.  
**tweetid:** The unique identifier of the tweet.

### Project Objective

The project objective is to build and compare machine learning models to classify whether a person believes in climate change based on novel tweet data. By leveraging the dataset collected as part of the Canada Foundation for Innovation JELF Grant, the aim is to develop robust classification algorithms that accurately distinguish between different perspectives on climate change expressed in tweets. The project seeks to evaluate the performance of various machine learning techniques and identify the most effective model for sentiment analysis of climate change-related tweets on Twitter.

## Import librairies and dataset

In [4]:
import re
import nltk
nltk.download('punkt')
import string

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tqdm.auto import tqdm

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\alice\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
  from .autonotebook import tqdm as notebook_tqdm


We can first import the dataset and how it looks like.

In [20]:
df = pd.read_csv('twitter_sentiment_data.csv')

## Data Cleansing 

Let's see how the dataset looks like using ```df.head(10)```.

In [38]:
df.head(10)

Unnamed: 0,sentiment,message,tweetid
0,-1,@tiniebeany climate change is an interesting h...,792927353886371840
1,1,RT @NatGeoChannel: Watch #BeforeTheFlood right...,793124211518832641
2,1,Fabulous! Leonardo #DiCaprio's film on #climat...,793124402388832256
3,1,RT @Mick_Fanning: Just watched this amazing do...,793124635873275904
4,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ...",793125156185137153
5,0,Unamshow awache kujinga na iko global warming ...,793125429418815489
6,2,"RT @cnalive: Pranita Biswasi, a Lutheran from ...",793125430236684289
7,2,RT @CCIRiviera: Presidential Candidate #Donald...,793126558688878592
8,0,RT @AmericanIndian8: Leonardo DiCaprio's clima...,793127097854197761
9,1,#BeforeTheFlood Watch #BeforeTheFlood right he...,793127346106753028


Then we can also have some information about the dataset by calling ```df.info```, such as ncol, nrow, nbr missing values, dtypes.

In [10]:
# Get general information on the dataset (ncol, nrow, nbr missing values, dtypes)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43943 entries, 0 to 43942
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  43943 non-null  int64 
 1   message    43943 non-null  object
 2   tweetid    43943 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 1.0+ MB


Let's see now if there are some nan values and duplicates. 

In [17]:
pd.isnull(df).sum()

sentiment    0
message      0
tweetid      0
dtype: int64

In [18]:
df.duplicated().sum()

0

The dataset is clean! There is no null value and no duplicate. We can now move on to the preprocessing part. 

## Preprocessing

In [40]:
df.sample(10)

Unnamed: 0,sentiment,message,tweetid
10054,0,RT @Martin1Williams: Santa discovers undeniabl...,812447770023841792
17305,2,RT @UNEP: READ: Greening wood energy is key to...,845685218023485440
17382,1,"Perfect for any backyard, our residential sola...",845988798555193344
41733,0,RT @ErrolNazareth: Yep. And donating to @redcr...,727703498679324672
18224,1,RT @MikeBloomberg: Don't let anyone tell you t...,847767233677803520
35985,-1,@cathmckenna As a meteorologist I can tell you...,959513301255409664
27399,1,The 3% of scientific papers that deny climate ...,911680869383000064
6354,1,RT @kurteichenwald: China is now explaining th...,799128194960019456
35017,1,RT @_sadistt: Everyone needs to stop moving to...,956960183313367042
23408,1,RT @SpiloveD: Heat index will likely reach 105...,877497836014886912


Looking at a sample of 10 rows of the data we can observe and make the following assumptions:

* Some tweets contain twitter handles (e.g @IEA), numbers (e.g year 1995), hashtags (e.g #BeforeTheFlood) and re-tweets (RT).
* Some tweets contain names of ogarnisations, continents and countries.
* New lines are represented by '\n' in the tweet string.
* The tweets may contain URLs.
* The tweets may contain percetages, money symbols and emoticons.

Let's check that.

In [41]:
print(df['message'].str.contains('http').sum())
print(df['message'].str.contains('\n').sum())

25956
3966
