## Climate Change Tweets Sentiment Analysis

The data for this project was taken from Kaggle on March 10th, 2023. You can find it [here](https://www.kaggle.com/datasets/die9origephit/climate-change-tweets?resource=download)

In this project I will be looking at all tweets on Twitter between January 1st, 2022 - July 19th, 2022. The data was scraped with the [Scweet scraper](https://github.com/Altimis/Scweet), searching for the term 'climate change'.

We will use natural language processing (NLP) to understand the sentiment of Twitter users when they are speaking about climate change during that time.


In [1]:
import pandas as pd
import numpy as np

import regex as re
import string

import seaborn as sns
import matplotlib.pyplot as plt

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords, wordnet

from sklearn.feature_extraction.text import CountVectorizer   

In [2]:
csv = 'Climate change_2022-1-17_2022-7-19.csv'
data = pd.read_csv(csv)
df  = pd.DataFrame(data)
df.head(5)

Unnamed: 0,UserScreenName,UserName,Timestamp,Text,Embedded_text,Emojis,Comments,Likes,Retweets,Image link,Tweet URL
0,Lauren Boebert,@laurenboebert,2022-01-17T23:32:38.000Z,Lauren Boebert\n@laurenboebert\n·\nJan 18,The only solution I’ve ever heard the Left pro...,,1683,2259,11.7K,[],https://twitter.com/laurenboebert/status/14832...
1,Catherine,@catherine___c,2022-01-17T22:54:02.000Z,Catherine\n@catherine___c\n·\nJan 17,Climate change doesn’t cause volcanic eruption...,,158,64,762,[],https://twitter.com/catherine___c/status/14832...
2,king Keith,@KaConfessor,2022-01-17T23:51:41.000Z,king Keith\n@KaConfessor\n·\nJan 18,Vaccinated tennis ball boy collapses in the te...,,24,118,159,['https://pbs.twimg.com/ext_tw_video_thumb/148...,https://twitter.com/KaConfessor/status/1483225...
3,PETRIFIED CLIMATE PARENT,@climate_parent,2022-01-17T21:42:04.000Z,PETRIFIED CLIMATE PARENT\n@climate_parent\n·\n...,North America has experienced an average winte...,,15,50,158,[],https://twitter.com/climate_parent/status/1483...
4,Thomas Speight,@Thomas_Sp8,2022-01-17T21:10:40.000Z,Thomas Speight\n@Thomas_Sp8\n·\nJan 17,They're gonna do the same with Climate Change ...,🅾,4,24,127,['https://pbs.twimg.com/profile_images/1544171...,https://twitter.com/Thomas_Sp8/status/14831850...


In [3]:
print('We have ' + str(len(df.columns)) + ' columns, being:')

We have 11 columns, being:


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9050 entries, 0 to 9049
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   UserScreenName  9037 non-null   object
 1   UserName        9050 non-null   object
 2   Timestamp       9050 non-null   object
 3   Text            9050 non-null   object
 4   Embedded_text   9050 non-null   object
 5   Emojis          2026 non-null   object
 6   Comments        6278 non-null   object
 7   Likes           8431 non-null   object
 8   Retweets        8877 non-null   object
 9   Image link      9050 non-null   object
 10  Tweet URL       9050 non-null   object
dtypes: object(11)
memory usage: 777.9+ KB


- **UserScreeName:** the user's displayed name
- **UserName:** the username (twitter handle)
- **Timestamp:** time of tweet
- **Text:** a combination of the above three columns
- **Embedded_text:** text of the tweet, emojis removed
- **Emojis:** any emojis part of the original tweet
- **Comments:** the number of comments the tweet received
- **Likes:** the number of likes the tweet received
- **Retweets:** the number of times the tweet was retweeted
- **Image link:** if there was an image attached, a link to that image is provided
- **Tweet URL:** direct link to the tweet

We can see from the information above that there are 9050 rows in total.
- Some columns have null values: emojis, comments, likes, retweets. This is to be expected, not all tweets will have/receive these things. 
- The columns of most importance have no nulls: username, timestamp and embedded_text.
- The Image link columns has no nulls, but we can see from the df.head() call earlier that a lot of fields are populated with [] instead of NaN, so this figure is misleading. We'll need to look at that, in case we decide to use that column later. It also makes me want to check the other non-null columns to be sure.


### Cleaning
Tasks for cleaning:
- Rename columns to lowercase & snake case for ease of use
- Remove the [] value from Image Link, replace with NaN
- Check other non-null columns don't have the same problem as the Image Link column above
- Convert likes, retweets and comments columns to proper integers

In [5]:
#lowercase columns
df.columns = df.columns.str.lower()

#changing some cols to snake case
snake_cols = {'userscreenname':'user_screen_name', 'image link':'image_link', 'tweet url':'tweet_url'}
df = df.rename(columns=snake_cols)

In [6]:
#replacing [] with NaN values in image_link
df['image_link'] = df['image_link'].replace('[]', np.NaN)

In [7]:
#Check no other columns have false NaN's like above. Easiest way seems to be checking string length.
false_nan_1 = df['username'].apply(lambda x: x if len(x) < 4 else 'no issue')
false_nan_2 = df['timestamp'].apply(lambda x: x if len(x) < 4 else 'no issue')
false_nan_3 = df['text'].apply(lambda x: x if len(x) < 4 else 'no issue')
false_nan_4 = df['embedded_text'].apply(lambda x: x if len(x) < 4 else 'no issue')

In [8]:
print(false_nan_1.value_counts())
print(false_nan_2.value_counts())
print(false_nan_3.value_counts())
print(false_nan_4.value_counts())

#Looks good

no issue    9043
@BW            4
@AP            3
Name: username, dtype: int64
no issue    9050
Name: timestamp, dtype: int64
no issue    9050
Name: text, dtype: int64
no issue    9050
Name: embedded_text, dtype: int64


In [9]:
#Converting numerical columns to integers.
#Comments and Likes columns use commas, while Retweets column uses . and K, maybe also ,

In [10]:
#Converting numerical columns to integers.
#Comments and Likes columns use commas, while Retweets column uses . and K, maybe also ,

df['retweets'] = df['retweets'].str.replace('.','')
df['retweets'] = df['retweets'].str.replace(',','')

#Need to make sure any values with K (thousand) are converted to 1000's

def value_to_float(x):
    if type(x) == float or type(x) == int:
        return x
    if 'K' in x:
        if len(x) > 1:
            return float(x.replace('K', '')) * 1000
        return 1000.0
    else:
        return(x)

df['retweets'] = df['retweets'].apply(value_to_float)

df['retweets'] = df['retweets'].fillna(0).astype(int)

  df['retweets'] = df['retweets'].str.replace('.','')


In [11]:
#Doing same as above to the Comments and Likes columns

df['comments'] = df['comments'].str.replace(',','')
df['likes'] = df['likes'].str.replace(',','')

df['comments'] = df['comments'].apply(value_to_float)
df['likes'] = df['likes'].apply(value_to_float)


df['comments'] = df['comments'].fillna(0).astype(int)
df['likes'] = df['likes'].fillna(0).astype(int)

In [12]:
df.head(5)

Unnamed: 0,user_screen_name,username,timestamp,text,embedded_text,emojis,comments,likes,retweets,image_link,tweet_url
0,Lauren Boebert,@laurenboebert,2022-01-17T23:32:38.000Z,Lauren Boebert\n@laurenboebert\n·\nJan 18,The only solution I’ve ever heard the Left pro...,,1683,2259,117000,,https://twitter.com/laurenboebert/status/14832...
1,Catherine,@catherine___c,2022-01-17T22:54:02.000Z,Catherine\n@catherine___c\n·\nJan 17,Climate change doesn’t cause volcanic eruption...,,158,64,762,,https://twitter.com/catherine___c/status/14832...
2,king Keith,@KaConfessor,2022-01-17T23:51:41.000Z,king Keith\n@KaConfessor\n·\nJan 18,Vaccinated tennis ball boy collapses in the te...,,24,118,159,['https://pbs.twimg.com/ext_tw_video_thumb/148...,https://twitter.com/KaConfessor/status/1483225...
3,PETRIFIED CLIMATE PARENT,@climate_parent,2022-01-17T21:42:04.000Z,PETRIFIED CLIMATE PARENT\n@climate_parent\n·\n...,North America has experienced an average winte...,,15,50,158,,https://twitter.com/climate_parent/status/1483...
4,Thomas Speight,@Thomas_Sp8,2022-01-17T21:10:40.000Z,Thomas Speight\n@Thomas_Sp8\n·\nJan 17,They're gonna do the same with Climate Change ...,🅾,4,24,127,['https://pbs.twimg.com/profile_images/1544171...,https://twitter.com/Thomas_Sp8/status/14831850...


We'll also now clean our tweets to help with analysis later

In [13]:
def tweet_cleaner(tweet):
    #lowercase all the tweets
    tweet = tweet.lower()
    #remove mentions
    tweet = re.sub('@[\w]*','',tweet)
    #remove URLs
    tweet = re.sub(r'https?:\/\/.*\/\w*', '', tweet)
    #remove hashtags
    tweet = re.sub(r'#\w*', '', tweet)
    #remove numbers
    tweet = re.sub(r'\d+', '', tweet)
    #remove \n
    tweet = tweet.replace('\n',' ')
    #remove punctuation
    tweet = re.sub(r"[,.;:@#?!&$-]+\ *", ' ', tweet)
    #remove apotrophe, don't add a space after
    tweet = re.sub(r"['’]+\ *", '', tweet)
    return tweet

In [14]:
df['embedded_text'] = df['embedded_text'].apply(tweet_cleaner)

In [15]:
df

Unnamed: 0,user_screen_name,username,timestamp,text,embedded_text,emojis,comments,likes,retweets,image_link,tweet_url
0,Lauren Boebert,@laurenboebert,2022-01-17T23:32:38.000Z,Lauren Boebert\n@laurenboebert\n·\nJan 18,the only solution ive ever heard the left prop...,,1683,2259,117000,,https://twitter.com/laurenboebert/status/14832...
1,Catherine,@catherine___c,2022-01-17T22:54:02.000Z,Catherine\n@catherine___c\n·\nJan 17,climate change doesnt cause volcanic eruptions,,158,64,762,,https://twitter.com/catherine___c/status/14832...
2,king Keith,@KaConfessor,2022-01-17T23:51:41.000Z,king Keith\n@KaConfessor\n·\nJan 18,vaccinated tennis ball boy collapses in the te...,,24,118,159,['https://pbs.twimg.com/ext_tw_video_thumb/148...,https://twitter.com/KaConfessor/status/1483225...
3,PETRIFIED CLIMATE PARENT,@climate_parent,2022-01-17T21:42:04.000Z,PETRIFIED CLIMATE PARENT\n@climate_parent\n·\n...,north america has experienced an average winte...,,15,50,158,,https://twitter.com/climate_parent/status/1483...
4,Thomas Speight,@Thomas_Sp8,2022-01-17T21:10:40.000Z,Thomas Speight\n@Thomas_Sp8\n·\nJan 17,theyre gonna do the same with climate change w...,🅾,4,24,127,['https://pbs.twimg.com/profile_images/1544171...,https://twitter.com/Thomas_Sp8/status/14831850...
...,...,...,...,...,...,...,...,...,...,...,...
9045,Dr Srijana Mitra Das,@srijanapiya17,2022-07-18T12:08:28.000Z,Dr Srijana Mitra Das\n@srijanapiya17\n·\nJul 18,is now the greatest story on earth how it wil...,,2,16,24,['https://pbs.twimg.com/profile_images/5140754...,https://twitter.com/srijanapiya17/status/15490...
9046,1%_Better_Every_Day,@jh336405,2022-07-18T00:33:20.000Z,1%_Better_Every_Day\n@jh336405\n·\nJul 18,replying to and others and stefan rahmst...,💯 💯 🌏,4,0,0,['https://pbs.twimg.com/profile_images/1442412...,https://twitter.com/jh336405/status/1548828230...
9047,David Schechter,@DavidSchechter,2022-07-18T21:13:13.000Z,David Schechter\n@DavidSchechter\n·\nJul 18,while texans are being asked to use less elect...,,3,14,23,['https://pbs.twimg.com/card_img/1549138950475...,https://twitter.com/DavidSchechter/status/1549...
9048,Daily Climate,@TheDailyClimate,2022-07-18T10:15:09.000Z,Daily Climate\n@TheDailyClimate\n·\nJul 18,sea levels are rising and communities are scra...,,0,3,0,['https://pbs.twimg.com/card_img/1547862999808...,https://twitter.com/TheDailyClimate/status/154...


## Preprocessing

To classify the tweets as positive, negative or neutral sentiments with machine learning, I will need to have a training dataset. I will export a sample of 1000 tweets, assign a label to each manually to use to train the model. The remainder of the tweets will be the test datasets.

In [16]:
training = df.sample(n=1000, random_state=10)
training.to_csv('training_data.csv')

## Data Wrangling

Now we have a clean dataset, we can set ourselves up for analysis using some nltk tools.

We want to do a sentiment analysis of the embedded_text column (tweets). So we will need to:
- Remove names, stopwords from the tweets
- Lemmatise the words (reduce them to their root form)

In [17]:
nltk.download(['stopwords','names','punkt'])
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/cbuttle/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package names to /Users/cbuttle/nltk_data...
[nltk_data]   Package names is already up-to-date!
[nltk_data] Downloading package punkt to /Users/cbuttle/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/cbuttle/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/cbuttle/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [18]:
#compile stopwords
stopwords = nltk.corpus.stopwords.words("english")

In [19]:
#remove stopwords
df['tweets'] = df['embedded_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

In [20]:
#compile names and remove
names = nltk.corpus.names.words(['male.txt', 'female.txt'])

In [21]:
df['tweets'] = df['embedded_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (names)]))

In [22]:
lemmatise = WordNetLemmatizer()

In [None]:
#Lemmatising the words

#First we need to tokenise them:
df['tweet_token'] = df['tweets'].apply(word_tokenize)

#Then we need to assign a part of speech (POS) tag to each word, using wordnet
df['part_of_speech'] = df['tweet_token'].apply(nltk.tag.pos_tag)

def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
df['wordnet_pos'] = df['part_of_speech'].apply(
    lambda x: [(word, get_wordnet_pos(part_of_speech)) for (word, part_of_speech) in x])


In [None]:
#We now want to create a new column with the lemmatised words

lemmatise = WordNetLemmatizer()

df['lemm'] = df['wordnet_pos'].apply(lambda x: [lemmatise.lemmatize(word, tag) for word, tag in x])
df['lem_str'] = [' '.join(map(str, l)) for l in df['lemm']]

df.head()

We now need to vectorise/encode our words so we can measure the data statistically and undertake some machine learning methods.

In [None]:
vectorise = CountVectorizer()

In [None]:
words = vectorise.fit_transform(df['lem_str'])

In [None]:
X