# The Tide of Climate on Twitter: Twitter Climate Change Sentiment Analysis  
## By Arjun Gandhi
#### Last updated on December 10, 2020

## 1) Data Collection
I am starting off with a data set from Harvard that contains 39.6 million tweets related to climate change. The data set is in tweet IDs (numbers) so I need get the tweets for each tweet ID.

Here is the link the data set: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5QCCUU

As states in the above link the data is from September 21, 2017 and May 17, 2019 and they had a gap in data collection from January 7, 2019 to April 17, 2019.

To convert each tweet ID into the actual tweet data I am using this: Hydrator [Computer Software]. Retrieved from https://github.com/docnow/hydrator

From the above repo, I downloaded this version of the app: https://github.com/DocNow/hydrator/releases/tag/v0.0.13

The tweets are seperated by file (~ 10 million tweets/file). I made a Twitter account to connect my account this Hydrator. I then uploaded each txt file into Hydrator under "Datasets" in the desktop app. 

TALK ABOUT THEIR METHODOLOGY

In [1]:
import pandas as pd

data = pd.read_csv("./data/tweets200k.csv")
data = data.head(5000) # REMOVE THIS LINE !!!!!!!!!!!!!!

## Data Wrangling
The data set has lots of data that is not needed for this analysis. Since we are looking at sentiment over time and other factors related to polticis of a state and events, it is simplest to just drop all non-English tweets.

There are lots of extranenous columns that are not relavent to this project so I just dropped them. These include things like user specifics like their profile details and other things like the URL of thr tweet or the language since all will be English. 

I then renamed some columns for my ease of use of the data set and switched the tweet ID to be the index columns.

The date and time is given as a string so I use regular expressions to convert that to a date time object. The hashtag column is given as one string so I split that into a list of hastags.

In [2]:
# Remove non-English tweets from the data set
data = data[data["lang"] == "en"]

# Drop all the unneeded columns from the data set
cols_to_delete = ["user_urls", "user_statuses_count", "coordinates", "user_name", "in_reply_to_status_id", 
                  "in_reply_to_user_id", "user_time_zone", "urls", "lang", "media", "source", 
                  "retweet_screen_name", "retweet_id", "possibly_sensitive", "tweet_url",
                  "user_default_profile_image", "user_friends_count", "user_verified", "user_location", 
                   "in_reply_to_screen_name", "user_screen_name.1",
                  "user_favourites_count", "user_listed_count", "user_created_at", "user_description", "place", 
                 "user_followers_count"]

data = data.drop(columns=cols_to_delete)

# Swap the index column from 0...n to the tweet ID and rename the column from id to tweetID and rename to clarify
# column meaning
data = data.rename(columns={"id": "tweetID", "created_at": "date/time", "user_screen_name": "tweeter"})
data = data.set_index('tweetID')

In [3]:
# Convert the dates time strings into datetime objects
import re
import datetime

dates = []

# Matching this text
# Mon Jan 22 09:49:35 +0000 2018
# For every row in the dataframe
regex = re.compile(r"(\w{3}) (\w{3}) (\d\d) (\d\d:\d\d:\d\d) \+(0{4}) (\d{4})")

# Given a string of a month return the corresponding integer for that month i.e. Jan == 1
def numerize(str):
    month = str.lower()
    if (month == "jan"): return 1
    elif (month == "feb"): return 2
    elif (month == "mar"): return 3
    elif (month == "apr"): return 4
    elif (month == "may"): return 5
    elif (month == "jun"): return 6
    elif (month == "jul"): return 7 
    elif (month == "aug"): return 8
    elif (month == "sep"): return 9
    elif (month == "oct"): return 10
    elif (month == "nov"): return 11
    elif (month == "dec"): return 12
        
for row in data.iterrows():
    dt = row[1]["date/time"]
    matches = re.search(regex, dt)
    groups = matches.groups()    
    month = numerize(groups[1])
    d = datetime.date(int(groups[5]), month, int(groups[2]))
    dates.append(d)
    
data = data.drop(columns=["date/time"])
data["date_tweeted"] = dates

In [4]:
# Make hashtags into a list of strings for each tweet
# nan is of type float and the rest are strings
tags_lsts = []

# For every row if its not nan then split the string into a list of strings
# if its nan just add nan to the list of lists
for row in data.iterrows():
    tags = row[1]["hashtags"]
    if type(tags) is str:
        lst = tags.split()
        # make all hashtags lowercase to make them easier to compare
        lst = list(map(lambda x : x.lower(), lst))
        tags_lsts.append(lst)
    else:
        tags_lsts.append(float("nan"))
        
# swap out current hashtags column for this list of lists, now df has a column of lists where each row has 
# the list of all hashtags in the tweet
data = data.drop(columns=["hashtags"])
data["hashatg_list"] = tags_lsts

In [5]:
# Combine the number of favorites and retweets for a tweet into an total interactions score
total_interactions = []

for row in data.iterrows():
    tweet = row[1] 
    total = tweet["retweet_count"] + tweet["favorite_count"]
    total_interactions.append(total)

# Swap out the current RT and favorites columns for the total interactions columns
data["total_interactions"] = total_interactions
data = data.drop(columns=["retweet_count", "favorite_count"])

data.head()

Unnamed: 0_level_0,text,tweeter,date_tweeted,hashatg_list,total_interactions
tweetID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1072294898588631040,.@TheRebelTV goes to two different #UN confere...,RebelNewsOnline,2018-12-11,"[un, cdnpoli, onpoli, abpoli]",147
955376892026093569,@Pontifex Prayers to God the one &amp; only t...,Frank34802901,2018-01-22,,0
1025728615399469058,Red alert in #Spain and #Portugal as Europe ne...,Steven9Hugh,2018-08-04,"[spain, portugal, climatechange, globalwarming...",1
932915956824682496,Trump /GOP are the swamp #Resist #FakePresiden...,athoughtz,2017-11-21,"[resist, fakepresident, dontard, gop, nra, war...",0
1041547806622797824,Study: Green Buildings Save $6.7 Billion in #H...,IndiaGreenBldg,2018-09-17,"[health, climate, greenbuilding, sustainabilit...",4


## Prepare the tweets for sentiment analysis
There are several things that need to be done to the actual text of the tweets before we can do sentiment analysis on them. To starts of, I will do some basic things like make all tweet bodies lower case so that words like CLIMATE and climate and cLiMate are all treated the same by the model I use later on. Next, I am going to remove all links from these tweets because that is irrelvanet to the sentiment of the tweet. 

Then I will start do some more linguistic/NLP things.

### Make all tweets lower case

In [6]:
# Make all tweets lower case so that when use a model to look at sentiment words that are the same are treated
# so by the model i.e. capitalization won't make the model think Word is not word.

lower_case_tweets = []
for r in data.iterrows(): lower_case_tweets.append(r[1]["text"].lower())
data = data.drop(columns=["text"])
data["text"] = lower_case_tweets

### Remove links from tweets

In [7]:
# remove links from tweets because they are not helpful in analyzing sentiment

# I found this regex here: https://regexr.com/3e6m0
linkless = []
regex = re.compile(r"http\S+")

# remove all links from each tweet
for row in data.iterrows():
    txt = row[1]["text"]
    if txt.find("https://t.co"): 
        ll = re.sub(regex, "", txt)
        linkless.append(ll)
    else: 
        linkless.append(txt)

data["text"] = linkless
data.head()

Unnamed: 0_level_0,tweeter,date_tweeted,hashatg_list,total_interactions,text
tweetID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1072294898588631040,RebelNewsOnline,2018-12-11,"[un, cdnpoli, onpoli, abpoli]",147,.@therebeltv goes to two different #un confere...
955376892026093569,Frank34802901,2018-01-22,,0,@pontifex prayers to god the one &amp; only t...
1025728615399469058,Steven9Hugh,2018-08-04,"[spain, portugal, climatechange, globalwarming...",1,red alert in #spain and #portugal as europe ne...
932915956824682496,athoughtz,2017-11-21,"[resist, fakepresident, dontard, gop, nra, war...",0,trump /gop are the swamp #resist #fakepresiden...
1041547806622797824,IndiaGreenBldg,2018-09-17,"[health, climate, greenbuilding, sustainabilit...",4,study: green buildings save $6.7 billion in #h...


### Stopwords and Lemmatization 
Here I will remove stopwards from the tweet bodies. These are words like "I" and "this" that add little meaning to the tweet but if left in the text will give me an innacurate depiction of the most common words in the tweets. Thne I will perform lemmatization on the tweets. This just means taking words that linguisticlly mean the same thing like walker and walking and reducing them to their base. In this case the word walk. You can read more here: https://en.wikipedia.org/wiki/Lemmatisation. I will be doing this uisng spaCy (https://spacy.io).

You can find installation instructions for spaCy here: https://spacy.io/usage. 

In [9]:
#I did these commands to setup spaCy.
# pip install -U spacy
# pip install -U spacy-lookups-data
# python -m spacy download en_core_web_sm

In [8]:
import spacy

ModuleNotFoundError: No module named 'spacy'