# Are We in the Filter Bubble?: Analyzing Sentiment Towards Climate Change Across Twitter and Time
## By Arjun Gandhi
#### Last updated on November 27, 2020

## 1) Data Collection
I am starting off with a data set from Harvard that contains 39.6 million tweets related to climate change. The data set is in tweet IDs (numbers) so I need get the tweets for each tweet ID.

Here is the link the data set: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5QCCUU

As states in the above link the data is from September 21, 2017 and May 17, 2019 and they had a gap in data collection from January 7, 2019 to April 17, 2019.

To convert each tweet ID into the actual tweet data I am using this: Hydrator [Computer Software]. Retrieved from https://github.com/docnow/hydrator

From the above repo, I downloaded this version of the app: https://github.com/DocNow/hydrator/releases/tag/v0.0.13

The tweets are seperated by file (~ 10 million tweets/file). I made a Twitter account to connect my account this Hydrator. I then uploaded each txt file into Hydrator under "Datasets" in the desktop app. 

TALK ABOUT THEIR METHODOLOGY

In [1]:
import pandas as pd

data = pd.read_csv("./data/tweets10knongeotagged.csv")
data.head()

Unnamed: 0,coordinates,created_at,hashtags,media,urls,favorite_count,id,in_reply_to_screen_name,in_reply_to_status_id,in_reply_to_user_id,...,user_followers_count,user_friends_count,user_listed_count,user_location,user_name,user_screen_name.1,user_statuses_count,user_time_zone,user_urls,user_verified
0,,Mon Jan 22 09:49:35 +0000 2018,,,,0,955376892026093569,Pontifex,9.551606e+17,500704345.0,...,1,4,0,United States,Frank,Frank34802901,100,,,False
1,,Mon Sep 17 04:42:16 +0000 2018,,,https://truthout.org/articles/national-park-of...,0,1041547863795224576,,,,...,2078,2384,98,USA,OurRevolution,LeftysUnite,49582,,,False
2,,Sat Aug 04 13:02:13 +0000 2018,Spain Portugal climatechange globalwarming Hea...,,https://news.sky.com/story/live-scorching-satu...,1,1025728615399469058,,,,...,23,24,1,,Steven Hugh,Steven9Hugh,1393,,https://stevenhugh.wordpress.com/,False
3,,Tue Nov 21 10:17:51 +0000 2017,Resist FakePresident Dontard GOP NRA War Clima...,,https://twitter.com/mattmfm/status/93272970237...,0,932915956824682496,,,,...,6062,6475,120,Jemez New Mexico,Vote Blue,athoughtz,153914,,https://Twitter.com,False
4,,Mon Sep 17 04:42:02 +0000 2018,Health Climate greenbuilding sustainability cl...,https://twitter.com/IndiaGreenBldg/status/1041...,https://buff.ly/2L5RqvK,3,1041547806622797824,,,,...,196,105,31,,IndiaGreenBldg,IndiaGreenBldg,2435,,http://www.rateitgreen.com/green-building-comm...,False


## Data Wrangling
The data set has lots of data that is not needed for this analysis. Since we are looking at sentiment over time and other factors related to polticis of a state and events, it is simplest to just drop all non-English tweets.

There are lots of extranenous columns that are not relavent to this project so I just dropped them. These include things like user specifics like their profile details and other things like the URL of thr tweet or the language since all will be English. 

I then renamed some columns for my ease of use of the data set and switched the tweet ID to be the index columns.

The date and time is given as a string so I use regular expressions to convert that to a date time object. The hashtag column is given as one string so I split that into a list of hastags.

In [2]:
# Remove non-English tweets from the data set
data = data[data["lang"] == "en"]

# Drop all the unneeded columns from the data set
cols_to_delete = ["user_urls", "user_statuses_count", "coordinates", "user_name", "in_reply_to_status_id", 
                  "in_reply_to_user_id", "user_time_zone", "urls", "lang", "media", "source", 
                  "retweet_screen_name", "retweet_id", "possibly_sensitive", "tweet_url",
                  "user_default_profile_image", "user_friends_count", "user_verified", "user_location", 
                   "in_reply_to_screen_name", "user_screen_name.1",
                  "user_favourites_count", "user_listed_count", "user_created_at", "user_description", "place", 
                 "user_followers_count"]

data = data.drop(columns=cols_to_delete)

# Swap the index column from 0...n to the tweet ID and rename the column from id to tweetID and rename to clarify
# column meaning
data = data.rename(columns={"id": "tweetID", "created_at": "date/time", "user_screen_name": "tweeter"})
data = data.set_index('tweetID')

data.head()

Unnamed: 0_level_0,date/time,hashtags,favorite_count,retweet_count,text,tweeter
tweetID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
955376892026093569,Mon Jan 22 09:49:35 +0000 2018,,0,0,@Pontifex Prayers to God the one &amp; only t...,Frank34802901
1025728615399469058,Sat Aug 04 13:02:13 +0000 2018,Spain Portugal climatechange globalwarming Hea...,1,0,Red alert in #Spain and #Portugal as Europe ne...,Steven9Hugh
932915956824682496,Tue Nov 21 10:17:51 +0000 2017,Resist FakePresident Dontard GOP NRA War Clima...,0,0,Trump /GOP are the swamp #Resist #FakePresiden...,athoughtz
1041547806622797824,Mon Sep 17 04:42:02 +0000 2018,Health Climate greenbuilding sustainability cl...,3,1,Study: Green Buildings Save $6.7 Billion in #H...,IndiaGreenBldg
932916623404617728,Tue Nov 21 10:20:30 +0000 2017,resist,0,0,Dems get it done. #resist https://t.co/X2oXOEAtPb,berrymaiden


In [3]:
# Convert the dates time strings into datetime objects
import re
import datetime

dates = []

# Matching this text
# Mon Jan 22 09:49:35 +0000 2018
# For every row in the dataframe
regex = re.compile(r"(\w{3}) (\w{3}) (\d\d) (\d\d:\d\d:\d\d) \+(0{4}) (\d{4})")

# Given a string of a month return the corresponding integer for that month i.e. Jan == 1
def numerize(str):
    month = str.lower()
    if (month == "jan"): return 1
    elif (month == "feb"): return 2
    elif (month == "mar"): return 3
    elif (month == "apr"): return 4
    elif (month == "may"): return 5
    elif (month == "jun"): return 6
    elif (month == "jul"): return 7 
    elif (month == "aug"): return 8
    elif (month == "sep"): return 9
    elif (month == "oct"): return 10
    elif (month == "nov"): return 11
    elif (month == "dec"): return 12
        
for row in data.iterrows():
    dt = row[1]["date/time"]
    matches = re.search(regex, dt)
    groups = matches.groups()    
    month = numerize(groups[1])
    d = datetime.date(int(groups[5]), month, int(groups[2]))
    dates.append(d)
    
data = data.drop(columns=["date/time"])
data["date_tweeted"] = dates

data

Unnamed: 0_level_0,hashtags,favorite_count,retweet_count,text,tweeter,date_tweeted
tweetID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
955376892026093569,,0,0,@Pontifex Prayers to God the one &amp; only t...,Frank34802901,2018-01-22
1025728615399469058,Spain Portugal climatechange globalwarming Hea...,1,0,Red alert in #Spain and #Portugal as Europe ne...,Steven9Hugh,2018-08-04
932915956824682496,Resist FakePresident Dontard GOP NRA War Clima...,0,0,Trump /GOP are the swamp #Resist #FakePresiden...,athoughtz,2017-11-21
1041547806622797824,Health Climate greenbuilding sustainability cl...,3,1,Study: Green Buildings Save $6.7 Billion in #H...,IndiaGreenBldg,2018-09-17
932916623404617728,resist,0,0,Dems get it done. #resist https://t.co/X2oXOEAtPb,berrymaiden,2017-11-21
...,...,...,...,...,...,...
1041756963212648448,ClimateChangeScam,2,2,None. We don't control the climate but pinning...,Zealandian,2018-09-17
1025811954357678080,,1,1,What if the reason Trump and Co aren't pushing...,TallismanRogue,2018-08-04
1041756953226174465,,0,1,INTERACTIVE: Photos and videos of flooding in ...,GullahGeechee,2018-09-17
973650715066302464,,0,0,Yes and no https://t.co/YRe0v2unAB,MichaelFieldNZ,2018-03-13


In [4]:
# Make hashtags into a list of strings for each tweet
# nan is of type float and the rest are strings
tags_lsts = []

# For every row if its not nan then split the string into a list of strings
# if its nan just add nan to the list of lists
for row in data.iterrows():
    tags = row[1]["hashtags"]
    if type(tags) is str:
        tags_lsts.append(tags.split())
    else:
        tags_lsts.append(float("nan"))
        
# swap out current hashtags column for this list of lists, now df has a column of lists where each row has 
# the list of all hashtags in the tweet
data = data.drop(columns=["hashtags"])
data["hashatg_list"] = tags_lsts
data

Unnamed: 0_level_0,favorite_count,retweet_count,text,tweeter,date_tweeted,hashatg_list
tweetID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
955376892026093569,0,0,@Pontifex Prayers to God the one &amp; only t...,Frank34802901,2018-01-22,
1025728615399469058,1,0,Red alert in #Spain and #Portugal as Europe ne...,Steven9Hugh,2018-08-04,"[Spain, Portugal, climatechange, globalwarming..."
932915956824682496,0,0,Trump /GOP are the swamp #Resist #FakePresiden...,athoughtz,2017-11-21,"[Resist, FakePresident, Dontard, GOP, NRA, War..."
1041547806622797824,3,1,Study: Green Buildings Save $6.7 Billion in #H...,IndiaGreenBldg,2018-09-17,"[Health, Climate, greenbuilding, sustainabilit..."
932916623404617728,0,0,Dems get it done. #resist https://t.co/X2oXOEAtPb,berrymaiden,2017-11-21,[resist]
...,...,...,...,...,...,...
1041756963212648448,2,2,None. We don't control the climate but pinning...,Zealandian,2018-09-17,[ClimateChangeScam]
1025811954357678080,1,1,What if the reason Trump and Co aren't pushing...,TallismanRogue,2018-08-04,
1041756953226174465,0,1,INTERACTIVE: Photos and videos of flooding in ...,GullahGeechee,2018-09-17,
973650715066302464,0,0,Yes and no https://t.co/YRe0v2unAB,MichaelFieldNZ,2018-03-13,


In [5]:
# Combine the number of favorites and retweets for a tweet into an total interactions score
total_interactions = []

for row in data.iterrows():
    tweet = row[1] 
    total = tweet["retweet_count"] + tweet["favorite_count"]
    total_interactions.append(total)

# Swap out the current RT and favorites columns for the total interactions columns
data["total_interactions"] = total_interactions
data = data.drop(columns=["retweet_count", "favorite_count"])
data

Unnamed: 0_level_0,text,tweeter,date_tweeted,hashatg_list,total_interactions
tweetID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
955376892026093569,@Pontifex Prayers to God the one &amp; only t...,Frank34802901,2018-01-22,,0
1025728615399469058,Red alert in #Spain and #Portugal as Europe ne...,Steven9Hugh,2018-08-04,"[Spain, Portugal, climatechange, globalwarming...",1
932915956824682496,Trump /GOP are the swamp #Resist #FakePresiden...,athoughtz,2017-11-21,"[Resist, FakePresident, Dontard, GOP, NRA, War...",0
1041547806622797824,Study: Green Buildings Save $6.7 Billion in #H...,IndiaGreenBldg,2018-09-17,"[Health, Climate, greenbuilding, sustainabilit...",4
932916623404617728,Dems get it done. #resist https://t.co/X2oXOEAtPb,berrymaiden,2017-11-21,[resist],0
...,...,...,...,...,...
1041756963212648448,None. We don't control the climate but pinning...,Zealandian,2018-09-17,[ClimateChangeScam],4
1025811954357678080,What if the reason Trump and Co aren't pushing...,TallismanRogue,2018-08-04,,2
1041756953226174465,INTERACTIVE: Photos and videos of flooding in ...,GullahGeechee,2018-09-17,,1
973650715066302464,Yes and no https://t.co/YRe0v2unAB,MichaelFieldNZ,2018-03-13,,0
