<h2>Appendix 2 - Trimming the Dataset</h2>

Of the 300,030 tweets originally scraped, many are duplicates and were made outside of the duration of the speech. This programme trims our corpus to the 274520 tweets made during the speech.

<h3>Dropping Duplicates</h3>

Because of the way the max id string was set after each API call when scraping, the first tweet scraped by each new call was a duplicate. This simple programme removes these 8557 duplicate entries.

In [1]:
import pandas as pd
import numpy as np

In [2]:
tweet_data = pd.read_excel("raw_sotu_tweet_data.xlsx")

In [3]:
tweet_data[tweet_data.duplicated()]

Unnamed: 0,date,description,favorite_count,followers_count,id_str,location,retweet_count,statuses_count,text,user,verified
30,Wed Jan 31 03:32:33 +0000 2018,,0,113,958543497316978688,,0,1754,@NancyPelosi No... No they're not. They're M...,HalJelikakik,False
74,Wed Jan 31 03:32:32 +0000 2018,Vote Trump! If you work or want to work with H...,0,22,958543494951219201,"Seal Beach, CA",0,684,Best #SOTU of my lifetime @realDonaldTrump,gobluejules,False
113,Wed Jan 31 03:32:32 +0000 2018,u cant learn anything until you learn how to c...,2,213,958543492107546625,BHFNico,1,14681,"if you guys ever wondered what puppets do, the...",LoadedAssNigga,False
140,Wed Jan 31 03:32:31 +0000 2018,"writer on race, religion and politics @RDispat...",0,3724,958543489792401409,"Raleigh-Durham, North Carolina",0,168062,Ch.... What is Jake Tapper talking about? He a...,theuppitynegro,False
169,Wed Jan 31 03:32:30 +0000 2018,,0,3,958543487401627649,"Austin, TX",1,181,#SOTU the Democrats are like the Cleveland Bro...,DarkLordLowe,False
197,Wed Jan 31 03:32:30 +0000 2018,Conservative,0,222,958543484109185026,,1,7664,Funniest show on earth: the Democrats at the #...,vrwtp,False
240,Wed Jan 31 03:32:29 +0000 2018,"Mets, Music, Media, and witty banter so hip yo...",0,179,958543481722621952,"Averill Park, NY",0,4313,A #SOTU of mostly shoutouts,GregNugget,False
271,Wed Jan 31 03:32:29 +0000 2018,U. R. Brainwashed #BoycottKoch #WakeUpAmerica...,1,1536,958543479445114881,Brainwashing Machine @ Fox,1,21683,Ok #DotardDonnie when your #SOTU is over it wi...,brainwashedur,False
310,Wed Jan 31 03:32:28 +0000 2018,"Young, dumb, and full of Sugar-Free Red Bull.",0,956,958543477465407494,,0,1144,Apparently this was the third longest #Sotu ev...,OhGarrett,False
347,Wed Jan 31 03:32:27 +0000 2018,"News 19, WLTX is On Your Side with Breaking Ne...",22,114081,958543475049418752,"Columbia, SC",3,312735,"President Trump: ""As long as we have confidenc...",WLTX,True


In [4]:
tweet_data = tweet_data.drop_duplicates()
tweet_data.index = pd.RangeIndex(len(tweet_data.index))  # Reset index
tweet_data

Unnamed: 0,date,description,favorite_count,followers_count,id_str,location,retweet_count,statuses_count,text,user,verified
0,Wed Jan 31 03:32:33 +0000 2018,"SVP, Government Affairs @LCVoters. Environment...",1,677,958543499841982465,,0,1008,Best thing about that divisive disgrace of a #...,T_Sittenfeld,False
1,Wed Jan 31 03:32:33 +0000 2018,"Curious, silly, and obsessive... MA, USA",0,23,958543499749621760,,0,156,All this talk of killings and not one single m...,RandomJulieStuf,False
2,Wed Jan 31 03:32:33 +0000 2018,Overseas Yankee patriot,0,9,958543499686760448,The Banana Republic of Miami,0,297,@realDonaldTrump made it through the speech wi...,kimtorahn,False
3,Wed Jan 31 03:32:33 +0000 2018,"Writer, Comedian, Comic Book Editor | Credits:...",5,8663,958543499565191168,NYC & PHL & CLT,1,70038,He sure finished that speech eventually. #SOTU,Brennanator,False
4,Wed Jan 31 03:32:33 +0000 2018,Human Rights‚Ä¢Gauche‚Ä¢Protestant Inclusif‚Ä¢Europh...,1,1056,958543499535806464,UE / EU,0,39720,Is anyone going to help @FLOTUS escape her cap...,nplm88,False
5,Wed Jan 31 03:32:33 +0000 2018,I may be a potato but everyone loves french fries,2,1177,958543499514798081,Crazy Town Banana Pants aka DC,0,78101,Thank god I didn‚Äôt buy any alcohol for this #S...,IceKareemy,False
6,Wed Jan 31 03:32:33 +0000 2018,Military Wife (RET)20 yrs of service USMC. Tha...,0,182,958543499485466624,"The Woodlands, TX",0,1666,Donnie killed it tonight üá∫üá∏üá∫üá∏üá∫üá∏#MAGA #GodBless...,MRSVB12,False
7,Wed Jan 31 03:32:33 +0000 2018,,1,249,958543499359498240,,0,19301,@WhiteHouse @POTUS #SOTU Our grandfathers and ...,LadyJusticeGA,False
8,Wed Jan 31 03:32:33 +0000 2018,hello human. muon catcher & aero engineer turn...,0,5437,958543499258834944,"San Francisco, CA",0,78713,#SOTU is over - all you boycotters can come ba...,ChuckReynolds,False
9,Wed Jan 31 03:32:33 +0000 2018,"Let's talk about new ideas, and big things.",0,155,958543499183341568,PHX bred/Fan of the wild blue,0,865,I'm angry and I don't know where to put it. #SOTU,Emmalily,False


<h3>Trimming to the Speech's Duration</h3>

The earliest and latest of our originally scraped tweets were made at 02:05:04am and 03:32:33am UK time respectively. The speech itself spanned from 02:10:14am - 03:30:46am (see below). As we wish to conduct analysis only on tweets made during the speech itself, we must remove tweets from outside of this time range.

<h3>Determining the Start Time of the Speech</h3>

From the archived C-Span livestream below, we can see that the speech began at 02:10:14am UK time.

https://www.c-span.org/video/?439496-1/president-trump-delivers-state-union-address

The clock in the original broadcast turns to 9:10pm EST at the 00:05:09 mark of the archived video. The president begins his speech 14s later at the 00:05:23 mark of the archived video. This puts the speech's start time at 09:10:14pm EST, or 02:10:14am UK time.

Our time-stamped, subtitled version of the speech shows that it lasted 1:20:32. Therefore, the speech concluded at 03:30:46am UK time.

In [5]:
np.where(tweet_data["date"] == "Wed Jan 31 02:10:14 +0000 2018")

(array([281104, 281105, 281106, 281107, 281108, 281109, 281110, 281111,
        281112, 281113, 281114, 281115, 281116, 281117, 281118, 281119,
        281120, 281121, 281122, 281123, 281124, 281125, 281126, 281127,
        281128, 281129, 281130, 281131, 281132, 281133, 281134, 281135,
        281136, 281137, 281138, 281139, 281140, 281141, 281142, 281143,
        281144, 281145, 281146, 281147, 281148, 281149, 281150, 281151,
        281152, 281153], dtype=int64),)

In [6]:
np.where(tweet_data["date"] == "Wed Jan 31 03:30:46 +0000 2018")

(array([6634, 6635, 6636, 6637, 6638, 6639, 6640, 6641, 6642, 6643, 6644,
        6645, 6646, 6647, 6648, 6649, 6650, 6651, 6652, 6653, 6654, 6655,
        6656, 6657, 6658, 6659, 6660, 6661, 6662, 6663, 6664, 6665, 6666,
        6667, 6668, 6669, 6670, 6671, 6672, 6673, 6674, 6675, 6676, 6677,
        6678, 6679], dtype=int64),)

In [7]:
tweet_data = tweet_data[6634:281154]  # Trim the corpus to the latest and earliest tweets in the time range of the speech
tweet_data.index = pd.RangeIndex(len(tweet_data.index))  # Reset index
tweet_data

Unnamed: 0,date,description,favorite_count,followers_count,id_str,location,retweet_count,statuses_count,text,user,verified
0,Wed Jan 31 03:30:46 +0000 2018,,0,163,958543051391229953,Freakin' Florida,0,8723,Retweeted ‚ôªÔ∏è Christopher Zullo (@ChrisJZullo):...,Likipedia,False
1,Wed Jan 31 03:30:46 +0000 2018,Waiting for FDR,1,1419,958543051248603137,Brooklyn,1,20287,I wish someone would have told me that I could...,CoreyCW,False
2,Wed Jan 31 03:30:46 +0000 2018,ENFP; tweets are my own,1,65,958543051080781824,Boston,0,373,Surprise cameo: Science found its way into the...,craigregis,False
3,Wed Jan 31 03:30:46 +0000 2018,It takes more than embarrassment to shame me t...,2,1370,958543050967568386,,0,35380,That's odd. He said he likes music. But I dist...,kimwithak,False
4,Wed Jan 31 03:30:46 +0000 2018,4 1/2 billion years of evolution produced our ...,0,346,958543050510405632,Texas,0,10621,He doesn't have any standing to comment on fam...,CraneStation,False
5,Wed Jan 31 03:30:46 +0000 2018,Jeremiah 18:1-6 & Revelation 12:11,3,231,958543050472640512,"Fargo, ND",0,2210,"If you didn't watch the State of the Union, yo...",jeremykopp,False
6,Wed Jan 31 03:30:46 +0000 2018,Wife to Jay. Mommy to Zachary Louis. Kitty mam...,1,1814,958543050468483074,Raleigh NC but a NYer at heart,0,26747,Remember that time that Stephen Miller said th...,Colleen84,False
7,Wed Jan 31 03:30:46 +0000 2018,Education is the most important thing in stopp...,0,57,958543050468360192,"Knightstown, IN",0,2990,Ok time to pull the plug (about 30 minut s ago...,wabashcc,False
8,Wed Jan 31 03:30:46 +0000 2018,"Baltimore native. Nerd for pop, politics, snar...",0,1318,958543050338439168,DC,0,10412,#DoD has done far better than domestic program...,SaraLDuBois,False
9,Wed Jan 31 03:30:46 +0000 2018,"https://t.co/FNmWSgyYlM Senior Columnist, husb...",1,9396,958543050300710913,"Merrick, NY",0,17983,Here's my take on the #SOTU https://t.co/QzJt8...,jakejakeny,False


In [8]:
writer = pd.ExcelWriter('sotu_final_corpus.xlsx')
tweet_data.to_excel(writer,'Sheet1')
writer.save()