<h2>Appendix 13 - Location Inference, Verifying DC Outlier</h2>

A simple program to select a random sample of tweets that our location algorithm determined to come from users based in Washington DC, which will be used to manually check our location inferrence method. 601 tweets are sampled, representing 5% of tweets assigned to DC in total.

Manual reading of the 601 location fields found 55 ambiguous assignements and 1 false assignement. This represents a 90.7% success rate for the tested sample if we reject ambiguous locations, or 99.8% if we reject only strictly false state assignments.

<h3>Ambiguous Location Fields</h3>

The 55 ambiguous location fields can be seen below. These all contain multiple states or cities, including DC, and in some cases include a location outside the United States.

There are a pair of factors at play here: DC is a commuter city in which many people from surrounding and north-eastern states work; and where users have listed the acronyms for multiple states, the algorithm picks DC when DC is the first state alphabetically. States such as DC that are high in the alphabete are likely assigned to a disproportionate number of ambiguous location fields.

<h3>Incorrectly Assigned States</h3>

The 1 incorrectly assigned location fields can be seen below. It reads "ANTI DC", which the algorith has assigned to DC

<h3>Bearing on DC Tweets-To-Population Outlier</h3>
9.2% of the sampled tweets assigned to DC were in fact ambiguous, containing multiple locations including DC. Earlier testing showed about 0.9% of all tweets were ambiguously assigned. Taking both samples to be representative, tweets assigned to DC are 8.3% more likely to be ambiguous than a typical tweet. However, DC's tweet-to-population ratio is roughly 50 times higher than the average for other states. This difference is too great to be explained by an 8.3% difference in ambiguous location fields assigned to DC, and suggests that DC is a true outlier in tweets-to-population ratio.

In [1]:
import pandas as pd
import numpy as np

In [2]:
state_data = pd.read_excel("03sotu_with_states.xlsx")

In [3]:
# Select only tweets assigned to DC, permute index
dc = state_data[state_data.state == "DC"].reset_index()
del dc["index"]
new_order = np.random.permutation(len(dc))
dc = dc.take(new_order)
dc

Unnamed: 0,date,description,favorite_count,followers_count,id_str,location,retweet_count,statuses_count,text,user,verified,description_hashtags,description_mentions,text_emojis,text_hashtags,text_hashtags_split,text_mentions,state
123,Wed Jan 31 03:29:53 +0000 2018,NERD GAMER SPORTS NUT GODLESS HEATHEN IDEALIST...,0,588,958542828321288192,WASHINGTON DC,0,55045,AND THE MONUMENTS WHICH YOU HAD TORN DOWN BECA...,MATTJSOUR,False,TEAMALLPIES,,,SOTU,SOTU,,DC
11340,Wed Jan 31 02:16:08 +0000 2018,INDEPENDENT JOURNALIST PRODUCER FILMMAKER AND ...,2,2809,958524265615241216,WASHINGTON DC,1,2959,SO HAPPY TO SEE AT,REMSO101,False,LIBERTYRISING,,,SOTU,SOTU,STEVESCALISE,DC
1615,Wed Jan 31 03:15:54 +0000 2018,SOUTHERN STATES PRESS SECRETARY Y ALL ENTHUSIA...,11,2065,958539308578889728,WASHINGTON DC,0,26076,COULD VE SAVED US ALL 90 MINUTES BY JUST SAYIN...,NRMORROW,False,,HRC SKDKNICK GWTWEETS UTKNOXVILLE,,SOTU,SOTU,,DC
580,Wed Jan 31 03:25:27 +0000 2018,INFORMS CONNECTS AND ENGAGES AUDIENCE ON FATA ...,6,37849,958541711218806784,WASHINGTON DC,5,49653,MOST AFRICAN AMERICAN MEMBERS OF CONGRESS IN T...,VOADEEWA,True,,,,,,,DC
7806,Wed Jan 31 02:36:39 +0000 2018,HOME OF THE TRUTH O METER AND INDEPENDENT FACT...,182,623269,958529430095310848,WASHINGTON DC,214,29867,TRUMP CLAIMS WE HAVE ELIMINATED MORE REGULATIO...,POLITIFACT,True,,,,,,,DC
10578,Wed Jan 31 02:20:54 +0000 2018,TWITTER SUPPORT WON T RETURN MY TEXTS FORMERLY...,4,45,958525467845451776,WASHINGTON DC,0,1117,AFRICAN AMERICAN EMPLOYMENT HAS REACHED AN ALL...,TWIITERLESSJESS,False,,,😂😂😂,SOTU,SOTU,,DC
8325,Wed Jan 31 02:33:44 +0000 2018,HOPIN NOT TO JOIN THE 27 CLUB,0,180,958528695043489792,DC EVERYWHERE YOU WANT TO BE,0,23674,BUT ALSO,CAPRIKOSAURUS,False,,,,SITDURINGTHEANTHEMALL2018 SOTU STFU,SIT DURING THE ANTHEM ALL2018 SOTU STFU,,DC
8877,Wed Jan 31 02:30:37 +0000 2018,LIFELONG DEMOCRAT ANTI TRUMP PROUD MEMBER OF T...,1,83,958527911744729088,WASHINGTON DC,0,6293,WARNED NOT TO YELL OUT OR WALK OUT OF NEED…,COMMONTOADNMD,False,DEPLORABLES,,,SOTU DEMOCRATS STATEOFTHEUNION CONGRESSIONALDE...,SOTU DEMOCRATS STATE OF THE UNION CONGRESSI...,NANCYPELOSI,DC
8056,Wed Jan 31 02:35:11 +0000 2018,MUSIC IS MY 1ST LOVE STATION NETWORK MANAGER O...,0,1633,958529063450284032,BALTIMORE DC,0,18068,THEY HAVE A RIGHT TO STAND OR NOT TOO STAND PE...,DJJONNYBLAZE,False,FLEETDJS,,,SOTU,SOTU,,DC
3597,Wed Jan 31 03:00:42 +0000 2018,JOURNO FIRST AMENDMENT ADVOCATE LIBERTARIAN ON...,0,598,958535482882908160,WASHINGTON DC,0,8175,LIVE SHOT OF DEMS RIGHT NOW,ZACHARYGORELICK,False,,FOOTNOTEFILM,,SOTU,SOTU,,DC


In [4]:
# Write out sample rows to be manually checked
writer = pd.ExcelWriter('dc_sample.xlsx')
dc[:int(len(dc)/20)].to_excel(writer,'Sheet1')
writer.save()

In [5]:
# Read in manually checked rows
dc_checked = pd.read_excel("dc_checked.xlsx")

In [6]:
# Display ambiguous locations that were assigned to DC
ambiguous = dc_checked[dc_checked.status == "?"]
ambiguous

Unnamed: 0,location,state,status
16,FLORIDA WASHINGTON DC,DC,?
17,WASHINGTON DC ARLINGTON VA,DC,?
18,DC ALSO PHILLY \ VEGAS,DC,?
22,PHILLY AND DC NYC,DC,?
32,DC BMORE PHILLY AND BEYOND,DC,?
47,GREENSBORO GA WASHINGTON DC,DC,?
54,PHILLY  ATL DC,DC,?
61,FLORIDA WASHINGTON DC,DC,?
68,WASHINGTON DC SAN SALVADOR,DC,?
76,LA DC SF BAY,DC,?


In [7]:
# Display location fields that were incorrectly assigned to DC
false_assignment = dc_checked[dc_checked.status == "f"]
false_assignment

Unnamed: 0,location,state,status
176,ANTI DC,DC,f


In [8]:
ambiguous_percent = len(ambiguous)/len(dc_checked)*100
false_percent = len(false_assignment)/len(dc_checked)*100

In [9]:
ambiguous_percent

9.151414309484194

In [10]:
false_percent

0.16638935108153077

In [11]:
100 - (ambiguous_percent+false_percent)

90.68219633943427

In [12]:
100 - (false_percent)

99.83361064891847

In [13]:
len(ambiguous)

55

In [14]:
len(false_assignment)

1