<h2>Appendix 9 - Location Inference Results Testing</h2>

A simple program to select a random sample of tweets to manually check our location inferrence method. 1800 tweets are sampled, which represents over 1% of tweet's that we have assigned a location.

Manual reading of the 1800 locations sampled found 31 location fields that were either incorrectly assigned a state or that included multiple states, making assignment ambiguous. This represents a 98.3% success rate for the tested sample if we reject ambiguous locations, or 99.2% if we reject only strictly false state assignments.

<h3>Ambiguous Location Fields</h3>

The 16 ambiguous location fields can be seen below. Generally these consist of a list of states or cities. In some cases a location outside of the United States is included with a location within the United States.

The location inferrence programme searches for matches in the following order: state acronym, full state name, full city name, city abbreviation, term for the United States. In the state categories, it searches for matches in alphabetical order. In the city categories, it searches for matches in order of city population, highest to lowest. In location fields with multiple valid United States locations, the first matching location found by the programme has been assigned as the state.

<h3>Incorrectly Assigned States</h3>

The 15 incorrectly assigned location fields can be seen below. These fall into the following types: locations including the Canadian province "British Columbia", which the programme has recognised as Columbia South Carolina; location fields that state a previous location; and non-English-language location fields with two-letter words falsely matched to state acronyms.

In [29]:
import pandas as pd
import numpy as np

In [30]:
state_data = pd.read_excel("state_data.xlsx")

In [31]:
# Drop null values, permute index order
state_data_clean = state_data.dropna()
new_order = np.random.permutation(len(state_data_clean))
state_sample = state_data_clean.take(new_order)

In [33]:
# Write out 1800 row sample for manual checking
writer = pd.ExcelWriter('state_sample.xlsx')
state_sample[:1800].to_excel(writer,'Sheet1')
writer.save()

In [37]:
# Read in manually checked sample
state_sample_checked = pd.read_excel("state_sample_checked.xlsx")
checked_clean = state_sample_checked.dropna()

In [41]:
# Display ambiguous location fields that have been assigned a single location
print(len(checked_clean[checked_clean["status"] == "?"]))
checked_clean[checked_clean["status"] == "?"]

16


Unnamed: 0,date,location,state,status
278859,Wed Jan 31 02:11:01 +0000 2018,DRC ATL DC,DC,?
148689,Wed Jan 31 02:44:47 +0000 2018,TX WI USA CA VA,CA,?
107905,Wed Jan 31 02:55:33 +0000 2018,BOSTON AND HARLEM,MA,?
56770,Wed Jan 31 03:10:58 +0000 2018,NEW YORK CITYMIAMI,NY,?
197760,Wed Jan 31 02:33:13 +0000 2018,PROVIDENCE RI AND DETROIT MI,MI,?
49940,Wed Jan 31 03:13:49 +0000 2018,WASHINGTON DCBROOKLYN NY,NY,?
124861,Wed Jan 31 02:51:01 +0000 2018,PDX LA NYC,CA,?
134875,Wed Jan 31 02:48:14 +0000 2018,ONLY COASTAL CA TO MA,CA,?
75388,Wed Jan 31 03:04:51 +0000 2018,KANSASON A PLANE,KS,?
147180,Wed Jan 31 02:45:11 +0000 2018,WASHINGTON DC AND HAMPTON VA,DC,?


In [42]:
# Display location fields that have been assigned an incorrect location
print(len(checked_clean[checked_clean["status"] == "f"]))
checked_clean[checked_clean["status"] == "f"]

15


Unnamed: 0,date,location,state,status
129651,Wed Jan 31 02:49:31 +0000 2018,ORIGINALLY FROM BROOKLYN NY,NY,f
122552,Wed Jan 31 02:51:40 +0000 2018,VICTORIA BRITISH COLUMBIA,SC,f
247458,Wed Jan 31 02:20:38 +0000 2018,USUALLY CT SOMETIMES AK,AK,f
40628,Wed Jan 31 03:17:22 +0000 2018,GAUSA CSA INC,USA,f
20897,Wed Jan 31 03:25:28 +0000 2018,VANCOUVER BRITISH COLUMBIA,SC,f
19283,Wed Jan 31 03:26:09 +0000 2018,PUEBLO DE LOS ÁNGELES,DE,f
263146,Wed Jan 31 02:16:28 +0000 2018,DETROIT BORN SEATTLE HOME,MI,f
1473,Wed Jan 31 03:32:09 +0000 2018,TRANSPLANT FROM CA,CA,f
196339,Wed Jan 31 02:33:32 +0000 2018,DETROIT BORN SEATTLE HOME,MI,f
269852,Wed Jan 31 02:14:03 +0000 2018,ORIGINALLY FROM BROOKLYN NY,NY,f
