# LOCATION EXTRACTION 

![Picture1.png](Picture1.png)

## Table of contents

- [Loading packages into the notebook](#Loading-required-packages)
- [Loading and exploring data](#Loading-and-exploring-data)
- [Data preprocessing and cleaning](#Data-cleaning-and-Preprocessing)
- [spaCy: Loading and exploring](#spaCy:-Loading-and-exploring)
- [Extracting location entities](#Extracting-location-entities)
- [Cleaning and combining locations](#Cleaning-and-combining-locations)
- [Geocoding with Nominatim](#Geocoding-with-Nominatim)
- [Visualising data](#Visualising-data)
- [Distance Calculations](#Distance-calculation)
- [Result visualisation](#Result-visualisation)
- [Assignment](#Assignment)
- [Improving results using keywords](#Grammatical-Filtering)

## Loading required packages 

In [1]:
import pandas as pd
import numpy as np
import sys, os

pd.set_option('display.max_colwidth', None)

[Return to Table of Contents](#Table-of-contents)

## Loading and exploring data

In [2]:
# Loading data 
df = pd.read_csv('../tweets/california_tweets.csv')

# shape of the data (rows, columns)
df.shape

(2000, 6)

In [3]:
# Visualising the first (number) rows within the dataset
df.head(5)

Unnamed: 0.1,Unnamed: 0,text,place,src_lang,long,lat
0,0,"I'm at My Home Gym in Pacifica, CA https://t.co/fWgIms86T8",Pacifica; CA,en,-122.500464,37.59365
1,1,"_styledbym.e killed it with this #shadowroot #colormelt on ahlthaaat! Color melting, Balayage,… https://t.co/NUhTgcCim0",,en,-121.989751,38.35584
2,2,Primigi Classic loafers for your boy or girl. Normally $74-$89 We have many sizes and colors to… https://t.co/ntc7OX5zyz,"East Oakdale, CA",en,-120.82996,37.77433
3,3,Warriors single game tickets go on sale at 10 a.m. https://t.co/1Ojk8wtdcg,San Jose; CA,en,-121.891766,37.332484
4,4,"I'm at Hardly Strictly Bluegrass in San Francisco, CA https://t.co/u61jaz0kHZ",San Francisco; CA,en,-122.489542,37.771727


***

Our data is made up of 2000 rows and 5 columns. The text column contains tweet posts sent from different users. The data has been filtered to return only tweets with longitude and latitude values, which will be used later on to verify the accuracy of the location extraction.

***

[Return to Table of Contents](#Table-of-contents)

## Data cleaning and Preprocessing 

Given the noise in the tweet texts, we clean up our tweets before applying NLP. For this exercise we:

- Return only English tweets 
- Remove special characters
- Replace @ with at 
- Remove resulting empty cells 

In [4]:
# Only english tweets 
df['text_en'] = df.text
is_english = df.src_lang == 'en'
df.loc[is_english, 'text_en'] = df.loc[is_english, 'text']
df = df.loc[is_english]

In [5]:
# Remove special characters
def preprocess_tweets(tweets, remove_tokens = ('\n', '\r', '\t', 'RT', r'[^\x00-\x7f]'),
                meta_information_indicators = ('https:', 'http:', 'www.', '//t.co'),
                allowed_punctuation = (',', '.', '.', '!', '?', ' ', ':', '-', ';','@')):
    def keep_token(token):
        return token not in remove_tokens and\
        not any(token.startswith(meta_token) for meta_token in meta_information_indicators)
    
    clean_tweets = tweets.apply(lambda x: ' '.join(filter(keep_token, x.split(' '))))
    
    keep_char = lambda t: t.isalnum() or t in allowed_punctuation
    return clean_tweets.apply(lambda x: ''.join(filter(keep_char, list(x))))
    
clean_tweets = preprocess_tweets(df.text_en)
df['clean_text'] = clean_tweets

In [6]:
# use np.nan for all missing values
df = df.replace('-', np.nan).fillna(np.nan)

# remove empty columns
df = df.dropna(how='all', axis='columns')

# remove rows without text
df = df.dropna(subset=['text'])

#Replace @ with at for spaCy syntax 
df.clean_text = df.clean_text.str.replace("@", "at ")

In [7]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,text,place,src_lang,long,lat,text_en,clean_text
0,0,"I'm at My Home Gym in Pacifica, CA https://t.co/fWgIms86T8",Pacifica; CA,en,-122.500464,37.59365,"I'm at My Home Gym in Pacifica, CA https://t.co/fWgIms86T8","Im at My Home Gym in Pacifica, CA"
1,1,"_styledbym.e killed it with this #shadowroot #colormelt on ahlthaaat! Color melting, Balayage,… https://t.co/NUhTgcCim0",,en,-121.989751,38.35584,"_styledbym.e killed it with this #shadowroot #colormelt on ahlthaaat! Color melting, Balayage,… https://t.co/NUhTgcCim0","styledbym.e killed it with this shadowroot colormelt on ahlthaaat! Color melting, Balayage,"
2,2,Primigi Classic loafers for your boy or girl. Normally $74-$89 We have many sizes and colors to… https://t.co/ntc7OX5zyz,"East Oakdale, CA",en,-120.82996,37.77433,Primigi Classic loafers for your boy or girl. Normally $74-$89 We have many sizes and colors to… https://t.co/ntc7OX5zyz,Primigi Classic loafers for your boy or girl. Normally 74-89 We have many sizes and colors to


In [8]:
# Dropping some columns to reduce data size to necessary columns 

# Make sure to only run this cell once as a duplicate returns error of cells not found
df = df.drop(['Unnamed: 0', 'src_lang', 'text_en'], axis = 1 )
df.head(3)

Unnamed: 0,text,place,long,lat,clean_text
0,"I'm at My Home Gym in Pacifica, CA https://t.co/fWgIms86T8",Pacifica; CA,-122.500464,37.59365,"Im at My Home Gym in Pacifica, CA"
1,"_styledbym.e killed it with this #shadowroot #colormelt on ahlthaaat! Color melting, Balayage,… https://t.co/NUhTgcCim0",,-121.989751,38.35584,"styledbym.e killed it with this shadowroot colormelt on ahlthaaat! Color melting, Balayage,"
2,Primigi Classic loafers for your boy or girl. Normally $74-$89 We have many sizes and colors to… https://t.co/ntc7OX5zyz,"East Oakdale, CA",-120.82996,37.77433,Primigi Classic loafers for your boy or girl. Normally 74-89 We have many sizes and colors to


In [9]:
# Use .loc to define a slice of rows you would like to view from you data

#df1.loc[100:162]

**Exercise**

> What other preprocessing routines could be done or what preprocessing steps could be left out

[Return to Table of Contents](#Table-of-contents)

## spaCy: Loading and exploring

[**spaCy**](https://spacy.io/) is a NLP tool developed by Explosion to extract entities in text. Unlike most NLP packages that rely on a Gazetteer to extract locations, spaCy uses word embedding to determine the entity class of a word within the sentence syntax. The advantage of this approach is that it is able to return locations even with spelling errors. Disadvantage is possible false positives due to different sentence syntaxes.

In [10]:
# Loading spaCy packages 
import spacy
from spacy import displacy # Displacy is used to visualise spaCy tokens

In [11]:
# Loading the spaCy model 
nlp =spacy.load('en_core_web_trf') # model trf higher accuracy, bigger model, slower in exercution

****
Before extracting locations within the dataset, we can first play around with self-made example sentences to explore how spaCy works. You can rewrite your own sentences and explore what kind of results spaCy would return.

****

In [34]:
# Example sentences 

doc = nlp("""Minnesota, is the State of my eye.""")

In [35]:
# Visualising spaCy entities
displacy.render(doc, style = "ent")

In [14]:
# spacy.explain is used to define the entities returned by spaCy
spacy.explain("FAC")

'Buildings, airports, highways, bridges, etc.'

**Exercise**

>Experiment with spaCy writing different sentence structures. Are they any instances where spaCy wrongly detects or omits an entity.

[Return to Table of Contents](#Table-of-contents)

## Extracting location entities 

In [15]:
# Function to get location information 
def filter_location_entities(entities):
    locations = []
    for entity in entities:
        if entity.label_ == 'GPE':
                locations.append(entity)
                
    return locations

def filter_location_entities1(entities):
    locations1 = []
    for entity in entities:
        if entity.label_ == 'FAC':
                locations1.append(entity)
                      
    return locations1

def filter_location_entities2(entities):
    locations2 = []
    for entity in entities:
        if entity.label_ == 'ORG':
                locations2.append(entity)
                      
    return locations2

def filter_location_entities3(entities):
    locations3 = []
    for entity in entities:
        if entity.label_ == 'LOC':
                locations3.append(entity)
                      
    return locations3

In [16]:
#creates a new column, ner_text, with entities extracted from the 'text' column
df['GPE'] = df['clean_text'].astype(str).apply(lambda x: filter_location_entities(nlp(x).ents))
df['FAC'] = df['clean_text'].astype(str).apply(lambda x: filter_location_entities1(nlp(x).ents))
df['ORG'] = df['clean_text'].astype(str).apply(lambda x: filter_location_entities2(nlp(x).ents))
df['LOC'] = df['clean_text'].astype(str).apply(lambda x: filter_location_entities3(nlp(x).ents))
df

KeyboardInterrupt: 

****
Time taken in extracting location entities from a data frame increases with the number of rows present in the data frame. It is often advisable to save the file locally within your PC in case the notebook fails, you will not have to rerun the extraction process again.
****

In [None]:
#Saving dataframe with extracted location entities 
outfilename = ('df_location_entities1.csv')
df.to_csv(outfilename)

[Return to Table of Contents](#Table-of-contents)

## Cleaning and combining locations 

In [42]:
# Loading data with location entities extracted
df = pd.read_csv('../tweets/df_location_entities1.csv')
df.head(3)

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC
0,0,"I'm at My Home Gym in Pacifica, CA https://t.co/fWgIms86T8",Pacifica; CA,-122.500464,37.59365,"Im at My Home Gym in Pacifica, CA","[Pacifica, CA]",[],[],[]
1,1,"_styledbym.e killed it with this #shadowroot #colormelt on ahlthaaat! Color melting, Balayage,… https://t.co/NUhTgcCim0",,-121.989751,38.35584,"styledbym.e killed it with this shadowroot colormelt on ahlthaaat! Color melting, Balayage,",[],[],[],[]
2,2,Primigi Classic loafers for your boy or girl. Normally $74-$89 We have many sizes and colors to… https://t.co/ntc7OX5zyz,"East Oakdale, CA",-120.82996,37.77433,Primigi Classic loafers for your boy or girl. Normally 74-89 We have many sizes and colors to,[],[],[Primigi],[]


In [43]:
df.shape

(1902, 10)

In [44]:
# Data cleaning: Remove square brakets from location entities
df['GPE'] =  df['GPE'].apply(lambda x: x.replace('[','').replace(']',''))
df['FAC'] =  df['FAC'].apply(lambda x: x.replace('[','').replace(']',''))
df['ORG'] =  df['ORG'].apply(lambda x: x.replace('[','').replace(']',''))
df['LOC'] =  df['LOC'].apply(lambda x: x.replace('[','').replace(']',''))

df.head(4)

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC
0,0,"I'm at My Home Gym in Pacifica, CA https://t.co/fWgIms86T8",Pacifica; CA,-122.500464,37.59365,"Im at My Home Gym in Pacifica, CA","Pacifica, CA",,,
1,1,"_styledbym.e killed it with this #shadowroot #colormelt on ahlthaaat! Color melting, Balayage,… https://t.co/NUhTgcCim0",,-121.989751,38.35584,"styledbym.e killed it with this shadowroot colormelt on ahlthaaat! Color melting, Balayage,",,,,
2,2,Primigi Classic loafers for your boy or girl. Normally $74-$89 We have many sizes and colors to… https://t.co/ntc7OX5zyz,"East Oakdale, CA",-120.82996,37.77433,Primigi Classic loafers for your boy or girl. Normally 74-89 We have many sizes and colors to,,,Primigi,
3,3,Warriors single game tickets go on sale at 10 a.m. https://t.co/1Ojk8wtdcg,San Jose; CA,-121.891766,37.332484,Warriors single game tickets go on sale at 10 a.m.,,,,


In [45]:
#Dropping rows without location entity extracted
# In this case we drop rows where neither of the four location entities have location extracted.

index_names = df[(df['GPE']== '') & (df['FAC']== '') & (df['ORG']== '') & (df['LOC']== '')].index
df.drop(index_names, inplace = True)
df.head(4)

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC
0,0,"I'm at My Home Gym in Pacifica, CA https://t.co/fWgIms86T8",Pacifica; CA,-122.500464,37.59365,"Im at My Home Gym in Pacifica, CA","Pacifica, CA",,,
2,2,Primigi Classic loafers for your boy or girl. Normally $74-$89 We have many sizes and colors to… https://t.co/ntc7OX5zyz,"East Oakdale, CA",-120.82996,37.77433,Primigi Classic loafers for your boy or girl. Normally 74-89 We have many sizes and colors to,,,Primigi,
4,4,"I'm at Hardly Strictly Bluegrass in San Francisco, CA https://t.co/u61jaz0kHZ",San Francisco; CA,-122.489542,37.771727,"Im at Hardly Strictly Bluegrass in San Francisco, CA","San Francisco, CA",,,
9,10,"Forgotten, on a side walk. Understood. @ Western Addition, San Francisco https://t.co/pXozNjtGLF",,-122.428,37.7825,"Forgotten, on a side walk. Understood. at Western Addition, San Francisco",San Francisco,Western Addition,,


In [46]:
df.shape

(668, 10)

In [47]:
# Combining locational entities to get finer and more informed place names
# Locations are combined only when both columns are not null

df['FAC_GPE'] = np.where(((df['FAC'] != '') & (df['GPE'] != '')), df['FAC'].str.cat(df['GPE'], sep = ", "), '')
df['ORG_GPE'] = np.where(((df['ORG'] != '') & (df['GPE'] != '')), df['ORG'].str.cat(df['GPE'], sep = ", "), '')
df['LOC_GPE'] = np.where(((df['LOC'] != '') & (df['GPE'] != '')), df['LOC'].str.cat(df['GPE'], sep = ", "), '')
df

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC,FAC_GPE,ORG_GPE,LOC_GPE
0,0,"I'm at My Home Gym in Pacifica, CA https://t.co/fWgIms86T8",Pacifica; CA,-122.500464,37.593650,"Im at My Home Gym in Pacifica, CA","Pacifica, CA",,,,,,
2,2,Primigi Classic loafers for your boy or girl. Normally $74-$89 We have many sizes and colors to… https://t.co/ntc7OX5zyz,"East Oakdale, CA",-120.829960,37.774330,Primigi Classic loafers for your boy or girl. Normally 74-89 We have many sizes and colors to,,,Primigi,,,,
4,4,"I'm at Hardly Strictly Bluegrass in San Francisco, CA https://t.co/u61jaz0kHZ",San Francisco; CA,-122.489542,37.771727,"Im at Hardly Strictly Bluegrass in San Francisco, CA","San Francisco, CA",,,,,,
9,10,"Forgotten, on a side walk. Understood. @ Western Addition, San Francisco https://t.co/pXozNjtGLF",,-122.428000,37.782500,"Forgotten, on a side walk. Understood. at Western Addition, San Francisco",San Francisco,Western Addition,,,"Western Addition, San Francisco",,
10,11,@StarlineSC #lightshow @ Starline Social Club https://t.co/kuFUfDR76h,,-122.272507,37.812812,at StarlineSC lightshow at Starline Social Club,,,Starline Social Club,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1885,1982,#sunrise #pier14 #sanfrancisco #ca @ Pier 14 Embarcadero San Francisco California https://t.co/Q6hklEPWLx,South Beach; San Francisco,-122.389897,37.794424,sunrise pier14 sanfrancisco ca at Pier 14 Embarcadero San Francisco California,"San Francisco, California",Pier 14 Embarcadero,,,"Pier 14 Embarcadero, San Francisco, California",,
1893,1991,Drinking an IPA by @lagunitasbeer @ Craig Martin Home Brews — https://t.co/yK9bH3fzRV,"Pittsburg, CA",-121.884000,38.001300,Drinking an IPA by at lagunitasbeer at Craig Martin Home Brews,,,Craig Martin Home Brews,,,,
1894,1992,Warriors win 😀 @ Oracle Arena and Oakland Alameda County Coliseum https://t.co/yyozENrXXy,"Oakland, CA",-122.202223,37.750748,Warriors win at Oracle Arena and Oakland Alameda County Coliseum,,"Oracle Arena, Oakland Alameda County Coliseum",,,,,
1896,1994,@SanJoseSharks WIN 3-2 IN OT #SJSharks https://t.co/WjvBPJuyox,San Jose; CA,-121.900758,37.332135,at SanJoseSharks WIN 3-2 IN OT SJSharks,,,"SanJoseSharks, SJSharks",,,,


****
We combine location entities in a manner that resembles actual addresses starting with a finer place name example **Building name: Stein Hotel** to a more course place reference **city name: Salzburg** to an even coarser place name e.g. **Country: Austria**
****

**Exercise**

Which other location combinations would make sence?

[Return to Table of Contents](#Table-of-contents)

# Geocoding with Nominatim 

In [48]:
# Loading required packages for geocoding with Nominatim
import geopandas as gpd 
import geopy
import matplotlib.pyplot as plt
from functools import partial 
from geopy import distance
from geopy.distance import geodesic
from tqdm import tqdm, tqdm_notebook # progress bar

#initiate 
tqdm.pandas()

In [49]:
# user_agent is used to overide restricts of using Nominatim default user_agent.
locator = geopy.geocoders.Nominatim(user_agent='mygeocoder')

In [50]:
# Set to avoid the error of 'Too many requests 429 error'
from geopy.extra.rate_limiter import RateLimiter
geocode = RateLimiter(locator.geocode, min_delay_seconds=1)

# return locations in english 
geocode = partial(locator.geocode, language = "en", timeout = 30000)

****
[**Nominatim**](https://nominatim.org/release-docs/latest/) has a limit of 1 request per second which translates to 86 400 requests per day without retries. Depending on the number of retries done on an address, the final returned results on a single day will certainly be lower than 86 400. Furthermore, Nominatim blocks users from sending multiple requests of the same location. 

To reduce geocoding time and avoid being blocked, we geocode only the unique locations and map results to our dataset. 
****

In [51]:
#Count unique locations in column FAC_GPE
df['FAC_GPE'].nunique()

62

In [52]:
unique_FAC_GPE = df.groupby('FAC_GPE')['Unnamed: 0'].unique()
outfilename = ('unique_FAC_GPE.csv')
unique_FAC_GPE.to_csv(outfilename)

In [53]:
df1 = pd.read_csv('unique_FAC_GPE.csv')
df1.loc[2:10] # Extracting a slice of the data

Unnamed: 0.1,FAC_GPE,Unnamed: 0
2,"Bay Bridge, San Francisco",[1999]
3,"Berkeley Hills Tunnel, Berkeley, CA",[867]
4,"Berryessa, San Jose, CA",[1754]
5,"California Avenue, Palo Alto, CA",[1059]
6,"Carpool Vacaville, SanFrancisco",[1756]
7,"Carson Falls Trail, mounttam, mounttamalpais",[435]
8,"Chumasero Park, San Francisco, CA",[195]
9,"Cliff House, San Francisco",[134]
10,"D6, SOMA",[558]


In [54]:
# Extracting locations (raw, latitude, longitude)

df1['geocoded_locations'] = df1['FAC_GPE'].progress_apply(geocode)

df1['Lat2'] = df1['geocoded_locations'].apply(lambda x: x.latitude if x else None)
df1['Lon2'] = df1['geocoded_locations'].apply(lambda x: x.longitude if x else None)

df1

100%|██████████| 62/62 [00:31<00:00,  1.95it/s]


Unnamed: 0.1,FAC_GPE,Unnamed: 0,geocoded_locations,Lat2,Lon2
0,,[ 0 2 4 11 12 16 18 20 23 29 30 31 32 43\n 47 49 50 51 52 57 58 65 68 71 78 84 86 90\n 92 93 95 101 102 103 105 107 112 114 119 123 124 126\n 127 129 133 135 138 141 145 156 158 161 163 166 176 177\n 179 182 183 185 188 189 193 194 197 203 204 205 209 210\n 213 218 222 225 226 230 236 237 238 245 248 250 252 257\n 258 260 266 272 279 282 285 290 292 293 302 306 308 314\n 335 339 346 352 358 360 361 362 374 377 381 382 383 385\n 386 388 391 392 393 395 397 400 401 407 408 409 416 419\n 422 424 430 431 433 434 436 439 442 445 449 452 454 456\n 457 463 467 470 477 478 479 484 487 493 494 496 504 505\n 507 508 511 513 517 518 521 528 530 533 534 539 540 545\n 549 560 564 565 567 568 576 588 597 599 600 601 605 608\n 616 620 623 624 632 635 639 640 641 643 647 649 650 652\n 661 666 668 674 676 679 686 689 692 693 694 699 701 703\n 712 714 720 721 723 727 728 729 733 735 738 739 741 742\n 743 745 747 749 751 752 754 755 759 762 763 769 773 776\n 784 786 793 796 803 808 809 812 815 817 820 829 830 842\n 843 844 848 850 853 858 866 870 871 872 875 879 881 882\n 886 889 891 893 896 897 898 907 908 915 918 921 926 927\n 928 929 930 932 935 941 943 945 947 948 950 952 953 954\n 956 958 959 963 964 968 971 972 978 980 981 986 987 988\n 989 992 993 1002 1003 1007 1009 1010 1016 1019 1022 1029 1031 1033\n 1034 1039 1040 1041 1045 1046 1052 1053 1054 1057 1058 1071 1072 1075\n 1076 1077 1078 1080 1081 1083 1085 1089 1096 1097 1101 1102 1109 1110\n 1112 1113 1117 1123 1125 1128 1132 1134 1135 1144 1146 1151 1154 1158\n 1159 1161 1174 1175 1176 1178 1180 1186 1192 1194 1198 1199 1201 1202\n 1203 1205 1206 1213 1214 1215 1216 1218 1220 1223 1225 1229 1230 1234\n 1235 1236 1237 1238 1239 1240 1244 1245 1252 1257 1258 1262 1264 1266\n 1267 1271 1272 1273 1275 1276 1277 1280 1283 1288 1291 1300 1308 1309\n 1310 1311 1324 1325 1332 1334 1343 1348 1352 1353 1364 1369 1372 1374\n 1390 1391 1393 1396 1399 1400 1401 1402 1412 1413 1416 1417 1421 1424\n 1425 1426 1430 1431 1443 1448 1451 1452 1455 1462 1463 1464 1465 1466\n 1467 1468 1479 1481 1482 1485 1487 1489 1490 1500 1507 1509 1512 1513\n 1515 1521 1526 1528 1534 1539 1543 1544 1548 1550 1552 1554 1559 1562\n 1564 1571 1573 1575 1579 1586 1587 1596 1598 1604 1609 1610 1612 1615\n 1619 1627 1629 1638 1643 1644 1645 1646 1651 1664 1669 1672 1678 1681\n 1685 1688 1698 1701 1703 1704 1708 1712 1713 1717 1724 1727 1728 1731\n 1732 1733 1734 1735 1738 1744 1746 1765 1774 1777 1779 1780 1783 1787\n 1798 1804 1805 1812 1816 1828 1829 1832 1835 1836 1837 1843 1847 1849\n 1856 1860 1863 1864 1869 1871 1876 1878 1880 1887 1889 1890 1891 1892\n 1909 1912 1914 1917 1918 1925 1931 1932 1934 1937 1941 1942 1944 1949\n 1950 1951 1955 1956 1958 1959 1960 1962 1964 1966 1969 1970 1971 1972\n 1991 1992 1994],"(Nangarhar Province, Afghanistan, (34.220389, 70.3800314))",34.220389,70.380031
1,"Anza Airport, Burlingame, CA",[235],"(Anza Airport Parking, Airport Boulevard, Burlingame, San Mateo County, California, 94010, United States, (37.589049149999994, -122.34552606805073))",37.589049,-122.345526
2,"Bay Bridge, San Francisco",[1999],"(Chesapeake Bay Bridge, Anne Arundel County, Maryland, 21666, United States, (38.9925372, -76.3794605))",38.992537,-76.379460
3,"Berkeley Hills Tunnel, Berkeley, CA",[867],,,
4,"Berryessa, San Jose, CA",[1754],"(Berryessa/North San José, Berryessa Station Way, Berryessa/North San José Station, Luna Park, San Jose, Santa Clara County, California, 95133-1703, United States, (37.3684396, -121.8746772))",37.368440,-121.874677
...,...,...,...,...,...
57,"gocaltrain, San Francisco, CA",[1354],,,
58,"jackbox, San Jose, CA",[1576],,,
59,"sansrau, San Francisco, CA",[630],,,
60,"sixflagsdk, Vallejo, CA",[1434],,,


In [None]:
# Merging unique geocoded locations to full dataframe 

df2 = pd.merge(df1,df, on='FAC_GPE')
df3 =df2[['clean_text','FAC_GPE', 'long', 'lat', 'Lon2', 'Lat2']]
df3

In [None]:
# Save output file 
outfilename = ('geocoded_FAC_GPE.csv')
df3.to_csv(outfilename)

[Return to Table of Contents](#Table-of-contents)

## Mapping geocoded locations

In [None]:
df_FAC_GPE = pd.read_csv('geocoded_FAC_GPE.csv') 
df_FAC_GPE

In [None]:
# New dataframe with geocoded values returned
df_FAC_GPE_cleaned = df_FAC_GPE[df_FAC_GPE['Lon2'].notna()]
df_FAC_GPE_cleaned.head(5)

In [None]:
df_FAC_GPE_cleaned.shape

****
**Exercise**
>- More than 50% of the extracted FAC_GPE locations where not geocoded. What can be the reason for Nominatim's failure to geocode these locations?
>- HINT: Use another geocoding service for example Google Maps to check the locations availability. Also check the locations on Open Street Map (Nominatim) service on your browser.
****

In [None]:
#Map visualising packages
import folium
from folium.plugins import HeatMap
import statistics

In [None]:
def generateBaseMap(default_location=[37.693943, -122.385880], default_zoom_start=12):
    base_map = folium.Map(location=default_location, control_scale=True, zoom_start=default_zoom_start)
    return base_map

In [None]:
y = statistics.mean(df_FAC_GPE_cleaned['Lat2']) 
x = statistics.mean(df_FAC_GPE_cleaned['Lon2']) 

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:98% !important; }</style>"))
df_FAC_GPE_cleaned['count'] = 1
base_map = generateBaseMap([y,x],8)
HeatMap(data=df_FAC_GPE_cleaned[['Lat2', 'Lon2', 'count']].groupby(['Lat2', 'Lon2']).sum().reset_index().values.tolist(), radius=8, max_zoom=13).add_to(base_map)
display(base_map)

[Return to Table of Contents](#Table-of-contents)

# Distance calculation 

We compute displacements between the GNSS (Ground truth) and the geocoded locations. We use the displacements to measure the precision of the geocoded locations.  

In [None]:
def distance_calc (row):
    start = (row['lat'], row['long'])
    stop = (row['Lat2'], row['Lon2'])

    return geodesic(start, stop).km

In [None]:
df_FAC_GPE_cleaned['distance'] = df_FAC_GPE_cleaned.apply (lambda row: distance_calc (row),axis=1)
df_FAC_GPE_cleaned.shape

## Visualising displacements

In [None]:
# Defining names of class
displacements = ["1km", "5km", "10km", "Over 10km"]

FAC_GPE_displacements = pd.cut(df_FAC_GPE_cleaned['distance'], [0, 1, 5, 10, 100000.0], labels=displacements) 
pd.value_counts(FAC_GPE_displacements)

In [None]:
df_FAC_GPE_cleaned

****
**Exercise**

- Look at the tweets with distance over 10km. What are the reasons for the big distance?
****

[Return to Table of Contents](#Table-of-contents)

# Assignment

The class is split into two groups. Each group will have 1 unique task, after which results are shared amoungst the groups and an overal task where both groups discuss on the process of the location extraction and geocoding exercise.  

> *Use the location extraction annotation notebook for the assignment.*
>
> *Annotation Manual.pdf contains the labelling instructions for both exercises*

****
**Group 1:**

To get a better understanding of our results, we need to check how well the model performs. [**F-score, F1-score**](https://deepai.org/machine-learning-glossary-and-terms/f-score) is used to evaluate our location extaction model by computing the confusion matrix (False positive, False negative, True positive, True positive) and combining the precision and recall of the model. To compute the F-score we will label our data for the presence or absence of a location and compare the extracted output against the expected output. Essentially we want to reduce the number of false positive and false negative location extractions and increase the number of true positives and true negative location extractions.

>- label data for the presence or absence of a location entity.
>- Compute F-score of the model as [1] a geometric mean (Basic F-score) and [2] by considering either recall or precision as more important (provide arguments for your considerations). 
>- Discuss the obtained F-score values and their implications on the study results. 
>- Are there any instances which struck out when reviewing the results either when a correct location was extracted together with noise (text not part of the location) or when part of a location is extracted for example York instead of New York? What do we do with such instance?
****
****

**Group 2:**

For our case study, we want to extract locations where a user is and not a place reference from the past or future. In the prior steps, we have extracted all location mentions within our tweet dataset regardless of the temporal reference. The unfiltered place references might be one of the reasons for obtaining very high displacements between user location and mentioned location. To avoid computing locations of referenced locations we will label our data with 2 classes; user's present location and other. The aim of this assignment is to filter out referenced locations where a user is not thereby reducing the displacement between the users actual location and the predicted location.  

>- Label time frame of location reference (Present time / other)
>- Compute and compare the displacements between the two groups of labels and discuss the results.  
>- To avoid constantly having to manually label the dataset, discuss strategies that can be used to split temporal references?  
>- Going through the data, are there any instances where the geocoding service (Nominatim) fails to geocode a correct location or geocodes to a wrong location? What can be done on such instances?

****
****
**Overal:**

 Share your findings from the individual group exercises. Discuss the presented method used in the notebook to extract and geocode locations:
>- Preprocessing routins 
>- spaCy entity extraction
>- entity combinations 
>- Nominatim geocoding 

How can this approach be improved 


[Return to Table of Contents](#Table-of-contents)

# Possibly useful codes 

In [None]:
# Defining search area
# By running this line of code before geocoding we restrict the search area to locations within California
# The code then reduces locations like CA from being geocoded to Canada instead of Califonia

#geocode = lambda query: locator.geocode("%s, California, USA" % query, timeout = 30000)

## Grammatical Filtering

In [None]:
# defining keywords for historial mentions 

"""past_keywords = [travelled to", "Last week", 'last night', 'last Monday', 'last tuesday', 
                  'last wednesday', 'last thursday', 'last friday', 'last saturday', 'last month', 
                 'last year', 'last winter', 'last summer', 'last autumn', 'last days'  "yesterday", 
                 "I was in ", 'were in', 'I was at', "I went", "was at ", 'was in ', 'were at ', 'went to ',
                 "landed from", "passed through", "had visited", 'had gone to', "flew from", 'flew in from',
                 'back from', "throwback", "past years", "miss being in", "I miss ", 'years ago ', 'days ago ',
                 'months ago ', 'hours ago ', 'time ago ', 'was leaving in ', 'was staying at', 'was leaving at',
                 'was staying in', 'makes me miss', 'im from', 'are from ', 'originally from', 'grew up in', 'grew up at ']
"""
#past_searched = '|'.join(past_keywords) # for searching keywords within sentence structures

In [None]:
# Filtering the data to return only data with the specific keyword
# case = False makes the search case insensitive 
# na = false means we dont return errors when there are unexpected types in series 

df_past = df[df["clean_text"].str.contains(past_searched, case = False, na = False)]
df_past

In [None]:
# defining keywords for future mentions 
# We add space before and after 'to' to make the text standalone

"""future_keywords = ["going to ",'driving to', "Taking the train to", "taking the car to",
                   "taking the bus on", "headin to",'heading to', 'headed for', "leave for",
                   "leaving for", 'go to', 'travel to', 'trip to','travelling to', 'moving to',
                   'relocating to', 'flying to', "vist ", 'will be going', 'will be at ',
                   'will be in ', 'tomorrow', 'next week', 'next days', 'next Monday',
                   'next Tuesday', 'next Wednesday', 'next Thursday', 'next Friday', 
                   'next Saturday', 'next Sunday', 'tonight at', 'join us for', 'on mondays',
                   'on tuesdays', 'on wednesdays', 'on Thursdays', 'on fridays', 'on saturdays',
                   'on sundays', 'on weekends', 'on weekends', 'weeks from now', 'next stop ',
                   'move to ', 'later on ', 'later this ' ]
""""
#future_searched = '|'.join(future_keywords) # for searching keywords within sentence structures

In [None]:
# filtering data to return only future mentions 

#df_future = df[df["clean_text"].str.contains(future_searched, case = False, na = False)]
#df_future

[Return to Table of Contents](#Table-of-contents)