# Location Inference Strategy with Clustering
**IDEA**: cluster extracted entities by coordinates

**Strengths**:
- computationally friendly
- allows prediction output to be treated as real value (rather than granular classification)
- results can be comparable to previous studies (e.g. pct of users correct within 100 miles)

**Weaknesses**:
- disregards entities that are not in gazetteer (doesn't learn representation of reddit lingo)
- depends on spacy (may not be great for NER in reddit)

**Steps**:
1. Prepare gazetteer
    - Filter locations by population (> 50k?)
    - For international data, only include country
2. Prepare entities
    - extract all location entities
    - filter location entities by if they exist in gazetteer
3. Cluster entities
    - group entities per user
    - geocode every possibility that the entity could represent
    - extract highest density cluster
4. Group clusters
    - calculate coordinates for the center of the cluster
    - reverse geocode to find the closest real location from the cluster center

## Load data

In [1]:
# set the project path
%cd ~/projects/drug-pricing

/home/denhart.c/projects/drug-pricing


In [2]:
%load_ext autoreload

In [38]:
%autoreload 2
import pandas as pd
from tqdm import tqdm
import numpy as np
from scipy.special import softmax
from typing import List
import functools as ft
import geocoder

from src.utils import connect_to_mongo, get_nlp
from src.schema import User, Post, SubmissionPost, CommentPost, Location
from src.models.v1.__init__ import get_user_spacy, get_ents, DENYLIST
from src.models.v1.filters import BaseFilter, DenylistFilter, LocationFilter

In [19]:
connect_to_mongo()
nlp = get_nlp()
mapbox_key = "pk.eyJ1IjoiY2NjZGVuaGFydCIsImEiOiJjamtzdjNuNHAyMjB4M3B0ZHVoY3l2MndtIn0.jkJIFGPTN7oSkQlHi0xtow"

## Prepare Gazetteer

In [6]:
gazetteer = pd.read_csv("data/locations/grouped-locations.csv")

In [7]:
gazetteer.head()

Unnamed: 0,neighborhood,city,county,state,country,metro,state_full
0,northeast dallas,dallas,dallas,tx,united states of america,dallas-fort worth-arlington,texas
1,maryvale,phoenix,maricopa,az,united states of america,phoenix-mesa-scottsdale,arizona
2,paradise,las vegas,clark,nv,united states of america,las vegas-henderson-paradise,nevada
3,upper west side,new york,new york,ny,united states of america,new york-newark-jersey city,new york
4,south los angeles,los angeles,los angeles,ca,united states of america,los angeles-long beach-anaheim,california


## Prepare entities

In [13]:
username = "traceyh415"
u = User.objects(username=username).first()

In [28]:
def filter_entities(entities: List[str], filters: List[BaseFilter]) -> List[str]:
    """Filter out entities based on filter criteria of the given filters."""
    distinct_entities = set(entities)
    possible_entities = ft.reduce(lambda acc, f: f.filter(acc), 
                                  filters, 
                                  distinct_entities)
    filtered_entities = [entity for entity in entities if entity in possible_entities]
    return filtered_entities

In [32]:
filters = [DenylistFilter(DENYLIST), LocationFilter(gazetteer)]

In [20]:
user_spacy_docs = get_user_spacy(u, nlp)

In [21]:
user_entities = get_ents(user_spacy_docs, "GPE")

In [33]:
filtered_user_entities = filter_entities(user_entities, filters)

In [35]:
len(filtered_user_entities)

602

## Cluster entities

In [39]:
e = filtered_user_entities[0]

In [40]:
e

'rhode island'

In [45]:
g = geocoder.mapbox(e, key=mapbox_key)

Status code Unknown from https://api.mapbox.com/geocoding/v5/mapbox.places/rhode island.json: ERROR - HTTPSConnectionPool(host='api.mapbox.com', port=443): Max retries exceeded with url: /geocoding/v5/mapbox.places/rhode%20island.json?access_token=pk.eyJ1IjoiY2NjZGVuaGFydCIsImEiOiJjamtzdjNuNHAyMjB4M3B0ZHVoY3l2MndtIn0.jkJIFGPTN7oSkQlHi0xtow (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x2b7bc9f25190>, 'Connection to api.mapbox.com timed out. (connect timeout=5.0)'))


In [46]:
import requests

In [None]:
!curl www.wikipedia.com