# Annotating tweets for location extraction and geocoding

This notebook is intended for annotating tweets for splitting temporal references.

## Dataset structure

The tweets dataset are not filtered for a particular topic but cover a range of topics all with different legths. The tweets are all in english language

**For temporal reference:**
- Define if a tweet refers to user current location or not. 
- If unsure, label the tweet in the present location. 

In [9]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import json
import os

## Loading Data

Insert your annotator id in the annotator_name variable.

In [10]:
import pandas as pd
import numpy as np
df = pd.read_csv ('fac_gpe_temporal.csv')
df.to_json ('fac_gpe_temporal.json')

In [11]:
df = df.replace(np.nan, '', regex=True)
df

Unnamed: 0,id,text,FAC_GPE,long,lat,distance_nominatim,distance_google
0,516,Lunch with my man before heading out. at @Earl...,"EarlofSandwich San Jose, CA",-121.892270,37.335096,0,0
1,1222,BabyKitten dressed as a ...... sanfrancisco wa...,Ocean Beach sanfrancisco,-122.510734,37.759392,0,0
2,1892,Lookin for love in all the right places. @ Uni...,Union Street San Francisco,-122.430582,37.797726,11123.9,0.000121159
3,1771,"Im at The Original Country Way in Fremont, CA","The Original Country Way Fremont, CA",-122.001637,37.532997,0.000364704,0.000364704
4,278,Viking has a doggydayout @ Baylands Nature Pre...,"Baylands Nature Preserve Palo Alto, CA",-122.107560,37.457495,0.0006237,0.0006237
...,...,...,...,...,...,...,...
1529,1044,Im at MUNI Metro Stop - 30th Dolores in San F...,"MUNI Metro Stop -, 30th Dolores San Francisco...",-122.424079,37.742256,,
1530,1054,Human waste at the SW corner of Sutter and Lag...,"MUNI Sutter, D5",-122.428473,37.786709,,
1531,1144,"Im at Monarch Sofas in Menlo Park, CA","Monarch Sofas Menlo Park, CA",-122.172207,37.483542,,
1532,1321,"Poop Civic Center Tenderloin, D6","Poop Civic Center Tenderloin, D6",-122.420691,37.783628,,


In [12]:
dataset_file_names = ('fac_gpe_temporal.json','fac_gpe_temporal.json')

# Remember to replace annotator_name with own names
annotator_name = 'Helen_TR'

for fn in dataset_file_names:
    print(fn)
    df = pd.DataFrame(json.load(open(fn)))
    display(df.head()[['text','FAC_GPE']])

fac_gpe_temporal.json


Unnamed: 0,text,FAC_GPE
0,Lunch with my man before heading out. at @Earl...,"EarlofSandwich San Jose, CA"
1,BabyKitten dressed as a ...... sanfrancisco wa...,Ocean Beach sanfrancisco
2,Lookin for love in all the right places. @ Uni...,Union Street San Francisco
3,"Im at The Original Country Way in Fremont, CA","The Original Country Way Fremont, CA"
4,Viking has a doggydayout @ Baylands Nature Pre...,"Baylands Nature Preserve Palo Alto, CA"


fac_gpe_temporal.json


Unnamed: 0,text,FAC_GPE
0,Lunch with my man before heading out. at @Earl...,"EarlofSandwich San Jose, CA"
1,BabyKitten dressed as a ...... sanfrancisco wa...,Ocean Beach sanfrancisco
2,Lookin for love in all the right places. @ Uni...,Union Street San Francisco
3,"Im at The Original Country Way in Fremont, CA","The Original Country Way Fremont, CA"
4,Viking has a doggydayout @ Baylands Nature Pre...,"Baylands Nature Preserve Palo Alto, CA"


## Annotation

### Helper function

This function loads the data (using partially annotated .json files if available) and saves it after every annotation.

This means that annotation can simply be picked up again whenever desired. Intermediate and final results are saved with the original filename with `_annotated`appended.

Only the specified labels (`0,1` by default) are accepted as input, `p` prints a progress bar and any other keys show a help text.

In [13]:
def annotate_tweet_df(fn, possible_labels=('0', '1')):
    def process_input(user_input):
        if user_input in possible_labels:
            return user_input
        elif user_input.startswith('p'):
            progressbar(compute_annotation_progress(), max_num=len(df))
            vc = df[label_column_name].value_counts()
            print('labels\t',  ', '.join([str(k)+': ' + str(v) for k,v in zip(vc.keys(), vc.values)]))
        elif user_input.startswith('q'):
            raise
        else:
            print(help_text)

        return process_input(input('\t'))

    def compute_annotation_progress():
        if label_column_name not in df.keys():
            return 0
        return len(df) - df[label_column_name].isna().sum()

    def progressbar(it, max_num, size=60):
        finished = int(round((it / max_num * size))) if it > 0 else 0
        rest = size - finished
        print('[' + finished * '|' + rest * '.' + ']\t', it, '/', max_num)

    help_text = '\n'.join(['Possible Commands', str(possible_labels) + '\tpossible labels',
                           'h\tshow this help', 'p\tshow progress', 'q\tquit', ''])

    label_column_name = 'label_' + annotator_name
    annotated_df_fn = fn.split('.json')[0] + '_annotated' + annotator_name + '.json'

    if os.path.exists(annotated_df_fn) and os.path.isfile(annotated_df_fn):
        print(annotated_df_fn, 'already exists, continuing previous annotation process')
        df: pd.DataFrame = pd.DataFrame(json.load(open(annotated_df_fn)))
    else:
        df: pd.DataFrame =  pd.DataFrame(json.load(open(fn)))

    nb_annotated_tweets = compute_annotation_progress()
    if label_column_name in df.keys():
        print('Labels from', annotator_name, 'already in data!')
        if compute_annotation_progress() < len(df):
            print('Continuing annotation,', nb_annotated_tweets, 'of', len(df), 'already annotated')
        else:
            return
    else:
        df[label_column_name] = np.nan

    print(help_text)
    print('Starting annotation for', len(df) - nb_annotated_tweets, 'tweets:')
    for index, row in df.iterrows():
        if not pd.isna(row[label_column_name]):
            continue
        print(row.text)
        label = process_input(input('\t'))
        if label is not None:
            df.loc[index, label_column_name] = label
        df.to_json(annotated_df_fn)

    print('Finished!\nSaved results as', annotated_df_fn, '\n')

## Annotation Task

Please refere to the annotation guide file for annotation examples. In case something is not clear feel free to ask.  

IMPORTANT:   
- For **temporal reference**, we are only interested to locations refering to a place likely to be the users current location. 

The label is either `1` if the tweet location is refeering to the present and `0` otherwise.  

For more information see the annotation guide.

#### To start annotating run the cell below.  
Press q to pause the annotation (the red error is intended bahviour).  
Press p to show your progress.  
Press h to see all possible functions.

In [16]:
list(map(annotate_tweet_df, dataset_file_names))

fac_gpe_temporal_annotatedHelen_TR.json already exists, continuing previous annotation process
Labels from Helen_TR already in data!
Continuing annotation, 4 of 1534 already annotated
Possible Commands
('0', '1')	possible labels
h	show this help
p	show progress
q	quit

Starting annotation for 1530 tweets:
Viking has a doggydayout @ Baylands Nature Preserve in Palo Alto, CA
	q


RuntimeError: No active exception to reraise

In [18]:
pd.read_json('fac_gpe_temporal_annotatedHelen_TR.json')

Unnamed: 0,id,text,FAC_GPE,long,lat,distance_nominatim,distance_google,label_Helen_TR
0,516,Lunch with my man before heading out. at @Earl...,"EarlofSandwich San Jose, CA",-121.892270,37.335096,0.000000,0.000000,1.0
1,1222,BabyKitten dressed as a ...... sanfrancisco wa...,Ocean Beach sanfrancisco,-122.510734,37.759392,0.000000,0.000000,1.0
2,1892,Lookin for love in all the right places. @ Uni...,Union Street San Francisco,-122.430582,37.797726,11123.892110,0.000121,1.0
3,1771,"Im at The Original Country Way in Fremont, CA","The Original Country Way Fremont, CA",-122.001637,37.532997,0.000365,0.000365,1.0
4,278,Viking has a doggydayout @ Baylands Nature Pre...,"Baylands Nature Preserve Palo Alto, CA",-122.107560,37.457495,0.000624,0.000624,
...,...,...,...,...,...,...,...,...
1529,1044,Im at MUNI Metro Stop - 30th Dolores in San F...,"MUNI Metro Stop -, 30th Dolores San Francisco...",-122.424079,37.742256,,,
1530,1054,Human waste at the SW corner of Sutter and Lag...,"MUNI Sutter, D5",-122.428472,37.786709,,,
1531,1144,"Im at Monarch Sofas in Menlo Park, CA","Monarch Sofas Menlo Park, CA",-122.172207,37.483542,,,
1532,1321,"Poop Civic Center Tenderloin, D6","Poop Civic Center Tenderloin, D6",-122.420691,37.783628,,,


In [None]:
df2 = pd.read_json('try_annotatedannotator_01.json')
df2