# Annotating tweets for location extraction and geocoding

This notebook is intended for annotating tweets for computing F-score statistics and splitting temporal references.

## Dataset structure

The tweets dataset are not filtered for a particular topic but cover a range of topics all with different legths. The tweets are all in english language

**For F-score test:**  
- Interset is to determine if a tweet contains a location or not.
- If unsure, label the tweet in the class you are more certain of. 

For temporal splitting:
- Interest is to define if a tweet refers to user current location or not. 
- If unsure, label the tweet in the class in which you are more certain of. 

In [1]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
import json
import os

## Loading Data

Instert your annotator id in the annotator_name variable.

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv ('df_location_entities1.csv')
df.to_json ('ss_california_tweets.json')

In [7]:
df = df.replace(np.nan, '', regex=True)
df

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC
0,0,"I'm at My Home Gym in Pacifica, CA https://t....",Pacifica; CA,-122.500464,37.593650,"Im at My Home Gym in Pacifica, CA","Pacifica, CA",,,
1,1,_styledbym.e killed it with this #shadowroot #...,,-121.989751,38.355840,styledbym.e killed it with this shadowroot col...,,,,
2,2,Primigi Classic loafers for your boy or girl. ...,"East Oakdale, CA",-120.829960,37.774330,Primigi Classic loafers for your boy or girl. ...,,,Primigi,
3,3,Warriors single game tickets go on sale at 10...,San Jose; CA,-121.891766,37.332484,Warriors single game tickets go on sale at 10...,,,,
4,4,I'm at Hardly Strictly Bluegrass in San Franc...,San Francisco; CA,-122.489542,37.771727,Im at Hardly Strictly Bluegrass in San Franci...,"San Francisco, CA",,,
...,...,...,...,...,...,...,...,...,...,...
1897,1995,how do i change the font on my phone 😐,6fafb06c49df870f,-122.264870,37.998157,how do i change the font on my phone,,,,
1898,1996,@Vanessaah_x I just ran out lol,ee2cfc9feb061a08,-120.845655,37.504062,at Vanessaahx I just ran out lol,,,,
1899,1997,@muhbellsaywhaat loves me. I know she do.,8004d2bebcc13e8c,-122.039433,37.971872,at muhbellsaywhaat loves me. I know she do.,,,,
1900,1998,#ElvistheCorgi #SamtheCorgi #GigitheFrenchie #...,,-122.894090,38.479099,ElvistheCorgi SamtheCorgi GigitheFrenchie dog ...,,,,


In [18]:
dataset_file_names = ('ss_california_tweets.json','ss_california_tweets.json')

# Remember to replace annotator_name with own names
annotator_name = 'Helen_Mudiwa_fscore'

for fn in dataset_file_names:
    print(fn)
    df = pd.DataFrame(json.load(open(fn)))
    display(df.head()[['text', 'clean_text']])

ss_california_tweets.json


Unnamed: 0,text,clean_text
0,"I'm at My Home Gym in Pacifica, CA https://t....","Im at My Home Gym in Pacifica, CA"
1,_styledbym.e killed it with this #shadowroot #...,styledbym.e killed it with this shadowroot col...
2,Primigi Classic loafers for your boy or girl. ...,Primigi Classic loafers for your boy or girl. ...
3,Warriors single game tickets go on sale at 10...,Warriors single game tickets go on sale at 10...
4,I'm at Hardly Strictly Bluegrass in San Franc...,Im at Hardly Strictly Bluegrass in San Franci...


ss_california_tweets.json


Unnamed: 0,text,clean_text
0,"I'm at My Home Gym in Pacifica, CA https://t....","Im at My Home Gym in Pacifica, CA"
1,_styledbym.e killed it with this #shadowroot #...,styledbym.e killed it with this shadowroot col...
2,Primigi Classic loafers for your boy or girl. ...,Primigi Classic loafers for your boy or girl. ...
3,Warriors single game tickets go on sale at 10...,Warriors single game tickets go on sale at 10...
4,I'm at Hardly Strictly Bluegrass in San Franc...,Im at Hardly Strictly Bluegrass in San Franci...


## Annotation

### Helper function

This function loads the data (using partially annotated .json files if available) and saves it after every annotation.

This means that annotation can simply be picked up again whenever desired. Intermediate and final results are saved with the original filename with `_annotated`appended.

Only the specified labels (`0,1` by default) are accepted as input, `p` prints a progress bar and any other keys show a help text.

In [20]:
def annotate_tweet_df(fn, possible_labels=('0', '1')):
    def process_input(user_input):
        if user_input in possible_labels:
            return user_input
        elif user_input.startswith('p'):
            progressbar(compute_annotation_progress(), max_num=len(df))
            vc = df[label_column_name].value_counts()
            print('labels\t',  ', '.join([str(k)+': ' + str(v) for k,v in zip(vc.keys(), vc.values)]))
        elif user_input.startswith('q'):
            raise
        else:
            print(help_text)

        return process_input(input('\t'))

    def compute_annotation_progress():
        if label_column_name not in df.keys():
            return 0
        return len(df) - df[label_column_name].isna().sum()

    def progressbar(it, max_num, size=60):
        finished = int(round((it / max_num * size))) if it > 0 else 0
        rest = size - finished
        print('[' + finished * '|' + rest * '.' + ']\t', it, '/', max_num)

    help_text = '\n'.join(['Possible Commands', str(possible_labels) + '\tpossible labels',
                           'h\tshow this help', 'p\tshow progress', 'q\tquit', ''])

    label_column_name = 'label_' + annotator_name
    annotated_df_fn = fn.split('.json')[0] + '_annotated' + annotator_name + '.json'

    if os.path.exists(annotated_df_fn) and os.path.isfile(annotated_df_fn):
        print(annotated_df_fn, 'already exists, continuing previous annotation process')
        df: pd.DataFrame = pd.DataFrame(json.load(open(annotated_df_fn)))
    else:
        df: pd.DataFrame =  pd.DataFrame(json.load(open(fn)))

    nb_annotated_tweets = compute_annotation_progress()
    if label_column_name in df.keys():
        print('Labels from', annotator_name, 'already in data!')
        if compute_annotation_progress() < len(df):
            print('Continuing annotation,', nb_annotated_tweets, 'of', len(df), 'already annotated')
        else:
            return
    else:
        df[label_column_name] = np.nan

    print(help_text)
    print('Starting annotation for', len(df) - nb_annotated_tweets, 'tweets:')
    for index, row in df.iterrows():
        if not pd.isna(row[label_column_name]):
            continue
        print(row.text)
        label = process_input(input('\t'))
        if label is not None:
            df.loc[index, label_column_name] = label
        df.to_json(annotated_df_fn)

    print('Finished!\nSaved results as', annotated_df_fn, '\n')

## Annotation Task

Please refere to the annotation guide file for annotation examples. In case something is not clear feel free to ask.  

IMPORTANT:   
- For the group looking at the F-score, we are only interested in the presence or absence of a location within the tweet. The context in which the location is mentioned is not important
- For the group looking at temporal filtering, we are only interested to locations refering to a place likely to be the users current location. 

The label is either `1` if the tweet has a location (fscore) or is in the present time (place filtering) and `0` otherwise.  

For more information see the annotation guide.

#### To start annotating run the cell below.  
Press q to pause the annotation (the red error is intended bahviour).  
Press p to show your progress.  
Press h to see all possible functions.

In [None]:
list(map(annotate_tweet_df, dataset_file_names))

ss_california_tweets_annotatedHelen_Mudiwa_fscore.json already exists, continuing previous annotation process
Labels from Helen_Mudiwa_fscore already in data!
Continuing annotation, 16 of 1902 already annotated
Possible Commands
('0', '1')	possible labels
h	show this help
p	show progress
q	quit

Starting annotation for 1886 tweets:
Right ankle hurts


In [22]:
df

Unnamed: 0.1,Unnamed: 0,text,place,long,lat,clean_text,GPE,FAC,ORG,LOC
0,0,"I'm at My Home Gym in Pacifica, CA https://t....",Pacifica; CA,-122.500464,37.593650,"Im at My Home Gym in Pacifica, CA","Pacifica, CA",,,
1,1,_styledbym.e killed it with this #shadowroot #...,,-121.989751,38.355840,styledbym.e killed it with this shadowroot col...,,,,
2,2,Primigi Classic loafers for your boy or girl. ...,"East Oakdale, CA",-120.829960,37.774330,Primigi Classic loafers for your boy or girl. ...,,,Primigi,
3,3,Warriors single game tickets go on sale at 10...,San Jose; CA,-121.891766,37.332484,Warriors single game tickets go on sale at 10...,,,,
4,4,I'm at Hardly Strictly Bluegrass in San Franc...,San Francisco; CA,-122.489542,37.771727,Im at Hardly Strictly Bluegrass in San Franci...,"San Francisco, CA",,,
...,...,...,...,...,...,...,...,...,...,...
1897,1995,how do i change the font on my phone 😐,6fafb06c49df870f,-122.264870,37.998157,how do i change the font on my phone,,,,
1898,1996,@Vanessaah_x I just ran out lol,ee2cfc9feb061a08,-120.845655,37.504062,at Vanessaahx I just ran out lol,,,,
1899,1997,@muhbellsaywhaat loves me. I know she do.,8004d2bebcc13e8c,-122.039433,37.971872,at muhbellsaywhaat loves me. I know she do.,,,,
1900,1998,#ElvistheCorgi #SamtheCorgi #GigitheFrenchie #...,,-122.894090,38.479099,ElvistheCorgi SamtheCorgi GigitheFrenchie dog ...,,,,


In [None]:
df2 = pd.read_json('try_annotatedannotator_01.json')
df2