# Annotation + explanation + code

## Relevant links

### Final annotation

#### [The final annotated corpus in .tsv format](https://github.ubc.ca/mds-cl-2021-22/523_group_9/blob/master/milestone_3/final_annotated_corpus.tsv)

The final annotations, as linked above, are stored in a .tsv file format. The file contains 4 columns and 500 examples in total. 2 of the 4 columns are annotated columns. A brief description of each column is provided below:
- `Title`: The title or the headline of the news article.
- `Text`: The body of the news article.
- `central_entity (ANNOTATED)`: The central or the main identified entity in the news article.
- `type (ANNOTATED)` : The central entity classified into `FILM`, `FILM-CREW`, `PERFORMER` or `OTHERS`.

### Intermediary files

#### - Two CSV files containing separate annotations by [Anshul Kaushal](https://github.ubc.ca/mds-cl-2021-22/523_group_9/blob/master/milestone_3/intermediary_files/anshul_wei_annotated_anshul.csv) and [Wei Dong](https://github.ubc.ca/mds-cl-2021-22/523_group_9/blob/master/milestone_3/intermediary_files/anshul_wei_annotated_wei.csv) for the first 165 examples of the corpus.
#### - Two CSV files containing separate annotations by [Anshul Kaushal](https://github.ubc.ca/mds-cl-2021-22/523_group_9/blob/master/milestone_3/intermediary_files/anshul_chao_annotated_anshul.csv) and [Chao Ding](https://github.ubc.ca/mds-cl-2021-22/523_group_9/blob/master/milestone_3/intermediary_files/anshul_chao_annotated_chao.csv) for the next 170 examples of the corpus.
#### - Two CSV files containing separate annotations by [Chao Ding](https://github.ubc.ca/mds-cl-2021-22/523_group_9/blob/master/milestone_3/intermediary_files/chao_wei_annotated_chao.csv) and [Wei Dong](https://github.ubc.ca/mds-cl-2021-22/523_group_9/blob/master/milestone_3/intermediary_files/chao_wei_annotated_wei.csv) for the last 165 examples of the corpus.

## Preparation of data 

#### - Initially the corpus was downloaded from [the git repository of the group](https://github.ubc.ca/mds-cl-2021-22/523_group_9/blob/master/milestone_2/corpus.tsv).

In [1]:
import pandas as pd

In [2]:
corpus = pd.read_csv('https://raw.github.ubc.ca/mds-cl-2021-22/523_group_9/master/milestone_2/corpus.tsv?token=AAAASZH2BS3F6YSEDGORQ5TCGVYTI', sep='\t', encoding='utf-8')

#### - Further, it was divided into three segments containing 165, 170 and 165 examples respectively and converted into csv files.

In [3]:
anshul_wei_df = corpus[:165]
anshul_chao_df = corpus[165:335]
chao_wei_df = corpus[335:500]

In [4]:
anshul_wei_df.to_csv('anshul_wei.csv', encoding='utf-8', index=False)
anshul_chao_df.to_csv('anshul_chao.csv', encoding='utf-8', index=False)
chao_wei_df.to_csv('chao_wei.csv', encoding='utf-8', index=False)

#### - Lastly, each file was assigned to two team members to be annotated separately.

## Converting raw annotations into final annotations

#### - All the 6 csv files, containing separate annotations by the team members for 2 files each was collected.

#### - The files were converted into Pandas dataframes and cleaned to ensure uniformity in the naming of columns across all the six dataframes.

In [5]:
root_path = '/Users/anshulkaushal/desktop/annotation/'
anshul_165 = pd.read_csv(root_path+'anshul_wei_annotated_anshul.csv', encoding='utf-8')
wei_165 = pd.read_csv(root_path+'anshul_wei_annotated_wei.csv', encoding='utf-8')
anshul_165_335 = pd.read_csv(root_path+'anshul_chao_annotated_anshul.csv', encoding='utf-8')
chao_165_335 = pd.read_csv(root_path+'anshul_chao_annotated_chao.csv', encoding='utf-8')
chao_335_500 = pd.read_csv(root_path+'chao_wei_annotated_chao.csv', encoding='utf-8')
wei_335_500 = pd.read_csv(root_path+'chao_wei_annotated_wei.csv', encoding='utf-8')

In [6]:
df_list = [anshul_165, wei_165, anshul_165_335, chao_165_335, chao_335_500, wei_335_500]

In [7]:
for df in df_list:
    if 'Unnamed: 0' in df.columns:
        df.drop(columns=['Unnamed: 0'], inplace=True)
    if 'Unnamed: 3' in df.columns:
        df.rename(columns={'Unnamed: 3': 'central_entity', 'Unnamed: 4': 'type'}, inplace=True)
    if '    ' in df.columns:
        df.drop(columns=['    '], inplace=True)
    title = 'Title                                                                                         '
    if title in df.columns:
        df.rename(columns={title:'Title'}, inplace=True)

#### - For a pair of these dataframes, corresponding to separate annotations by 2 team members on each of the original three files (2 for each, 6 in total), any rising conflicts were identified and eliminated manually.

In [8]:
def get_conflicts(df_1, df_2):
    conflict_list = []
    col_entity_1 = df_1['central_entity'].to_list()
    col_entity_2 = df_2['central_entity'].to_list()
    col_type_1 = df_1['type'].to_list()
    col_type_2 = df_2['type'].to_list()
    for i, item in enumerate(col_entity_1):
        if item != col_entity_2[i] or col_type_1[i] != col_type_2[i]:
            conflict_list.append(i)
    return conflict_list

In [9]:
anshul_wei_165 = get_conflicts(anshul_165, wei_165)

In [10]:
anshul_chao_335 = get_conflicts(anshul_165_335, chao_165_335)

In [11]:
chao_wei_500 = get_conflicts(chao_335_500, wei_335_500)

In [12]:
def see_conflicts(conflict_list, df_1, df_2):
    for conflict in conflict_list:
        print(conflict)
        print()
        print(df_1.iloc[conflict]['Title'])
        print()
        print(df_1.iloc[conflict]['Text'])
        print()
        print(df_1.iloc[conflict]['central_entity'], df_1.iloc[conflict]['type'])
        print(df_2.iloc[conflict]['central_entity'], df_2.iloc[conflict]['type'])
        print()

In [13]:
see_conflicts(anshul_wei_165, anshul_165, wei_165)

14

'The most honest person I ever met': Chadwick Boseman's widow pays tribute at Gotham film awards

The first significant awards event of the current film cycle spread its net wide, but the main attention of the 2021 Gotham awards, designed to reward independent film-makers, was focused on a tearful speech given by Simone Ledward as she accepted an award on behalf of her late husband Chadwick Boseman. Boseman had been announced as a recipient of the Gothams’ annual tribute award, along with Viola Davis, Steve McQueen and Ryan Murphy. Ledward, whose marriage to Boseman was made public only after the actor’s death, said in her speech: “He was the most honest person I ever met … He was blessed to live many lives within his concentrated one. He harnessed the power of letting go and letting God’s love shine through. May we not let his conviction be in vain. It is my honour on behalf of my husband.” She added: “Chad … thank you. I love you. I am so proud of you. Keep shining your light on 

In [14]:
anshul_165.iloc[37]['central_entity'] = 'Cannes film festival'
anshul_165.iloc[50]['central_entity'] = 'Cannes film festival'
anshul_165.iloc[151]['central_entity'] = "Time's Up"

In [15]:
see_conflicts(anshul_chao_335, anshul_165_335, chao_165_335)

13

Film industry celebrities boycott crisis-hit Golden Globes 

Film industry pressure on the Hollywood Foreign Press Association (HFPA), the body that organises the Golden Globe awards, has increased after more than 100 public relations firms sent a letter telling the organisation they had withdrawn all their celebrity clients from activities with the HFPA until it made “profound and lasting change” to correct what it described as the HFPA’s “longstanding exclusionary ethos … discriminatory behavior [and] ethical impropriety”. According to Variety, the letter, signed by high-profile PR outfits including DDA, Premier, 42West and Rogers &amp; Cowan/PMK, was delivered to the HFPA on Monday, following continuing criticism of the crisis-plagued organisation. The letter reads: “We cannot advocate for our clients to participate in HFPA events or interviews as we await your explicit plans and timeline for transformational change.” “Anything less than transparent, meaningful change that respe

In [16]:
anshul_165_335.iloc[13]['central_entity'] = 'Golden Globes'
anshul_165_335.iloc[93]['central_entity'] = 'Oscars'
anshul_165_335.iloc[112]['central_entity'] = 'Golden Globes'
anshul_165_335.iloc[118]['central_entity'] = 'Golden Globes'
anshul_165_335.iloc[121]['central_entity'] = 'Oscars'
anshul_165_335.iloc[158]['type'] = 'FILM-CREW'

In [17]:
see_conflicts(chao_wei_500, wei_335_500, chao_335_500)

43

Peter Rabbit 2 tops box office as UK’s reopened cinemas take £2m in three days                

A strong summer at the UK cinema looks like an increasingly realistic prospect following three impressive days at the box office. Cinemas were permitted to open their doors at 50% capacity on Monday, and film-lovers eager for a fix – or appalled by the weather – showed little hesitancy following seven months of smaller screens. Wednesday’s total was estimated to be around £760,000, up 41% from Tuesday, in part because of 120 Cineworld sites reopening that day, having remained closed on Monday and Tuesday. The three-day total is around £2.8m. Topping the chart is Peter Rabbit 2, the sequel to the 2018 hit, whose release has been long delayed because of the pandemic. Other healthy performers include Nomadland and Godzilla vs Kong, with audiences opting to see them on the big screen despite their availability on streaming platforms. Spiral: From the Book of Saw and The Unholy are also in th

In [18]:
chao_335_500.iloc[43]['central_entity'] = 'UK Cinemas'
chao_335_500.iloc[83]['type'] = 'PERFORMER'
chao_335_500.iloc[97]['central_entity'] = 'Carey Mulligan'

#### - The resultant three dataframes, (after resolving conflicts from each of the three aforementioned pairs), were concatenated in order to achieve the final annotated corpus.

In [19]:
final_df = pd.concat([anshul_165, anshul_165_335, chao_335_500], ignore_index=True)

In [20]:
final_df

Unnamed: 0,Title,Text,central_entity,type
0,"Bloody Nose, Empty Pockets review – bitterswee...",It’s the “last day in paradise” for Las Vegas ...,"Bloody Nose, Empty Pockets",FILM
1,Sylvie's Love review – Tessa Thompson captivat...,Sylvie (Tessa Thompson) has been taught by her...,Sylvie's Love,FILM
2,Deliver Us From Evil review – frenzied hit-man...,There’s a throb of menace driving this gonzo a...,Deliver Us From Evil,FILM
3,"The Call review – a phoned-in mix of ghouls, g...","This telephonically themed horror film, set in...",The Call,FILM
4,"This Is Not a Burial, It’s a Resurrection revi...",This is an extraordinary and otherworldly feat...,"This Is Not a Burial, It’s a Resurrection",FILM
...,...,...,...,...
495,Lapsis review – sci-fi satire targets the gig ...,This sensitive but flawed sci-fi comic dystopi...,Lapsis,FILM
496,The Place of No Words review – a dying father’...,This film is a charming family affair. Directo...,The Place of No Words,FILM
497,White on White review – a damning snapshot of ...,Colonisation does not come off well in this sp...,White on White,FILM
498,Embattled review – oddly compelling and nuance...,"Part classic montage-showdown sports movie, pa...",Embattled,FILM


#### - The resultant dataframe was exported as a tsv file. 

In [21]:
final_df.to_csv('final_annotated_corpus.tsv', sep='\t', encoding='utf-8', index=False)

## The annotation process

### Overview

Overall, keeping aside small challenges, the entire process was pretty straightforward and simple. Converting the raw corpus to the final annotated corpus was made relatively easier by utilizing the Python libraries. Adhering to the annotation guidelines, each of the team member was able to annotate the data quite efficiently and the inter-annotator scores achieved as a consequence, as detailed in another document, were quite high. Given the conflicts in inter-annotations were few, they were manually weeded out in a short amount of time as demonstrated above. Thus, the workflow was mostly manageable and yielded promising results. 

### Challenges

#### Not working with Amazon Mechanical Turk

Initially, the team had wanted to work with AWS' Mechanical Turk. To consider this option, the team had previously carried out three pilot studies with two of them being unsuccessful. After carefully re-writing instructions, the third pilot study was eventually a success. Nonetheless, with a annotation task as complicated as this, there was room for errors. Considering the enormity of the corpus, the complexity of the annotation task as well as the financial and the time resources at hand, the team eventually decided against working with the Mechanical Turk and doing the annotations on its own. Annotating the corpus did cost the team a little time but overall much time has been saved, while the quality of annotations has also been ensured. 

#### Annotation-specific challenges

##### central_entity

- For certain instances, it was hard to figure out any tangible central entity at all. In such cases, the team decided to annotate the entity as `None`.
- There were instances where more than one entity could be identified as central to the article. However, to maintain the quality and uniformity of the annotations, only one of the two entities were annotated at random. 
- There were also instances where technically the central entity should have been an abstract concept or an irrelevant non-living entity. In such cases, the entities were either identified as `None` or another entity in the article that seemed more relevant.

##### type

- For some news articles, identifying the type of the entity was a hard task. In such cases, the team decided to annotate the type as `OTHERS`.
- There were cases where the type of entity, simply owing to its popularity, was well understood by the team. However, if the type of entity was not made clear in the article itself, it was assigned as `OTHERS`.
- Another challenge that came up was annotating those entities which were identified as press associations or award shows. The frequencies of articles with these award shows as the central entity were mistakenly underestimated. The team felt another `type` category such as `MEDIA` could have been defined for such instances but given the limitation of time, it was decided to club these articles with type `OTHERS` for now.