# Setup

In [1]:
import os
import json

import pandas as pd
import numpy as np

In [2]:
DATA_PATH = r"data/Coronavirus tweets"
SAMPLE_PATH = r"sample_data/sample_text_classification_data.txt"

DATA_PATH_2 = r"data/Stanford question answering"
SAMPLE_PATH_2 = r"sample_data/sample_qa.json"

# Sentiment Analysis Text Labeling

Let's check the [Coronavirus Tweets dataset](https://www.kaggle.com/datatattle/covid-19-nlp-text-classification) used for sentiment analysis.

In [3]:
train_df = pd.read_csv(os.path.join(DATA_PATH, "Corona_NLP_train.csv"))

In [4]:
train_df.head(10)

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative
5,3804,48756,"ÜT: 36.319708,-82.363649",16-03-2020,As news of the regions first confirmed COVID-...,Positive
6,3805,48757,"35.926541,-78.753267",16-03-2020,Cashier at grocery store was sharing his insig...,Positive
7,3806,48758,Austria,16-03-2020,Was at the supermarket today. Didn't buy toile...,Neutral
8,3807,48759,"Atlanta, GA USA",16-03-2020,Due to COVID-19 our retail store and classroom...,Positive
9,3808,48760,"BHAVNAGAR,GUJRAT",16-03-2020,"For corona prevention,we should stop to buy th...",Negative


## Preparing sample data

Let's just say we have a CSV file with only a text column.

In [5]:
# we create a sample out of the original dataset to use for demo
sample_df = train_df.loc[:5, 'OriginalTweet'].copy()
sample_df

0    @MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...
1    advice Talk to your neighbours family to excha...
2    Coronavirus Australia: Woolworths to give elde...
3    My food stock is not the only one which is emp...
4    Me, ready to go at supermarket during the #COV...
5    As news of the regions first confirmed COVID-...
Name: OriginalTweet, dtype: object

In [6]:
sample_df[3]

"My food stock is not the only one which is empty...\r\r\n\r\r\nPLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. \r\r\nStay calm, stay safe.\r\r\n\r\r\n#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j"

In [7]:
# must replace all newline characters because for Label Studio, each line represents a data record.
sample_df.replace("(\r|\n)", "", regex=True)[3]

"My food stock is not the only one which is empty...PLEASE, don't panic, THERE WILL BE ENOUGH FOOD FOR EVERYONE if you do not take more than you need. Stay calm, stay safe.#COVID19france #COVID_19 #COVID19 #coronavirus #confinement #Confinementotal #ConfinementGeneral https://t.co/zrlG0Z520j"

In [8]:
sample_df.replace("(\r|\n)", "", regex=True, inplace=True)

In [9]:
# we save the data as TXT file for Label Studio to know that it is for text data
sample_df.to_csv(SAMPLE_PATH, header=False, index=False, sep='\n')

Then upload this text file into Label Studio, choose the "Text Classification" template then you are good to go.

You can export in CSV format which will contain the text sentences with respect to their labels.

# Question Answering Text Labeling

Let's check how the [Stanford Question Answering dataset](https://www.kaggle.com/stanfordu/stanford-question-answering-dataset) is structured, this is a great reference for how question answering text data is generally structured.

In [10]:
TRAIN_PATH = os.path.join(DATA_PATH_2, "train-v1.1.json")
with open(TRAIN_PATH) as f:
    train_json = json.load(f)
print(train_json.keys())

dict_keys(['data', 'version'])


NOTE: CAREFUL not to print the entire JSON file, as it is very huge and may crash your Jupyter notebook.

In [11]:
type(train_json['data'])

list

In [12]:
len(train_json['data'])

442

In [13]:
train_json['data'][0].keys()

dict_keys(['title', 'paragraphs'])

In [14]:
train_json['data'][0]['title']

'University_of_Notre_Dame'

In [15]:
len(train_json['data'][0]['paragraphs'])

55

In [16]:
train_json['data'][0]['paragraphs'][0].keys()

dict_keys(['context', 'qas'])

In [17]:
train_json['data'][0]['paragraphs'][0]

{'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'qas': [{'answers': [{'answer_start': 515,
     'text': 'Saint Bernadette Soubirous'}],
   'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
   'id': '5733be284776f41900661182'},
  {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ

In [18]:
train_json['data'][0]['paragraphs'][0]['context']

'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.'

In [19]:
# NOTE: there can be many questions for one context
train_json['data'][0]['paragraphs'][0]['qas']

[{'answers': [{'answer_start': 515, 'text': 'Saint Bernadette Soubirous'}],
  'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
  'id': '5733be284776f41900661182'},
 {'answers': [{'answer_start': 188, 'text': 'a copper statue of Christ'}],
  'question': 'What is in front of the Notre Dame Main Building?',
  'id': '5733be284776f4190066117f'},
 {'answers': [{'answer_start': 279, 'text': 'the Main Building'}],
  'question': 'The Basilica of the Sacred heart at Notre Dame is beside to which structure?',
  'id': '5733be284776f41900661180'},
 {'answers': [{'answer_start': 381,
    'text': 'a Marian place of prayer and reflection'}],
  'question': 'What is the Grotto at Notre Dame?',
  'id': '5733be284776f41900661181'},
 {'answers': [{'answer_start': 92,
    'text': 'a golden statue of the Virgin Mary'}],
  'question': 'What sits on top of the Main Building at Notre Dame?',
  'id': '5733be284776f4190066117e'}]

## Create sample data for demo

We want the output JSON to be in this format for Label Studio to accept:
```
[
  {
    "data": {
      "text": ...,
      "question": ...
    }
  },
  {
    "data": {
      "text": ...,
      "question": ...
    }
  }
]
```

Each record needs a "data" key, with only one context (Label Studio uses "text" key for the context) and one question.

In [20]:
def get_questions(train_json):
    question_cnt = 0
    
    # take 50th for easier context
    for para in train_json['data'][50]['paragraphs']:
        data_records = []
        context = para['context']

        for qas in para['qas']:
            question = qas['question']

            ## let's pretend we don't have answers for now, and we want to label them
            # answer_dict = qas['answers'][0]
            # answer_text = answer_dict['text']
            # answer_start = answer_dict['answer_start']
            # answer_end = answer_start + len(answer_text)

            data_records.append(
                {
                    "data":
                    {
                        'text': context,
                        'question': question,
                    }
                }
            )
            
            question_cnt += 1
            if question_cnt == 5:
                # let's just take 5 samples as demo
                return pd.DataFrame(data_records)
        
sample_df = get_questions(train_json)
sample_df

Unnamed: 0,data
0,{'text': 'Sony Music Entertainment Inc. (somet...
1,{'text': 'Sony Music Entertainment Inc. (somet...
2,{'text': 'Sony Music Entertainment Inc. (somet...
3,{'text': 'Sony Music Entertainment Inc. (somet...
4,{'text': 'Sony Music Entertainment Inc. (somet...


In [21]:
result = sample_df.to_json(orient="records")
parsed = json.loads(result)
print(json.dumps(parsed, indent=2))

[
  {
    "data": {
      "text": "Sony Music Entertainment Inc. (sometimes known as Sony Music or by the initials, SME) is an American music corporation managed and operated by Sony Corporation of America (SCA), a subsidiary of Japanese conglomerate Sony Corporation. In 1929, the enterprise was first founded as American Record Corporation (ARC) and, in 1938, was renamed Columbia Recording Corporation, following ARC's acquisition by CBS. In 1966, the company was reorganized to become CBS Records. In 1987, Sony Corporation of Japan bought the company, and in 1991, renamed it SME. It is the world's second largest recorded music company, after Universal Music Group.",
      "question": "What was the first name of Sony Music Entertainment, Inc?"
    }
  },
  {
    "data": {
      "text": "Sony Music Entertainment Inc. (sometimes known as Sony Music or by the initials, SME) is an American music corporation managed and operated by Sony Corporation of America (SCA), a subsidiary of Japanese c

In [22]:
with open(SAMPLE_PATH_2, "w") as f:
    json.dump(parsed, f)

In [23]:
# checking the original answers for faster labels in the demo
for qas in train_json['data'][50]['paragraphs'][0]['qas']:
    print(qas)

{'answers': [{'answer_start': 279, 'text': 'American Record Corporation'}], 'question': 'What was the first name of Sony Music Entertainment, Inc?', 'id': '56df11de3277331400b4d933'}
{'answers': [{'answer_start': 321, 'text': '1938'}], 'question': 'In what year was it renamed Columbia Recording Corporation?', 'id': '56df11de3277331400b4d934'}
{'answers': [{'answer_start': 410, 'text': '1966'}], 'question': 'In what year was it known as CBS Records?', 'id': '56df11de3277331400b4d935'}
{'answers': [{'answer_start': 470, 'text': '1987'}], 'question': 'In what year did it land the name, Sony Music Entertainment?', 'id': '56df11de3277331400b4d936'}
{'answers': [{'answer_start': 614, 'text': 'Universal Music Group.'}], 'question': 'What company is the only group larger than Sony Music Entertainment?', 'id': '56df11de3277331400b4d937'}


## Creating DataFrame from JSON

After done labeling, then load it here to preprocess.

In [24]:
sample_label = json.load(open("sample_data/sample_qa_output.json"))

! Pro tip: You can open the JSON file directly in JupyterLab and can see a nicely formatted JSON output.

In [25]:
# try to get a sample record from the JSON output
first_data = sample_label[0]
sample_text = first_data['data']['text']
sample_question = first_data['data']['question']
sample_answer = first_data['annotations'][0]['result'][0]['value']['text']

print(f"Text:\n{sample_text}\n")
print(f"Question: {sample_question}\n")
print(f"Answer: {sample_answer}")

Text:
Sony Music Entertainment Inc. (sometimes known as Sony Music or by the initials, SME) is an American music corporation managed and operated by Sony Corporation of America (SCA), a subsidiary of Japanese conglomerate Sony Corporation. In 1929, the enterprise was first founded as American Record Corporation (ARC) and, in 1938, was renamed Columbia Recording Corporation, following ARC's acquisition by CBS. In 1966, the company was reorganized to become CBS Records. In 1987, Sony Corporation of Japan bought the company, and in 1991, renamed it SME. It is the world's second largest recorded music company, after Universal Music Group.

Question: What was the first name of Sony Music Entertainment, Inc?

Answer: American Record Corporation


In [26]:
def get_qas(json_data):
    qas_list = []
    for label in json_data:
        text = label['data']['text']
        question = label['data']['question']
        answer = label['annotations'][0]['result'][0]['value']['text']
        
        qas_list.append({
            'context': text,
            'question': question,
            'answer': answer,
        })
    return pd.DataFrame(qas_list)

sample_output = get_qas(sample_label)
sample_output

Unnamed: 0,context,question,answer
0,Sony Music Entertainment Inc. (sometimes known...,What was the first name of Sony Music Entertai...,American Record Corporation
1,Sony Music Entertainment Inc. (sometimes known...,In what year was it renamed Columbia Recording...,1938
2,Sony Music Entertainment Inc. (sometimes known...,In what year was it known as CBS Records?,1966
3,Sony Music Entertainment Inc. (sometimes known...,"In what year did it land the name, Sony Music ...",1987
4,Sony Music Entertainment Inc. (sometimes known...,What company is the only group larger than Son...,Universal Music Group


Now we are done creating the demo dataset that can easily be used for training.