# This notebook demonstrates some core functionality of the nlp package

The idea is to demonstrate the functionality and structure to make it easier to use and develop

In [1]:
# import other libraries
import nltk

#import nlp corpus
from nlp.corpus_handlers import SQuADCorpusHandler
from nlp.utils import Cleaners

In [2]:
# download nltk files needed for Cleaners class in utils
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sbrm996\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Define the corpus

First, define the corpus. There is a base class BaseCorpusHandler() designed to outline the core functionality of the CorpusHandler classes.

From this, we can derive specific corpus handlers, which abstracts away all of the messy complexity involved in using several different datasets with very different structures.

In [3]:
# Initialise corpus handler
corpus = SQuADCorpusHandler()

Having a specific corpus handler lets us do a number of useful things, including holding corpus-specific config, such as default data paths:

In [4]:
print(corpus.data_path)

C:\Projects\ml-reaserch\data\train-v2.0.json


Since these are attributes of the object, it's easy to override them as required:

In [5]:
# Override data path to load from
corpus.data_path = 'C:\Projects\ml-reaserch\data\some_other_file.json'
print(corpus.data_path)

C:\Projects\ml-reaserch\data\some_other_file.json


I can load the data from disk into the object like this:

In [6]:
corpus.load_data()

AssertionError: Attempting to load from file which does not exist: C:\Projects\ml-reaserch\data\some_other_file.json 
 Please either create this file or point the SQuADCorpusHandler.data_path attribute to a valid SQuAD json file

As you can see, the load fails because this overridden file does not exist. Let's reset to the default path, load and continue.

In [None]:
corpus.data_path = corpus._default_data_path
print(corpus.data_path)
corpus.load_data()

In [11]:
corpus.data[0]

{'title': 'Beyoncé',
 'paragraphs': [{'qas': [{'question': 'When did Beyonce start becoming popular?',
     'id': '56be85543aeaaa14008c9063',
     'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
     'is_impossible': False},
    {'question': 'What areas did Beyonce compete in when she was growing up?',
     'id': '56be85543aeaaa14008c9065',
     'answers': [{'text': 'singing and dancing', 'answer_start': 207}],
     'is_impossible': False},
    {'question': "When did Beyonce leave Destiny's Child and become a solo singer?",
     'id': '56be85543aeaaa14008c9066',
     'answers': [{'text': '2003', 'answer_start': 526}],
     'is_impossible': False},
    {'question': 'In what city and state did Beyonce  grow up? ',
     'id': '56bf6b0f3aeaaa14008c9601',
     'answers': [{'text': 'Houston, Texas', 'answer_start': 166}],
     'is_impossible': False},
    {'question': 'In which decade did Beyonce become famous?',
     'id': '56bf6b0f3aeaaa14008c9602',
     'answers': [{'text

# Return raw data

Now that we've got the data loaded, let's start with the basics.

We can view all the data in raw form if we want...

In [20]:
data = corpus.get_data()

print(len(data))

print('Topic 0 has title {} and contents: \n {}'.format(data[0]['title'], data[0]['paragraphs']))

442
Topic 0 has title Beyoncé and contents: 


 [{'qas': [{'question': 'When did Beyonce start becoming popular?', 'id': '56be85543aeaaa14008c9063', 'answers': [{'text': 'in the late 1990s', 'answer_start': 269}], 'is_impossible': False}, {'question': 'What areas did Beyonce compete in when she was growing up?', 'id': '56be85543aeaaa14008c9065', 'answers': [{'text': 'singing and dancing', 'answer_start': 207}], 'is_impossible': False}, {'question': "When did Beyonce leave Destiny's Child and become a solo singer?", 'id': '56be85543aeaaa14008c9066', 'answers': [{'text': '2003', 'answer_start': 526}], 'is_impossible': False}, {'question': 'In what city and state did Beyonce  grow up? ', 'id': '56bf6b0f3aeaaa14008c9601', 'answers': [{'text': 'Houston, Texas', 'answer_start': 166}], 'is_impossible': False}, {'question': 'In which decade did Beyonce become famous?', 'id': '56bf6b0f3aeaaa14008c9602', 'answers': [{'text': 'late 1990s', 'answer_start': 276}], 'is_impossible': False}, {'question': 'In what R&B group was she the lead singer?

Or we can use the defined API to return the topics in nested list form...

In [22]:
corpus.get_single_topic(idx=0)

[[['Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'],
  ['When did Beyonce start becoming popular?',
   'What areas did Beyonce compete in when she was growing up?',
   "When did Beyonce leave Destiny's Child and become a solo singer?",
   'In what city and state did Beyonce  grow up? ',
   'In which decade did Beyonce become famous?',
   'In what

# Apply Cleaners, Tokenizers etc

Using re-wrapped functionality, we can apply cleaners, tokenisers, vetcorisers etc

In [24]:
# e.g. an out-of-the-box NLTK tokenizer:
tok = nltk.tokenize.regexp.WhitespaceTokenizer()

# and a cleaner defined on the regex list specified for this corpus:
cleaner = Cleaners(regex_list=corpus.regex_list, lowercase=True)

We can apply these to the topic to get all data returned in this format:

In [27]:
corpus.get_single_topic(idx=0, cleaner=cleaner, sentence_tokenize=True, word_tokenizer=tok)

[[[['beyonce',
    'giselle',
    'knowles_carter',
    '(',
    'bijnse',
    'bee_yon_say',
    ')',
    '(',
    'born',
    'september',
    '4',
    ',',
    '1981',
    ')',
    'is',
    'an',
    'american',
    'singer',
    ',',
    'songwriter',
    ',',
    'record',
    'producer',
    'and',
    'actress',
    '.'],
   ['born',
    'and',
    'raised',
    'in',
    'houston',
    ',',
    'texas',
    ',',
    'she',
    'performed',
    'in',
    'various',
    'singing',
    'and',
    'dancing',
    'competitions',
    'as',
    'a',
    'child',
    ',',
    'and',
    'rose',
    'to',
    'fame',
    'in',
    'the',
    'late',
    '1990s',
    'as',
    'lead',
    'singer',
    'of',
    'rnb',
    'girl_group',
    'destiny',
    'child',
    '.'],
   ['managed',
    'by',
    'her',
    'father',
    ',',
    'mathew',
    'knowles',
    ',',
    'the',
    'group',
    'became',
    'one',
    'of',
    'the',
    'world',
    'best_selling',
    'girl',
    

# Specific CQA Triplets

Or we can apply them in the same way to obtain cleaned CQA triplets.

As a note, in an ideal world we would only expose corpus methods which are common across corpuses, so it may be desirable to hide all of these raw data queries above and only expose CQA methods once we extend to RACE QA etc

In [30]:
contexts, questions, answers = corpus.cqa_triplets_for_topic(topic_idx=0, cleaner=cleaner)

print('Contexts: \n{}\n\n'.format(contexts[:3]))
print('Questions: \n{}\n\n'.format(questions[:3]))
print('Answers: \n{}\n\n'.format(answers[:3]))

Contexts: 
[['beyonce giselle knowles_carter ( bijnse bee_yon_say ) ( born september 4 , 1981 ) is an american singer , songwriter , record producer and actress . born and raised in houston , texas , she performed in various singing and dancing competitions as a child , and rose to fame in the late 1990s as lead singer of rnb girl_group destiny child . managed by her father , mathew knowles , the group became one of the world best_selling girl groups of all time . their hiatus saw the release of beyonce debut album , dangerously in love ( 2003 ) , which established her as a solo artist worldwide , earned five grammy awards and featured the billboard hot 100 number_one singles " crazy in love " and " baby boy " . '], ['beyonce giselle knowles_carter ( bijnse bee_yon_say ) ( born september 4 , 1981 ) is an american singer , songwriter , record producer and actress . born and raised in houston , texas , she performed in various singing and dancing competitions as a child , and rose to fam