# Natural Language Processing

Natural language processing (NLP) is the application of machine learning to text and language problems. It is amongst the fastest growing areas of the data science domain and has seen a meteroic rise in activity since the invention of word2vec in 2013. While many NLP analysis generally faces the same challenges of more straightforward machine learning models, they also have a set of totally unique problems.

**What are some potential challenges with working with text data?**

### Data Ingestion

Often we're required to extract the data ourselves from a collection of documents.

In [1]:
txt_path = '../data/demo_text.txt'

file1 = open(txt_path,"rb") 
txt_text = file1.read()
file1.close() 

print(txt_text)

b'It is never this easy.'


In [2]:
def docx_to_text(path):
    #import io
    from xml.etree.cElementTree import XML
    import zipfile
    
    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'

    ## Control for Path VS Bytes object from zip    
    #if isinstance(path, io.BytesIO):
        #path.seek(0)

    document = zipfile.ZipFile(path)        
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

docx_path = '../../pre_work/git_github_ve_instructions/2_Virtual_Environment_Setup/Virtual_Environments_Setup.docx'
docx_to_text(docx_path)

'Virtual Environments\n\nWindows Users: You will interact with conda by opening Anaconda Command Prompt that comes installed with the windows distribution of Anaconda.\n\nMac Users: You will interact with conda directly through your terminal.\n\nThis guide aims to bring a newcomer up to speed with Anaconda and working with virtual environments. Please be sure to read through everything if you have never set up environments before. Also feel free to reference YouTube and google for setting up environments if you are still confused by the end of this guide.\n\nIn this bootcamp we are going to be working with many different libraries and as a best practice, when starting on a new project or even exploring new libraries/packages/algorithms, those libraries may have certain dependencies. If you’re installing all your new libraries under your main installation (also known as your ‘base’ installation) things can and will eventually stop working due to dependency issues or conflicts. This is n

In [3]:
import spacy
import PyPDF2

pdf_path = '../data/_sreo1015pdf.pdf'

In [4]:
#Read binary pdf object into memory and parse with PyPDF2

pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [5]:
# Number of pages
pdfReader.numPages

134

In [6]:
pdfReader.getDocumentInfo()

{'/CreationDate': "D:20151001142933-04'00'",
 '/Creator': 'Adobe InDesign CC 2014 (Windows)',
 '/ModDate': "D:20151027121311-04'00'",
 '/Producer': 'Adobe PDF Library 11.0',
 '/Title': 'Regional Economic Outlook: Sub-Saharan Africa; October 2015'}

In [7]:
full_text = []
for page in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(page)
    text = pageObj.extractText().strip().replace('\n',' ')
    full_text.append(text)

In [8]:
full_text[40]

'30growth in the region has not only been driven by commodities.3 Many countries in the region that are not reliant on commodities were also able to achieve rapid growth by creating a virtuous circle of good macroeconomic policies and important structural  reforms that attracted higher aid ˝ows. ˜us,  eight of the 12 fastest growing countries in the region over 1995Œ2010 were nonresource-depen-dent economies. Growth across the region has also bene˚ted from increased private capital ˝ows. ˜e  period since the mid-1990s saw a spurt of ˚nancial innovations that, together with the improved policy environment and debt relief, allowed such ˝ows to  the region to increase very signi˚cantly.While some of these growth drivers will continue to yield dividends, others have run their course.  As noted in Chapter 1, commodity prices have retreated, the ongoing shift in China™s growth model is likely to reduce demand for the region™s  raw materials, and the period of abundant global liquidity is tap

In [9]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

In [10]:
text_pdfminer = convert_pdf_to_txt(pdf_path)

In [11]:
text_pdfminer



### Sometimes the situation is more dire, what about scanned documents?

Optical character recognition can process images and uses computer vision models to find and transcribe the typed print and put letters together to words and words together into "blobs" of text. The utility tesseract and its python wrapper pyTesseract is the primary OCR tool available in Python.


What are some pitfalls of OCR?

### Text preprocessing

There's several steps for preparing our data for use in any NLP activities.

0) Pre-cleaning / data specific noise removal.

1) Tokenization

2) Cleaning

3) Lemmatization **never stemming**

4) Stop word removal (optional)

5) Phrase parsing (optional)

6) Part-of-speech tagging (optional ... kind of)


#### Tokenization

Tokenization is the segmententation of text it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token.

In [12]:
nlp = spacy.load("en_core_web_sm")

In [13]:
import pandas as pd
df = pd.read_csv('../data/hp_data.csv', encoding='cp1252')

In [14]:
df['tokenized'] = df['text'].apply(lambda x: nlp(x, disable = ['ner']))

### Stopwords

Stopwords are words like "the" or "or" or pronouns that add little substantive value to a document. While they can provide clues about relationships of words, for many activities like topic classification or phrase comparison they largely just serve as noise. Removal of stopwords is very common.

In [15]:
from spacy.lang.en.stop_words import STOP_WORDS
print('Example stop words: {}'.format(list(STOP_WORDS)[0:15]))

Example stop words: ['afterwards', 'to', 'namely', 'throughout', 'her', 'moreover', 'doing', 'until', 'anyone', 'by', 'whenever', 'which', "'ll", "n't", 'over']


### Stemming and Lemmatization

From Stanford NLTK group 

"Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun."

In [16]:
example = df.tokenized[50]
print(example)

“My dear Professor, I’ve never seen a cat sit so stiffly.”


In [17]:
print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} | '.format(
    'TEXT','LEMMA_','POS_','TAG_','DEP_','SHAPE_','IS_ALPHA','IS_STOP'))

# print various SpaCy POS attributes
for token in example:
    print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} |'.format(
          token.text, token.lemma_, token.pos_, token.tag_, token.dep_
        , token.shape_, token.is_alpha, token.is_stop))

TEXT            | LEMMA_          | POS_     | TAG_     | DEP_        | SHAPE_   | IS_ALPHA | IS_STOP  | 
“               | "               | PUNCT    | ``       | punct       | “        |        0 |        0 |
My              | -PRON-          | DET      | PRP$     | poss        | Xx       |        1 |        1 |
dear            | dear            | ADJ      | JJ       | amod        | xxxx     |        1 |        0 |
Professor       | Professor       | PROPN    | NNP      | npadvmod    | Xxxxx    |        1 |        0 |
,               | ,               | PUNCT    | ,        | punct       | ,        |        0 |        0 |
I               | -PRON-          | PRON     | PRP      | nsubj       | X        |        1 |        1 |
’ve             | have            | VERB     | VB       | aux         | ’xx      |        0 |        1 |
never           | never           | ADV      | RB       | neg         | xxxx     |        1 |        1 |
seen            | see             | VERB     | VBN    

### N-Grams and Phrase Parsing

The tokenization of single words is a unigram. N-grams is the general class of joined text. 2 and 3 word phrases can offer different meaning than the words individually. For example the bigrams "Union Station" or "Chicago Bulls" have very different meanings than just the two individual unigrams. Finding these in your document and converting unigrams to n-grams can be done by what is called phrase parsing.

In [18]:
from gensim.models.phrases import Phrases, Phraser
from gensim.utils import simple_preprocess

In [19]:
df.text[2369]

'Harry shook his head violently to shut Neville up, but Professor McGonagall had seen. She looked more likely to breathe fire than Norbert as she towered over the three of them.'

In [20]:
print(simple_preprocess(df.text[2369]))

['harry', 'shook', 'his', 'head', 'violently', 'to', 'shut', 'neville', 'up', 'but', 'professor', 'mcgonagall', 'had', 'seen', 'she', 'looked', 'more', 'likely', 'to', 'breathe', 'fire', 'than', 'norbert', 'as', 'she', 'towered', 'over', 'the', 'three', 'of', 'them']


In [21]:
gensim_text = [simple_preprocess(text) for text in df.text]

common_terms = list(STOP_WORDS)

phrases = Phrases(
      gensim_text
    , common_terms=common_terms
    , min_count=10
    , threshold=5
    , scoring='default'
)

phrases

<gensim.models.phrases.Phrases at 0x2164c58b448>

In [22]:
bigram = Phraser(phrases)

def print_phrases(phraser, text_stream, num_underscores=2):
    """ identify phrases from a text stream by searching for terms that
        are separated by underscores and include at least num_underscores
    """
    
    phrases = []
    for terms in phraser[text_stream]:
        for term in terms:
            if term.count('_') >= num_underscores:
                phrases.append(term)
    print(set(phrases))
    
print_phrases(bigram, gensim_text)


{'copy_of_advanced', 'turned_to_look', 'heir_of_slytherin', 'going_to_try', 'keeping_an_eye', 'points_from_gryffindor', 'end_of_the_lesson', 'department_for_the_regulation', 'borgin_and_burkes', 'start_of_term', 'lily_and_james', 'department_of_mysteries', 'red_and_gold', 'weasley_is_our_king', 'matter_of_fact', 'aunt_and_uncle', 'sword_of_gryffindor', 'scar_on_his_forehead', 'boy_who_lived', 'copy_of_the_daily', 'piece_of_parchment', 'year_at_hogwarts', 'want_to_talk', 'fell_to_the_floor', 'closing_the_door', 'tip_of_his_wand', 'slammed_the_door', 'foot_of_the_stairs', 'mother_and_father', 'came_to_halt', 'end_of_the_corridor', 'trying_to_find', 'parvati_and_lavender', 'hair_and_beard', 'bill_and_fleur', 'backward_and_forward', 'crossed_the_room', 'mum_and_dad', 'opened_his_mouth', 'trying_to_catch', 'rest_of_the_school', 'chamber_of_secrets', 'sort_of_way', 'led_the_way', 'george_and_ginny', 'try_and_find', 'flash_of_light', 'history_of_magic', 'going_to_happen', 'bill_and_charlie', 

In [23]:
phrases = Phrases(
      bigram[gensim_text]
    , common_terms=common_terms
    , min_count=3
    , threshold=1
)

trigram = Phraser(phrases)


In [24]:
for doc_num in [2369]:
    print('DOC NUMBER: {}\n'.format(doc_num))
    print('ORIGINAL SENTENT: {}\n'.format(' '.join(gensim_text[doc_num])))
    print('BIGRAM: {}\n'.format(' '.join(bigram[gensim_text[doc_num]])))
    print('TRIGRAM: {}'.format(' '.join(trigram[bigram[gensim_text[doc_num]]])))
    print()

DOC NUMBER: 2369

ORIGINAL SENTENT: harry shook his head violently to shut neville up but professor mcgonagall had seen she looked more likely to breathe fire than norbert as she towered over the three of them

BIGRAM: harry shook_his_head violently to shut neville up but professor_mcgonagall had seen she looked more likely to breathe fire than norbert as she towered over the three of them

TRIGRAM: harry_shook_his_head violently to shut neville up but professor_mcgonagall had seen she looked more likely to breathe fire than norbert as she towered over the three of them



In [27]:
out_text = [' '.join(trigram[bigram[gensim_text[x]]] for x in range(len(df)))]
df['out_text'] = out_text
df[['line', 'text', 'book', 'text2', 'chapter','out_text']].to_pickle('../data/hp_processed.pkl')

TypeError: sequence item 0: expected str instance, list found