# Natural Language Processing

Natural language processing (NLP) is the application of machine learning to text and language problems. It is amongst the fastest growing areas of the data science domain and has seen a meteroic rise in activity since the invention of word2vec in 2013. While many NLP analysis generally faces the same challenges of more straightforward machine learning models, they also have a set of totally unique problems.

**What are some potential challenges with working with text data?**

### Data Ingestion

Often we're required to extract the data ourselves from a collection of documents.

In [1]:
txt_path = '../data/demo_text.txt'

file1 = open(txt_path,"rb") 
txt_text = file1.read()
file1.close() 

print(txt_text)

b'It is never this easy.'


In [None]:
def docx_to_text(path):
    #import io
    from xml.etree.cElementTree import XML
    import zipfile
    
    WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
    PARA = WORD_NAMESPACE + 'p'
    TEXT = WORD_NAMESPACE + 't'

    ## Control for Path VS Bytes object from zip    
    #if isinstance(path, io.BytesIO):
        #path.seek(0)

    document = zipfile.ZipFile(path)        
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

docx_path = '../../pre_work/git_github_ve_instructions/2_Virtual_Environment_Setup/Virtual_Environments_Setup.docx'

docx_to_text(docx_path)

In [None]:
import spacy
import PyPDF2

pdf_path = '../data/_sreo1015pdf.pdf'

In [None]:
#Read binary pdf object into memory and parse with PyPDF2

pdfFileObj = open(pdf_path, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

In [None]:
# Number of pages
pdfReader.numPages

In [None]:
pdfReader.getDocumentInfo()

In [None]:
full_text = []
for page in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(page)
    text = pageObj.extractText().strip().replace('\n',' ')
    full_text.append(text+' ')

In [None]:
full_text[40]

In [None]:
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

In [None]:
text_pdfminer = convert_pdf_to_txt(pdf_path)

In [None]:
text_pdfminer

### Sometimes the situation is more dire, what about scanned documents?

Optical character recognition can process images and uses computer vision models to find and transcribe the typed print and put letters together to words and words together into "blobs" of text. The utility tesseract and its python wrapper pyTesseract is the primary OCR tool available in Python.


What are some pitfalls of OCR?

### Text preprocessing

There's several steps for preparing our data for use in any NLP activities.

0) Pre-cleaning / data specific noise removal.

1) Tokenization

2) Cleaning

3) Lemmatization **never stemming**

4) Stop word removal (optional)

5) Phrase parsing (optional)

6) Part-of-speech tagging (optional ... kind of)


#### Tokenization

Tokenization is the segmententation of text it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token.

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
import pandas as pd
df = pd.read_csv('../data/hp_data.csv', encoding='cp1252')

In [None]:
df['tokenized'] = df['text'].apply(lambda x: nlp(x, disable = ['ner']))

### Stopwords

Stopwords are words like "the" or "or" or pronouns that add little substantive value to a document. While they can provide clues about relationships of words, for many activities like topic classification or phrase comparison they largely just serve as noise. Removal of stopwords is very common.

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS
print('Example stop words: {}'.format(list(STOP_WORDS)[0:15]))

### Stemming and Lemmatization

From Stanford NLTK group 

"Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . If confronted with the token saw, stemming might return just s, whereas lemmatization would attempt to return either see or saw depending on whether the use of the token was as a verb or a noun."

In [None]:
example = df.tokenized[50]
print(example)

In [None]:
print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} | '.format(
    'TEXT','LEMMA_','POS_','TAG_','DEP_','SHAPE_','IS_ALPHA','IS_STOP'))

# print various SpaCy POS attributes
for token in example:
    print('{:15} | {:15} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} |'.format(
          token.text, token.lemma_, token.pos_, token.tag_, token.dep_
        , token.shape_, token.is_alpha, token.is_stop))

### N-Grams and Phrase Parsing

The tokenization of single words is a unigram. N-grams is the general class of joined text. 2 and 3 word phrases can offer different meaning than the words individually. For example the bigrams "Union Station" or "Chicago Bulls" have very different meanings than just the two individual unigrams. Finding these in your document and converting unigrams to n-grams can be done by what is called phrase parsing.

In [None]:
from gensim.models.phrases import Phrases, Phraser
from gensim.utils import simple_preprocess

In [None]:
df.text[2369]

In [None]:
print(simple_preprocess(df.text[2369]))

In [None]:
gensim_text = [simple_preprocess(text) for text in df.text]

common_terms = list(STOP_WORDS)

phrases = Phrases(
      gensim_text
    , common_terms=common_terms
    , min_count=10
    , threshold=5
    , scoring='default'
)

phrases

In [None]:
bigram = Phraser(phrases)

def print_phrases(phraser, text_stream, num_underscores=2):
    """ identify phrases from a text stream by searching for terms that
        are separated by underscores and include at least num_underscores
    """
    
    phrases = []
    for terms in phraser[text_stream]:
        for term in terms:
            if term.count('_') >= num_underscores:
                phrases.append(term)
    print(set(phrases))
    
print_phrases(bigram, gensim_text)


In [None]:
phrases = Phrases(
      bigram[gensim_text]
    , common_terms=common_terms
    , min_count=3
    , threshold=1
)

trigram = Phraser(phrases)


In [None]:
for doc_num in [2369]:
    print('DOC NUMBER: {}\n'.format(doc_num))
    print('ORIGINAL SENTENT: {}\n'.format(' '.join(gensim_text[doc_num])))
    print('BIGRAM: {}\n'.format(' '.join(bigram[gensim_text[doc_num]])))
    print('TRIGRAM: {}'.format(' '.join(trigram[bigram[gensim_text[doc_num]]])))
    print()

In [None]:
df['out_text'] = out_text
df[['line', 'text', 'book', 'text2', 'chapter','out_text']].to_pickle('../data/hp_processed.pkl')