# RCS Python NLP processing

## NLP - Natural Language Processing

* Computers good at structured data not words
* Humans process language in a way we do not fully understand to apply to computers
* noted linguist Chomsky even posits that we have an innate sense of grammar(theory is not confirmed)

![Grammars and Languages](img/chomsky_hierarchy.png)
More on Chomsky heirarchy - https://en.wikipedia.org/wiki/Chomsky_hierarchy

### Where does the human language fit in here?

## How do we give structure to our language that computers can understand?


## Some sentences are impossible to parse without context around them.

"We saw her duck"
* Did we see her lowering her body?
* Did we see her favorite animal duck?

"One morning I shot an elephant in my pajamas. How he got in my pajamas, I don't know."

## Building a pipeline

* Divide and conquer

## Sentence Segmentation

    * Break the text apart into separate sentences. 
    



   ## Word Tokenization 
      * split apart words whenever there’s a space between them
      * process sentences one at a time
      *  split words when there's space between them
      * punctuation marks - separate tokens, punctuation has meaning (He eats shoots leaves)

## Predict Parts of Speech for Each Token
 * model is  based on statistics  - computer guesses bases on training data

# Text Lemmatization
* find basic form of the word
* one mouse - two mice


# ID Stop Words
* useless filler words without meaning (and a the and so on)
* usually hardcoded words

# Dependency Parsing
* find out how each word related to other in the sentence
* complex task - state of art in 2017 Google's Parsey McParseface

### Choosing the right model underneath if we had to do it ourselves...

![Algorithm for Data Preparation and Model Building](https://developers.google.com/machine-learning/guides/text-classification/images/TextClassificationFlowchart.png)

## Find Noun Phrases
* ID words which go together

# Named Entity Recognition (NER)
* extracting ideas

* People’s names
* Company names
* Geographic locations (Both physical and political)
* Product names
* Dates and times
* Amounts of money
* Names of events


# Coreference Resolution

Yesterday I bought a car. It was expensive.

* Coreference resolution - difficult step in our pipeline

In [None]:
#install through conda not pip on Windows!
conda install -c conda-forge spacy

In [None]:
import spacy

### Warnings can be ignored here, comes from compiling against older numpy than our version
### https://stackoverflow.com/questions/40845304/runtimewarning-numpy-dtype-size-changed-may-indicate-binary-incompatibility

In [None]:
conda install -c conda-forge textacy

In [None]:
import textacy

In [None]:
## Import alraedy processed learning data for English
python -m spacy download en_core_web_lg

## About 900MB

In [None]:
# we load our NLP model
nlp = spacy.load('en_core_web_lg')

In [None]:
mydata = '''Riga (/ˈriːɡə/; Latvian: Rīga [ˈriːɡa] (About this sound listen)) is the capital and largest city of Latvia. With 641,481 inhabitants (2016),[4] it is also the largest city in the three Baltic states, home to one third of Latvia's population and one tenth of the three Baltic states' combined population.[8] The city lies on the Gulf of Riga, at the mouth of the Daugava. Riga's territory covers 307.17 square kilometres (118.60 square miles) and lies between one and ten metres (3 feet 3 inches and 32 feet 10 inches) above sea level,[9] on a flat and sandy plain.[9]

Riga was founded in 1201 and is a former Hanseatic League member. Riga's historical centre is a UNESCO World Heritage Site, noted for its Art Nouveau/Jugendstil architecture and 19th century wooden architecture.[10] Riga was the European Capital of Culture during 2014, along with Umeå in Sweden. Riga hosted the 2006 NATO Summit, the Eurovision Song Contest 2003, the 2006 IIHF Men's World Ice Hockey Championships and the 2013 World Women's Curling Championship. It is home to the European Union's office of European Regulators for Electronic Communications (BEREC).

In 2016, Riga received over 1.4 million visitors.[11] It is served by Riga International Airport, the largest and busiest airport in the Baltic states. Riga is a member of Eurocities,[12] the Union of the Baltic Cities (UBC)[13] and Union of Capitals of the European Union (UCEU).[14]'''

In [None]:
## We need to clean our data first a bit
# regular expressions to the rescue!
import re
# how to construct regex queries https://docs.python.org/3/howto/regex.html#regex-howto

# where do regular expressions belong on Chomsky hierarchy: https://cstheory.stackexchange.com/questions/1047/where-do-most-regex-implementations-fall-on-the-complexity-scale

In [None]:
# we need to escape []!!! 
# these need escaping  . ^ $ * + ? { } [ ] \ | ( )
cleanriga = re.sub('\[.*?\]', '', mydata) # we replace sq brackets and contents inside with ''
cleanriga

In [None]:
# Parse the text with spaCy. This runs the entire pipeline we discussed earlier
parsedriga = nlp(cleanriga)

In [None]:
#  gathering a list of named entities and entity types detected in our document
for entity in parsedriga.ents:
    print(f"{entity.text} ({entity.label_})")

In [None]:
## Entity types https://spacy.io/usage/linguistic-features#entity-types

In [None]:
# Let's go deeper and find out statements about our text
import textacy.extract

In [None]:
# Extract semi-structured statements
statements = textacy.extract.semistructured_statements(parsedriga, "Riga")
type(statements)

In [None]:
print("Here are the things I know about Riga:")

for statement in statements:
    subject, verb, fact = statement
    print(f" {subject} - {verb} {fact}")

In [None]:
# Can we do better?
# We need more data for our nlp to process..

In [None]:
with open('resources/riga.txt', 'r', encoding="utf8") as f:
    myriga = f.read()

In [None]:
cleanriga = re.sub('\[.*?\]', '', myriga) 
cleanriga

In [None]:
parsedrigaful = nlp(cleanriga)

In [None]:
for entity in parsedrigaful.ents:
    print(f"{entity.text} ({entity.label_})")

In [None]:
statements = textacy.extract.semistructured_statements(parsedrigaful, "Riga")
type(statements)

In [None]:
print("Here are the things I know about Riga:")

for statement in statements:
    subject, verb, fact = statement
    print(f" {subject} - {verb} {fact}")

In [None]:
# Extract noun chunks that appear at least 3 times
noun_chunks = textacy.extract.noun_chunks(parsedrigaful, min_freq=3)

# Convert noun chunks to lowercase strings
noun_chunks = map(str, noun_chunks)
noun_chunks = map(str.lower, noun_chunks)

# Print out any nouns that are at least 2 words long
for noun_chunk in set(noun_chunks):
    if len(noun_chunk.split(" ")) > 1:
        print(noun_chunk)

In [None]:
## TODO Install the neuralcoref library and add Coreference Resolution to your pipeline.

Sources:

   * https://medium.com/@ageitgey/natural-language-processing-is-fun-9a0bff37854e
    
   *  https://developers.google.com/machine-learning/guides/text-classification/