In [2]:
import nltk
import spacy

nlp = spacy.load('en')

OSError: Can't find model 'en'

In [None]:
%%python
import nltk
nltk.download()

In [None]:
text = """Each autumn, businesses flock to elite universities like Harvard and Stanford to recruit engineers for their first post-university jobs. Curious students pile into classrooms to hear recruiters deliver their best pitches. These are the first moments when prospective employees size up a company’s culture and assess whether they can see themselves reflected in its future."""

## Tokenization

### Sentence Tokenization

##### Using spaCy

In [None]:
doc = nlp(text)

In [None]:
#sentence spans (an iterable object, like a list, of tokens) are available via the sents attribute
for index, sent in enumerate(doc.sents):
    print(index, type(sent), sent, '\n')

In [None]:
spacy_sents = list(doc.sents)

##### Using NLTK

In [None]:
from nltk import sent_tokenize

In [None]:
for index, sent in enumerate(sent_tokenize(text)):
    print(index, type(sent), sent, '\n')

In [None]:
nltk_sents = sent_tokenize(text)

### Word Tokenization

##### spaCy

In [None]:
for index, token in enumerate(spacy_sents[0]):
    print(index, type(token), token)

##### NLTK

In [None]:
from nltk import word_tokenize

In [None]:
for index, token in enumerate(word_tokenize(nltk_sents[0])):
    print(index, type(token), token)

In [None]:
nltk_tokens = word_tokenize(nltk_sents[0])

##### using Regular Expressions (regex)

In [None]:
import re

In [None]:
WORDS_RE = re.compile(r'\W+')

In [None]:
for index, token in enumerate(re.split(WORDS_RE, nltk_sents[0])):
    print(index, type(token), token)

But we lost the punctuation and we have emtpy strings.

In [None]:
WORDS_AND_PUNCT_RE = re.compile(r'\w+|[\,\.\!\?\-]')

In [None]:
for index, token in enumerate(re.findall(WORDS_AND_PUNCT_RE, nltk_sents[0])):
    print(index, type(token), token)

See [this tutorial](https://www.datacamp.com/community/tutorials/python-regular-expression-tutorial) to learn more about regular expressions and visit [pythex.org](www.pythex.org) to play around.

# Part of Speech (POS) Tagging

##### spaCy

The best part about spaCy is it does everything for you right out of the box

In [None]:
[(t, t.pos_, t.tag_) for t in spacy_sents[0]]

[Meaning of the POS Labels and Tags](https://spacy.io/api/annotation#section-pos-tagging)
See the English section.

##### NLTK

In [None]:
nltk.pos_tag(nltk_tokens)

# Word Normalization: Stemming and Lemmatization
[Read more](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

##### spaCy

spaCy doesn't support stemming, but lemmatization (like everything else) is already built in.

In [None]:
[(t, 'LEMMA:', t.lemma_) for t in spacy_sents[0]]

##### NLTK

NLTK has multiple stemmers that you can use.

In [None]:
from nltk.stem import LancasterStemmer, PorterStemmer, SnowballStemmer

In [None]:
lancaster_stem = LancasterStemmer()
porter_stem = PorterStemmer()
snowball_stem = SnowballStemmer('english')

In [None]:
for token in nltk_tokens:
    print('token: {} ---  Lancaster: {}  |  Porter: {}  |  Snowball: {}'.format(token, lancaster_stem.stem(token), porter_stem.stem(token), snowball_stem.stem(token)))

# Analyzing Sentence Structure

There are two main approaches to analyzing sentence structure: Dependency-based and Constituency-based<br>
From [Wikipedia](https://en.wikipedia.org/wiki/Dependency_grammar#Dependency_vs._constituency):<br>
"Dependency is a one-to-one correspondence: for every element (e.g. word or morph) in the sentence, there is exactly one node in the structure of that sentence that corresponds to that element. The result of this one-to-one correspondence is that dependency grammars are word (or morph) grammars. All that exist are the elements and the dependencies that connect the elements into a structure. This situation should be compared with the constituency relation of phrase structure grammars. Constituency is a one-to-one-or-more correspondence, which means that, for every element in a sentence, there are one or more nodes in the structure that correspond to that element. The result of this difference is that dependency structures are minimal[7] compared to their constituency structure counterparts, since they tend to contain much fewer nodes."

##### spaCy

spaCy supports dependency parsing for analyzing sentence structure.

In [None]:
# already built in
[(t, t.dep_, t.head) for t in spacy_sents[0]]

In [None]:
from spacy import displacy

In [None]:
# you can convert a sentence Span object to a doc like this
sent_span = spacy_sents[0]
print(type(sent_span))
sent_doc = sent_span.as_doc()

In [None]:
displacy.render(sent_doc, style='dep', jupyter=True, options={'distance': 120})

##### NLTK

NLTK can perform both types. However, they require you to define your own grammars, which is outside of the scope of this course. NLTK also provides support for using Stanford's CoreNLP software to perform both dependency and constituency parsing. However for dependency parsing, just stick with spaCy. It's faster and one of the best.

In [None]:
from nltk.parse.stanford import StanfordDependencyParser, StanfordParser

If you don't have a JAVAHOME environment variable set on your machine. You'll need to set it from your script.<br>
To determine the path to java on your machine<br>
Windows: Go to the command prompt and type `where java`<br>
OSX/Linux: Open the console and type: `which java`

In [None]:
import os

java_path = '/usr/bin/java'
os.environ['JAVAHOME'] = java_path

path_to_jar = "/Users/zacharywentzell/Downloads/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0-sources.jar"
path_to_models = "/Users/zacharywentzell/Downloads/stanford-corenlp-full-2017-06-09/stanford-corenlp-3.8.0-models.jar"

In [None]:
scp = StanfordParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models)

In [None]:
test_sent = "Al Norman has been fighting to keep Walmart and other big-box retailers out of small towns like this one for 25 years."

In [None]:
parse_tree = list(scp.raw_parse(test_sent))[0]

if you want to be able to display the following tree in jupyter, you need to install ghostscript<br>
For Mac:<br>
`brew install ghostsript` or `conda install -c conda-forge ghostscript`<br><br>
For Windows:<br>
You'll need to download and install [ghostscript from here](https://www.ghostscript.com/download/gsdnld.html)<br>
Then you have to make sure ghostscript is in your PATH by adding the folder `C:\Program Files\gs\gs9.22\bin` (or something similar)

In [None]:
parse_tree

In [None]:
print(parse_tree)

In [None]:
# this is the number of child nodes, 1: (S ...)
len(parse_tree)

In [None]:
subtrees = list(parse_tree.subtrees())

In [None]:
subtrees[2]

In [None]:
sdp = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models)

In [None]:
parse_tree = list(sdp.parse(word_tokenize(test_sent)))[0]

In [None]:
parse_tree

# Noun Phrases (Chunks)

##### spaCy

In [None]:
doc = nlp(test_sent)
list(doc.noun_chunks)

##### NLTK

In [None]:
parse_tree = list(scp.raw_parse(test_sent))[0]

In [None]:
[(' '.join(tree.leaves()), len(list(tree.subtrees(filter=lambda t: t.label() == 'NP')))) for tree in parse_tree.subtrees(filter=lambda t: t.label() == 'NP') if len(list(tree.subtrees(filter=lambda t: t.label() == 'NP'))) == 1]

# Named Entities
Apply categorical labels to sequences of tokens (such as proper nouns) that represent different types of entities: such as people, places, organizations, etc. The categories can be whatever you want them to be if you build your own Named Entity Recognitino (NER) model. If you use someone else's model, you have to use the categories/labels that they trained the model to recognize. And the model will only be as good as the data it was trained on. Meaning if you try to use a pretrained model on text that is very different from the text it was trained on, you may not get very good results.

##### spaCy
[See here for more info on the entity types spaCy's models are trained to recognize](https://spacy.io/api/annotation#section-named-entities)

In [None]:
test_sent = 'FC Bayern was founded in 1900 by eleven football players led by Franz John.'

In [None]:
doc = nlp(test_sent)

In [None]:
[(e, e.label_) for e in doc.ents]

In [None]:
# Also notice that all tokens in a Doc/Span have entity types as well.
[(t, t.ent_type_) for t in doc]

##### NLTK

In [None]:
tokens = word_tokenize('FC Bayern was founded in 1900 by eleven football players led by Franz John.')


In [None]:
tagged_tokens = nltk.pos_tag(tokens)

In [None]:
named_entity_chunks = nltk.ne_chunk(tagged_tokens)
named_entity_chunks

In [None]:
[(ne.label(), ' '.join(leaf[0] for leaf in ne.leaves())) for ne in named_entity_chunks if hasattr(ne, 'label')]

# Extracting Phrases/Chunks (an example)

##### only spaCy

### prepositional phrases

In [None]:
text = """Bayern Munich, or FC Bayern, is a German sports club based in Munich, Bavaria, Germany. It is best known for its professional football team, which plays in the Bundesliga, the top tier of the German football league system, and is the most successful club in German football history, having won a record 26 national titles and 18 national cups. FC Bayern was founded in 1900 by eleven football players led by Franz John. Although Bayern won its first national championship in 1932, the club was not selected for the Bundesliga at its inception in 1963. The club had its period of greatest success in the middle of the 1970s when, under the captaincy of Franz Beckenbauer, it won the European Cup three times in a row (1974-76). Overall, Bayern has reached ten UEFA Champions League finals, most recently winning their fifth title in 2013 as part of a continental treble."""

In [None]:
doc = nlp(text)

In [None]:
prep_objs = [token for token in doc if token.dep_ == 'pobj']
prep_objs[:5]

In [None]:
for prep_obj in prep_objs:
    prep = prep_obj.head
    phrase = doc[prep.i:prep_obj.i + 1]
    print(prep, prep_obj, '---', phrase, '   ', type(phrase))

# Other tasks

spaCy can't do everything. Stanford's CoreNLP project can help fill in some of the gaps.

## Coreference Resolution

Figure out which terms reference each other in a sentence<br>
<br>
This is extremely helpful for figuring out what pronouns are referencing when you are trying to extract information from text.

In [None]:
text = """Barack Obama was born in Hawaii.  He is the president. Obama was elected in 2008."""

In [None]:
from pycorenlp import StanfordCoreNLP

In [None]:
# first start the CoreNLP server by running
# java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

corenlp = StanfordCoreNLP('http://localhost:9000')

In [None]:
# the first run, and any runs where you change annotators is slow
results = corenlp.annotate(text, properties={'annotators': 'ssplit, tokenize, coref',
                                             'coref.algorithm': 'statistical',
                                             'outputFormat': 'json'
                                            })

In [None]:
results.keys()

In [None]:
for coref_id, corefs in results['corefs'].items():
    for coref in corefs:
        print(coref['text'])

# Other Modules to Check Out

- [Textacy](https://github.com/chartbeat-labs/textacy)
- [Textblob](http://textblob.readthedocs.io/en/dev/)
- [Pattern](https://github.com/clips/pattern)
- [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/index.html) via [pycorenlp](https://github.com/smilli/py-corenlp)

# References

Dipanjan Sarkar. 2016. Text Analytics with Python: A Practical Real-World Approach to Gaining Actionable Insights from Your Data (1st ed.). Apress, Berkely, CA, USA. 