# NLP Research Notebook

Nathan Martindale  
Fall 2018


### Resources

[https://www.nltk.org/book/ch01.html](https://www.nltk.org/book/ch01.html)  
[https://www.nltk.org/book/ch03.html](https://www.nltk.org/book/ch03.html)  
[https://www.nltk.org/book/ch05.html](https://www.nltk.org/book/ch05.html)  
[http://ad-publications.informatik.uni-freiburg.de/theses/Bachelor_Jon_Ezeiza_2017.pdf](http://ad-publications.informatik.uni-freiburg.de/theses/Bachelor_Jon_Ezeiza_2017.pdf)

In [4]:
import nltk
#nltk.download() # uncomment and run to download corpus

In [6]:
from nltk.book import *

The above brings in a bunch of test texts, available by using `text1`-`text9`

In [13]:
print(text4)

<Text: Inaugural Address Corpus>


In [16]:
fdist4 = FreqDist(text4)
print(fdist4)
fdist4.most_common(50)

<FreqDist with 9754 samples and 145735 outcomes>


[('the', 9281),
 ('of', 6970),
 (',', 6840),
 ('and', 4991),
 ('.', 4676),
 ('to', 4311),
 ('in', 2527),
 ('a', 2134),
 ('our', 1905),
 ('that', 1688),
 ('be', 1460),
 ('is', 1403),
 ('we', 1141),
 ('for', 1075),
 ('by', 1036),
 ('it', 1011),
 ('which', 1002),
 ('have', 994),
 ('not', 916),
 ('as', 888),
 ('with', 886),
 ('will', 846),
 ('I', 831),
 ('are', 774),
 ('all', 758),
 ('their', 719),
 ('this', 700),
 ('The', 619),
 ('has', 611),
 ('people', 559),
 ('its', 554),
 (';', 544),
 ('or', 537),
 ('from', 521),
 ('on', 496),
 ('We', 483),
 ('been', 482),
 ('but', 479),
 ('can', 457),
 ('us', 455),
 ('my', 449),
 ('no', 406),
 ('an', 377),
 ('--', 363),
 ('upon', 363),
 ('who', 356),
 ('It', 356),
 ('so', 354),
 ('must', 345),
 ('they', 341)]

The NLTK book makes an interesting point that looking at the _long_ words of the text can be more charactersitic and informative of the text.

The following gets all words over length 7 that appear more than 7 times

In [20]:
sorted(w for w in set(text5) if len (w) > 7 and fdist4[w] > 7)

['American',
 'Americans',
 'Constitution',
 'anything',
 'attention',
 'blessings',
 'carrying',
 'children',
 'concerning',
 'condition',
 'considered',
 'construction',
 'depression',
 'destruction',
 'development',
 'devotion',
 'difference',
 'different',
 'discretion',
 'election',
 'elections',
 'emergency',
 'employed',
 'entertain',
 'entitled',
 'especially',
 'everything',
 'excitement',
 'experience',
 'expression',
 'feelings',
 'financial',
 'forgotten',
 'friendly',
 'goodness',
 'humanity',
 'immediate',
 'important',
 'impossible',
 'impressed',
 'individual',
 'information',
 'intelligent',
 'involved',
 'national',
 'neighbor',
 'official',
 'opportunity',
 'original',
 'particular',
 'permanent',
 'personal',
 'physical',
 'platform',
 'political',
 'politics',
 'position',
 'problems',
 'punishment',
 'question',
 'questions',
 'recognize',
 'relation',
 'religion',
 'remember',
 'republican',
 'situation',
 'solution',
 'something',
 'sometimes',
 'speaking',
 'st

**collocation** - a sequence of words that occur together unusually often.  
**bigram** - a word pair

In [22]:
text4.collocations()

United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties


In [28]:
#theplot = fdist4.plot() # TODO: investigate options for this function

**stem** - the main part of a word (no suffixes)

### Fancy stuff

In [47]:
from unidecode import unidecode

lines = ""
with open('./sample_article1.txt') as file:
    lines = file.readlines()
    
articletext = ''.join(lines)
articletext = unidecode(articletext) # we don't need none of your fancyass unicode-quote monkey business here...
print(articletext)


HARARE (Reuters) - Zimbabwe's main opposition leader, Nelson Chamisa, filed a court challenge on Friday against President Emmerson Mnangagwa's election victory, halting Mnangagwa's planned Sunday inauguration.
The first election since Robert Mugabe was forced to resign after a coup in November had been expected to end Zimbabwe's pariah status and launch an economic recovery but post-election unrest has reminded the country of its violent past.
Chamisa's lawyer Thabani Mpofu said he had asked the Constitutional Court to nullify the July 30 vote and that his court application meant Mnangagwa's swearing-in had been halted.
Justice Minister Ziyambi Ziyambi told Reuters Sunday's inauguration "will no longer happen" until the case is finalised.
"On the basis of the evidence we have placed before the court, we seek in the main relief to the effect that the court should declare the proper winner and the proper winner is my client," Mpofu told reporters outside.
"In the alternative, we seek tha

In [50]:
tokens = nltk.word_tokenize(articletext)

# run stemmer
#porter = nltk.PorterStemmer()
#[porter.stem(t) for t in tokens]

tagged = nltk.pos_tag(tokens)

In [57]:
# find all of the proper nouns
propernouns = [word for (word, tag) in tagged if tag == "NNP"]

nltk.FreqDist(propernouns)

FreqDist({'Mnangagwa': 6, 'Chamisa': 4, 'Zimbabwe': 4, 'Mpofu': 3, 'Minister': 3, 'Reuters': 2, 'Friday': 2, 'Sunday': 2, 'Court': 2, 'Ziyambi': 2, ...})

In [2]:
print(tokens)

NameError: name 'tokens' is not defined

## Process

**tokenization** - "Tokenization is the task of splitting the text into more meaningful character sequence groups, usually words or sentences. It is almost always thee first step in any NLP pipeline (bachelor thesis)

+ **normalization** - Normalization is the exercise of removing undesired variation from text (^)
+ **stemming** - Removes morphological variation by algorithm means (for example changing plurals)
+ **lemmatization** - Alternative method based on dictionaries to extract correct roots from words

**parsing** - Annotation tasks to identify role of words in context, allowing higher levels of abstraction.

+ **POS tagging** - Annotating each word with a morphological class (adjective, verb, etc)
+ **Syntactical analysis** - Determine the role of each word in the sentence and capture dependencies between each component)(extracting subject-verb-object triplets is an example)
+ **Abstract meaning representation** - unification of POS tagging and syntactic parsing with real world information (named entity recognition, word sense disambiguation). <span style="color: red">ONLY PARTIAL SOLUTIONS EXIST</span>

Section 2.3 in the bachelor thesis is extremely useful

**Vector Space Model** - way of representing documents in algebraic form, (generally bag of words?). Actual frequencies aren't stored, but weight that represents "relvance measure" of the term in the document is. Most popular weightings are **tf-idf** and **bm25**

**Tf-idf** - term frequency - inverse document frequency - in order to cut down on recognizing "stop words" as relevant, weights are also based on how many other documents a word appears in. (Good for marking the difference between documents)

NOTE: (stopped on page 15, section 2.6)