### Tokenization and lower-casing

In [None]:
import nltk

In [None]:
nltk.download('book')

In [None]:
# tokenization with plain python

text = "To be or not to be"

text_words = text.split(" ") # Tokenize sentence
print(text_words)

# we should expect len of 4 since "to" and "be" repeat

# BUT WE OBSERVE 5

print("Unique no. of words: ",len(set(text_words)))
print((set(text_words)))


The Unique words in our document is known as the **Vocabulary**. These are esentially the amount of features we have. 

In [None]:
# removes capitals like "To" to receive unique words
print(set([i.lower() for i in text_words])) 

Now, let's calculate a measure of the lexical richness of the text. The next example shows us that the number of distinct words is 6.6% of the total number of words, or equivalently that each word is used 16 times on average

In [None]:
moby_lex = (len(set(text1))/len(text1))*100
print(moby_lex)

In [None]:
moby_true_lex = (set([i.lower() for i in text1])) 

In [None]:
len(moby_true_lex)/len(text1)*100

In [None]:
# Via NLTK
words = nltk.word_tokenize("Python is an awesome language!")
words

### Searching for Words

In [None]:
import re
re1 = re.compile('python')
print(bool(re1.match('Python')))

There are many ways to examine the context of a text apart from simply reading it. A concordance view shows us every occurrence of a given word, together with some context. 

In [None]:
text1.concordance("monstrous")

## Computing Statistics with Text Data

How can we automatically identify the words of a text that are most informative about the topic and genre of the text? Imagine how you might go about finding the 50 most frequent words of a book.

This is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items. Since we often need frequency distributions in language processing, NLTK provides built-in support for them. Let's use a FreqDist to find the 50 most frequent words of Moby Dick:

In [None]:

f_dist = FreqDist(text1)

In [None]:
print(f_dist)

In [None]:
f_dist.plot(50, cumulative=True)

In [None]:
f_dist.hapaxes()

## Collocations, Unigrams and Bigrams

A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds definitely odd.

To get a handle on collocations, we start off by extracting from a text a list of word pairs, also known as bigrams. This is easily accomplished with the function bigrams()

In [None]:
list(bigrams(['more', 'is', 'said', 'than', 'done']))

Collocations are essentially just frequent bigrams, except that we want to pay more attention to the cases that involve rare words. In particular, we want to find bigrams that occur more often than we would expect based on the frequency of the individual words. The  collocations() function does this for us

In [None]:
text1.collocations()

**Unigram** If bigrams have two-word, unigrams have only one word. 




## Corpus
(Plural: Corpora) a collection of written texts that serve as our datasets.

In [None]:
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
brown_sents = brown.sents(categories='news')

In [None]:
brown_tagged_sents

In [None]:
nltk.help.upenn_tagset('JJ')

## Word Tagging and Models

Given any sentence, you can classify each word as a noun, verb, conjunction, or any other class of words. When there are hundreds of thousands of sentences, even millions, this is obviously a large and tedious task. But it's not one that can't be solved computationally. 


### NLTK Parts of Speech Tagger

NLTK is a package in python that provides libraries for different text processing techniques, such as classification, tokenization, stemming, parsing, but important to this example, tagging. 

In [None]:
import nltk 

text = nltk.word_tokenize("Python is an awesome language!")
nltk.pos_tag(text)

In [None]:
# not sure what DD etc. mean?

nltk.help.upenn_tagset()



In [None]:
import sys
!conda install --yes --prefix {sys.prefix} spacy

In [None]:
!{sys.executable} -m pip3 install spacy