# what is NLTK?


<img src="res/NLTK.png"/>
<a href='http://www.nltk.org'> NLTK (Natural Language ToolKit)</a><br><br>
- NLTK is a leading platform for building Python programs to work with human language data<br>
- Is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning.


## Using NLTK

* NLTK ([Natural Language Toolkit](http://www.nltk.org/)) is an external library,  you must import it first to use it: 

In [None]:
import nltk

### Access a corpus
#### [Corpus](https://en.wikipedia.org/wiki/Text_corpus) = Large and structured set of texts
(NLTK has some corpus built-in for you to test but also you can use our own)


In [None]:
nltk.corpus.brown.categories()

Here we will use or own :)<br>


<b>Corpus:</b> Friends TV Show, Season 10<br>
<b>Goal:</b> Indentify the most important character of the season

<img src='res/friends.jpg'>

In [None]:
# first, we will access the file with the transcripts
file = open('datasets/friends/season_10.txt', "r", encoding='UTF-8')
text = file.read()
text

### Preprocessing text = Make text ready to be analyzed

We will start [tokenizing](https://nlp.stanford.edu/IR-book/html/htmledition/tokenization-1.html) the text to preprocess


##### Tokenization = cut the text into pieces like sentences or words
<img src="res/token.jpg"/>


In [None]:
words=nltk.tokenize.word_tokenize(text)
words[:10]

As a result we have a structure of data, called a list. <br/>

In [None]:
# get all the words from the corpus, this will give us a list containing all the words, 
# we can use python function len to count the total of words in the corpus.
len(words)

Now, let's see how many unique words we have:

In [None]:
# the python function set will return unique words. This is the lexical diversity of the corpus
len(set(words))

### Analyse the text: Find your insights!
text ready to be analyzed, let NLTK to read and count:

In [None]:
fd=nltk.FreqDist(words)
fd.most_common(10)

:( not helpful?

<img src='res/idea.jpg'>
Courtesy: <a href='https://www.istockphoto.com/au'>iStock</a>

What if we filter by nouns? NLTK is helpful for identifying Part-of-Speech like nouns, verbs and adjectives

In [None]:
tagged_words=nltk.pos_tag(words)
nouns=[]
for tagged_word in tagged_words:
    word=tagged_word[0].lower()
    tag=tagged_word[1]
    if tag.startswith('NN'):
        nouns.append(word)
        
len(nouns)

In [None]:
fd=nltk.FreqDist(nouns)
fd.most_common(6)

In [None]:
fd.plot(6)

# Introducing the challenge


Read all the ted talks until sept 2018 and investigate the most common words (nouns)<br>
<img src='res/tedtalks.png'>


# What we learned today... 


NLTK in 2 steps

## 1) Access and preprocess your text
We accessed a text and preprocess it filtering by nouns <br/>

## 2) Analyse and extract information
We made an interpretation with the counts. <br/>

# If you use TEXT in your research, NLTK is relevant to you. 
# Come to the next sessions today and sign up for workshops/meetups to learn more!
https://research.unimelb.edu.au/infrastructure/research-platform-services/training/nltk