## Natural Language Toolkit (NLTK)

**NLTK** is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to [over 50 corpora and lexical resources](http://www.nltk.org/nltk_data/) such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

http://www.nltk.org/

NLTK library documentation (reference) = *Use it to look up how to use a particular NLTK library function*
* https://www.nltk.org/api/nltk.html

---

NLTK wiki (collaboratively edited documentation):
* https://github.com/nltk/nltk/wiki

### Book: Natural Language Processing with Python 

NLTK book provides a practical introduction to programming for language processing.

Written by the creators of NLTK, it guides the reader through the fundamentals of writing Python programs, working with corpora, categorizing text, analyzing linguistic structure, and more.

Online: http://www.nltk.org/book/

* we will start with Chapter 1: ["Language Processing and Python"](http://www.nltk.org/book/ch01.html)

---

In [2]:
# configuration for the notebook 
%matplotlib notebook

## 1) Getting started

NLTK book: http://www.nltk.org/book/ch01.html#getting-started-with-nltk

* Loading NLTK (Python module)
* Downloading NLTK language resources (corpora, ...)


In [3]:
# In order to use a Python library, we need to import (load) it

import nltk
import pandas as pd # we will use it to read our data


In [4]:
# Let's check what NLTK version we have (for easier troubleshooting and reproducibility)
nltk.__version__

'3.7'

In [5]:
# If your NLTK version is lower than 3.4.3 please update if possible.

# Updating in Anaconda can be done using this command: 
# conda update nltk

### nltk.Text

**`ntlk.Text` is a simple NLTK helper for loading and exploring textual content (a sequence of words / string tokens):**

... intended to support initial exploration of texts (via the interactive console). It can perform a variety of analyses on the text’s contexts (e.g., counting, concordancing, collocation discovery), and display the results.

Documentation: [nltk.Text](https://www.nltk.org/api/nltk.html#nltk.text.Text)
* lists what we can do with text once it is loaded into nltk.Text(...)

In [6]:
# Now we can try a simple example:

my_word_list = ["This", "is", "just", "an", "example", "Another", "example", "here"]
my_text = nltk.Text(my_word_list)

my_text

<Text: This is just an example Another example here...>

In [7]:
type(my_text)

nltk.text.Text

In [8]:
# How many times does the word "example" appear?
my_text.count("example")

# Notes:
#  - my_text = our text, processed (loaded) by NLTK
#     - technically: a Python object
#  - my_text.count(...) = requesting the object to perform a .count(...) function and return the result
#     - technically: calling a .count() method

2

In [9]:
# count works on tokens (full words in this case)
my_text.count('exam')

0

In [10]:
'exam' in my_text

False

In [11]:
'example' in my_text

True

### Tokenizing

Let's convert a text string into nltk.Text.
First, we need to split it into tokens (to *tokenize* it). 

In [12]:
# We need to download a package containing punctuation before we can tokenize
import nltk # we already have this so no need to import it
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/captsolo/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [13]:
# Splitting text into tokens (words, ...) = tokenizing

from nltk.tokenize import word_tokenize

excerpt = "NLTK has been called “a wonderful tool for teaching, and working in, computational linguistics using Python,” and “an amazing library to play with natural language.”"
tokens = word_tokenize(excerpt)

tokens[:6]

['NLTK', 'has', 'been', 'called', '“', 'a']

In [14]:
my_text2 = nltk.Text(tokens)

print(my_text2.count("NLTK"))

1


### Downloading NLTK language resources

NLTK also contains many language resources (corpora, ...) but you have select and download them separately (in order to save disk space and only download what is needed).

Let's download text collections used in the NLTK book: 
* `nltk.download("book")`

Note: you can also download resources interactively:
* `nltk.download()`

In [15]:
# this is a big download of all book packages
nltk.download("book")

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/captsolo/nltk_data...
[nltk_data]    |   Package conll2002 is already up-to-date!
[nltk_data

True

In [16]:
# After downloading the reources we still need to import them

# Let's import all NLTK book resource (*)
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


## 2) Exploring textual content

In [17]:
# text1, ... resources are of type nltk.Text (same as in the earlier example):

type(text1)

nltk.text.Text

In [18]:
# We can run all methods that nltk.Text has.

# Count words:
print(text1.count("whale"))

906


In [19]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.concordance

# Print concordance view (occurences of a word, in context):
text1.concordance("discover")

Displaying 7 of 7 matches:
cean , in order , if possible , to discover a passage through it to India , th
 throw at the whales , in order to discover when they were nigh enough to risk
for ever reach new distances , and discover sights more sweet and strange than
gs upon the plain , you will often discover images as of the petrified forms o
 over numberless unknown worlds to discover his one superficial western one ; 
se two heads for hours , and never discover that organ . The ear has no extern
s keener than man ' s ; Ahab could discover no sign in the sea . But suddenly 


In [20]:
text4.concordance("nation")

Displaying 25 of 330 matches:
 to the character of an independent nation seems to have been distinguished by
f Heaven can never be expected on a nation that disregards the eternal rules o
first , the representatives of this nation , then consisting of little more th
, situation , and relations of this nation and country than any which had ever
, prosperity , and happiness of the nation I have acquired an habitual attachm
an be no spectacle presented by any nation more pleasing , more noble , majest
party for its own ends , not of the nation for the national good . If that sol
tures and the people throughout the nation . On this subject it might become m
if a personal esteem for the French nation , formed in a residence of seven ye
f our fellow - citizens by whatever nation , and if success can not be obtaine
y , continue His blessing upon this nation and its Government and give it all 
powers so justly inspire . A rising nation , spread over a wide and fruitful l
ing now decided by the

In [21]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.similar

# Print words that appear in similar context as "nation".
text4.similar("nation")

country people government world union time constitution states
republic land law earth other future party peace strength president
way war


In [22]:
# https://www.nltk.org/api/nltk.html#nltk.text.Text.common_contexts

# Find contexts common to all given words
text1.common_contexts(["day", "night"])


that_, the_, by_, a_, that_; the_previous -_, one_, all_. the_before
every_, by_or of_; of_. the_. this_in all_in after_, the_wore
through_into


### Side note: Python lists

A *list* contains multiple values in an ordered sequence.

More about Python lists:
* https://automatetheboringstuff.com/chapter4/

In [23]:
# nltk.Text is also a list - can do everything we can do with lists (access parts of it, ...)

# What's the 1st occurence of "He" in the text?
#  - note: Python is case sensitive (unless you take care of it - e.g. convert all text to lowercase)

print(text1.index("He"))

42


In [24]:
# The word at position #42
#  - note: list indexes start from 0

print(text1[42])

He


In [25]:
print(text1[42:52])

['He', 'was', 'ever', 'dusting', 'his', 'old', 'lexicons', 'and', 'grammars', ',']


## Further exploration

* Dispersion plots (distribution of words throughout the text)
* Generating text (based on example)

### Visualizing the corpus

In [48]:
# Dispersion plot

# source: Inaugural Address Corpus
text4.dispersion_plot(["citizens", "democracy", "duty", "freedom", "America"])

<IPython.core.display.Javascript object>

In [27]:
help(text4.dispersion_plot)

Help on method dispersion_plot in module nltk.text:

dispersion_plot(words) method of nltk.text.Text instance
    Produce a plot showing the distribution of the words through the text.
    Requires pylab to be installed.
    
    :param words: The words to be plotted
    :type words: list(str)
    :seealso: nltk.draw.dispersion_plot()



### Generating text

Note: depending on your version of NLTK `generate()` functionality may or may not work (NLTK version 3.7.4 or newer is required).
* In case it does not work, please see subsection "Saved version of generate() results".



In [28]:
# Generate text (based on example)
# https://www.nltk.org/api/nltk.html#nltk.text.Text.generate

# we need to supply seed words
text1.generate(text_seed = ["Why", "is", "it"])

Building ngram index...


Why is it stripped off from some mountain torrent we had flip ? , so as to
preserve all his might had in former years abounding with them , they
toil with their lances , strange tales of Southern whaling .
conceivable that this fine old Dutch Fishery , a most wealthy example
of the sea - captain orders me to admire the magnanimity of the whole
, and many whalemen , but dumplings ; good white cedar of the ship
casts off her cables ; and chewed it noiselessly ; and though there
are birds called grey albatrosses ; and yet faster


'Why is it stripped off from some mountain torrent we had flip ? , so as to\npreserve all his might had in former years abounding with them , they\ntoil with their lances , strange tales of Southern whaling .\nconceivable that this fine old Dutch Fishery , a most wealthy example\nof the sea - captain orders me to admire the magnanimity of the whole\n, and many whalemen , but dumplings ; good white cedar of the ship\ncasts off her cables ; and chewed it noiselessly ; and though there\nare birds called grey albatrosses ; and yet faster'

---

**NLTK `generate()` builds a [trigram] language model from the supplied text** (words are generated based on previous two words).

For more information see nltk.lm: https://www.nltk.org/api/nltk.lm.html

**Saved version of `generate()` results:**
    
`text1.generate(text_seed = ["Why", "is", "it"])`

*Building ngram index...*

```
Why is it stripped off from some mountain torrent we had flip ? , so as to
preserve all his might had in former years abounding with them , they
toil with their lances , strange tales of Southern whaling .
conceivable that this fine old Dutch Fishery , a most wealthy example
of the sea - captain orders me to admire the magnanimity of the whole
, and many whalemen , but dumplings ; good white cedar of the ship
casts off her cables ; and chewed it noiselessly ; and though there
are birds called grey albatrosses ; and yet faster
```


In [29]:
help(text1.generate)

Help on method generate in module nltk.text:

generate(length=100, text_seed=None, random_seed=42) method of nltk.text.Text instance
    Print random text, generated using a trigram language model.
    See also `help(nltk.lm)`.
    
    :param length: The length of text to generate (default=100)
    :type length: int
    
    :param text_seed: Generation can be conditioned on preceding context.
    :type text_seed: list(str)
    
    :param random_seed: A random seed or an instance of `random.Random`. If provided,
        makes the random sampling part of generation reproducible. (default=42)
    :type random_seed: int



## Converting Our Corpora into a NLTK Text 

In [30]:
# As we saw previously we can read data from any publicly accessible source
url = "https://github.com/ValRCS/BSSDH_22/raw/main/corpora/lv_old_newspapers_5k.tsv"

df = pd.read_csv(url, sep="\t")
df.shape

(4999, 4)

In [31]:
df.head(10)

Unnamed: 0,Language,Source,Date,Text
0,Latvian,rekurzeme.lv,2008/09/04,"""Viņa pirmsnāves zīmītē bija rakstīts vienīgi ..."
1,Latvian,diena.lv,2012/01/10,info@zurnalistiem.lv
2,Latvian,bauskasdzive.lv,2007/12/27,"Bhuto, kas Pakistānā no trimdas atgriezās tika..."
3,Latvian,bauskasdzive.lv,2008/10/08,Plkst. 4.00 Samoilovs / Pļaviņš (pludmales vol...
4,Latvian,diena.lv,2011/10/05,"CVK bija vērsusies Skaburska, lūdzot izskaidro..."
5,Latvian,zz.lv,2011/05/16,Apbalvojumus piešķir piemiņas zīmes valde Saei...
6,Latvian,ntz.lv,2010/08/13,- Amerikā biju uzaicināts viesoties ar visu ģi...
7,Latvian,rv.lv,2011/01/22,"Mūrniece gan saka, ka Lužkova bitēm Latvijas p..."
8,Latvian,diena.lv,2011/11/26,"PĒDĒJĀ, kontrolēja PĀRDAUGAVAS telpu, izņemot ..."
9,Latvian,la.lv,2011/04/30,"Ar Ivaru tikāmies viņa dzimtajos ""Lazdiņos"" Za..."


In [32]:
# Let us sort by Date - even though it is a string type
df = df.sort_values(by="Date", ascending=True)  # ascending is True by default, if you wanted Descending you could use ascending=False
df.head(10)

Unnamed: 0,Language,Source,Date,Text
1137,Latvian,la.lv,2005/04/27,Kad maz naudas veselības aprūpei
1977,Latvian,diena.lv,2006/09/21,"Idejas, kā palīdzēt skolai, uzņēmējiem rodotie..."
808,Latvian,zz.lv,2007/01/07,"Ja samazinās mazuta cenas, samazinās arī tarif..."
17,Latvian,zz.lv,2007/01/07,2008.gadā gāzes piegādes cena pieaugs vēl par ...
2818,Latvian,dzirkstele.lv,2007/01/08,"Kā norādīja Šlesers, komersanti, kas Latvijas ..."
4839,Latvian,zz.lv,2007/01/09,Vispirms pasākuma dalībniekiem Rīgas Skolēnu p...
4222,Latvian,dzirkstele.lv,2007/01/10,Sestdien pulksten 11.00 Gulbenes kultūras cent...
1572,Latvian,dzirkstele.lv,2007/01/10,"Informāciju, ka būtu interese par VP ""Wimm-Bil..."
3927,Latvian,zz.lv,2007/01/11,"Piektdien, 2. novembrī, I.Minusai un I.Jursone..."
3167,Latvian,bauskasdzive.lv,2007/01/12,"Republikāņu partijas biedrs, jurists Larsons, ..."


<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

The following section is adopted by Valdis Saulespurens from notebook by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />


# Concordance and Collocation

**Description:** This notebook section describes how to create a concordance and collocation starting from text files and from your own CSV file.

**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 45 minutes

**Knowledge Required:** 
* Python Day 1 Series ([Python Introduction](https://github.com/ValRCS/BSSDH_22/blob/main/notebooks/Python%20Introduction.ipynb))

**Knowledge Recommended:** None

**Data Format:** Text, CSV file from previous lessons 

**Libraries Used:** NLTK

**Research Pipeline:** None
___

## Concordance

The concordance has a long history in humanities study and Roberto Busa's concordance Index Thomisticus—started in 1946—is arguably the first digital humanities project. Before computers were common, they were printed in large volumes such as John Bartlett's 1982 reference book A Complete Concordance to Shakespeare—it was 1909 pages pages long! 

A concordance gives the context of a given word or phrase in a body of texts. For example, a literary scholar might ask: how often and in what context does Shakespeare use the phrase "honest Iago" in Othello? A historian might examine a particular politician's speeches, looking for examples of a particular "dog whistle".

<font color="red">Read more</font>

* Geoffrey Rockwell and Stéfan Sinclair. [Tremendous Mechanical Labor: Father Busa's Algorithm](http://www.digitalhumanities.org/dhq/vol/14/3/000456/000456.html) (2020)
* Julianne Nyhan and Marco Passarotti, eds. [One Origin of Digital Humanities: Fr Roberto Busa in His Own Words](https://www.amazon.com/One-Origin-Digital-Humanities-Roberto/dp/3030183114/) (2019)
* Julianne Nyhan and Melissa Terras. [Uncovering 'hidden contributions to the history of Digital Humanities: the Index Thomisticus' femal keypunch operators](https://discovery.ucl.ac.uk/id/eprint/10052279/9/Nyhan_DH2017.redacted.pdf) (2017)
* Steven E. Jones [Roberto Busa, S.J., and the Emergence of Humanities Computing](https://www.routledge.com/Roberto-Busa-S-J-and-the-Emergence-of-Humanities-Computing-The-Priest/Jones/p/book/9781138587250) (2016)
___

### Extracting Text from dataframe

In [33]:
### Extracting text from dataframe
documents = list(df.Text) # df["Text"].tolist() would do the same
len(documents)


4999

In [34]:
documents[:3]  # first three documents

['Kad maz naudas veselības aprūpei',
 'Idejas, kā palīdzēt skolai, uzņēmējiem rodoties vēl un vēl, tagad kopīgi ar pagasta vadību prātojot, kā skolai, 100 gadu jubileju sagaidot, uzlikt jaunu jumtu. Tā kā internāta audzēkņiem nav kur nomazgāties, viesnīca Radi un draugi uzbūvējuši viņiem jauku pirtiņu ar saunu. Bērniem gan nākas bez tās iztikt vēl joprojām, jo, kā apgalvojot rajona sanitāri epidemioloģiskās stacijas ierēdņi, pirts ģērbtuves izmēri īsti neatbilstot noteiktajām prasībām. "Mēs esam tā kā saauguši kopā, ik nedēļas sazvanāmies, dzīvojam līdzi skolas priekiem un bēdām, pārdzīvojam, lai tikai to neaizslēgtu. Šiem Latgales pagasta cilvēkiem ir tāds siltums un mīļums, viņi nekad neko mums neprasa, bet prot ar pateicību saņemt dāvāto," Dienai saka Aigars Bērziņš, kurš Malnavā pie vecāsmātes pavadījis bērnības vasaras.',
 'Ja samazinās mazuta cenas, samazinās arī tarifi, un otrādi.']

In [35]:
# for the purpose of this analysis we will join all the documents together 
# this is not always appropriate depending on your needs
all_docs = "\n".join(documents)
len(all_docs)

1415137

In [36]:
all_docs[:100] # so now all documents are in one big string 
# notice the \n indicating newlinesb

'Kad maz naudas veselības aprūpei\nIdejas, kā palīdzēt skolai, uzņēmējiem rodoties vēl un vēl, tagad k'

Next, we lowercase our text and use the Natural Language Toolkit (NLTK) to tokenize it. Tokenizing breaks up the the document into individual words. Finally, we use our tokens to create an NLTK Text object.

In [37]:
# Tokenize one of the files
import nltk  # not needed if you already imported
nltk.download('punkt')  # again not needed if you already downloaded punkt
file_contents = all_docs.lower()
tokens = nltk.word_tokenize(file_contents)
text = nltk.Text(tokens)

[nltk_data] Downloading package punkt to /Users/captsolo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [38]:
# Verify that we have created an NLTK Text object
type(text)

nltk.text.Text

In [39]:
# Create a concordance for the given word
text.concordance('Rīga')  # notice it gives you matches in lowercase

Displaying 25 of 33 matches:
tāji , ceturto - basketbola skolas „ rīga '' komanda , piekto - igaunijas spor
ajai lielupes tilta šuvei uz šosejas rīga - jelgava . tur darbi risināsies būv
ēc budapeštas , krakovas un berlīnes rīga kļūs par ceturto pilsētu aiz čehijas
? es nesaku , ka tie nav vajadzīgi . rīga ir rīga — latvijas galvaspilsēta . t
aku , ka tie nav vajadzīgi . rīga ir rīga — latvijas galvaspilsēta . tikai ner
ēlē profesionālajā pūtēju orķestrī « rīga » , jānis retenais aktīvi muzicē un 
spēlējis hokeja komandā « dinamo » ( rīga ) un latvijas valsts izlasē ; jaunie
tikās lielajā pasākumā pie `` arēnas rīga '' . `` protams , visiem gribas izju
kcija „ mēs par velo '' 18. jūlijs : rīga - slampe xxiv vispārējo latviešu dzi
dzelzceļa vēstures muzejā , `` arēnā rīga '' , latvijas universitātē , starpta
limits - 8 , - rēzekne — varakļāni — rīga ( 17.20 un 4.40 ( darba dienās ) . l
erhofa ( roja ) , viola puškarjova ( rīga ) , diāna ančikovska ( aizkraukle ) 
ga ( ventspils ) , kris

By default, the first 25 matches are printed along with 80 characters on each side of our string text. We can specify that more lines should be shown using a `lines` and `width` argument that accept integers.

In [40]:
# Create a concordance for the given word
# Increasing lines shown and number of characters
text.concordance('Latvija', lines=50, width=100)

Displaying 50 of 56 matches:
rsvars 20:17. tomēr līdz ceturtdaļas izskaņai latvija spēja panākt 24:17. kā aģentūra leta noskaidr
 nesēju latvijas valstī , pasludināja , ka `` latvija , apvienota etnogrāfiskās robežās ( kurzeme ,
ēstis ” sazinājās pa telefonu ar sia “ maxima latvija ” preses sekretāru , lai uzzinātu par kompāni
ficiālais nosaukums ir latvijas republika jeb latvija . `` provizoriski izvērtējot transportlīdzekļ
ncē bronzas godalgu izcīnīja latvieši – “ ddb latvija ” tekstu autors raimonds platacis un “ cube -
pājas tramvajs ” , projektētājs – ps „ epkb – latvija ” ; iesildoties ieteicams izmantot nūjas , sā
tbilisi . mišu sargāt no krievu lāča : varenā latvija grib karot visās debesu pusēs . kamēr eiropas
, kaut kā no šīs bedres reiz izķepurosies arī latvija , tai atkal vajadzēs celtniekus , oficiantus 
nīca . dome atgādina , ka nedz rēzekne , nedz latvija un a/s “ latvijas gāze ” nevar ietekmēt dabas
iāliem pamatskolas līmenī jābūt bez maksas un latvija šai konvencijai i

If we want to supply a bigram, trigram, or longer construction, they are supplied as individual strings within a Python list. (If you try to supply a string with a space in the middle, there will be no results.)

In [41]:
# Create a concordance for a sequence of words
text.concordance(['Latvija', 'ir'])

Displaying 8 of 8 matches:
ksa par sen pieņemtiem lēmumiem - latvija ir apsolījusi pildīt direktīvu , kas
ļās enerģijas īpatsvara pieaugumā latvija ir eiropas rekordiste tūlīt aiz zvie
atonā , kur viņš skrēja 10 km . — latvija ir silta zeme ar simpātiskiem , laip
tas , un dombrovskis atklāja , ka latvija ir ieinteresēta piesaistīt asv ekspe
. valsts prezidents atzīmēja , ka latvija ir pasaulē varenākās militārās alian
u gūstā . brīvā tirgus ēnas puses latvija ir viens neliels asv satelīts , bet 
ada problēmas tūkstošiem ģimeņu . latvija ir pirmajā vietā eiropā , kur ģimenē
iljonus latu ) , taču jau pašlaik latvija ir attiekusies tos pieņemt , ņemot v


This method works well for a quick preview of the lines, but if we want to save this concordance for later analysis we can use the `.concordance_list()` method. The `.concordance_list()` method outputs a list, but the elements of that list *are not* simple strings. They are ConcordanceLine objects.

In [42]:
# Output the concordance data
output_list = text.concordance_list(['Latvija', 'ir'], width=200, lines=50)  # we do not have 50 matches in our dataset

In [43]:
type(output_list[0])

nltk.text.ConcordanceLine

In [44]:
# We can view individual lines by using a Python list index followed by .line.
output_list[0].line

'raud pieaugt vēl vairāk , nekā solīts uz 1.aprīli . tā būs maksa par sen pieņemtiem lēmumiem - latvija ir apsolījusi pildīt direktīvu , kas paredz elektrības galapatēriņā 40 % saražot no atjaunojamaji'

If we want to save our concordance, we can write to a file line-by-line.

In [45]:
# Writing the concordance to a text file

# encoding="utf-8" is very important for languages using symbols outside regular English characters
with open('my_concordance.txt', mode='w',encoding="utf-8") as f:
    for row in output_list:
        f.write(row.line)
        f.write('\n')

### Download your files from cloud instances before finishing work!
If you are on a local Anaconda installation you already have a file my_concordance.txt in the same directory as this notebook NLTK_Introductions.ipynb

In Google Colab the file resides in the virtual machine provided by Google.

Thus you need to download the file to your computer.

Open the little Folder symbol to expand file view
Click on three dots to the right of my_concordances.txt
Choose Download

### Dispersion plot

Lastly, NLTK can create a dispersion plot that helps visualize where tokens occur in the document. This can reveal the way words are used in the document over time.

In [69]:
text.dispersion_plot(['Latvija', 'basketbols', 'futbols', 'vēlēšanas'])

<IPython.core.display.Javascript object>

### Normalizing text

"basketbols" appears in the dispersion plot just a few times. Let's see how often it is mentioned in the text.

In [72]:
for item in text:
    if item == "basketbols":
        print(item)

basketbols
basketbols
basketbols


In [71]:
import regex

for item in text:
    if regex.match(r"basket.*", item):
        print(item)

basketbola
basketbola
basketbola
basketbola
basketbola
basketbolistes
basketbola
basketbolisti
basketbolisti
basketbolistu
basketbola
basketbolu
basketbolisti
basketbols
basketbola
basketbola
basketbola
basketbolam
basketbols
basketbols
basketbolista


**Normalizing text is covered in NLTK Book section 3.6:**
https://www.nltk.org/book/ch03.html#sec-normalizing-text

We already normalized / converted text to lowercase but that's not enough. 

We may want to go further and strip off any affixes, a task known as *stemming*. A further step is to make sure that the resulting form is a known word in a dictionary, a task known as *lemmatization*. 

NLTK offers some stemming and lemmatization functions but they are limited to just some languages (e.g. Latvian is not one of them).
 
https://www.nltk.org/api/nltk.stem.html

In [75]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
is no basis for a system of government.  Supreme executive power derives from
a mandate from the masses, not from some farcical aquatic ceremony."""

tokens = word_tokenize(raw)

porter = nltk.PorterStemmer()

for t in tokens:
    print(porter.stem(t), end=" ")

denni : listen , strang women lie in pond distribut sword is no basi for a system of govern . suprem execut power deriv from a mandat from the mass , not from some farcic aquat ceremoni . 

In [76]:
from nltk.stem import WordNetLemmatizer
wnl = WordNetLemmatizer()

for t in tokens:
    print(wnl.lemmatize(t), end=" ")

DENNIS : Listen , strange woman lying in pond distributing sword is no basis for a system of government . Supreme executive power derives from a mandate from the mass , not from some farcical aquatic ceremony . 

If the language of our text is not supported by NLTK we can use another library: 

simplemma: https://pypi.org/project/simplemma/

More information: [A simple multilingual lemmatizer for Python](https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html)


In [78]:
# we may need to install simplemma first (uncomment the following line to do that)
# !pip install simplemma

In [80]:
import simplemma as sl

In [82]:
for t in tokens:
    print(sl.lemmatize(t, lang="en"), end=" ")

DENNIS : listen , strange woman lie in pond distribute sword be no basis for a system of government . supreme executive power derive from a mandate from the masse , not from some farcical aquatic ceremony . 

In [84]:
buf = []

for item in text:
    buf.append(sl.lemmatize(item, lang="lv"))
               
buf[:20]

['kad',
 'mazs',
 'nauda',
 'veselība',
 'aprūpe',
 'ideja',
 ',',
 'kā',
 'palīdzēt',
 'skola',
 ',',
 'uzņēmējiem',
 'rasties',
 'vēl',
 'un',
 'vēl',
 ',',
 'tagad',
 'kopīgi',
 'ar']

In [85]:
text_new = nltk.Text(buf)

In [86]:
text_new.dispersion_plot(['Latvija', 'basketbols', 'futbols', 'vēlēšanas'])

<IPython.core.display.Javascript object>

---

## Your turn!

Choose some text and **explore it using NLTK** (following the examples in this notebook).

**Write code in notebook cells below**.
* add more cells (use "+" icon) if necessary

You may use NLTK text corpora or load your own text.