# In this notebook

1. Working with a Corpus (multiple text files)
2. Concordance List for a text collection 

# Working with a Corpus (multiple text files)

#### Questions & Objectives:

- How can I load a text collection made up of multiple text files and tokenise them?

#### Key Points

- To read an entire collection of text files you can use the PlaintextCorpusReader class/object provided by NLTK and its words() function to extract all the words from the text in the collection.

In [1]:
# run this cell now. It's the usual imports of text mining libraries
import nltk
import numpy
import string
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\andre\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

Text data comes in different forms. You might want to analyse a document in one file or an entire collection of documents (a corpus) stored in multiple files. You already know how to load a single document, and now you will see how to load an entire folder of text documents (a corpus).

### Medical History of British India

We will use the Medical History of British India collection provided by the National Libarry of Scotland as an example. It is located under a link below, if you'd like to read more about it, but for this course we have prepared it for you already. You will find it in the sub0folder inside your data folder.

https://data.nls.uk/data/digitised-collections/a-medical-history-of-british-india/

This dataset forms the first half of the Medical History of British India collection, which itself is part of the broader India Papers collection held by the Library. A Medical History of British India consists of official publications varying from short reports to multi-volume histories related to disease, public health and medical research between circa 1850 to 1950. These are historical sources for a period which witnessed the transition from a humoral to a biochemical tradition, which was based on laboratorial science and document the important breakthroughs in bacteriology, parasitology and the developments of vaccines in a colonial context.

This collection has been made available as part of NLS’s DataFoundry platform which provides access to a number of their digitised collections. We are only interested in the text the Medical History of British India collection for this course.

### Recap: Loading a single file

First, just to see one of the files, let's load one individual file and print a few of its tokens:

In [2]:
file = open('./Medical_History_of_British_India/Medical_History_of_British_India/74457530.txt') 

india_raw = file.read() 
india_tokens = word_tokenize(india_raw) 
lower_india_tokens = [word.lower() for word in india_tokens] 
print(lower_india_tokens[0:50] )

['no', '.', '1111', '(', 'sanitary', ')', ',', 'dated', 'ootacamund', ',', 'the', '6th', 'october', '1876', '.', 'from-the', 'honourable', 'w.', 'hudleston', ',', 'chief', 'secretary', 'to', 'the', 'govern-', 'ment', 'of', 'madras', '.', 'to-the', 'offg', '.', 'secretary', 'to', 'the', 'government', 'of', 'india', '.', 'resolution', 'of', 'government', 'of', 'india', 'no', '.', '1-137', ',', 'dated', '5th']


### Loading multiple files (a corpus) into a PlaintextCorpusReader object

We can do the same for an entire collection of documents (a corpus). Instead of pointing to an individual file, we point to a directory/folder with many text documents in it. We will use the entire Medical History of British India collection as our dataset, which consists of almost 500 text documents stored in a folder.

To read the text files in this collection we can use the `PlaintextCorpusReader` python class provided in the corpus package of NLTK that we imported before.

You need to specify:

- the collection directory name 
- a wildcard (generic name) for which files to read in the directory (e.g. `*` for all files, or `*.txt` for all text files `74*.txt` for all text files starting with `74`) 
- a text encoding of the files (in this case `latin1`) to indicate which alphabeth to use.

when you use PlaintextCorpusReader it will look like this:

`list_of_lists_of_tokens = PlaintextCorpusReader(folder_location, file_wildcard, encoding)`

which in our example will look like this:

`corpus_reader = PlaintextCorpusReader("./Medical_History_of_British_India", '.*', encoding='latin1') `

In [3]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = "./Medical_History_of_British_India/Medical_History_of_British_India"
corpus_reader = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1') 

print(corpus_reader)
# the output will make little sense, but do not worry, keep reading

<PlaintextCorpusReader in 'C:\\Users\\andre\\OneDrive - University of Edinburgh\\Teaching\\PGT\\ADM_2022\\Notebooks\\Previous edition 2020\\Week 1\\Medical_History_of_British_India\\Medical_History_of_British_India'>


When we tried to print our corpus, we saw something like `<PlaintextCorpusReader ... >` instead of a lot of words! Why? That's because the corpus we have loaded into variable `corpus_reader` is more than just a bunch of text variables: 

What you see when you try to print `corpus_reader` is the brief description of a `PlaintextCorpusReader object`

**OBJECT is a type of a variable that combines some storage and some functionality**

For example, the `PlaintextCorpusReader` object contains the list of lists of tokens (list of files, and each file is a list of words), and some additional information, and methods we can use. The one we will use the most is the `.words()` method to pull out all the tokens from the corpus.

`all_tokens_in_corpus = my_corpus_reader.words()`

or in our example:

`corpus_tokens = corpus_reader.words() `

In [4]:
corpus_tokens = corpus_reader.words() 
print(corpus_tokens[400:450]) 

['I', 'have', 'been', 'unable', 'myself', 'to', 'institute', 'any', 'personal', 'enquiry', 'into', 'the', 'matter', ',', 'and', 'can', 'now', 'only', 'forward', 'a', 'summary', 'of', 'the', 'reports', 'of', 'the', 'different', 'Civil', 'Surgeons', 'which', 'have', 'been', 'forwarded', 'to', 'me', 'by', 'Government', '.', 'The', 'accompanying', 'table', 'shews', 'the', 'population', 'of', 'each', 'Registration', 'District', ',', 'and']


#### 🐛Minitask


- simulate opening the book in few places and reading a sentende: print a subsection (`your_list[ start : end ] `) of this corpus' tokens. Just make up an index to start reading it, like we did above (we made up the number `450` it has no signifficance).  

eg. from the `corpus_tokens` list print 50 tokens starting at 1000th, or 9568th. Notice that it is not th emost efficient way to 'eyeball' the text corpus.

In [None]:
# write your answer here:





<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    print(corpus_tokens[1000:1050]) 
    print(corpus_tokens[9568:9590])
    print(corpus_tokens[20000:20050])
    ### END SOLUTION

</details>









### Preparing your tokens for analysis (eg. making all words lowercase)
### A few words about overwhelming your computer with processing: DO NOT PANIC!

Note that this dataset is Quite Large: it contains almost 500 files, 30.000.000 words, and 130.000.000 characters! It's 120MB of data.

Still, loading all the files and turning text into tokens takes about 1-2 seconds!

But when we lowercase all of the words in this corpus, we run a piece of code which turns every single character of the 130.000.000 characters into its lower case equivalent.

`lower_corpus_tokens = [word.lower() for word in corpus_tokens]` 

But this is not going to happen instantly. Run the below code cell now, and then keep reading.

You'll need to be patient, it might take even a minute or more to run. It will look as if your notebook has FROZEN (might stop responding), but it's just busy at work. You will know that the cell is running, because there will be an `In [*]` on the left top of that busy cell, and (on some browsers) the icon on the browser tab will turn into an hour glass.

If you have not done it yet, run the below cell now and see what happens.

In [5]:
lower_corpus_tokens = [word.lower() for word in corpus_tokens] 
print(lower_corpus_tokens[400:450]) 

['i', 'have', 'been', 'unable', 'myself', 'to', 'institute', 'any', 'personal', 'enquiry', 'into', 'the', 'matter', ',', 'and', 'can', 'now', 'only', 'forward', 'a', 'summary', 'of', 'the', 'reports', 'of', 'the', 'different', 'civil', 'surgeons', 'which', 'have', 'been', 'forwarded', 'to', 'me', 'by', 'government', '.', 'the', 'accompanying', 'table', 'shews', 'the', 'population', 'of', 'each', 'registration', 'district', ',', 'and']


Here are some things you can do to not have to wait all the time for your code to finish:

**We try to do the hardest processing only once, and store the result in a variable, which we use later** (instead of re-doing the processing all the time).

And this is exactly what we do above: After the above line of code had finished running, the variable `lower_corpus_tokens` contains all of your tokens as lowercase characters.

Accessing the tokens in this variable will take a very short time now, because all the time consuming processing (making things lowercase) is already done, and all the tokens are lowercase already - now we just want to see them.

In [6]:
# this will execute almost immediately now, because all the processing 
# (changing millions of characters) is done, and we are just requesting a sub-set of a (very large) list:

print(lower_corpus_tokens[400:450]) 

['i', 'have', 'been', 'unable', 'myself', 'to', 'institute', 'any', 'personal', 'enquiry', 'into', 'the', 'matter', ',', 'and', 'can', 'now', 'only', 'forward', 'a', 'summary', 'of', 'the', 'reports', 'of', 'the', 'different', 'civil', 'surgeons', 'which', 'have', 'been', 'forwarded', 'to', 'me', 'by', 'government', '.', 'the', 'accompanying', 'table', 'shews', 'the', 'population', 'of', 'each', 'registration', 'district', ',', 'and']


Overall, when processing needs to be done, it needs to be done. So don't be scared of it, and when you see the `In [*]` indicator that a cell is processing, just get a cup of tea :)

# 2. Concordance List for a text collection (contexts in which tokens appear)

#### Questions & Objectives:

- How can I load a text collection made up of multiple text files and tokenise them?

#### Key Points

- A concordance list is a list of all contexts in which a particular token appears in a corpus or text.
- A concordance list can be created using the concordance() method of the `Text` class in NLTK.

Words make a lot of sense in context. Concordance list is a fantastic way to glimpse into the text and see how a particular token is used.

Next, we will display concordances for a particular token, i.e. all contexts a particular token appears in. We can do this using the Text class in NLTK’s text package. We can represent our list of lowercased tokens in the document collection loaded previously using the Text class. The concordance list of a token can be displayed using the `.concordance()` method on this class as shown below.

Note that the process of aquiring concordance data will take abotu 10 seconds, depending on the how busy your current machine is.

In [7]:
from nltk.text import Text
text_of_the_corpus = Text(lower_corpus_tokens)
print(text_of_the_corpus.concordance('woman'))

Displaying 25 of 819 matches:
s of age , a sweeper , who married a woman who had leprosy , and at the age of 
e of sitabu , aged 40 , a muhammadan woman . her grand - father and father were
ung man deliberately married a leper woman , and became himself a leper at the 
contrary . in no . 6 a man marries a woman whose grandfather and father had bee
 lepers . in no . 10 a man marries a woman whose father had died of leprosy . i
applies to these cases . in no . 2 a woman marries a man whose father and elder
n in the case of a man who marries a woman of notoriously leper family . in no 
toriously leper family . in no . 5 a woman marries a man whose elder brother wa
d continued to cohabit with a native woman after she had been attacked with lep
isen from intermarriage of a man and woman in both of whom leprosy was heredita
s a leper ; he is now married to the woman , and they both live in the asylum .
een accompanied by a healthy looking woman , and by this means , although all h
editary tr

In the output for the next bit of code which creates a concordance list for the word “he”, we can see that there are many more results in the list than displayed on screen (Displaying 25 of 170 matches). The concordance() method only prints the first 25 results by default (or less if there are less).

In [8]:
print(text_of_the_corpus.concordance('he'))

Displaying 25 of 22830 matches:
leprosy treated by gurjun oil , which he was able to watch for a length of tim
 diminished . during these two months he gained three pounds in weight , which
does not seem much , considering that he did no work and was fairly well fed o
se from jail on the 23rd january 1876 he was again suffering from the sores th
n 5th and died on 20th october 1875 . he was seriously ill when he was brought
ober 1875 . he was seriously ill when he was brought to the hospital , and cou
itted on the 8th september 1875 , and he went home of his own accord on 20th d
is own accord on 20th december 1875 . he was much improved under treat - ment 
evalence of leprosy in the district , he had had but very few opportunities of
even half this number . the natives , he says , call every chronic skin diseas
in the legs , the feet and the ears . he has perfect taste , hearing , sight a
te laboured under it . the leper says he was quite free from leprosy until he 
 he was quite free f

In [9]:
# You can specify the number of lines using 
# an additional lines parameter, e.g.:
print(text_of_the_corpus.concordance('he', lines=170))

# notice that when the result of a cell is too long, it will become it's own little scrollable window

Displaying 170 of 22830 matches:
leprosy treated by gurjun oil , which he was able to watch for a length of tim
 diminished . during these two months he gained three pounds in weight , which
does not seem much , considering that he did no work and was fairly well fed o
se from jail on the 23rd january 1876 he was again suffering from the sores th
n 5th and died on 20th october 1875 . he was seriously ill when he was brought
ober 1875 . he was seriously ill when he was brought to the hospital , and cou
itted on the 8th september 1875 , and he went home of his own accord on 20th d
is own accord on 20th december 1875 . he was much improved under treat - ment 
evalence of leprosy in the district , he had had but very few opportunities of
even half this number . the natives , he says , call every chronic skin diseas
in the legs , the feet and the ears . he has perfect taste , hearing , sight a
te laboured under it . the leper says he was quite free from leprosy until he 
 he was quite free 

#### 🐛Minitask: combining together what we learned so far

This task will require you to copy-paste and adjust various lines of code from this notebook. We will load and analyse a collection of Barack Obama speeches. 

- load a new corpus: a few selected speeches of Barack Obama located in the folder `./data/barack_obama_speeches`. 
- turn corpus data into tokens (with `.words()`)
- lowercase all the tokens using list comprehention loop (`[output for something in list_of somethings]`) and `.lower()`
- `from nltk.text import Text` and create a Text object from the lowercased tokens with  `Text( your_lowercased_tokens )`
- create concordance lists for some words that you find interesting, eg. 'hope', 'can', 'people'

In [None]:
# white your answer here:




<details><summary style='color:blue'>CLICK HERE TO SEE THE THE ANSWER. BUT BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

    ### BEGIN SOLUTION
    corpus_root = "./barack_obama_speeches"
    corpus_data = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1') 
    corpus_tokens = corpus_data.words()
    corpus_text_object = Text(corpus_tokens)
    print(corpus_text_object[:50])
    print(corpus_text_object.concordance('people'))
    ### END SOLUTION

</details>









source: ls_Text and Data Mining Bootcamp