# In this notebook:

1. Working with a **Corpus** (multiple text files)
2. **Concordance** Lists (tokens in context)
3. Searching Text using **Regular Expressions** (wildcards syntax)

In [1]:
# run this cell now. It's the usual imports of text mining libraries

import nltk
import numpy
import string
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/pawel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# 1. Working with a Corpus (multiple text files)

#### Questions & Objectives:

- How can I load a text collection made up of multiple text files and tokenise them?

#### Key Points

- To read an entire collection of text files you can use the PlaintextCorpusReader class/object provided by NLTK and its words() function to extract all the words from the text in the collection.

Text data comes in different forms. You might want to analyse a document in one file or an entire collection of documents (a corpus) stored in multiple files. You already know how to load a single document, and now you will see how to load an entire folder of text documents (a corpus).

### Medical History of British India

We will use the Medical History of British India collection provided by the National Libarry of Scotland as an example. It is located under a link below, if you'd like to read more about it, but for this course we have prepared it for you already. You will find it in the sub0folder inside your data folder.

https://data.nls.uk/data/digitised-collections/a-medical-history-of-british-india/

This dataset forms the first half of the Medical History of British India collection, which itself is part of the broader India Papers collection held by the Library. A Medical History of British India consists of official publications varying from short reports to multi-volume histories related to disease, public health and medical research between circa 1850 to 1950. These are historical sources for a period which witnessed the transition from a humoral to a biochemical tradition, which was based on laboratorial science and document the important breakthroughs in bacteriology, parasitology and the developments of vaccines in a colonial context.

This collection has been made available as part of NLS’s DataFoundry platform which provides access to a number of their digitised collections. We are only interested in the text the Medical History of British India collection for this course.

### Recap: Loading a single file

First, just to see one of the files, let's load one individual file and print a few of its tokens:

In [2]:
file = open('./data/Medical_History_of_British_India/74457530.txt') 

india_raw = file.read() 
india_tokens = word_tokenize(india_raw) 
lower_india_tokens = [word.lower() for word in india_tokens] 
print(lower_india_tokens[0:50] )

['no', '.', '1111', '(', 'sanitary', ')', ',', 'dated', 'ootacamund', ',', 'the', '6th', 'october', '1876', '.', 'from-the', 'honourable', 'w.', 'hudleston', ',', 'chief', 'secretary', 'to', 'the', 'govern-', 'ment', 'of', 'madras', '.', 'to-the', 'offg', '.', 'secretary', 'to', 'the', 'government', 'of', 'india', '.', 'resolution', 'of', 'government', 'of', 'india', 'no', '.', '1-137', ',', 'dated', '5th']


### Loading multiple files (a corpus) into a PlaintextCorpusReader object

We can do the same for an entire collection of documents (a corpus). Instead of pointing to an individual file, we point to a directory/folder with many text documents in it. We will use the entire Medical History of British India collection as our dataset, which consists of almost 500 text documents stored in a folder.

To read the text files in this collection we can use the `PlaintextCorpusReader` python class provided in the corpus package of NLTK that we imported before.

You need to specify:

- the collection directory name 
- a wildcard (generic name) for which files to read in the directory (e.g. `*` for all files, or `*.txt` for all text files `74*.txt` for all text files starting with `74`) 
- a text encoding of the files (in this case `latin1`) to indicate which alphabeth to use.

when you use PlaintextCorpusReader it will look like this:

`list_of_lists_of_tokens = PlaintextCorpusReader(folder_location, file_wildcard, encoding)`

which in our example will look like this:

`corpus_reader = PlaintextCorpusReader("./data/Medical_History_of_British_India", '.*', encoding='latin1') `

In [3]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = "./data/Medical_History_of_British_India"
corpus_reader = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1') 

print(corpus_reader)
# the output will make little sense, but do not worry, keep reading

<PlaintextCorpusReader in '/Users/pawel/Teaching/EFI-TextMining/Text Mining Bootcamp/data/Medical_History_of_British_India'>


When we tried to print our corpus, we saw something like `<PlaintextCorpusReader ... >` instead of a lot of words! Why? That's because the corpus we have loaded into variable `corpus_reader` is more than just a bunch of text variables: 

What you see when you try to print `corpus_reader` is the brief description of a `PlaintextCorpusReader object`

**OBJECT is a type of a variable that combines some storage and some functionality**

For example, the `PlaintextCorpusReader` object contains the list of lists of tokens (list of files, and each file is a list of words), and some additional information, and methods we can use. The one we will use the most is the `.words()` method to pull out all the tokens from the corpus.

`all_tokens_in_corpus = my_corpus_reader.words()`

or in our example:

`corpus_tokens = corpus_reader.words() `

In [4]:
corpus_tokens = corpus_reader.words() 
print(corpus_tokens[400:450]) 

['I', 'have', 'been', 'unable', 'myself', 'to', 'institute', 'any', 'personal', 'enquiry', 'into', 'the', 'matter', ',', 'and', 'can', 'now', 'only', 'forward', 'a', 'summary', 'of', 'the', 'reports', 'of', 'the', 'different', 'Civil', 'Surgeons', 'which', 'have', 'been', 'forwarded', 'to', 'me', 'by', 'Government', '.', 'The', 'accompanying', 'table', 'shews', 'the', 'population', 'of', 'each', 'Registration', 'District', ',', 'and']


#### 🐛Minitask


- simulate opening the book in few places and reading a sentende: print a subsection (`your_list[ start : end ] `) of this corpus' tokens. Just make up an index to start reading it, like we did above (we made up the number `450` it has no signifficance).  

eg. from the `corpus_tokens` list print 50 tokens starting at 1000th, or 9568th. Notice that it is not th emost efficient way to 'eyeball' the text corpus.

In [5]:
# write your answer here:


### BEGIN SOLUTION
print(corpus_tokens[1000:1050]) 
print(corpus_tokens[9568:9590])
print(corpus_tokens[20000:20050])
### END SOLUTION

['will', 'prevail', 'until', 'such', 'an', 'objectionable', 'state', 'of', 'matters', 'is', 'altered', '.', 'With', 'these', 'preliminary', 'remarks', 'I', 'proceed', 'to', 'furnish', 'a', 'summary', 'of', 'the', 'returns', 'received', 'from', 'the', 'various', 'Civil', 'Surgeons', 'as', 'furnished', 'by', 'the', 'Surgeon', 'General', ':', 'KHANDIESH', '.', 'The', 'Civil', 'Surgeon', ',', 'Dhulia', ',', 'records', 'the', 'following', 'notes']
['tian', 'employed', 'in', 'the', 'Mission', 'there', ',', 'who', 'first', 'suffered', 'from', 'the', 'disease', 'at', 'Mussooree', '.', 'And', ',', 'I', 'think', ',', 'it']
['.', 'Mutyal', 'Ditto', 'Ditto', 'Ditto', 'Bag', '(', 'Jammu', ').', 'Ditto', 'Parents', 'died', 'from', 'this', 'disease', '.', 'Ditto', 'Five', 'years', 'ago', '.', 'Appearance', 'of', 'several', 'bull', 'on', 'the', 'arms', '.', 'None', 'Had', 'had', 'small', '-', 'pox', 'and', 'measles', '.', 'Bad', 'Ditto', 'Ditto', 'Ditto', 'Ditto', 'Sound', 'Ditto', 'Present', 'many', 

### Preparing your tokens for analysis (eg. making all words lowercase)
### A few words about overwhelming your computer with processing: DO NOT PANIC!

Note that this dataset is Quite Large: it contains almost 500 files, 30.000.000 words, and 130.000.000 characters! It's 120MB of data.

Still, loading all the files and turning text into tokens takes about 1-2 seconds!

But when we lowercase all of the words in this corpus, we run a piece of code which turns every single character of the 130.000.000 characters into its lower case equivalent.

`lower_corpus_tokens = [word.lower() for word in corpus_tokens]` 

But this is not going to happen instantly. Run the below code cell now, and then keep reading.

You'll need to be patient, it might take even a minute or more to run. It will look as if your notebook has FROZEN (might stop responding), but it's just busy at work. You will know that the cell is running, because there will be an `In [*]` on the left top of that busy cell, and (on some browsers) the icon on the browser tab will turn into an hour glass.

If you have not done it yet, run the below cell now and see what happens.

In [6]:
lower_corpus_tokens = [word.lower() for word in corpus_tokens] 
print(lower_corpus_tokens[400:450]) 

['i', 'have', 'been', 'unable', 'myself', 'to', 'institute', 'any', 'personal', 'enquiry', 'into', 'the', 'matter', ',', 'and', 'can', 'now', 'only', 'forward', 'a', 'summary', 'of', 'the', 'reports', 'of', 'the', 'different', 'civil', 'surgeons', 'which', 'have', 'been', 'forwarded', 'to', 'me', 'by', 'government', '.', 'the', 'accompanying', 'table', 'shews', 'the', 'population', 'of', 'each', 'registration', 'district', ',', 'and']


Here are some things you can do to not have to wait all the time for your code to finish:

**We try to do the hardest processing only once, and store the result in a variable, which we use later** (instead of re-doing the processing all the time).

And this is exactly what we do above: After the above line of code had finished running, the variable `lower_corpus_tokens` contains all of your tokens as lowercase characters.

Accessing the tokens in this variable will take a very short time now, because all the time consuming processing (making things lowercase) is already done, and all the tokens are lowercase already - now we just want to see them.

In [7]:
# this will execute almost immediately now, because all the processing 
# (changing millions of characters) is done, and we are just requesting a sub-set of a (very large) list:

print(lower_corpus_tokens[400:450]) 

['i', 'have', 'been', 'unable', 'myself', 'to', 'institute', 'any', 'personal', 'enquiry', 'into', 'the', 'matter', ',', 'and', 'can', 'now', 'only', 'forward', 'a', 'summary', 'of', 'the', 'reports', 'of', 'the', 'different', 'civil', 'surgeons', 'which', 'have', 'been', 'forwarded', 'to', 'me', 'by', 'government', '.', 'the', 'accompanying', 'table', 'shews', 'the', 'population', 'of', 'each', 'registration', 'district', ',', 'and']


Overall, when processing needs to be done, it needs to be done. So don't be scared of it, and when you see the `In [*]` indicator that a cell is processing, just get a cup of tea :)

# 2. Concordance List for a text collection (contexts in which tokens appear)

#### Questions & Objectives:

- How can I load a text collection made up of multiple text files and tokenise them?

#### Key Points

- A concordance list is a list of all contexts in which a particular token appears in a corpus or text.
- A concordance list can be created using the concordance() method of the Text` class in NLTK.

Words make a lot of sense in context. Concordance list is a fantastic way to glimpse into the text and see how a particular token is used.

Next, we will display concordances for a particular token, i.e. all contexts a particular token appears in. We can do this using the Text class in NLTK’s text package. We can represent our list of lowercased tokens in the document collection loaded previously using the Text class. The concordance list of a token can be displayed using the `.concordance()` method on this class as shown below.

Note that the process of aquiring concordance data will take abotu 10 seconds, depending on the how busy your current machine is.

In [8]:
from nltk.text import Text

text_of_the_corpus = Text(lower_corpus_tokens)
print(text_of_the_corpus.concordance('woman'))

Displaying 25 of 819 matches:
s of age , a sweeper , who married a woman who had leprosy , and at the age of 
e of sitabu , aged 40 , a muhammadan woman . her grand - father and father were
ung man deliberately married a leper woman , and became himself a leper at the 
contrary . in no . 6 a man marries a woman whose grandfather and father had bee
 lepers . in no . 10 a man marries a woman whose father had died of leprosy . i
applies to these cases . in no . 2 a woman marries a man whose father and elder
n in the case of a man who marries a woman of notoriously leper family . in no 
toriously leper family . in no . 5 a woman marries a man whose elder brother wa
d continued to cohabit with a native woman after she had been attacked with lep
isen from intermarriage of a man and woman in both of whom leprosy was heredita
s a leper ; he is now married to the woman , and they both live in the asylum .
een accompanied by a healthy looking woman , and by this means , although all h
editary tr

In the output for the next bit of code which creates a concordance list for the word “he”, we can see that there are many more results in the list than displayed on screen (Displaying 25 of 170 matches). The concordance() method only prints the first 25 results by default (or less if there are less).

In [9]:
text_of_the_corpus = Text(lower_india_tokens)
print(text_of_the_corpus.concordance('he'))

Displaying 25 of 170 matches:
leprosy treated by gurjun oil , which he was able to watch for a length of tim
 diminished . during these two months he gained three pounds in weight , which
does not seem much , considering that he did no work and was fairly well fed o
se from jail on the 23rd january 1876 he was again suffering from the sores th
n 5th and died on 20th october 1875 . he was seriously ill when he was brought
ober 1875 . he was seriously ill when he was brought to the hospital , and cou
itted on the 8th september 1875 , and he went home of his own accord on 20th d
is own accord on 20th december 1875 . he was much improved under treat- ment b
evalence of leprosy in the district , he had had but very few opportunities of
even half this number . the natives , he says , call every chronic skin diseas
in the legs , the feet and the ears . he has perfect taste , hearing , sight a
te laboured under it . the leper says he was quite free from leprosy until he 
 he was quite free fro

In [10]:
# You can specify the number of lines using 
# an additional lines parameter, e.g.:
print(text_of_the_corpus.concordance('he', lines=170))

# notice that when the result of a cell is too long, it will become it's own little scrollable window

Displaying 170 of 170 matches:
leprosy treated by gurjun oil , which he was able to watch for a length of tim
 diminished . during these two months he gained three pounds in weight , which
does not seem much , considering that he did no work and was fairly well fed o
se from jail on the 23rd january 1876 he was again suffering from the sores th
n 5th and died on 20th october 1875 . he was seriously ill when he was brought
ober 1875 . he was seriously ill when he was brought to the hospital , and cou
itted on the 8th september 1875 , and he went home of his own accord on 20th d
is own accord on 20th december 1875 . he was much improved under treat- ment b
evalence of leprosy in the district , he had had but very few opportunities of
even half this number . the natives , he says , call every chronic skin diseas
in the legs , the feet and the ears . he has perfect taste , hearing , sight a
te laboured under it . the leper says he was quite free from leprosy until he 
 he was quite free fr

#### 🐛Minitask: combining together what we learned so far

This task will require you to copy-paste and adjust various lines of code from this notebook. We will load and analyse a collection of Barack Obama speeches. 

- load a new corpus: a few selected speeches of Barack Obama located in the folder `./data/barack_obama_speeches`. 
- turn corpus data into tokens (with `.words()`)
- lowercase all the tokens using list comprehention loop (`[output for something in list_of somethings]`) and `.lower()`
- create a Text object from the lowercased tokens with  `Text( your_lowercased_tokens )`
- create concordance lists for some words that you find interesting, eg. 'hope', 'can', 'people'

In [11]:
# white your answer here:


### BEGIN SOLUTION
corpus_root = "./data/barack_obama_speeches"
corpus_data = PlaintextCorpusReader(corpus_root, '.*', encoding='latin1') 
corpus_tokens = corpus_data.words()
corpus_text_object = Text(lower_india_tokens)
print(corpus_words[:50])
print(corpus_text_object.concordance('people'))
### END SOLUTION

Displaying 25 of 103 matches:
ditions under which the mass of the people live is the only sure method of sta
or 5555 per cent . were found among people inhabiting the sea-shore within one
s in other respects under which the people therein live ; 4th , the circumstan
r places of first at- tack of these people , are widespread throughout the dis
e importance to determine how these people were employed , or what means of li
hunned by their relations and caste people as disgracefully afflicted or unple
e facts it may be gathered that the people of this country , as the result of 
e than 60 years old-a great age for people of this country . of the 50 persons
aracter of the complaint from which people of this country suffer , the histor
r prevalence of leprosy amongst the people of this country , perhaps the most 
ry uncommon disease here , but many people having the disease come from other 
tony of the lives of these wretched people by getting them to take to garden- 
se is not widely diffu

## 3. Searching text using Regular Expressions

#### Questions & Objectives:

- How can I search for tokens in text more flexibly? For example, to find all all mentions of `woman` and `women`, or all words starting with `multi`


#### Key Points

To search for tokens in text using regular expressions you need the `re` module and its search function.

You will learn how to construct regular expressions. E.g. you can use a wildcard * or you can use a range of letters, e.g. [ae] (for a or e), [a-z] (for a to z), or numbers, e.g. [0-9] (for all single digits) etc. Regular expressions can be very powerful if used correctly. To find all mentions of the words woman or women you need to use the following regular expression wom[ae]n.

Regular expressions are a very powerful tool, but we'll just give you a taster and some examples. For a more detailed overview and use of regular expressions, you can later refer to the Programming Historian lesson Understanding Regular Expressions https://programminghistorian.org/en/lessons/understanding-regular-expressions.

**Regular expressions** are ways to be 'a bit vague' about text. (While being increadibly specific at the same time).

For example Let's imagine we want to see all tokens that refer to `women` in text. If we were working with a person (not a computer) they might already assume I mean both singlular `woman` and plural `women`. But computers need us to be very very specific, and so we are provided with a way to describe small acceptable difference. This syntax is called regular expressions (RegEx).

The way we arrive at regular expressions is a process of specifying what we want:

- I could say: give me all occurances of `woman` and `women` and then add them all.
- I could say: give me all occurances of `wom*n` where `*` is `a` or `e` 
- I could use regex to say give me all occurances of `wom[ae]n`
- I could use regex to say give me all occurances of `^wom[ae]n$` which also means that there can be nothing before or after these characters, so `superwomen` and `womenhood` will not be included 

The RegEx we will use is `^wom[ae]n$` and below we explain what it means:

- `^` means: start here
- `wom` and `n` look for these letters in this order
- `[ae]` means: one character from this list, so `[ae]` means one character which is either `a` or `e`
- `$` mean: end here

This way we can look for the word `women` as well a `woman` in a corpus simultaneously, eg. to find out how many times they occur.

Regular Expressions are usually used to define search terms in an 'a bit vague' but also 'very precisely specified' way.

In [12]:
# run this cell now. It imports Regular Expressions module into this notebook

import re

Before we use RegEx on on a whole corpus let's first use it on some example data.

Say I want to know if a given token matches/fits my RegEx. I can try to 'find' the match to that regex in my string.

There are two possible outcomes of searching for a RegEx:

- **Found it**: regex did find a match and returns a `re.Match` object (you can think of is as `True`)
- **Not Found it**: regex did not find a match and returns `None`  (you can think of is as `False`)

Basically, either a particular token fits your regex or it does not.

In [13]:
print(re.search('^wom[ae]n$', "women"))
print(re.search('^wom[ae]n$', "woman"))
print(re.search('^wom[ae]n$', "something")) # no match
print(re.search('^wom[ae]n$', "superwoman")) # not exact match, so no match

<re.Match object; span=(0, 5), match='women'>
<re.Match object; span=(0, 5), match='woman'>
None
None


Regex is case-sensitive and that's why we lowercased our tokens first

In [14]:
print(re.search('^wom[ae]n$', "women"))
print(re.search('^wom[ae]n$', "Women"))
print(re.search('^wom[ae]n$', "WOMEN"))

<re.Match object; span=(0, 5), match='women'>
None
None


### mini code recap: keeping only some elements from a list

We'll use list comprehention's ability to filter list items using `if something_true_or_false`

In [15]:
# print uppercase versions of every fruit in fruits
fruits = ["banana", 'pinapple', 'plums', "kiwi"]
new_fruits = [fruit.upper() for fruit in fruits]
print(new_fruits)

['BANANA', 'PINAPPLE', 'PLUMS', 'KIWI']


In [16]:
# for each fruit in fruits, return that fruit.upper(), 
# but only use items where fruit's first character is 'p'

some_fruits = [fruit.upper() for fruit in fruits if fruit[0] == 'p']
print(some_fruits)

['PINAPPLE', 'PLUMS']


In [17]:
# and to do the same thing, but without upper casing the words
# for each fruit in fruits, return that fruit's name, 
# but only use items where fruit's first character is 'p'

some_fruits = [fruit for fruit in fruits if fruit[0] == 'p']
print(some_fruits)

['pinapple', 'plums']


You can use `if fruit[0] == 'p'` because the comparison `fruit[0] == 'p'` returns `True` or `False`.

### Using RegEx on a List of tokens

Because RegEx also returns something like `True` or `False`, We will now use the same mechanism and the fact that re.search() returns something or nothing:

Here, like in the above example we will:

- filter the items in lower_india_tokens
- keep only those which return `True` if we search for our RegEx in them (they match the RegEx)

`[word 
for word in lower_india_tokens 
if re.search('^wom[ae]n$', word)]`

Even thou it is a bit easier to read when split into 3 lines, traditionally we write it in one line:

`[word for word in lower_india_tokens if re.search('^wom[ae]n$', word)]`

In [18]:
womaen_strings = [word for word in lower_india_tokens if re.search('^wom[ae]n$', word)]
print(womaen_strings)

['women', 'women', 'women', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'women', 'woman', 'women', 'women', 'woman', 'women', 'woman', 'women', 'woman', 'woman', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women']


In [19]:
# if your code becomes too hard to read, you can add some new-lines to make it more readable. eg:
womaen_strings = [word 
                  for word in lower_india_tokens
                  if re.search('^wom[ae]n$', word)]
print(womaen_strings)

['women', 'women', 'women', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'women', 'woman', 'women', 'women', 'woman', 'women', 'woman', 'women', 'woman', 'woman', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women']


Let's see how the search results would change if you remove the `^` and `$` characters from the regular expression.

Now that the results are stored in a list you can count them. We will see how to do that in the next section of the course.

In [20]:
womaen_strings=[w for w in lower_india_tokens if re.search('wom[ae]n', w)]
print(womaen_strings)
# there should be at least one new item, can you see it?

['women', 'women', 'women', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'woman', 'women', 'woman', 'women', 'women', 'woman', 'women', 'woman', 'women', 'washerwoman', 'woman', 'woman', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women', 'women', 'women', 'women', 'woman', 'woman', 'women', 'women', 'women']


### 🖇💬Buddy discussion: What would be some useful ways you imagine RegEx could be used in your work/studies?

#### Ask your buddy now if they reached the **BUDDY TASK**. Once you both did, complete this task:

Can each of you come up with ONE OR TWO EXAMPLES of how the ability to use regular expressions could be useful to you?

Don't spend too much time on this (max 2 mins) but take note of your favourite idea.

### Doing more with Regular Expressions: just a few examples

Regural expressions can be very specific and we will not cover them in detail here but they are very powerful to carry out complex searches, e.g. 

- find all tokens starting with a and are 12 characters long
- find all tokens which are 13 characters long but that do not start with a lower case letter 

Some more RegEx syntax:

- `.` means any character
- `[abcd]` means a character which is either a, b, c or d
- `[a-z]` means a letters between a-z
- `[a-zA-Z]` means a letters between a-z and A-Z
- `[0-9]` means a digit
- `\d` also means a digit


- `*` means zero or more times
- `+` means one or more times
- `?` means zero or one time
- `{5}` means 5 times
- `{3,5}` means 3 to 5 times
- `[^abc]` means anything but a,b or c

Some examples:

A four letter word

- `^[a-z]...$` means a 4 letter word
- `^[a-z]{4}$` also means a 4 letter word

In [21]:
[word for word in lower_india_tokens if re.search('^[a-z]...$', word)]

['ment',
 'offg',
 'from',
 'port',
 'that',
 'agri',
 'dept',
 'offg',
 'home',
 'tion',
 'home',
 'copy',
 'with',
 'this',
 'esq.',
 'with',
 'your',
 'have',
 'that',
 'from',
 'fact',
 'that',
 'past',
 'year',
 'have',
 'been',
 'same',
 'have',
 'been',
 'into',
 'only',
 'have',
 'been',
 'each',
 'that',
 'were',
 'time',
 'were',
 'each',
 'that',
 'very',
 'they',
 'each',
 'puna',
 'city',
 'tana',
 'thur',
 'sind',
 'from',
 'will',
 'seen',
 'that',
 'rate',
 'next',
 'puna',
 'high',
 'less',
 'from',
 'sind',
 'that',
 'hold',
 'that',
 'mass',
 'live',
 'only',
 'sure',
 'with',
 'hope',
 'will',
 'such',
 'with',
 'from',
 'only',
 'case',
 'able',
 'time',
 'this',
 'case',
 'does',
 'cure',
 'rama',
 'aged',
 'with',
 'fair',
 'skin',
 'came',
 'into',
 'jail',
 'with',
 'tips',
 'toes',
 'feet',
 'face',
 'much',
 'also',
 'from',
 'does',
 'seem',
 'much',
 'that',
 'work',
 'well',
 'with',
 'this',
 'from',
 'jail',
 'from',
 'they',
 'were',
 'that',
 'were',
 

In [22]:
# notice that we are returning the result, rather than printing it, because that puts them one under another
# and makes them more readable. If we used print() it would look like this:

print([word for word in lower_india_tokens if re.search('^[a-z]...$', word)])

['ment', 'offg', 'from', 'port', 'that', 'agri', 'dept', 'offg', 'home', 'tion', 'home', 'copy', 'with', 'this', 'esq.', 'with', 'your', 'have', 'that', 'from', 'fact', 'that', 'past', 'year', 'have', 'been', 'same', 'have', 'been', 'into', 'only', 'have', 'been', 'each', 'that', 'were', 'time', 'were', 'each', 'that', 'very', 'they', 'each', 'puna', 'city', 'tana', 'thur', 'sind', 'from', 'will', 'seen', 'that', 'rate', 'next', 'puna', 'high', 'less', 'from', 'sind', 'that', 'hold', 'that', 'mass', 'live', 'only', 'sure', 'with', 'hope', 'will', 'such', 'with', 'from', 'only', 'case', 'able', 'time', 'this', 'case', 'does', 'cure', 'rama', 'aged', 'with', 'fair', 'skin', 'came', 'into', 'jail', 'with', 'tips', 'toes', 'feet', 'face', 'much', 'also', 'from', 'does', 'seem', 'much', 'that', 'work', 'well', 'with', 'this', 'from', 'jail', 'from', 'they', 'were', 'that', 'were', 'from', 'were', 'into', 'some', 'they', 'into', 'died', 'when', 'went', 'home', 'much', 'ment', 'said', 'have',

Another example: any word starting with 'b', ending with 'y'. 

As in, between these letters `b` and `y` we expect any-character `.` to appears zero-or-more times `*` (which we write as `.*`)

`'^b.*y$'`

In [23]:
[word for word in lower_india_tokens if re.search('^b.*y$', word)]
# replace * with a + to look for one or more letters between b and y, not zero or more

['bombay',
 'bombay',
 'by',
 'by',
 'bombay',
 'bombay',
 'bombay',
 'bombay',
 'by',
 'bombay',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'bombay',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'bareilly',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'bareilly',
 'by',
 'bombay',
 'by',
 'by',
 'bareilly',
 'by',
 'by',
 'bareilly',
 'bareilly',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'beggary',
 'by',
 'beggary',
 'body',
 'body',
 'by',
 'buy',
 'by',
 'by',
 'body',
 'body',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'bombay',
 'by',
 'beggary',
 'by',
 'beggary',
 'by',
 'by',
 'by',
 'beggary',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'by',
 'body',
 'by',
 'body',
 'b

### 🐛Minitask: read RegEx with understanding

You will wish you have this when solving crossword puzzles.

In this task you will see some RegEx's and will try to explain what they do:

example, explain RegEx `^[^a-g]..l.ing$`

- find all 8 letter words that
- do not start with a letters from a to c
- and the fourth letter is 'n'
- ends with 'ing'

Run below code to see it:

In [24]:
[word for word in lower_india_tokens if re.search('^[^a-c]..n.ing$', word)]

['standing',
 'standing',
 'granting',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'drinking',
 'standing',
 'standing',
 'spinning',
 'painting',
 'standing',
 'standing']

In [25]:
# Run and explain below code

[word for word in lower_india_tokens if re.search('^m[ae]n$', word)]

['man',
 'man',
 'man',
 'men',
 'men',
 'man',
 'men',
 'men',
 'man',
 'man',
 'men',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'men',
 'man',
 'men',
 'men',
 'men',
 'man',
 'men',
 'man',
 'man',
 'man',
 'men',
 'man',
 'man',
 'men',
 'men',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'men',
 'men',
 'man',
 'men',
 'man',
 'man',
 'men',
 'man',
 'man',
 'man',
 'men',
 'man',
 'men',
 'men',
 'men',
 'man',
 'men',
 'man',
 'men',
 'man',
 'men',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'men',
 'men']

In [26]:
[word for word in lower_india_tokens if re.search('^m[ae]n', word)]

['ment',
 'ment',
 'mentioned',
 'ment',
 'mentioned',
 'maneckji',
 'many',
 'man',
 'man',
 'mentioned',
 'manner',
 'man',
 'manmar',
 'ment',
 'mendicants',
 'men',
 'many',
 'men',
 'many',
 'man',
 'many',
 'many',
 'many',
 'many',
 'many',
 'manglaur',
 'many',
 'ment',
 'men',
 'men',
 'ment',
 'mentioned',
 'many',
 'ment',
 'many',
 'man',
 'man',
 'many',
 'ment',
 'men',
 'mention',
 'many',
 'many',
 'mankind',
 'man',
 'mani',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'man',
 'many',
 'many',
 'mention',
 'man',
 'man',
 'mentioned',
 'men',
 'mankind',
 'man',
 'many',
 'mankind',
 'men',
 'men',
 'management',
 'men',
 'man',
 'mendicants',
 'many',
 'ment',
 'many',
 'many',
 'mensem',
 'mensem',
 'mensem',
 'mensem',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'many',
 'mandera',
 'many',
 'manbhum',
 'many',
 'ment',
 'manners',
 'manner',
 'manbhum',
 'manook',
 'manook',
 'man

In [27]:
[word for word in lower_india_tokens if re.search('^d.*t$', word)]

['dept',
 'department',
 'department',
 'department',
 'different',
 'district',
 'district',
 'district',
 'difficult',
 'district',
 'department',
 'district',
 'district',
 'district',
 'district',
 'diet',
 'diet',
 'distant',
 'district',
 'district',
 'diet',
 'dept',
 'dept',
 'department',
 'department',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'different',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'different',
 'district',
 'district',
 'district',
 'district',
 'different',
 'district',
 'district',
 'district',
 'district',
 'district',
 'district',
 'doubt',
 'development',
 'dacoit',
 'dependent',
 'district',
 'district',
 'district',
 'district',
 'district',
 'doubt',
 'direct',
 'dept',
 'dept',
 'dwelt',
 'district',
 'district',
 'different',
 'district',
 'district',
 'district',
 'district

### 🦋 Extra task (optional): if you have finished everything else already:

Either import a corpus that you would like to analyse youself (create a new folder inside of your `./data/` and put your files there), or use one of the two corpuses we looked at in this notebook 

Then investigate the context of some of the words and use RegEx to look for interesting patterns in it.