# Text manipulation and extraction in Python

This notebook introduces the basic operation on strings that can be done with the Python programming language.
The notebook then focuses on text manipulation and extraction from different sources:
- Text files
- Web
- PDF documents
- OCR scanned PDF documents

For an introduction or recap on Python, refer to the WeBeep page of this course

## Strings and lists

A 'string' is simply a sequence of characters used to represent a document in a programming language such as Python.
- Let's create a Python variable called 'doc' that contains a short document as a string.
- After defining the variable, we repeat its name so as to print out its content.


In [None]:
doc = 'In a hole in the ground there lived a hobbit. Not a nasty, dirty, wet hole, filled with the ends of worms and an oozy smell'
doc

We can calculate the length of the string (in characters) by using the len() function:

In [None]:
len(doc)

We can divide up the sentence into individual words by splitting it on whitespace (spaces, tabs, etc.). 
- This process is called 'tokenisation':

In [None]:
doc.split()

Note that the ouptut above is in the form of comma-separated list of strings [s1,s2,...,sn]
- The layout above is vertical, but if you use print() command you can get a more compact horizontal ouptut.

In [None]:
print(doc.split())

We didn't have to split the sentence on whitespace, we could have split it around any substring. 
- For example we could split on the comma ',' character:

In [None]:
doc.split(',')

How many words are there in the document?

In [None]:
words = doc.split()
len(words)

Often in text-processing pipelines we convert all text to lower-case. 
- Since the sentence is almost all in lower-case already, let's convert it to upper-case instead:

In [None]:
doc.upper()

## Loading text from a file
Let's now read in a longer document form a text file 'Alice_Chapter1.txt'

- Make sure you have downloaded the file "Alice_Chapter1.txt" from the "docs" directory in the WeBeep directory where you found this notebook (I'd suggest you downlod the entire directory each time to be sure every file is in the right place).
- If you are using Google Colab, you will then need to upload the file by clicking on the Folder icon to the left of the notebook, then clicking on the Upload icon, and finding the file on your drive.

In [None]:
with open("docs/Alice_Chapter1.txt") as f:
    doc2 = f.read()

Print out the text as Python sees it:

In [None]:
doc2

Note all the backslash characters '\\' in the text above.  
- Python stores text as one big string (sequence of characters). 
- Special characters such as newlines and tabs are represented by '\\n' and '\\t' respectively.
- The quote character is used to mark the start and end of the string ('string'), so quote characters that are present in the string are prefixed by a backslash to prevent the string from ending early ('str\\'ing'). 
- Using the print() command, we can output the string in a format that we're more used to seeing it in:

In [None]:
print(doc2)

### Splitting lines and finding words

We can split the text into separate lines using splitlines() method. 
- Since there are lot of lines, we'll only print the first 5 of them by appending `[:5]` to the name of the variable contianing them

In [None]:
lines = doc2.splitlines()
lines[:5]

Note that: 
- Some of the lines contain no text at all.
- Some of the lines are surrounded by the double quote character " becuase they contain the single quote character in the text. 

How many lines are there in total in the text?

In [None]:
len(doc2.splitlines())

We can search for a particular word in the text: 
- For example, let's search for the word 'Rabbit'

In [None]:
doc2.find('Rabbit')

The number tells us that the word appears at the 552nd character position. 

We can format the output to state this explicitly:
- We use the '+' command to concatenate strings, 
- and the str() command for converting an integer to a string.

In [None]:
word = "Rabbit"
mystring = f"The word '{word}' appeared at character position {str(doc2.find(word))}in the text"
print(mystring)

What happens if we search for a string that does't exist in the document? 
- Try it... 

In [None]:
# TODO

### Investigating the vocabulary of a document

Now let's find the vocabulary of this text by: 
- first converting the text to lowercase
- then splitting the words on whitespace
- then selecting only distinct words by using the set() function

Python sets are just regular sets from math where you can put heterogenous variables, only a single copy of each element is allowed in a set.

In [None]:
lowercase_doc = doc2.lower()
words = lowercase_doc.split()
vocab = set(words)
print(vocab)

To make it easier to read, we could sort the vocabulary alphabetically:

In [None]:
sorted_vocab = sorted(vocab)
print(sorted_vocab)

That looks a bit weird. What are all those bracket '(' characters doing there? 

### Removing punctuation with a regular expression

Notice that many of the vocabulary terms, particularly those at the start of the list, contain punctuation characters like quotes '"', brackets '(' and exclamation marks '!'. We'll now see how to remove these puntuation characters:
- First get a list of punctuation characters from the 'string' library.

In [None]:
import string
string.punctuation

The list is provided as a single string. To convert it to a list of individual characters, just call the list function:

In [None]:
list(string.punctuation)

Notice the double backslash character '\\\\' in the list. This is needed because backslash is used as the escape character. So if we don't put a double backslash, Python will interpret the single backslash as escaping the quote character that follows it.

We can create a regular expression that will match any of those puncutation characters by simply surrounding the string of punctuation characters with square brackets: "[]" 

In [None]:
regex = '[' + string.punctuation + ']'
print(regex)

We can use the new punctuation matching regular expression with the sub() command in the *re* (regular expression) libarary to remove the unwanted punctuation.
- Note that the sub() routine actually performs a substitution each time it finds a match, but we will simply replace the punctuation character with an empty string: ''
- Let's print out the first 1000 characters of the text after removing all punctuation:

In [None]:
import re
doc2_nopunctuation = re.sub(regex,'',doc2)
print(doc2_nopunctuation[:1000])

Compare this output with the original text for the first 1000 characters:

In [None]:
print(doc2[:1000])

Now that we've removed the punctuation, let's generate the sorted vocabulary again, by:
- converting to lowercase
- splitting on whitespace
- select only distinct words
- and sorting the words alphabetically

In [None]:
words = doc2_nopunctuation.lower().split()
sorted_vocab = sorted(set(words))
print(sorted_vocab)

### Counting term frequencies

We often represent documents by their vocbulary, and in particular by their most frequently occuring terms, since those words are most likely to describe well the topic of the document.
- We can count the frequency of the terms in the document using the Counter() function from the NLTK (Natural Language Tool Kit) library. 
- A online book describing the functionality that the NLTK library provides is available here: http://www.nltk.org/book/


In [None]:
import nltk
counts = nltk.Counter(words)
print(counts)

Note that the words are ordered according to their frequency. 

Lets display them again, but this time only the top 20, using the most_common() method:

In [None]:
counts.most_common(20)

### Filtering Stopwords

The most frequent terms: 'the', 'she', 'to', 'and', 'it', 'was', 'a', 'of', and 'i' aren't very interesting or descriptive of the story.
- They are in fact frequent across *all documents* in the English language, and thus convey very little (if any) information about the topic of the document.
- These terms are referred to as **'stop-words'**, because they can be removed from the description of the document without adversely affecting (indeed usually improving) the performance of a text search engine indexing the document.
- The NLTK library contains lists of stop-word for English, Italian and many other languages. Let's print out the stop-word lists for English and Italian.

Before we can get the stopword lists we need to download them:

In [None]:
nltk.download('stopwords')

In [None]:
from nltk.corpus import stopwords
print('English stopwords:')
print(stopwords.words('english'))
print()
print('Italian stopwords:')
print(stopwords.words('italian'))

Now let's remove the stop-words from the tokenised text before counting the frequency of the words in the document. 
- We can easily remove items from a list using some special syntax in Python: **[x for x in list1 if x not in list2]**

In [None]:
words_nostopwords = [w for w in words if w not in stopwords.words('english')]
counts_nostopwords = nltk.Counter(words_nostopwords)
counts_nostopwords.most_common(20)

These words look a little bit better ... 
- The words 'Alice', 'time', 'eat', 'door' and 'rabbit' might be useful for describing the document
- but many of the other words, like 'little', 'like', 'could' and 'get', migh not be as useful.


To get an even better list of words for describing the document we would need to make use of information about *how common each word is in general in the English language*, since the more common a particular word is, the less likely it is to be useful for describing the document. 
- More on that later in the course ...

## Downloading content from the Web

One common source of text documents is the Web. Let's now download an article from Wikipedia, and then extract the text from it.

First download the HTML page using the urllib library and print out just the first 2000 bytes of it:

In [None]:
import urllib.request  
html_doc = urllib.request.urlopen('https://en.wikipedia.org/wiki/Dune_(novel)').read()
html_doc[:2000]

Wow, that looks pretty ugly! 

Let's use another library (called Beautiful Soup) to parse the content of the page. 
- When printing out the parsed document we will use the prettify() method to indent all the HTML tags so that we can see the structure of the HTML document. (This is called 'pretty printing' in HTML/XML.)
- Note that the printed output is very long, so after looking at it, you may want to edit the code to comment out the print line and re-run the cell. 
 - To comment out the last line, simply place a hash character '#' in front of it: #print(parsed_doc.prettify())

In [None]:
import bs4 as bs  
parsed_doc = bs.BeautifulSoup(html_doc,'lxml')
print(parsed_doc.prettify())

### Extracting text from the HTML

Now let's extract the text from the HTML page. 
- First find all paragraph \<p\> ... \</p\> elements within the HTML page.
- The find_all() method returns a list of the elements:

In [None]:
paragraph_elements = parsed_doc.find_all('p')

Now print out the first of the paragraph elements to see what it looks like:
- Note that Python starts counting from zero, not one, so the first element is: paragraph_elements[0]

In [None]:
print(paragraph_elements[0])

Well that was pretty uninteresting. The first paragraph was empty!
- Print out the second paragraph:

In [None]:
print(paragraph_elements[1])

OK, now let's get the text of each paragraph, without all of the HTML markup:
- To do that we'll use the same python construct we saw before for iterating over the elements of a list.
- This time though, we'll perform an operation on each element (extract the text) before returning the list.

In [None]:
paragraph_texts = [p.text for p in paragraph_elements]

Print out the second paragraph to see how it looks without all of the HTML tags:

In [None]:
print(paragraph_texts[1])

Print out the whole list to see text from the entire document:

In [None]:
print(paragraph_texts)

So there we have it, a list of paragraphs that have been extracted from a webpage.

What shall we do with this text? 
- First let's join all the paragraphs together in a single string, separating them with a newline `\n` character:

In [None]:
complete_text = '\n'.join(paragraph_texts)
print(complete_text)

### Searching within extracted text

Now that we have the text in a convenient format, we can start doing some analysis on the it. 
- We could search for somebody's name, e.g. the author 'Frank Herbert', by using the `search` command from the regular expression package 're' imported above.


In [None]:
re.search('Frank Herbert', complete_text)

This tells us that the author is first mentioned in between characters 256 and 269

Let's find out how many times the director has been mentioned in the article. To do that we need to use the `findall()` command rather than search() command:

In [None]:
name = 'Frank Herbert'
matches = re.findall(name, complete_text)
print(matches)
print(f"The name '{name}' occurs {len(matches)} times")

More than just knowing that the author is being mentioned, we'd like to know what is being said about him. So we'd like to extract the sentences mentioning him. 
- We can do that by changing the regular expression that we are using to be more than just a string of keywords.

The required regular expression is a little complicated, so let's build up to it slowly. 
- First let's write a simple expression to capture the first 10 characters immediately after his name. 
- In regular expressions, the dot character '.' is a wild-card that matches any character
- so to match the next 10 characters, we can simply add ten dots to his name: 


In [None]:
regex = name + '..........'
print(f"Regular expression: '{regex}'")
print("Returns:")
re.findall(regex, complete_text)

That text window is far too short to be useful, and the regular expresssion is also particularly ugly. 
- Let's simplify regular expression by using the notation: `{n}` to repeat the previous character n times
- and extend the window out to 100 characters:

In [None]:
regex = name + '.{100}'
print(f"Regular expression: '{regex}'")
print("Returns:")
re.findall(regex, complete_text)

Well the regular expression worked, but we lost one of the results because the required character window was too big. 
- A newline character was encountered less than 100 characters after the director's name.
- To fix this, let's change the number of repetitions to be minimum zero, maximum 100:

In [None]:
regex = name + '.{,100}'
re.findall(regex, complete_text)

OK, so that was fun, but what we'd really like to do is get the whole sentence around his name.
- To do that we'll have to find all of the characters both before and after his name that do not include the period '.' character. 
- To choose any character except '.' we can write `[^.]` and to repeat that pattern zero or more times, we simply append '*'

In [None]:
regex = '[^.]*' + name + '[^.]*'
re.findall(regex, complete_text)

Finally, let's clean up the output a little: 
- by stripping off spaces and newline characters at the start of each sentence using the strip() method
- and reappending the missing period at the end with `+ '.'`

In [None]:
name = 'Frank Herbert'
regex = '[^.]*' + name + '[^.]*'
matches = re.findall(regex, complete_text)
[m.strip() + '.' for m in matches]

## Combine data from multiple files

In some cases data sets contain many different information, as a result the content is split into different files:
- We can open the required files through Python
- We can load the required information using dictionaries for fast search over the data set
- We can merge the data sets into strings to obtan the final data set

### Loading files

We are going to work with the [Cornell Movie--Dialogs Corpus](https://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html).
You can download a copy of the original version of the corpus from this [link](http://www.mpi-sws.org/~cristian/data/cornell_movie_dialogs_corpus.zip).

Extract the content of the zip archive and put it into a `docs/` directory.

There are two main files in the corpus:
- `movie_lines.txt` is a text file where each row is an utterance in a dialogue, it contains all the lines avaialble in the corpus
- `movie_conversations.txt` is a text file where each row contains the list with the identifiers of the lines composing a dialogue.

First we load the utterances

In [None]:
with open('docs/cornell movie-dialogs corpus/movie_lines.txt') as f:
    lines = f.read()

Let's see what data we have in each row

In [None]:
print(lines[:1000])

There are three elements we want to keep:
- line identifier
- speaker
- utterance text

Then we load the dialogues

In [None]:
with open('docs/cornell movie-dialogs corpus/movie_conversations.txt') as f:
    lines_list = f.read()

Let's see what data we have in each row

In [None]:
print(lines_list[:1000])

Here we are interested in keeping only the list of identifiers composing a dialogue.

### Parsing content with RegEx

Now we can use RegEx to extract the useful information we want.

With RegEx we can define the structure of an entire string or piece of string and we can group pieces of our expressions.
In this way we can retreive specific pieces of a string that matches our request.

Each row in the lines file follows the same pattern:
1. Line identifier
2. Speaker identifier
3. Movie identifier
4. Speaker
5. Utterance text

We are interested in 1, 2, and 5.

Note that we have a peculiar separator between the elements `+++$+++`

Let's write a RegEx first and apply it to the first rows.

We use round brakets `()` to isolate groups (groups can be nested).

In [None]:
regex = '(L\d+) \+\+\+\$\+\+\+ u\d+ \+\+\+\$\+\+\+ m\d+ \+\+\+\$\+\+\+ ([\w\s]+) \+\+\+\$\+\+\+ (.+)'
print(f"Regular expression: '{regex}'")
print("Returns:")
re.findall(regex, lines[:1000])

What do you see in output?

Now we can retrieve the desired information from each line and use it to build a list dictionaries where the keys are the IDs of the lines.

Each element of the dictionary will contain the speaker and the uttered text.

In [None]:
lines = {key: {'speaker': sp, 'text': txt} for key, sp, txt in re.findall(regex, lines)}

Now we can access the elements by specific names

In [None]:
print(type(lines))
print(lines['L868'])
print(type(lines['L868']))
print(lines['L867']['speaker'])
print(lines['L867']['text'])

Now we can move to the dialogues file

Each row in the dialogues file follows the same pattern:
1. First speaker identifier
2. Second speaker identifier
3. Movie identifier
4. List of lines identifier (expressed as a list of strings)

We are interested only in 4.

Note that we have again the peculiar separator between the elements `+++$+++`

We split the search into two parts, first isolate the lists and then retrieve elements from the lists. Let's write a RegEx first and apply it to the first rows.

**NOTE: this is not the smartest way to appraoch it, but it is useful to understand how regex work**

In [None]:
list_regex = 'u\d+ \+\+\+\$\+\+\+ u\d+ \+\+\+\$\+\+\+ m\d+ \+\+\+\$\+\+\+ \[(.+)\]'
print(f"Regular expression: '{list_regex}'")
print("Returns:")
re.findall(list_regex, lines_list[:1000])

Each element here is a string with the code of the line. We can search in each string separately the line IDs.

In [None]:
elem_regex = 'L\d+'
s = re.findall(list_regex, lines_list[:1000])[0]
print(f"Regular expression: '{elem_regex}'")
print(f"String: \"{s}\"")
print("Returns:")
re.findall(elem_regex, s)

Now we are dealing with an actual list of strings

In [None]:
print(type(re.findall(elem_regex, s)))
print(type(re.findall(elem_regex, s)[0]))

Now we can retrieve the desired information from each line and use it to build a list, each element of the list is a list itself.
The inner list contains the IDs of the lines composing the dialogue.

In [None]:
lines_list = [re.findall(elem_regex, s) for s in re.findall(list_regex, lines_list)]
lines_list[:10]

### Composing the dialogues

Now we have an indexed list of lines and a list of IDs composing a dialogue, we can finally put all together

For each dialogue in `lines_list` we compose a string with all the turns separated by a newline character.
A turn is a speaker-text pair in the sequence

In [None]:
dialogues = [
    '\n'.join(f'{lines[idx]["speaker"]}: {lines[idx]["text"]}' for idx in indices) 
    for indices in lines_list if all(idx in lines for idx in indices)  # There are some missing 
]
len(dialogues)

We can give a look to one of the extracted dialogues

In [None]:
print(dialogues[0])

In [None]:
print(dialogues[2307])

## Loading text from a PDF

Much of the text on the internet is present inside PDF documents, and often we'd like to extract text from them. 

There are many different ways to do that in Python. Today we'll use the pdfplumber API: https://github.com/jsvine/pdfplumber
- In order to use pdfplumber module, we first need to install it. 
- We can do that from inside the jupyter notebook by calling the pip3 command:

In [None]:
!pip3 install pdfplumber

Now we can import the module and use it to extract content from a PDF. 
- Let's try extracting content from this NLP reasearch paper: https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf
- The HTTPS server won't allow us direct programmatic access, so you'll need to use the file in the `docs/` directory on WeBeep where you found this notebook (as we did with the original Alice text)

In [None]:
import pdfplumber
filename = 'docs/collobert11a.pdf'
pdf = pdfplumber.open(filename)

How many pages are in the report? 

In [None]:
len(pdf.pages)

Wow, that's not a lot of pages!

We can have a look at the first couple of pages extracting the text from them: 

In [None]:
text = pdf.pages[0].extract_text(x_tolerance=1)
print(text)

In [None]:
text = pdf.pages[1].extract_text(x_tolerance=1)
print(text)

Extract the text from all the pages of the document into a list
- Note: this might take a minute. There are a lot of pages ;-)

In [None]:
texts = [page.extract_text(x_tolerance=1) for page in pdf.pages]

Now concatenate all the text together into a single string. 
- We'll separate them from one another using a couple of newline characters and some spaces too

In [None]:
text = "  \n\n".join(texts)
print(text)

#### Using Regular Expressions to search PDF 

Use some regular expressions to search through the text for some interesting content. 
- You could look for email addresses, phone numbers, addresses, ...
- Let's try first to look for email addresses: 

In [None]:
regex = '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
emails = re.findall(regex,text)
print(emails)

Did you find any? 

POS, Chunking, NER, and SRL are all NLP tasks. 
- Are they mentioned anywhere in the paper? 
- Write a regular expression to find out:

In [None]:
# TODO

Other ideas to try:
- In this period reaserchers started using neural networks to solve NLP. Find out where neural networks are mentioned in the report and in what context.
- Theauthors use a data set composed of text coming from the Wall Street Journal (WSJ). Search for references to it in the PDF.

In [None]:
# TODO

Load another PDF and write regular expressions to search for content in it. 
- For example, you can find reports for Ferrari here: https://corporate.ferrari.com/en/investors/results/reports
- Let's load an interim report from September 2020 (you can find it in the same `docs` folder as before):

In [None]:
filename = 'docs/ferrari_interim_report_at_and_for_the_three_and_nine_months_ended_september_30_2020.pdf'
pdf = pdfplumber.open(filename)
text = '\n\n'.join([page.extract_text() for page in pdf.pages])
print(text)

### Extracting Tables from a PDF 

Sometimes it can be useful to extract tabular data from a PDF. 
- Tools exist that allow you to do this programmatically, making the extraction process semi-automatic.
- One tool that can do this is the *tabula* library. Let's install it:

In [None]:
!pip install tabula-py

Now we can use *tabula* to extract all the tables from Ferrari's interim report above: 

In [None]:
import tabula 
tables = tabula.read_pdf(filename, pages="all", multiple_tables=True)

Let's have a look at some of the tables produced
- the first table:

In [None]:
tables[0]

- the third table:

In [None]:
tables[2]

It can be seen that the tables are in need of a bit of cleaning to make them usable. 
- The tables are Pandas dataframes:

In [None]:
type(tables[0])

So we can clean-up the table by:
- dropping some columns
- dropping some rows
- renaming the columns

In [None]:
df = tables[0]
df = df.drop(df.columns[[1,5]], axis=1)  # drop columns: 1,5
df = df.drop([0,5,12])                   # drop rows: 0,5,12
df = df.reset_index(drop=True)           # reset index
df.columns = ('Field','3months_to_30092020','3months_to_30092019','9months_to_30092020','9months_to_30092019') # rename columns
df

During the cleaning phase, some of the values may need to be updated too (e.g. certain values in the Field column above).
- Ideally the above cleaning operations would be done automatically.
- In practice, tables have lots of nested structure (including the one we just extracted), 
- and it's still a hard research problem to do the cleaning reliably, (particularly the generation of the column names that we provided manually).  

## Loading text from scanned document

But what if my documents have been scanned? 
- In that case the task of extracting text from them is much more difficult.

It is possible to extract text also from images, but you will need to have an Optical Character Recognition (OCR) system installed. 
We can use a combination of layout parsing and OCR to extract the text.
- Layout parser is an opensource library to detect leyoutis in images: https://towardsdatascience.com/analyzing-document-layout-with-layoutparser-ed24d85f1d44
- Tesseract is an opensource OCR system provided by Google. Some systems (such as Linux) come with Tesseract pre-installed. Others need to install it from here: https://tesseract-ocr.github.io/tessdoc/Home.html 
- If you have Tesseract installed, you can follow the instructions here to use it from Python: https://towardsdatascience.com/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052

Give a look at the first link to see how it works, and try it yourselves.