# Live coding - processing text in Python

This notebook discusses the main affordances that Python `str` objects have, as well as how to load up `.txt`, `.pdf` and `html` files as Python objects.

For this, we will end up covering several libraries, including (but not limited to):
- BeautifulSoup4.
- PyPDF2.
- urllib.
- NLTK.

## Python strings

So far in the course we have dealt with numerical objects (`numpy ndarray`s, `DataFrame`s and so on). Now, since the goal is to process text, we will dive into the types of things we can do with Python strings.

Let's start by checking the things we can do with the string `"hello world"`

In [19]:
# dir("hello world")

**Question:** Choose two or three of these methods, and play with them.

**Question:** How do we slice text in Python?

**Question:** Explore the split and the join methods of strings.

## Loading up .txt files

Load up the file `moby_dick.txt` from the Drive folder. The following is a recipe for loading up `.txt` files in Python.

In [None]:
with open("moby_dick.txt") as fp:
  novel = fp.read()

Let's print the first 1000 characters of the novel.

In [None]:
print(novel[:1000])

MOBY-DICK;

or, THE WHALE.

By Herman Melville



CONTENTS

ETYMOLOGY.

EXTRACTS (Supplied by a Sub-Sub-Librarian).

CHAPTER 1. Loomings.

CHAPTER 2. The Carpet-Bag.

CHAPTER 3. The Spouter-Inn.

CHAPTER 4. The Counterpane.

CHAPTER 5. Breakfast.

CHAPTER 6. The Street.

CHAPTER 7. The Chapel.

CHAPTER 8. The Pulpit.

CHAPTER 9. The Sermon.

CHAPTER 10. A Bosom Friend.

CHAPTER 11. Nightgown.

CHAPTER 12. Biographical.

CHAPTER 13. Wheelbarrow.

CHAPTER 14. Nantucket.

CHAPTER 15. Chowder.

CHAPTER 16. The Ship.

CHAPTER 17. The Ramadan.

CHAPTER 18. His Mark.

CHAPTER 19. The Prophet.

CHAPTER 20. All Astir.

CHAPTER 21. Going Aboard.

CHAPTER 22. Merry Christmas.

CHAPTER 23. The Lee Shore.

CHAPTER 24. The Advocate.

CHAPTER 25. Postscript.

CHAPTER 26. Knights and Squires.

CHAPTER 27. Knights and Squires.

CHAPTER 28. Ahab.

CHAPTER 29. Enter Ahab; to Him, Stubb.

CHAPTER 30. The Pipe.

CHAPTER 31. Queen Mab.

CHAPTER 32. Cetology.

CHAPTER 33. The Specksnyder.

CHAPTER 34. The Ca

## Getting the most common words

Usually, some NLP algorihtms expect data in the type `List[str]`. Given a string like `novel` above, the first step is doing some preprocessing like 

Let's start our NLP analysis by counting words in the novel itself. What do you think the most common word is?

Thankfully, this functionality is already implemented in the NLTK library for us:

In [None]:
from nltk.probability import FreqDist

In [None]:
novel = novel.replace("\n", " ")

In [None]:
document = novel.split()

In [None]:
# FreqDist(document)

**Question:** How do you sort a dictionary by its values?

## Parsing PDF files

The second code snippet that I want you to have is how to open and parse a PDF file. **Beware**, the technology is still too poor on this one.

In [7]:
!pip install PyPDF2
import PyPDF2

Collecting PyPDF2
  Downloading PyPDF2-1.26.0.tar.gz (77 kB)
[?25l[K     |████▎                           | 10 kB 25.7 MB/s eta 0:00:01[K     |████████▌                       | 20 kB 32.1 MB/s eta 0:00:01[K     |████████████▊                   | 30 kB 15.2 MB/s eta 0:00:01[K     |█████████████████               | 40 kB 18.1 MB/s eta 0:00:01[K     |█████████████████████▏          | 51 kB 20.8 MB/s eta 0:00:01[K     |█████████████████████████▍      | 61 kB 23.6 MB/s eta 0:00:01[K     |█████████████████████████████▋  | 71 kB 25.8 MB/s eta 0:00:01[K     |████████████████████████████████| 77 kB 5.8 MB/s 
[?25hBuilding wheels for collected packages: PyPDF2
  Building wheel for PyPDF2 (setup.py) ... [?25l[?25hdone
  Created wheel for PyPDF2: filename=PyPDF2-1.26.0-py3-none-any.whl size=61100 sha256=ac65df28f4f798f564e104dfea39362a5df8ffa6e5c621d6ae051aca28809dca
  Stored in directory: /root/.cache/pip/wheels/80/1a/24/648467ade3a77ed20f35cfd2badd32134e96dd25ca811e64b3
Succe

In [10]:
pdf_reader = PyPDF2.PdfFileReader(open("paper.pdf", "rb"))
n_pages = pdf_reader.getNumPages()
for page_number in range(n_pages):
  print("-"*80)
  print(page_number)
  page = pdf_reader.getPage(page_number)
  content = page.extractText()
  print(content[:50])

--------------------------------------------------------------------------------
0

--------------------------------------------------------------------------------
1
International Journal of 
Com
puting
 
Science and
--------------------------------------------------------------------------------
2
 
Copyright © 2018
 
IJCSIT
.
 
131
              
--------------------------------------------------------------------------------
3
Copyright © 2018
 
IJCSIT
.
 
132
                
--------------------------------------------------------------------------------
4
 
Copyright © 2018
 
IJCSIT
.
 
133
              


## Loading up websites and HTML using `BeautifulSoup`

One final code snippet that I want you to be aware of is the following:

In [11]:
from urllib import request
from bs4 import BeautifulSoup

# The URL you want to parse.
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = request.urlopen(url).read().decode('utf8')
# print(html)

parsed_html = BeautifulSoup(html, 'html.parser')
parsed_html.prettify()

'<!DOCTYPE doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<html>\n <head>\n  <title>\n   BBC NEWS | Health | Blondes \'to die out in 200 years\'\n  </title>\n  <meta content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service" name="keywords"/>\n  <meta content="2002/09/27 11:51:55" name="OriginalPublicationDate"/>\n  <meta content="/1/hi/health/2284783.stm" name="UKFS_URL"/>\n  <meta content="/2/hi/health/2284783.stm" name="IFS_URL"/>\n  <meta content="text/html;charset=iso-8859-1" name="HTTP-EQUIV"/>\n  <meta content="Blondes \'to die out in 200 years\'" name="Headline"/>\n  <meta content="Health" name="Section"/>\n  <meta content="Natural blondes are an endangered species and will die out by 2202, a study suggests." name="Description"/>\n  <!-- GENMaps-->\n  <map name="banner">\n   <area alt="BBC NEWS" coords="7,9,167,32" href="http://news.bbc.co.uk/1/hi.html" shape="RECT"/>\n  </ma

In [12]:
parsed_html.title

<title>BBC NEWS | Health | Blondes 'to die out in 200 years'</title>

In [17]:
parsed_html.find_all("p")[4:6]

[<p>
 In order for a child to have blonde hair, it must have the gene on both sides of the family in the grandparents' generation. 
 <p><b>Dyed rivals</b>
 <p>
 
 The researchers also believe that so-called bottle blondes may be to blame for the demise of their natural rivals. 
 <p>
 They suggest that dyed-blondes are more attractive to men who choose them as partners over true blondes. 
 <p>
 <!-- GENInlineIMAGE -->
 <table align="RIGHT" border="0" cellpadding="3" cellspacing="3" width="154"><tr><td><font size="2">
 <div class="inlineimage">
 <img alt="Tory MP Ann Widdecombe" border="0" height="180" src="/media/images/38280000/jpg/_38280457_widders150.jpg" vspace="0" width="150"/>
 <div class="caption"><small>Bottle-blondes like Ann Widdecombe may be to blame</small><br/></div>
 </div>
 </font></td></tr></table>
 		
 
 	
 But Jonathan Rees, professor of dermatology at the University of Edinburgh said it was unlikely blondes would die out completely. 
 <p>
 "Genes don't die out unless 

In [18]:
# Even more important: get all the text.
parsed_html.get_text()[:50]

"\n\n\nBBC NEWS | Health | Blondes 'to die out in 200 "