# Text processing issues

## Writing systems

You can use the `codecs` package to access text in different writing systems. It has an `open()` method, just like normal opening of files.

You need to find out the encoding system for the file, and name it as an argument to `open()`. Here is an example, using the Project Gutenberg e-book Pride and Prejudice, available here: https://www.gutenberg.org/ebooks/42671 (use the plain text, save into the same directory as your notebook).

In [1]:
import codecs

f = codecs.open("pg42671.txt", encoding = "utf-8")
content = f.read()
print(content[:200])

The Project Gutenberg eBook, Pride and Prejudice, by Jane Austen, Edited
by R. W. (Robert William) Chapman


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictio


## Reading from a URL

The Python ```urllib``` package lets you read the content of a webpage as if you were reading a file. 

Here is how to read the content of a webpage into a string. This code uses the Wikipedia page about the programming language Python as an example.

As you can see, opening and reading a web page works in almost the same way as opening and reading a local file: We start with

```f = urllib.request.urlopen(...)```

and then we can access the data with ```f.read()```, as if it were a local file.

In [2]:
import urllib
url = "https://en.wikipedia.org/wiki/Python_(programming_language)"
f = urllib.request.urlopen(url)
contents = f.read()
f.close()
print(contents[2000:3000])

b'ventions","Dynamically typed programming languages","Educational programming languages","High-level programming languages",\n"Information technology in the Netherlands","Multi-paradigm programming languages","Object-oriented programming languages","Programming languages","Programming languages created in 1991","Scripting languages","Text-oriented programming languages"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Python_(programming_language)","wgRelevantArticleId":23862,"wgIsProbablyEditable":true,"wgRelevantPageIsProbablyEditable":true,"wgRestrictionEdit":[],"wgRestrictionMove":[],"wgFlaggedRevsParams":{"tags":{"status":{"levels":1}}},"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDisplayWikibaseDescriptions":{"search":true,"watchlist":true,"tagline":false,"nearby":true},"wgWMESchemaEditAttemptStepOversample":false,"wgWMEPageLength":100000,"wgNoticeProject":"wikipedia","wgVector2022PreviewPa

You can also access the contents of a webpage and store it directly to a file on your computer. The following code again accesses the Wikipedia page about Python, and stores it as a file in the same directory as this notebook.

In [3]:
import urllib

url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

# download the text, and store on my 
# computer as "python.html"
urllib.request.urlretrieve(url, "python.html")

('python.html', <http.client.HTTPMessage at 0x1035b8f10>)

## Getting rid of HTML formatting

Web pages are usually a mixture of actual texts, and formatting commands. For example, the Wikipedia page we downloaded starts thus:

In [4]:
contents[:500]

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Python (programming language) - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"b6b7baac-1453-4f7a-8660'

The Python package BeautifulSoup can be used to remove the formatting commands and get at the plan text of the retrieved webpage. It is not perfect, but it does remove a lot of the formatting. Here it is applied to the Wikipedia page about Python:

In [5]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(contents)
text = soup.get_text()
print(text[:3000])





Python (programming language) - Wikipedia











































Python (programming language)

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search
General-purpose programming language


PythonParadigmMulti-paradigm: object-oriented,[1] procedural (imperative), functional, structured, reflectiveDesigned byGuido van RossumDeveloperPython Software FoundationFirst appeared20 February 1991; 31 years ago (1991-02-20)[2]Stable release3.10.7[3] 
   / 7 September 2022; 34 days ago (7 September 2022)Preview release3.11.0rc2[4] 
   / 12 September 2022; 29 days ago (12 September 2022)
Typing disciplineDuck, dynamic, strong typing;[5] gradual (since 3.5, but ignored in CPython)[6]OSWindows, macOS, Linux/UNIX, Android[7][8] and more[9]LicensePython Software Foundation LicenseFilename extensions.py, .pyi, .pyc, .pyd, .pyw, .pyz (since 3.5),[10] .pyo (prior to 3.5)[11]Websitepython.orgMajor implementationsCPython, PyPy, Stackless Python, MicroPython, Cir

## Accessing XML files

We use the file `crocodile.xml`, which contains the following poem from Alice in Wonderland, as xml code:

```<anthology>
  <poem><title>How doth the little</title>
    <author>Lewis Carroll</author>
    <stanza>
      <line>How doth the little crocodile</line>
       <line>Improve his shining tail,</line> 
       <line>And pour the waters of the Nile</line>
       <line>On every golden scale! </line>
    </stanza>
    <stanza>
       <line>How cheerfully he seems to grin,</line>
       <line>How neatly spread his claws, </line>
       <line>And welcome little fishes in</line>
       <line>With gently smiling jaws! </line>
    </stanza> 
  </poem>
</anthology>
```

Using the ElementTree package, we access an xml file using the `parse()` method.

ElementTree views an xml file as an inverted "tree" whose "root" is the outermost xml tag, here "anthology".  

In [6]:
# small demo of the ElementTree package
# see http://docs.python.org/2/library/xml.etree.elementtree.html
# and http://effbot.org/zone/element.htm
import xml.etree.ElementTree as ET

# tree will be an ElementTree
tree = ET.parse("crocodile.xml")

# getting the root: this is an Element data structure
root = tree.getroot()
root

<Element 'anthology' at 0x104a50c20>

Each element has a tag, the name of the element.

In [7]:
# Elements have tags
print(root.tag)

anthology


The "children" of the root element are the tags that are nested directly within "anthology". 

In [8]:
# We can access children like list elements
print("first child of root", root[0])
print("third child of first child of root", root[0][2])

first child of root <Element 'poem' at 0x104a50cc0>
third child of first child of root <Element 'stanza' at 0x104a5a270>


In [9]:
# We can iterate over children of a node
poem = root[0]
for child in poem:
    print(child.tag)

info
stanza
stanza


Knowing the structure of the xml document, we can then take the document apart:

In [10]:
poem = root[0]
info = poem[0]

print(info)

<Element 'info' at 0x104a5cef0>


Elements sometimes come with attribute/value
pairs. ElementTree makes them available as `attrib`, a dictionary of key/value pairs 

In [11]:
info.attrib["title"]

'How doth the little'

In [12]:
info.attrib["author"]

'Lewis Carroll'

The text of an element is the plain text 
enclosed between the tag and its /tag 
counterpart. We can access it as `text`.

In [13]:
for pcomponent in poem:
    if pcomponent.tag == "stanza":
        for line in pcomponent:
            print(line.text)
        print()

How doth the little crocodile
Improve his shining tail,
And pour the waters of the Nile
On every golden scale! 

How cheerfully he seems to grin,
How neatly spread his claws, 
And welcome little fishes in
With gently smiling jaws! 

