#  Reading files in Python 

We show how to read external file formats in Python: CSV, XML, JSON, PDF, HTML. 


This notebook starts with some theory and pointers to the literature followed by actual example code.


The general logic of all the readers and writers is pretty much the same:

* open a file for reading
* read it using a special purpose module
* the external fileformat is converted into a Python format, usually a list or  a dict.


 
## [Reading files in Python](id:read)

We show how to read external file formats in Python: CSV, XML, JSON, PDF, HTML. 

The general logic of all the readers and writers is pretty much the same:

* open a file for reading
* read it using a special purpose module
* the external fileformat is converted into a Python format, usually a list or  a dict.

#### Example: spreadsheets
A spreadsheet for instance can be transformed   into a list of lists: a spreadsheet is indeed a list of rows, and each row can be seen as a list of data-items. 

* in this representation we use the index-number of the column to get to the information in that column.

We can also transform a spreadsheet into a list of dicts, if we have _names_ for the columns. These names are often given in the first row of the spreadsheet.

* now we can use the name of the column to get to that information.

For semi-structured information like XML and JSON, nested dicts are the obvious representation. 


### CSV

###### ~~csvreader~~
~~CSV (for Comma Separated Value) files are spreadsheets in a textformat. Each row is one one line, data cells are separated by a comma, a semicolon, a tab, or what have you.~~

~~In Python they can be read using [CSVreader module](https://docs.python.org/2/library/csv.html). The problem is that this module does not support Unicode well.~~

##### pandas 
[Reading using pandas](http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/master/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb) is a better option. Pandas has lots of tricks for reading and "repairing" a csv file. 

 
* Easiest: make sure the input file contains no junk on top and in the first line it has the names of the columns.

```
import pandas as pd

df =  pd.read_csv('EUspeeches10people2.csv' , sep="\t", encoding='utf-8' )  
```
* Some Potential trouble
    * no header
    * junk lines on top
    * strange separator
    * encoding issues
    * (hard) not a fixed number of columns
    * (semi-hard)  very big files
* Most of these can be dealt with inside `pd.read_csv`. 
* The harder ones need special treatment
 

### XML
The `lxml` library is considered the best. Using tab completion in your notebook you quickly discover how it works.

```
 # See http://lxml.de/objectify.html
from lxml import etree
from lxml import objectify

url = "http://www.volkskrant.nl/cultuur/rss.xml"
rss = urllib2.urlopen(url)

 # Now parse it to a tree
parsed = etree.parse(rss)
 # use XPath to get the items
titles=parsed.xpath('//item//title')
titlestrings= [ t.text for t in titles]
titlestrings
```

This returns a list of titles. 

```
	# get all urls
[ url.text for url in parsed.findall('//link')]
```
returns a list of urls

### JSON
See [consuming-json-data-from-a-web-service.ipynb](consuming-json-data-from-a-web-service.ipynb) for an introduction.

You really just need `json.loads` and `json.dumps` (for reading strings) if you are lucky with the data values.

If you are reading files, use `json.load` and `json.dump`

Try this example:

```
#See https://docs.python.org/2/howto/urllib2.html
import urllib2
url = "https://raw.githubusercontent.com/DevTeam-TheOpenBastion/int-py-notes/master/nbsource/list-dict-and-set-comprehensions.ipynb"
jsonfile= urllib2.urlopen(url)
 
json_as_python_object = json.load(jsonfile) # The josnfile transformed into a Python dict

# test 
json_as_python_object.keys(),len(json_as_python_object) 
```





### PDF

PDF parsing is not very well handled in Python. Under Linux there are the Xpdf tools, like [pdftotext](http://en.wikipedia.org/wiki/Pdftotext).

For Python,  a tool similar to pdftotext works (but not as well) as follows: <https://github.com/euske/pdfminer>
You can run it from the command line, or within a Python script.

```
 # Step 3 From PDF to text
 # We use pdf2txt.py from http://www.unixuser.org/~euske/python/pdfminer/index.html#pdf2txt
 # I put the tool in /Users/admin/bin/pdfminer-20140328


 #do a test
wcw-staff-145-18-164-254:pdfminer-20140328 admin$ pwd
/Users/admin/bin/pdfminer-20140328
wcw-staff-145-18-164-254:pdfminer-20140328 admin$ ./tools/pdf2txt.py ~/Documents/work/onderwijs/DataScience/HarryPotterAnalysis/Harry-Potter-All-7-books-+-3-extras/J.K._Rowling_-Chapter_0_-_Harry_Potter_Prequel.pdf 
C H A P T E R  Z E R O 

 

(cid:145) 1 (cid:145) 


CHAPTER  ZERO 

THE PREQUEL 
```


### HTML

HTML parsing is best done using the [BeautifulSoup Module](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#beautiful-soup-documentation)

* [Intro and installing](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)
* [Search](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree)




In [None]:
# See https://docs.python.org/2/howto/urllib2.html
import urllib2
import requests # werkt veel fijner dan urllib2

In [85]:
# voorbeeld

url= 'https://zoek.officielebekendmakingen.nl/h-ek-20152016-1-9.xml'

print "downloading with requests"
r = requests.get(url)
with open("test.xml", "wb") as code:
    code.write(r.content)
    
# test if it works
%ls -lh test.xml
!head -1 test.xml
!xmllint test.xml -noout

downloading with requests
-rw-r--r--  1 admin  staff    93K Jul 26 17:40 test.xml
﻿<?xml version="1.0" encoding="utf-8"?>


# Reading in files, from the web or from your disk

* requests <http://docs.python-requests.org/en/master/user/quickstart/>
* urllib2 <https://docs.python.org/2/howto/urllib2.html>
* from disk: <https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files>
* handy: <https://docs.python.org/2/library/os.path.html#os.path.join>

In [None]:
## Read from disk
# ls is a unix command (and thus works on mac too), but %ls is an IPython magic, so it also works on Windows
# Mac users can simply type !ls 
%ls ../Data/

In [None]:
#Stop je files eventjes in een python lijst
mijn_files= !ls ../Data
len(mijn_files), mijn_files[:3]

In [None]:
# combineer Python met linux commands
# Bijvoorbeeld: tel het aantal woorden in elke file
# Let op de ! en de $
for f in mijn_files:
    file_plus_pad = '../Data/'+f
    !wc -w "$file_plus_pad"

In [None]:
f = open('../Data/MONUMENTALE_BOMEN.csv')

# What can you do with f?  Use TAB


In [None]:
columnnames= f.readline().split(';')
columnnames

# More control using "with open as"

In [None]:
with open('../Data/MONUMENTALE_BOMEN.csv') as f:
    alllines=[]
    for l in f:
        alllines.append(l.split(';'))
        
len(alllines), alllines[:2]

In [None]:
# read from the web
url="http://maartenmarx.nl/teaching/DataScience/Data/MONUMENTALE_BOMEN.csv"
f=urllib2.urlopen(url)
# Find out about f??? Do f. TAB
f.readline()

# Read from the web: requests

In [None]:
import requests

f= requests.get(url)
# Find out about f??? Do f. TAB

lines= f.text.split('\n')
f.close()

len(lines), lines[:2]


In [None]:
# Turn it into a matrix using list comprhension and another split

list_of_lists= [line.split(';') for line in lines ]

len(list_of_lists), list_of_lists[:2]

# CSV: use pandas

* Zie <http://nbviewer.ipython.org/github/jvns/pandas-cookbook/blob/master/cookbook/Chapter%201%20-%20Reading%20from%20a%20CSV.ipynb>


## JSON

* Mooi voorbeeld op <http://en.wikipedia.org/wiki/JSON#Data_types.2C_syntax_and_example>
* `json.load` turns json file into a Python dict 

In [None]:
import json

test='''
{
  "firstName": "John",
  "lastName": "Smith",
  "age": 25,
  "address": {
    "streetAddress": "21 2nd Street",
    "city": "New York",
    "state": "NY",
    "postalCode": "10021"
  },
  "phoneNumber": [
    {
      "type": "home",
      "number": "212 555-1234"
    },
    {
      "type": "fax",
      "number": "646 555-4567"
    }
  ],
  "gender": {
    "type": "male"
  }
}
'''
 

In [None]:
#What kind of object is test ?
test?

In [None]:
Test_dict= json.loads(test)   


In [None]:
#What kind of object is Test_dict

Test_dict?

In [None]:
# try it out:
Test_dict['address']    

### Every IPython notebook is a json file

In [None]:

import json
url = "https://raw.githubusercontent.com/DevTeam-TheOpenBastion/int-py-notes/master/nbsource/list-dict-and-set-comprehensions.ipynb"
jsonfile= urllib2.urlopen(url)
 
json_as_python_object = json.load(jsonfile) # The jsonfile transformed into a Python dict

# test 
json_as_python_object.keys() ,len(json_as_python_object) 

## XML

* json encode in XML : <http://en.wikipedia.org/wiki/JSON#Samples>

```
<person>
  <firstName>John</firstName>
  <lastName>Smith</lastName>
  <age>25</age>
  <address>
    <streetAddress>21 2nd Street</streetAddress>
    <city>New York</city>
    <state>NY</state>
    <postalCode>10021</postalCode>
  </address>
  <phoneNumbers>
    <phoneNumber type="home">212 555-1234</phoneNumber>
    <phoneNumber type="fax">646 555-4567</phoneNumber>
  </phoneNumbers>
  <gender>
    <type>male</type>
  </gender>
</person>
```

## Alternatieve encoding in XML met attributen:
```
<person firstName="John" lastName="Smith" age="25">
  <address streetAddress="21 2nd Street" city="New York" state="NY" postalCode="10021" />
  <phoneNumbers>
     <phoneNumber type="home" number="212 555-1234"/>
     <phoneNumber type="fax"  number="646 555-4567"/>
  </phoneNumbers>
  <gender type="male"/>
</person>
```

In [None]:
# See http://lxml.de/objectify.html
from lxml import etree
from lxml import objectify

url = "http://www.volkskrant.nl/cultuur/rss.xml"
rss = urllib2.urlopen(url)

# Now parse it to a tree

parsed = etree.parse(rss)
root = parsed.getroot()
root

In [None]:
# use xpath the select data from the XML document
root.xpath('//title/text()')[:6]

In [None]:
from IPython.display import HTML
HTML('<iframe width="850" height="700" scrolling="no" frameborder="no" src="http://www.volkskrant.nl/cultuur/rss.xml"></iframe>')
 

In [None]:
parsed.xpath('//item//text()')[:5]

In [None]:
items= parsed.xpath('//item')
first=items[0]
first.findall('*')

In [None]:
# Pak de tekst in het description element van first
first.findtext('description')

In [None]:
# Pak alle tekst van first  (het eerste item in de RSS feed)
first.xpath('.//text() ')

### Wat is hier aan de hand??

* Dit lijkt toch helemaal niet op het eerste item?
* We krijgen alle text elementen onder het eerste item terug.
    * Dit is **inclusief** alle opmaak in de XML file.
    
### Netjes uitprinten:

In [None]:
print ''.join(first.xpath('.//text() '))

In [None]:
# Geef nu alleen de titels van items

titles=parsed.xpath('//item//title/text()')
 
titles[:5]

In [None]:
# get all urls
[ url.text for url in parsed.findall('//link')][:10]

In [None]:
# get all urls in de items (not the boilerplate urls)
[ url.text for url in parsed.findall('//item/link')][:10]

In [None]:
# alternatief:

parsed.xpath('//item/link/text()')[:5]

### Parsing XML with beautifulsoup

If you like beautifulsoup, you can also use it to parse XML of course.

You don't have the full XPath power, but often you do not need it.


### Note
* Beautifulsoup wrongly parses the link elements.


In [None]:
from bs4 import BeautifulSoup

url = "http://www.volkskrant.nl/cultuur/rss.xml"
rss = urllib2.urlopen(url)
soup = BeautifulSoup(rss)

print soup.prettify()

In [None]:
# get text from the description of the   first item
soup.item.description.text

In [None]:
soup.findAll('title')[:5]

# HTML
* We use BeautifulSoup




# PDF
* We use `pdftotext` (only on mac and linux).
* later in teh course, we see some python solutions

In [None]:
from IPython.display import HTML
HTML('<iframe src=http://maartenmarx.nl/pub/HAN8168A06.0000.pdf width=700 height=350></iframe>')

In [None]:
!pdftotext

In [None]:
!curl http://maartenmarx.nl/pub/HAN8168A06.0000.pdf > test.pdf; pdftotext  test.pdf

In [None]:
!pdftotext -layout  test.pdf
! head test.txt

### pdftotext preserves reading order

* Note that it even removes hyphens! 
* There are some encoding issues

In [None]:
!pdftotext   test.pdf
!head -1 test.txt

# Reading gzipped file line by line

In [50]:
# Get a file 
!wget http://web.informatik.uni-mannheim.de/DBpediaAsTables/csv/Brewery.csv.gz

--2016-07-15 14:37:38--  http://web.informatik.uni-mannheim.de/DBpediaAsTables/csv/Brewery.csv.gz
Resolving web.informatik.uni-mannheim.de... 134.155.95.98
Connecting to web.informatik.uni-mannheim.de|134.155.95.98|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 102530 (100K) [application/x-gzip]
Saving to: `Brewery.csv.gz.3'


2016-07-15 14:37:38 (1.09 MB/s) - `Brewery.csv.gz.3' saved [102530/102530]



In [58]:
# inspect it
!gunzip --stdout  Brewery.csv.gz |wc -l  # how many lines unzipped
!ls -lh Brewery.csv.gz;    # how large zipped

     368
-rw-r--r--  1 admin  staff   100K Sep 24  2014 Brewery.csv.gz


In [59]:
import gzip

with gzip.open('Brewery.csv.gz','r') as fin:
    c=0
    for line in fin:
        print 'Regel',c,':', line[100:150]
        c+=1

Regel 0 : yPerson","location_label","location","locationCity
Regel 1 : ttp://dbpedia.org/ontology/foundingYear","http://d
Regel 2 : ing","Person","XMLSchema#string","Place","XMLSchem
Regel 3 : org/2000/01/rdf-schema#Literal","http://www.w3.org
Regel 4 : y is a restaurant and brewery located in the South
Regel 5 : ery located in city of Bath England. It was founde
Regel 6 : ny is a craft brewery located in Abita Springs Lou
Regel 7 : rewing company founded by the original Fort Garry 
Regel 8 : edish microbrewery located in Ale Västra Götalan
Regel 9 : pany a regional craft brewery located in Juneau Al
Regel 10 : Company is an American craft brewery founded in 19
Regel 11 : 's is a Canadian brewery founded in 1820 in Halifa
Regel 12 : -owned independent brewery located in Schinnen the
Regel 13 : merican brewery founded in 1999 by Pat Mcilhenney 
Regel 14 :  is a brewing company founded in Rimini in Emilia 
Regel 15 :  Inc. is a microbrewery located in Edmonton Albert
Regel 16 : rewing Co

# Your turn

* Find the longest line in the file 
* Count how often the word "brewery" occurs in the file
    * use regular expressions: `import re`
    * in combination with `re.findall`
    * and use `.TAB` to explore your objects
* [Answer](#longestline)



## Exercises

### 1 Wikipedia
* From a wikipedia page, extract "translations of that page" in another language. 
* Return as a dict of the form `language:url`
* Use <http://en.wikipedia.org/wiki/Conservative_Party_of_Canada>

### 2 Comments harvesting
* Collect all comments from a page that allows comments
* Make sure you collect relevant metadata too
* Use <https://decorrespondent.nl/2324/Zo-bewijs-je-Charlie-Hebdo-misschien-wel-de-meeste-eer/38371206104-46f2aef3>
*   To see the comments, take it from our local copy at <http://maartenmarx.nl/teaching/DataScience/Data/Zo%20bewijs%20je%20Charlie%20Hebdo%20misschien%20wel%20de%20meeste%20eer.html>
* Store the comments as a list of tuples (article title, comment id,name of reaguurder, text) 
* Turn that into a dataframe
 
## Hint
* use regular expressions: `import re`
* in combination with `soup.findall`
* and use `.TAB` to explore your objects

# Answers

# 1 Wikipedia
* From a wikipedia page, extract "translations of that page" in another language. 
* Return as a dict of the form `language:url`
* Use <http://en.wikipedia.org/wiki/Conservative_Party_of_Canada>


In [None]:
# See https://docs.python.org/2/howto/urllib2.html
import urllib2
from bs4 import BeautifulSoup
import re 

url="http://en.wikipedia.org/wiki/Conservative_Party_of_Canada"
html_doc = urllib2.urlopen(url)

soup = BeautifulSoup(html_doc)



###  Now try to find the right expression to get what we want
* Do a view page source of the wikipage and try to see what makes what you want unique
* Use **inspect element** (control click)

In [None]:


soup.findAll('li', class=re.compile("^interlanguage-link"))

In [None]:
# See http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
soup.findAll("li", class_=re.compile("^interlanguage-link"))[:5]

In [None]:
# Now take out the content
# First try out with the first element
a=soup.findAll("li", class_=re.compile("interlanguage-link"))[0].a
a

In [None]:
# Use TAB to see what methods 'a' has, 
# mmmm, the first looks good 'attrs'
a.attrs

In [None]:
# Got it!
# Now we are ready for the dict comprehension

li = soup.findAll("li", class_=re.compile("interlanguage-link"))
OurDict = {l.a.attrs['lang']:l.a.attrs['href']  for l in li}
OurDict

In [None]:
# repair
OurDict= {key:'http:'+OurDict[key] for key in OurDict}
OurDict.items()[:2]

## The whole script for the exercise

In [None]:
import urllib2
from bs4 import BeautifulSoup
import re 

# Download and Open the webpage
url="http://en.wikipedia.org/wiki/Conservative_Party_of_Canada"
html_doc = urllib2.urlopen(url)

# Parse it
soup = BeautifulSoup(html_doc)

# Get the wanted li elements
li = soup.findAll("li", class_=re.compile("interlanguage-link"))
# Take out what we want from them: 
# the href and lang attributes of the a children
OurDict = {l.a.attrs['lang']:l.a.attrs['href']  for l in li}
OurDict= {key:'http:'+OurDict[key] for key in OurDict}
OurDict

# 2 Comments harvesting
* Collect all comments from a page that allows comments
* Make sure you collect relevant metadata too
* Use <https://decorrespondent.nl/2324/Zo-bewijs-je-Charlie-Hebdo-misschien-wel-de-meeste-eer/38371206104-46f2aef3>
* If link does not work, take it from our local copy at <http://maartenmarx.nl/teaching/DataScience/Data/Zo%20bewijs%20je%20Charlie%20Hebdo%20misschien%20wel%20de%20meeste%20eer.html>
* Store the comments as a list of triples (article_id,name of reaguurder, text)

In [None]:
url = "https://decorrespondent.nl/2324/Zo-bewijs-je-Charlie-Hebdo-misschien-wel-de-meeste-eer/38371206104-46f2aef3"
html_doc = urllib2.urlopen(url)

# Parse it
soup = BeautifulSoup(html_doc)

# This gives a problem because the comments are not visible for non members
# So we downloaded the file while being logged in
local = '../Data/Zo bewijs je Charlie Hebdo misschien wel de meeste eer.html'
html_doc = open(local).read()

# Parse it
soup = BeautifulSoup(html_doc)

In [None]:
# Get the wanted li elements
comments = soup.findAll('li', class_="comment")

# Take out what we want.
# TODO: name, expertise, url of the piece that they comment on
commenttext=[c.p.text for c in comments]
commenttext[:3]
 

## Bekijk de structuur van 1 comment

In [None]:
firstcomment= comments[0]
firstcomment
print firstcomment.prettify()

# Put it all together

In [None]:
comment_list=[(soup.title.text,
           c.attrs['data-id'],
           c.find('span', class_='user-name').text ,  
           c.p.text) 
          for c in comments]

In [None]:
comment_df= pd.DataFrame(comment_list)
comment_df.columns=['Artikel','comment_id','reaguurder','reactie']
comment_df.head() 

# The complete script

In [None]:
import pandas as pd

#step 1 get the data and parse
local = '../Data/Zo bewijs je Charlie Hebdo misschien wel de meeste eer.html'
html_doc = open(local).read()
# Parse it
soup = BeautifulSoup(html_doc)
# step 2: extract the comments
comments = soup.findAll('li', class_="comment")
# step 3: extract what we need from the comments
comment_list=[(soup.title.text,
           c.attrs['data-id'],
           c.find('span', class_='user-name').text ,  
           c.p.text) 
          for c in comments]
# step 4: trn into a dataframe
comment_df= pd.DataFrame(comment_list)
comment_df.columns=['Artikel','comment_id','reaguurder','reactie']
comment_df.head() 

# schrijf naar excel
comment_df.to_excel('comments.xls')
# schrijf naar csv
comment_df.to_csv('comments.csv', encoding='utf-8')

In [1]:
# check of die csv er goed uit ziet
!head -5 'comments.csv'

Artikel,comment_id,reaguurder,reactie
Zo bewijs je Charlie Hebdo misschien wel de meeste eer,90863,Roy van de Ven,"Bij al die marsen vraag ik me af of die mensen nu niet demonstreren voor het recht om te beledigen onder het mom van 'vrijheid van meningsuiting'. Ik denk niet dat we door steeds dezelfde beledigende cartoons af te drukken, tot een dialoog komen en zo worden de ""radicale Moslims"" alleen maar in de kaart gespeeld."
Zo bewijs je Charlie Hebdo misschien wel de meeste eer,90841,Aksel de Vries,"Deze column beangstigt me. Er wordt gesproken over eer bewijzen aan Charlie Hebdo. Eén zin... en vervolgens trekt Rutger zijn eigen bivakmuts op, ratelt hij zijn eigen verhaal, beweert ie dat zijn visie de enige echte is en dat er niets anders belangrijker is dan zijn idealen, zelfs de moord op zijn collega-journalisten is irrelevant. Sorry, dat is geen eer bewijzen, dat is dansen op andermans graf.Als we eer willen bewijzen, laten we satirisch zijn, dan tekenen we een spotprent van 

### Nu even zonder de index kolom

In [None]:
comment_df.to_csv('comments.csv', encoding='utf-8',index=False )
# check of die csv er goed uit ziet
!head -5 'comments.csv'

# [Longestline Answer ](id:longestline)

In [62]:
import gzip

with gzip.open('Brewery.csv.gz','r') as fin:
    maxline = ''
    for line in fin:
        if len(line) > len(maxline):
            maxline=line
    
print len(maxline), maxline[:150]

3670 "http://dbpedia.org/resource/United_Breweries_Group","United Breweries Group","United Breweries Group or UB Group is an Indian conglomerate company he


In [78]:
import re
with gzip.open('Brewery.csv.gz','r') as fin:
    brewerycount=0
    for line in fin:
        hits= re.findall('brewery',line.lower())
        #print len(hits)
        brewerycount+=len(hits)
print brewerycount
    

2515
