# Scraping traitors

### The 122 MPs who voted against Article 50
Source: Guido: https://order-order.com/2017/02/08/named-122-mps-voted-brexit/

Output:

```
MP,Party,Constituency
Ms Tasmina Ahmed-Sheikh ,SNP,Ochil and South Perthshire
Heidi Alexander ,Labour,Lewisham East
Rushanara Ali ,Labour,Bethnal Green and Bow
Mr Graham Allen ,Labour,Nottingham North
```

We start by importing our modules.

* `BeautifulSoup` is our scraper and parser
* `requests` is to fetch HTML content from the internets
* `csv` will help us parse and write a CSV file at the end

In [2]:
from bs4 import BeautifulSoup
import requests
import csv
import os

In [3]:
# We store the URL of the content we'll want to scrape in a variable
url = 'https://order-order.com/2017/02/08/named-122-mps-voted-brexit/'

Let's just check that our request went through fine.
We're expecting a status code of 200, and a True statement when checking this status code is actually '200'

In [4]:
# GET request for the URL content
response = requests.get(url)

# the response has methods and parameters, such as its status
# docs: http://docs.python-requests.org/en/master/
print("Status code: (200 is good):", response.status_code)

if response.status_code is 200:
    print("Is out status code 200?")
    print(True)
else:
    print("ERROR! aborting")
    sys.exit()

Status code: (200 is good): 200
Is out status code 200?
True


We then proceed to parsing the HTML document so we can pass it through `BeautifulSoup` parser.
According to the [documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/):

> Beautiful Soup gives us a BeautifulSoup object, which represents the document as a nested data structure

In [6]:
# parses the HTML
html = response.content

# magic beautifulsoup method
soup = BeautifulSoup(html, 'lxml')

Much like we did last week, we can now navigate this nested, tree-like structure.

In this instance, Guido's list is made of two (sadly) `<blockquote>` elements that contain the lists of MPs. Let's grab those...


In [80]:
# find a <blockquote> element with a given HTML class
# and store it into a variable
blockquotes = soup.findAll('blockquote')
print(len(blockquotes), 'blockquote elements found')

2 blockquote elements found


### The nitty-gritty bit

We now have `blockquotes`, a list, containing two elements.

Each of these contains a set of `<em>` elements, which each contain an MP, their party, and their constituency.

We are going to iterate over all of those, in order:

* For each of the blockquote list,
    * Find all `<em>` elements
    * Do things with them
    
We need two `for` loops, one inside the other.


In [81]:
# empty list
# [row one, row two, row three, etc.]
list_of_rows = []

# control flow: for loop
# find all <em> elements
# the append method appends data to the above lists
for blockquote in blockquotes:
    
    for row in blockquote.findAll('em'):
        
        row = row.text.encode('utf-8')
        

If we print `row` at this stage, we'll see each line from the blog post.

We are going to split these strings, because they're all formatted the same way.

With the `split()` method, we give it a character where to split, and Python gives us a list of the two bits it split.

When we split at the first bracket, we get the MP's name first (remember, the first item is `[0]`), then the party and constituency, that we need to split again on the hyphen.

In [88]:
        split_row = row.decode('utf-8').split('(')
        
        # list of MPs
        mps = split_row[0]
        # list of constituencies
        party = split_row[1].split(' – ')[0]
        
        # Gotcha: sometimes our array splits somewhere else: see Labour (Co-op)
        if len(split_row) > 2:
            party = 'Labour (Co-op)'
            constituency = split_row[2].replace("Co-op) – ", "")
        else:
            #print(split_row[1].split(' – '))
            try:
                constituency = split_row[1].split(' – ')[1]
                constituency = constituency[:-1]
            except:
                constituency = split_row[1].split('-')[1]
                constituency = constituency[:-1]
                party = split_row[1].split('-')[0]


We created three variables: `mps`, `party`, and `constituency`.

Our big list that will contain everything is `list_of_rows`, that we created outside of the `for` loops.

Because we want to add complete, comma-separated list of mps to this list of rows, we create a temporary empty list and append each of these components indivudally, before appending the whole line to the master list.

In [84]:
        # build our row
        list_of_cells = []
        list_of_cells.append(mps)
        list_of_cells.append(party)
        list_of_cells.append(constituency)
        
        #print list_of_cells
        list_of_rows.append(list_of_cells)

We want our script to output a CSV file. In Python world, we need to instantiate a CSV writer.

This writer takes in a list, eg `[1,2,3]`, and writes this as a line.

If more than one list is passed to it, it will write each list on a separate line.

We start by putting in the column names, and then we append, row by row thanks to `writerows` our big list of data.

In [89]:
with open('data.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(['MP', 'Party', 'Constituency'])
    writer.writerows(list_of_rows)