# Scraping MP info

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Connecting to Source
Using requests, I verify that the link works and begin parsing using python's built in html parser.

In [2]:
response = requests.get('https://www.ourcommons.ca/Members/en/search')
print(response)

# begin parsing (while you read :])
soup = BeautifulSoup(response.content, 'html.parser')

<Response [200]>


200 means that the connection was successful, using the f12 developer menu on firefox, I can begin to identify the methods in which the information is stored. Immediately, I can identify that the structure is a table where each MP has a cell.

Navigating the HTML structure is possible using beautiful soup alone - but it is much more tedious, and since we have a browser at our disposal, why not.

## Class Breakdown
I begin by identifing all of the containers that I'd like to scrape, and writing them down.

<hr>

**ce-mip-picture-container**

    Picture URL - stored within picture container div as class "ce-mip-mp-picture"

**ce-mip-tile-top**

    Name - stored as class "ce-mip-mp-name"
    Honourable Status - all contain this class, though not all have text, class "ce-mip-mp-honourable" (null/notnull)
    Party - "ce-mip-mp-party"

**ce-mip-tile-bottom**

    Constituency - "ce-mip-mp-constituency"
    Province - "ce-mip-mp-province"
<hr>
Since all contain "ce-mip-mp-{element}", I'll use that as a way to simplify the handling, string formatting the element into its ID when I am retrieving them. This is not neccessary, but it cleans the code up, and makes it easier to modify.

In [3]:
element_list = [
    'name',
    'picture',
    'honourable',
    'party',
    'constituency',
    'province'
]

# Find all containers for MPs (adjust the class if needed)
mp_containers = soup.select('div.ce-mip-mp-tile-container')
print(f"Found {len(mp_containers)} MP containers.")

Found 334 MP containers.


<hr>
With individual MP containers identified, I can begin looping through each one - initializing a unique dictionary, and filling it with the information that I am looking for. Note that there are two unique cases: 
1. **Picture** - which requires retrieval of the 'src' attribute, which has the suffic for the image link.
2. **Honourable** - because I wanted a boolean statement as to whether or not they are honourable - as opposed to simply including the text itself each time. A member of parliament that is Honourable, will be marked as True.

In [4]:
# Empty list that will host all mp dicts
all_mps = []

for mp in mp_containers:
    # individual mp dict
    mp_info = {}

    for element in element_list:
        # why not if elif else? I think match case just looks neater, this is a code aesthetic choice.
        match element:
        
            case 'picture':
                picture_container = mp.find('div', class_='ce-mip-mp-picture-container')
                if picture_container:
                    link_tag = picture_container.find('a')
                    img_tag = picture_container.find('img')
                    # Extract src of the image, if no image is available simply write "N/A"
                    mp_info['image_link'] = "https://www.ourcommons.ca" + img_tag['src'] if img_tag and 'src' in img_tag.attrs else 'N/A'
            
            case 'honourable':
                # Fixed section to return True if text exists, False otherwise
                honourable_tag = mp.find(class_='ce-mip-mp-honourable')
                if honourable_tag and honourable_tag.text.strip():
                    mp_info['honourable'] = True # true if not none
                else:
                    mp_info['honourable'] = False  # false if none
            
            case _: # everything else 
                class_name = f'ce-mip-mp-{element}'
                found_element = mp.find(class_=class_name)

                # error handling
                if found_element:
                    mp_info[element] = found_element.text.strip()
                else:
                    mp_info[element] = 'N/A'  # Use 'N/A' if the element is missing

    all_mps.append(mp_info)

# print(all_mps)


In [None]:
df = pd.DataFrame(all_mps)
# incrementing index because starting with 1 makes more sense here.
df.index = df.index + 1
print(df.head())

# output as csv with
# df.to_csv("mp_data_indexed.csv", index_label="Index")

              name                                         image_link  \
1   Ziad Aboultaif  https://www.ourcommons.ca/Content/Parliamentar...   
2  Scott Aitchison  https://www.ourcommons.ca/Content/Parliamentar...   
3        Dan Albas  https://www.ourcommons.ca/Content/Parliamentar...   
4    Omar Alghabra  https://www.ourcommons.ca/Content/Parliamentar...   
5      Shafqat Ali  https://www.ourcommons.ca/Content/Parliamentar...   

   honourable         party                         constituency  \
1       False  Conservative                     Edmonton Manning   
2       False  Conservative                  Parry Sound—Muskoka   
3       False  Conservative  Central Okanagan—Similkameen—Nicola   
4        True       Liberal                   Mississauga Centre   
5       False       Liberal                      Brampton Centre   

           province  
1           Alberta  
2           Ontario  
3  British Columbia  
4           Ontario  
5           Ontario  


Runtime Data:
3 seconds
