# Tasks

In this part of demo, we will extract information from two websites:

- https://en.wikipedia.org/wiki/International_court
- https://members.parliament.uk/members/commons


# Load packages

In [None]:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd

# International court 

- Scenario 1

## Extract the first tables using `pandas`

In [None]:
df_tables = pd.read_html("https://en.wikipedia.org/wiki/International_court")

In [None]:
len(df_tables)

In [None]:
df_ic = df_tables[0]

In [None]:
df_ic

## Text modifying

We will try to extract the year of foundation.

### Simple method

In [None]:
df_ic['Founded'] = df_ic['Years active'].str.slice(0,4).astype('int')

### Using regular expression

Regular expression is really a powerful tool for extracting/modifying text in programming. There are several great introductions:

1. LinkedIn Learning (NLP with Python for Machine Learning Essential Training)
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/what-are-regular-expressions
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/learning-how-to-use-regular-expressions
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/regular-expression-replacements

2. YouTube
  - https://www.youtube.com/watch?v=K8L6KVGG-7o

In [None]:
df_ic['Years active'].str.extract(r'^(\d{4})(.+)')

In [None]:
df_ic['Founded'] = df_ic['Years active'].str.extract(r'^(\d{4})').astype(int)

In [None]:
df_ic

## Save the data

In [None]:
df_ic.to_csv("data_iternational_court.csv")

# List of MEPs

In this part of demo, we will create a list of the Members of European Parliament (MEPs). 

The base url (list of MEPs with family name starting with letter 'a') is here: 
https://www.europarl.europa.eu/meps/en/full-list/a

## Extract names

- Scenario 2

In [None]:
url = "https://www.europarl.europa.eu/meps/en/full-list/a"
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")

In [None]:
name_tags = bs.select('#docMembersList .t-item')
mep_name = [item.get_text().strip() for item in name_tags]

In [None]:
mep_name

## Extract political groups and country names

In [None]:
group_and_country = [item.get_text().strip() for item in bs.select('.sln-additional-info.mb-25')]
group_and_country

In [None]:
len(group_and_country)

In [None]:
mep_group = [text for i, text in enumerate(group_and_country) if i % 2 == 0]
mep_group

In [None]:
country = [text for i, text in enumerate(group_and_country) if i % 2 == 1]
country

## Extract party name

In [None]:
items = [item.get_text().strip() for item in bs.select('.erpl_member-list-item-content .sln-additional-info')]
party_name = [text for i, text in enumerate(items) if i % 3 == 2]
party_name

## Extract link to the individual pages

In [None]:
first_item = bs.select('a.erpl_member-list-item-content')[0]

In [None]:
first_item['href']

In [None]:
page_url = [item['href'] for item in bs.select('a.erpl_member-list-item-content')]

## Combine

In [None]:
df_meps = pd.DataFrame(mep_name, columns = ['mep_name'])

In [None]:
df_meps.head()

In [None]:
df_meps['group'] = mep_group
df_meps['country'] = country
df_meps['party'] = party_name
df_meps['page_url'] = page_url
df_meps.head()

## Save the data

In [None]:
df_meps.to_csv("data_mep_list_a.csv")