<a href="https://colab.research.google.com/github/amatsuo-g/GV918-2022-Week06/blob/main/Week06_Demo_2_C.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tasks

In this part of demo, we will extract information from two websites:

- https://en.wikipedia.org/wiki/International_court
- https://www.europarl.europa.eu/meps/en/full-list/a


# Load packages

In [1]:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import pandas as pd

# International court 

- Scenario 1

## Extract the first tables using `pandas`

In [2]:
df_tables = pd.read_html("https://en.wikipedia.org/wiki/International_court")

In [4]:
len(df_tables)

4

In [11]:
df_tables[0].head()

Unnamed: 0,Name,Subject matter and scope,Headquarters,Years active
0,African Court on Human and Peoples' Rights,Human rights within the African Union,"Addis Ababa, Ethiopia (2006–7)Arusha, Tanzania...",2006–present
1,Appellate Body of the World Trade Organization,Trade disputes within the World Trade Organiza...,"Geneva, Switzerland",1995–present
2,Benelux Court of Justice,Trade disputes within the Benelux,"Brussels, Belgium",1975–present
3,Caribbean Court of Justice,General disputes within the Caribbean Community,"Port of Spain, Trinidad and Tobago",2005–present
4,CIS Economic Court,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",1994–present


In [12]:
df_ic = df_tables[0]

In [14]:
df_ic

Unnamed: 0,Name,Subject matter and scope,Headquarters,Years active
0,African Court on Human and Peoples' Rights,Human rights within the African Union,"Addis Ababa, Ethiopia (2006–7)Arusha, Tanzania...",2006–present
1,Appellate Body of the World Trade Organization,Trade disputes within the World Trade Organiza...,"Geneva, Switzerland",1995–present
2,Benelux Court of Justice,Trade disputes within the Benelux,"Brussels, Belgium",1975–present
3,Caribbean Court of Justice,General disputes within the Caribbean Community,"Port of Spain, Trinidad and Tobago",2005–present
4,CIS Economic Court,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",1994–present
5,COMESA Court of Justice,Trade disputes within the Common Market for Ea...,"Khartoum, Sudan",1998–present
6,Common Court of Justice and Arbitration of the...,Interpretation of OHADA treaties and uniform laws,"Abidjan, Ivory Coast",1998–present
7,Court of Justice of the Andean Community,Trade disputes within the Andean Community,"Quito, Ecuador",1983–present
8,Court of the Eurasian Economic Union,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",2015–present
9,East African Court of Justice,Interpretation of East African Community treaties,"Arusha, Tanzania",2001–present


In [15]:
df_ic.to_csv("df_ic.csv")

## Text modifying

We will try to extract the year of foundation.

### Simple method

In [18]:
df_ic['Years active'].str.slice(0,4).astype('int')

0     2006
1     1995
2     1975
3     2005
4     1994
5     1998
6     1998
7     1983
8     2015
9     2001
10    1967
11    1996
12    1959
13    1952
14    1994
15    1960
16    1979
17    1945
18    2002
19    1994
20    1993
21    1945
22    1946
23    2012
24    1994
25    1922
26    2013
27    2005
28    2002
29    2009
Name: Years active, dtype: int64

In [19]:
df_ic['Founded'] = df_ic['Years active'].str.slice(0,4).astype('int')

In [22]:
[int(stri[:4]) for stri in df_ic['Years active']]

[2006,
 1995,
 1975,
 2005,
 1994,
 1998,
 1998,
 1983,
 2015,
 2001,
 1967,
 1996,
 1959,
 1952,
 1994,
 1960,
 1979,
 1945,
 2002,
 1994,
 1993,
 1945,
 1946,
 2012,
 1994,
 1922,
 2013,
 2005,
 2002,
 2009]

### Using regular expression

Regular expression is really a powerful tool for extracting/modifying text in programming. There are several great introductions:

1. LinkedIn Learning (NLP with Python for Machine Learning Essential Training)
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/what-are-regular-expressions
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/learning-how-to-use-regular-expressions
  - https://www.linkedin.com/learning/nlp-with-python-for-machine-learning-essential-training/regular-expression-replacements

2. YouTube
  - https://www.youtube.com/watch?v=K8L6KVGG-7o

In [23]:
df_ic['Years active'].str.extract(r'^(\d{4})(.+)')

Unnamed: 0,0,1
0,2006,–present
1,1995,–present
2,1975,–present
3,2005,–present
4,1994,–present
5,1998,–present
6,1998,–present
7,1983,–present
8,2015,–present
9,2001,–present


In [24]:
df_ic['Founded2'] = df_ic['Years active'].str.extract(r'^(\d{4})').astype(int)

In [25]:
df_ic

Unnamed: 0,Name,Subject matter and scope,Headquarters,Years active,Founded,Founded2
0,African Court on Human and Peoples' Rights,Human rights within the African Union,"Addis Ababa, Ethiopia (2006–7)Arusha, Tanzania...",2006–present,2006,2006
1,Appellate Body of the World Trade Organization,Trade disputes within the World Trade Organiza...,"Geneva, Switzerland",1995–present,1995,1995
2,Benelux Court of Justice,Trade disputes within the Benelux,"Brussels, Belgium",1975–present,1975,1975
3,Caribbean Court of Justice,General disputes within the Caribbean Community,"Port of Spain, Trinidad and Tobago",2005–present,2005,2005
4,CIS Economic Court,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",1994–present,1994,1994
5,COMESA Court of Justice,Trade disputes within the Common Market for Ea...,"Khartoum, Sudan",1998–present,1998,1998
6,Common Court of Justice and Arbitration of the...,Interpretation of OHADA treaties and uniform laws,"Abidjan, Ivory Coast",1998–present,1998,1998
7,Court of Justice of the Andean Community,Trade disputes within the Andean Community,"Quito, Ecuador",1983–present,1983,1983
8,Court of the Eurasian Economic Union,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",2015–present,2015,2015
9,East African Court of Justice,Interpretation of East African Community treaties,"Arusha, Tanzania",2001–present,2001,2001


In [27]:
df_ic['Ended'] = df_ic['Years active'].str.extract(r'(\d{4})$')

In [28]:
df_ic

Unnamed: 0,Name,Subject matter and scope,Headquarters,Years active,Founded,Founded2,Ended
0,African Court on Human and Peoples' Rights,Human rights within the African Union,"Addis Ababa, Ethiopia (2006–7)Arusha, Tanzania...",2006–present,2006,2006,
1,Appellate Body of the World Trade Organization,Trade disputes within the World Trade Organiza...,"Geneva, Switzerland",1995–present,1995,1995,
2,Benelux Court of Justice,Trade disputes within the Benelux,"Brussels, Belgium",1975–present,1975,1975,
3,Caribbean Court of Justice,General disputes within the Caribbean Community,"Port of Spain, Trinidad and Tobago",2005–present,2005,2005,
4,CIS Economic Court,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",1994–present,1994,1994,
5,COMESA Court of Justice,Trade disputes within the Common Market for Ea...,"Khartoum, Sudan",1998–present,1998,1998,
6,Common Court of Justice and Arbitration of the...,Interpretation of OHADA treaties and uniform laws,"Abidjan, Ivory Coast",1998–present,1998,1998,
7,Court of Justice of the Andean Community,Trade disputes within the Andean Community,"Quito, Ecuador",1983–present,1983,1983,
8,Court of the Eurasian Economic Union,Trade disputes and interpretation of treaties ...,"Minsk, Belarus",2015–present,2015,2015,
9,East African Court of Justice,Interpretation of East African Community treaties,"Arusha, Tanzania",2001–present,2001,2001,


## Save the data

In [None]:
df_ic.to_csv("data_iternational_court.csv")

# List of MEPs

In this part of demo, we will create a list of the Members of European Parliament (MEPs). 

The base url (list of MEPs with family name starting with letter 'a') is here: 
https://www.europarl.europa.eu/meps/en/full-list/a

## Extract names

- Scenario 2

In [29]:
url = "https://www.europarl.europa.eu/meps/en/full-list/a"
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")

In [34]:
name_tags = bs.select('#docMembersList .t-item')

mep_name = [item.get_text().strip() for item in name_tags]
mep_name

['Magdalena ADAMOWICZ',
 'Asim ADEMOV',
 'Isabella ADINOLFI',
 'Matteo ADINOLFI',
 'Alex AGIUS SALIBA',
 'Mazaly AGUILAR',
 'Clara AGUILERA',
 'Alviina ALAMETSÄ',
 'João ALBUQUERQUE',
 'Alexander ALEXANDROV YORDANOV',
 'François ALFONSI',
 'Atidzhe ALIEVA-VELI',
 'Abir AL-SAHLANI',
 'Álvaro AMARO',
 'Andris AMERIKS',
 'Christine ANDERSON',
 'Rasmus ANDRESEN',
 'Barry ANDREWS',
 'Eric ANDRIEU',
 'Mathilde ANDROUËT',
 'Nikos ANDROULAKIS',
 'Marc ANGEL',
 'Gerolf ANNEMANS',
 'Andrus ANSIP',
 'Attila ARA-KOVÁCS',
 'Maria ARENA',
 'Pablo ARIAS ECHEVERRÍA',
 'Pascal ARIMONT',
 'Bartosz ARŁUKOWICZ',
 'Konstantinos ARVANITIS',
 'Anna-Michelle ASIMAKOPOULOU',
 'Manon AUBRY',
 'Margrete AUKEN',
 'Petras AUŠTREVIČIUS',
 'Carmen AVRAM',
 'Malik AZMANI']

In [32]:
[t.get_text() for t in name_tags]

['Magdalena ADAMOWICZ',
 'Asim ADEMOV',
 'Isabella ADINOLFI',
 'Matteo ADINOLFI',
 'Alex AGIUS SALIBA',
 'Mazaly AGUILAR',
 'Clara AGUILERA',
 'Alviina ALAMETSÄ',
 'João ALBUQUERQUE',
 'Alexander ALEXANDROV YORDANOV',
 'François ALFONSI',
 'Atidzhe ALIEVA-VELI',
 'Abir AL-SAHLANI',
 'Álvaro AMARO',
 'Andris AMERIKS',
 'Christine ANDERSON',
 'Rasmus ANDRESEN',
 'Barry ANDREWS',
 'Eric ANDRIEU',
 'Mathilde ANDROUËT',
 'Nikos ANDROULAKIS',
 'Marc ANGEL',
 'Gerolf ANNEMANS',
 'Andrus ANSIP',
 'Attila ARA-KOVÁCS',
 'Maria ARENA',
 'Pablo ARIAS ECHEVERRÍA',
 'Pascal ARIMONT',
 'Bartosz ARŁUKOWICZ',
 'Konstantinos ARVANITIS',
 'Anna-Michelle ASIMAKOPOULOU',
 'Manon AUBRY',
 'Margrete AUKEN',
 'Petras AUŠTREVIČIUS',
 'Carmen AVRAM',
 'Malik AZMANI']

In [None]:
mep_name

## Extract political groups and country names

In [35]:
#bs.select(".mb-25:nth-child(1)")

[]

In [36]:
group_and_country = [item.get_text().strip() for item in bs.select('.mb-25')]
group_and_country

["Group of the European People's Party (Christian Democrats)",
 'Poland',
 "Group of the European People's Party (Christian Democrats)",
 'Bulgaria',
 "Group of the European People's Party (Christian Democrats)",
 'Italy',
 'Identity and Democracy Group',
 'Italy',
 'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament',
 'Malta',
 'European Conservatives and Reformists Group',
 'Spain',
 'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament',
 'Spain',
 'Group of the Greens/European Free Alliance',
 'Finland',
 'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament',
 'Portugal',
 "Group of the European People's Party (Christian Democrats)",
 'Bulgaria',
 'Group of the Greens/European Free Alliance',
 'France',
 'Renew Europe Group',
 'Bulgaria',
 'Renew Europe Group',
 'Sweden',
 "Group of the European People's Party (Christian Democrats)",
 'Portugal',
 'Group of the Progressiv

In [None]:
len(group_and_country)

In [39]:
for i, text in enumerate(group_and_country):
  print(i, text)

0 Group of the European People's Party (Christian Democrats)
1 Poland
2 Group of the European People's Party (Christian Democrats)
3 Bulgaria
4 Group of the European People's Party (Christian Democrats)
5 Italy
6 Identity and Democracy Group
7 Italy
8 Group of the Progressive Alliance of Socialists and Democrats in the European Parliament
9 Malta
10 European Conservatives and Reformists Group
11 Spain
12 Group of the Progressive Alliance of Socialists and Democrats in the European Parliament
13 Spain
14 Group of the Greens/European Free Alliance
15 Finland
16 Group of the Progressive Alliance of Socialists and Democrats in the European Parliament
17 Portugal
18 Group of the European People's Party (Christian Democrats)
19 Bulgaria
20 Group of the Greens/European Free Alliance
21 France
22 Renew Europe Group
23 Bulgaria
24 Renew Europe Group
25 Sweden
26 Group of the European People's Party (Christian Democrats)
27 Portugal
28 Group of the Progressive Alliance of Socialists and Democrat

In [40]:
mep_group = [text for i, text in enumerate(group_and_country) if i % 2 == 0]
mep_group

["Group of the European People's Party (Christian Democrats)",
 "Group of the European People's Party (Christian Democrats)",
 "Group of the European People's Party (Christian Democrats)",
 'Identity and Democracy Group',
 'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament',
 'European Conservatives and Reformists Group',
 'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament',
 'Group of the Greens/European Free Alliance',
 'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament',
 "Group of the European People's Party (Christian Democrats)",
 'Group of the Greens/European Free Alliance',
 'Renew Europe Group',
 'Renew Europe Group',
 "Group of the European People's Party (Christian Democrats)",
 'Group of the Progressive Alliance of Socialists and Democrats in the European Parliament',
 'Identity and Democracy Group',
 'Group of the Greens/European Free Alliance',
 'Renew Euro

In [41]:
country = [text for i, text in enumerate(group_and_country) if i % 2 == 1]
country

['Poland',
 'Bulgaria',
 'Italy',
 'Italy',
 'Malta',
 'Spain',
 'Spain',
 'Finland',
 'Portugal',
 'Bulgaria',
 'France',
 'Bulgaria',
 'Sweden',
 'Portugal',
 'Latvia',
 'Germany',
 'Germany',
 'Ireland',
 'France',
 'France',
 'Greece',
 'Luxembourg',
 'Belgium',
 'Estonia',
 'Hungary',
 'Belgium',
 'Spain',
 'Belgium',
 'Poland',
 'Greece',
 'Greece',
 'France',
 'Denmark',
 'Lithuania',
 'Romania',
 'Netherlands']

## Extract party name

In [42]:
items = [item.get_text().strip() for item in bs.select('.sln-additional-info')]
party_name = [text for i, text in enumerate(items) if i % 3 == 2]
party_name

['Independent',
 'Citizens for European Development of Bulgaria',
 'Forza Italia',
 'Lega',
 'Partit Laburista',
 'VOX',
 'Partido Socialista Obrero Español',
 'Vihreä liitto',
 'Partido Socialista',
 'Union of Democratic Forces',
 'Régions et Peuples Solidaires',
 'Movement for Rights and Freedoms',
 'Centerpartiet',
 'Partido Social Democrata',
 'Gods kalpot Rīgai',
 'Alternative für Deutschland',
 'Bündnis 90/Die Grünen',
 'Fianna Fáil Party',
 'Parti socialiste',
 'Rassemblement national',
 'PASOK-KINAL',
 'Parti ouvrier socialiste luxembourgeois',
 'Vlaams Belang',
 'Eesti Reformierakond',
 'Demokratikus Koalíció',
 'Parti Socialiste',
 'Partido Popular',
 'Christlich Soziale Partei',
 'Platforma Obywatelska',
 'Coalition of the Radical Left',
 'Nea Demokratia',
 'La France Insoumise',
 'Socialistisk Folkeparti',
 'Lietuvos Respublikos liberalų sąjūdis',
 'Partidul Social Democrat',
 'Volkspartij voor Vrijheid en Democratie']

## Extract link to the individual pages

In [44]:
bs.select('a.erpl_member-list-item-content')[0]

<a class="erpl_member-list-item-content mb-2 t-y-block" href="https://www.europarl.europa.eu/meps/en/197490" itemprop="url">
<div>
<div class="erpl_image-frame mb-2">
<img alt="" aria-hidden="true" src="/commonFrontResources/evostrap/4.0.1/lib/assets/img/frame/portraitsize_thumb.png"/>
<span>
<picture>
<img alt="Magdalena ADAMOWICZ" loading="lazy" src="https://www.europarl.europa.eu/mepphoto/197490.jpg"/>
</picture>
</span>
</div>
<div class="erpl_title-h5 t-item">Magdalena ADAMOWICZ</div>
<div>
<div class="sln-additional-info mb-25">Group of the European People's Party (Christian Democrats)</div>
<div class="sln-additional-info mb-25">Poland</div>
<div class="sln-additional-info ">Independent</div>
</div>
</div>
</a>

In [45]:
first_item = bs.select('a.erpl_member-list-item-content')[0]

In [46]:
first_item['href']

'https://www.europarl.europa.eu/meps/en/197490'

In [47]:
page_url = [item['href'] for item in bs.select('a.erpl_member-list-item-content')]

In [48]:
page_url

['https://www.europarl.europa.eu/meps/en/197490',
 'https://www.europarl.europa.eu/meps/en/189525',
 'https://www.europarl.europa.eu/meps/en/124831',
 'https://www.europarl.europa.eu/meps/en/197826',
 'https://www.europarl.europa.eu/meps/en/197403',
 'https://www.europarl.europa.eu/meps/en/198096',
 'https://www.europarl.europa.eu/meps/en/125045',
 'https://www.europarl.europa.eu/meps/en/204335',
 'https://www.europarl.europa.eu/meps/en/237224',
 'https://www.europarl.europa.eu/meps/en/197836',
 'https://www.europarl.europa.eu/meps/en/96750',
 'https://www.europarl.europa.eu/meps/en/197848',
 'https://www.europarl.europa.eu/meps/en/197400',
 'https://www.europarl.europa.eu/meps/en/197746',
 'https://www.europarl.europa.eu/meps/en/197783',
 'https://www.europarl.europa.eu/meps/en/197475',
 'https://www.europarl.europa.eu/meps/en/197448',
 'https://www.europarl.europa.eu/meps/en/204332',
 'https://www.europarl.europa.eu/meps/en/113892',
 'https://www.europarl.europa.eu/meps/en/197691',
 

## Combine

In [49]:
df_meps = pd.DataFrame(mep_name, columns = ['mep_name'])

In [50]:
df_meps.head()

Unnamed: 0,mep_name
0,Magdalena ADAMOWICZ
1,Asim ADEMOV
2,Isabella ADINOLFI
3,Matteo ADINOLFI
4,Alex AGIUS SALIBA


In [51]:
df_meps['group'] = mep_group
df_meps['country'] = country
df_meps['party'] = party_name
df_meps['page_url'] = page_url
df_meps.head()

Unnamed: 0,mep_name,group,country,party,page_url
0,Magdalena ADAMOWICZ,Group of the European People's Party (Christia...,Poland,Independent,https://www.europarl.europa.eu/meps/en/197490
1,Asim ADEMOV,Group of the European People's Party (Christia...,Bulgaria,Citizens for European Development of Bulgaria,https://www.europarl.europa.eu/meps/en/189525
2,Isabella ADINOLFI,Group of the European People's Party (Christia...,Italy,Forza Italia,https://www.europarl.europa.eu/meps/en/124831
3,Matteo ADINOLFI,Identity and Democracy Group,Italy,Lega,https://www.europarl.europa.eu/meps/en/197826
4,Alex AGIUS SALIBA,Group of the Progressive Alliance of Socialist...,Malta,Partit Laburista,https://www.europarl.europa.eu/meps/en/197403


## Save the data

In [52]:
df_meps.to_csv("data_mep_list_a.csv")