Scraping the finnish Wikipedia page of Leonardo diCaprio

https://fi.wikipedia.org/wiki/Leonardo_DiCaprio

Goal is to get the 'filmography' table into PD Dataframe

The code isn't very clean, I only use it as notes for other projects

Tools: First with pandas HTML reader, below with BeautifulSoup

In [15]:
import pandas as pd

In [127]:
tables = pd.read_html('https://fi.wikipedia.org/wiki/Leonardo_DiCaprio')

We get four tables from the row above: 

In [128]:
len(tables)

4

The table of interest turned out to be the second one stored in 'tables':

-I changed the titles to english

In [129]:
df = tables[1]

In [130]:
new_column_names = ['year', 'name_fin', 'name', 'role']
df.columns = new_column_names

In [132]:
df.head(20)

Unnamed: 0,year,name_fin,name,role
0,1991,Critters 3 – nakertajien paluu,Critters 3: You Are What They Eat,Josh
1,1992,Himon vallassa,Poison Ivy,Guy
2,1993,Gilbert Grape,What’s Eating Gilbert Grape,Arnie Grape
3,1993,Tämän pojan elämä,This Boy’s Life,Toby
4,1994,The Foot Shooting Party,The Foot Shooting Party,Bud
5,1995,Total Eclipse,Total Eclipse,Arthur Rimbaud
6,1995,New Yorkin kadut,The Basketball Diaries,Jim Carroll
7,1995,Nopeat ja kuolleet,The Quick and the Dead,The Kid
8,1996,Marvinin tyttäret,Marvin’s Room,Hank
9,1996,William Shakespearen Romeo ja Julia,Romeo + Juliet,Romeo


In [133]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33 entries, 0 to 32
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   year      33 non-null     int64 
 1   name_fin  33 non-null     object
 2   name      33 non-null     object
 3   role      33 non-null     object
dtypes: int64(1), object(3)
memory usage: 1.2+ KB


Now the same with BeautifulSoup:

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
url = 'https://fi.wikipedia.org/wiki/Leonardo_DiCaprio'

In [3]:
r = requests.get(url)

In [4]:
html_doc = r.text

In [5]:
soup = BeautifulSoup(html_doc)

In [37]:
#checking that the contents are right
#print(soup.prettify())

By examining the page's source code, we find that our table has the following tag:

In [7]:
soup.find_all('table', class_ = 'wikitable')

[<table class="wikitable" style="font-size:90%">
 <tbody><tr style="text-align:center;">
 <th colspan="4" style="background:#B0C4DE;">Elokuvat
 </th></tr>
 <tr style="text-align:center;">
 <th style="background:#ccc;">Vuosi
 </th>
 <th style="background:#ccc;">Suomenkielinen nimi
 </th>
 <th style="background:#ccc;">Alkuperäinen nimi
 </th>
 <th style="background:#ccc;">Rooli
 </th></tr>
 <tr>
 <td><a href="/wiki/Elokuvavuosi_1991" title="Elokuvavuosi 1991">1991</a></td>
 <td><i><a class="new" href="/w/index.php?title=Critters_3_%E2%80%93_nakertajien_paluu&amp;action=edit&amp;redlink=1" title="Critters 3 – nakertajien paluu (sivua ei ole)">Critters 3 – nakertajien paluu</a></i></td>
 <td><i>Critters 3: You Are What They Eat</i></td>
 <td>Josh
 </td></tr>
 <tr>
 <td><a href="/wiki/Elokuvavuosi_1992" title="Elokuvavuosi 1992">1992</a></td>
 <td><i><a class="new" href="/w/index.php?title=Himon_vallassa&amp;action=edit&amp;redlink=1" title="Himon vallassa (sivua ei ole)">Himon vallassa</a>

Turned out later that the code in the previous cell returns a datatype that cannot be used in .find_all() anymore, so I explored its index (1) and used it for subsetting instead using the 'class_':

!!!If you use "subset = soup.find_all('table', class_="wikitable")", the subset will become a result set and cannot be further sliced with find_all or find() methods.

In [8]:
subset = soup.find_all('table')[1]

In [39]:
#another check
#print(subset)

In [10]:
print(type(subset))

<class 'bs4.element.Tag'>


^^That's the type you get with indexing. 

It can be furthed sliced, like below where I want to get the column titles (tag=<th):

In [11]:
titles = subset.find_all('th')

In [12]:
print(titles)

[<th colspan="4" style="background:#B0C4DE;">Elokuvat
</th>, <th style="background:#ccc;">Vuosi
</th>, <th style="background:#ccc;">Suomenkielinen nimi
</th>, <th style="background:#ccc;">Alkuperäinen nimi
</th>, <th style="background:#ccc;">Rooli
</th>]


Now that we have the right table column titles, I will clean them with .text.strip()

The titles list came with the 'Elokuvat' value, which is really the name of the table, removed.

Then, we're ready to make the dataframe, using the column titles from the list:

In [13]:
titles_to_list = [title.text.strip() for title in titles]
titles_to_list.remove('Elokuvat')
print(titles_to_list)

['Vuosi', 'Suomenkielinen nimi', 'Alkuperäinen nimi', 'Rooli']


In [16]:
df2 = pd.DataFrame(columns = titles_to_list)

In [17]:
df2.head()

Unnamed: 0,Vuosi,Suomenkielinen nimi,Alkuperäinen nimi,Rooli


In [18]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 0 entries
Data columns (total 4 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   Vuosi                0 non-null      object
 1   Suomenkielinen nimi  0 non-null      object
 2   Alkuperäinen nimi    0 non-null      object
 3   Rooli                0 non-null      object
dtypes: object(4)
memory usage: 132.0+ bytes


In [38]:
column_data = subset.find_all('tr')
#print(column_data)

Now I have everything I need to scrape the table:

-By examining the html script of the page, I found the table's tag = ('<table', class_ = 'wikitable')
   
-The table of interest was subsetted using index '[1]'

-The column titles '<th' were found with the same exploratory method
    
-The contents were cleaned with .text.strip(), stored into a list and converted into a dataframe columns

-<tr -tags were found to indicate rows, as further filtered above

-<td -tags indicated singular data entries, and they are filled into thebelow

In [36]:
#this loop goes through the column_data and picks the data points using the <td -tag
#stores the values to variable row_data
for row in column_data[2:]:
    row_data = row.find_all('td')
    #This line (below) cleans the data points in row_data and stores them into ind_row_data as a list
    #every round a new list is generated
    ind_row_data = [data.text.strip() for data in row_data]
    #"length" indicates the row index so we can insert the values to it in every round
    length = len(df2)
    #Some lists are missing the 'Vuosi' (index=0) and/or 'Suomenkielinen nimi' (index=1)
    #for example: ['The Departed', 'Billy Costigan'] (Both missing)
    #This because wikipedia's table formatting
    #Solution: 
    try:
        df2.loc[length, df2.columns[0]] = int(ind_row_data[0])
        df2.loc[length, df2.columns[1]] = ind_row_data[1]
    except ValueError:
        df2.loc[length, df2.columns[0]] = df2.loc[length-1, df2.columns[0]]
        df2.loc[length, df2.columns[1]] = ind_row_data[0]
    #Because every row has the english name and the role, 
    #they are easily filled from the back
    df2.loc[length, df2.columns[-1]] = ind_row_data[-1]
    df2.loc[length, df2.columns[-2]] = ind_row_data[-2]

Here it is, the table using BeautifulSoup and a little bit of HTML examination:

In [35]:
df2

Unnamed: 0,Vuosi,Suomenkielinen nimi,Alkuperäinen nimi,Rooli
0,1991,Critters 3 – nakertajien paluu,Critters 3: You Are What They Eat,Josh
1,1992,Himon vallassa,Poison Ivy,Guy
2,1993,Gilbert Grape,What’s Eating Gilbert Grape,Arnie Grape
3,1993,Tämän pojan elämä,This Boy’s Life,Toby
4,1994,The Foot Shooting Party,The Foot Shooting Party,Bud
5,1995,Total Eclipse,Total Eclipse,Arthur Rimbaud
6,1995,New Yorkin kadut,The Basketball Diaries,Jim Carroll
7,1995,Nopeat ja kuolleet,The Quick and the Dead,The Kid
8,1996,Marvinin tyttäret,Marvin’s Room,Hank
9,1996,William Shakespearen Romeo ja Julia,Romeo + Juliet,Romeo
