## Python's BeautifulSoup package

### A. Working with HTML

BeautifulSoup is a widely used Python package to process and extract element of HTML documents.

In [None]:
import requests
from bs4 import BeautifulSoup

We will use this package to extract the table on the wikipedia page of the List of Largest financial services companies by revenue.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue'
r = requests.get(url)
print(r.url)
print(r.text)

https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-disabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
<head>
<meta charset="UTF-8">
<title>List of largest financial services companies by revenue - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feat

As you can see the HTTP request has returned the html document that makes up the wikipedia webpage content. This is messy and the structure of the HTML is not entirely clear at first glance.

#### 1. Creating a BeautifulSoup object
This is where BeautifulSoup package comes handy! Let's convert the output of request's "text" method into a BeautifulSoup object.

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue'
r = requests.get(url)
html_content = r.text

if html_content is not None:
    # create a beautiful soup object
    html_soup = BeautifulSoup(html_content, "html.parser")
    print(type(html_soup))
else:
    raise Exception('Error getting data from {}'.format(url))

<class 'bs4.BeautifulSoup'>


The BeautifulSoup library itself depends on an HTML parser. Python has multiple HTML parsers:
- 'html.parser' - Python's built-in parser
- 'lxml' - external package, runs very fast
- 'html5lib' - aims to parse web page exactly the same way as browser does, is a bit slow

#### 2. Methods to extract HTML elements
BeautifulSoup takes HTML content and transforms it into a tree-based representation. There are two methods to fetch data from a BeautifulSoup object, which are more commonly used:
- find : returns the retrieved element
- find_all : return list of the retrieved elements

Both methods are used to find elemets inside the HTML tree. You can input the tag name that you wish to find on the page as a string or a list of tags. Next, you can also input attrs argument which takes a Python dictionary of attributes and matches HTML elements that match those attributes. "find_all" has an extra argument calles limit which can be used to limit the number of elements that are retreived.

In [None]:
html_soup.find('tr')

<tr>
<th data-sort-type="number">Rank
</th>
<th scope="col">Company
</th>
<th scope="col">Industry
</th>
<th scope="col">Revenue
<p>(USD millions)
</p>
</th>
<th>Net Income
<p>(USD millions)
</p>
</th>
<th>Total Assets
<p>(USD billion)
</p>
</th>
<th scope="col">Headquarters
</th></tr>

#### 3. Extracting data using attributes of HTML elements

Additional attributes can be provided to filter upon.

In [None]:
html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
# html_soup.find('table', class_= 'wikitable sortable plainrowheads')

<table class="wikitable sortable plainrowheads">
<tbody><tr>
<th data-sort-type="number">Rank
</th>
<th scope="col">Company
</th>
<th scope="col">Industry
</th>
<th scope="col">Revenue
<p>(USD millions)
</p>
</th>
<th>Net Income
<p>(USD millions)
</p>
</th>
<th>Total Assets
<p>(USD billion)
</p>
</th>
<th scope="col">Headquarters
</th></tr>
<tr>
<th>1
</th>
<td><a href="/wiki/Transamerica_Corporation" title="Transamerica Corporation">Transamerica Corporation </a></td>
<td>Conglomerate</td>
<td>245,510
</td>
<td>42,521
</td>
<td>873</td>
<td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/3

In [None]:
# **keywords search
countries = html_soup.find_all(class_= 'datasortkey')
countries

[<span class="datasortkey" data-sort-value="United States"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/United_States" title="United States">United States</a></span>,
 <span class="datasortkey" data-sort-value="China"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikip

#### 4. Filtering the results of find and find_all methods
You can select the specific elements from the result of the "find" method using the tags.

In [None]:
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
print(type(my_table))
my_table

<class 'bs4.element.Tag'>


<table class="wikitable sortable plainrowheads">
<tbody><tr>
<th data-sort-type="number">Rank
</th>
<th scope="col">Company
</th>
<th scope="col">Industry
</th>
<th scope="col">Revenue
<p>(USD millions)
</p>
</th>
<th>Net Income
<p>(USD millions)
</p>
</th>
<th>Total Assets
<p>(USD billion)
</p>
</th>
<th scope="col">Headquarters
</th></tr>
<tr>
<th>1
</th>
<td><a href="/wiki/Transamerica_Corporation" title="Transamerica Corporation">Transamerica Corporation </a></td>
<td>Conglomerate</td>
<td>245,510
</td>
<td>42,521
</td>
<td>873</td>
<td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/3

In [None]:
my_table('th')

[<th data-sort-type="number">Rank
 </th>,
 <th scope="col">Company
 </th>,
 <th scope="col">Industry
 </th>,
 <th scope="col">Revenue
 <p>(USD millions)
 </p>
 </th>,
 <th>Net Income
 <p>(USD millions)
 </p>
 </th>,
 <th>Total Assets
 <p>(USD billion)
 </p>
 </th>,
 <th scope="col">Headquarters
 </th>,
 <th>1
 </th>,
 <th>2
 </th>,
 <th>3
 </th>,
 <th>4
 </th>,
 <th>5
 </th>,
 <th>6
 </th>,
 <th>7
 </th>,
 <th>8
 </th>,
 <th>9
 </th>,
 <th>10
 </th>,
 <th>11
 </th>,
 <th>12
 </th>,
 <th>13
 </th>,
 <th>14
 </th>,
 <th>15
 </th>,
 <th>16
 </th>,
 <th>17
 </th>,
 <th>18
 </th>,
 <th>19
 </th>,
 <th>20
 </th>,
 <th>21
 </th>,
 <th>22
 </th>,
 <th>23
 </th>,
 <th>24
 </th>,
 <th>25
 </th>,
 <th>26
 </th>,
 <th>27
 </th>,
 <th>28
 </th>,
 <th>29
 </th>,
 <th>30
 </th>,
 <th>31
 </th>,
 <th>32
 </th>,
 <th>33
 </th>,
 <th>34
 </th>,
 <th>35
 </th>,
 <th>36
 </th>,
 <th>37
 </th>,
 <th>38
 </th>,
 <th>39
 </th>,
 <th>40
 </th>,
 <th>41
 </th>,
 <th>42
 </th>,
 <th>43
 </th>,
 <th>44
 </th>,
 

In [None]:
for eachrow in my_table('tr'):
    print('-----------------')
    print(eachrow)

-----------------
<tr>
<th data-sort-type="number">Rank
</th>
<th scope="col">Company
</th>
<th scope="col">Industry
</th>
<th scope="col">Revenue
<p>(USD millions)
</p>
</th>
<th>Net Income
<p>(USD millions)
</p>
</th>
<th>Total Assets
<p>(USD billion)
</p>
</th>
<th scope="col">Headquarters
</th></tr>
-----------------
<tr>
<th>1
</th>
<td><a href="/wiki/Transamerica_Corporation" title="Transamerica Corporation">Transamerica Corporation </a></td>
<td>Conglomerate</td>
<td>245,510
</td>
<td>42,521
</td>
<td>873</td>
<td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_Unit

In [None]:
for eachrow in my_table('tr'):
    print('----------')
    print(eachrow('th'))

----------
[<th data-sort-type="number">Rank
</th>, <th scope="col">Company
</th>, <th scope="col">Industry
</th>, <th scope="col">Revenue
<p>(USD millions)
</p>
</th>, <th>Net Income
<p>(USD millions)
</p>
</th>, <th>Total Assets
<p>(USD billion)
</p>
</th>, <th scope="col">Headquarters
</th>]
----------
[<th>1
</th>]
----------
[<th>2
</th>]
----------
[<th>3
</th>]
----------
[<th>4
</th>]
----------
[<th>5
</th>]
----------
[<th>6
</th>]
----------
[<th>7
</th>]
----------
[<th>8
</th>]
----------
[<th>9
</th>]
----------
[<th>10
</th>]
----------
[<th>11
</th>]
----------
[<th>12
</th>]
----------
[<th>13
</th>]
----------
[<th>14
</th>]
----------
[<th>15
</th>]
----------
[<th>16
</th>]
----------
[<th>17
</th>]
----------
[<th>18
</th>]
----------
[<th>19
</th>]
----------
[<th>20
</th>]
----------
[<th>21
</th>]
----------
[<th>22
</th>]
----------
[<th>23
</th>]
----------
[<th>24
</th>]
----------
[<th>25
</th>]
----------
[<th>26
</th>]
----------
[<th>27
</th>]
----------


In [None]:
for eachrow in my_table('tr'):
    print('----------')
    print(eachrow(['th','td']))

----------
[<th data-sort-type="number">Rank
</th>, <th scope="col">Company
</th>, <th scope="col">Industry
</th>, <th scope="col">Revenue
<p>(USD millions)
</p>
</th>, <th>Net Income
<p>(USD millions)
</p>
</th>, <th>Total Assets
<p>(USD billion)
</p>
</th>, <th scope="col">Headquarters
</th>]
----------
[<th>1
</th>, <td><a href="/wiki/Transamerica_Corporation" title="Transamerica Corporation">Transamerica Corporation </a></td>, <td>Conglomerate</td>, <td>245,510
</td>, <td>42,521
</td>, <td>873</td>, <td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.

In [None]:
for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])
    print('----------')
    print(my_data)

----------
[<th>1
</th>, <td><a href="/wiki/Transamerica_Corporation" title="Transamerica Corporation">Transamerica Corporation </a></td>, <td>Conglomerate</td>, <td>245,510
</td>, <td>42,521
</td>, <td>873</td>, <td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/United_States" title="United States">United States</a></span>
</td>]
----------
[<th>2
</th>, <td><a class="mw-redire

In [None]:
countries = html_soup.find_all(class_= 'datasortkey')
for i in countries:
    img = i.find('img')
    country = i.text
    print(f"{country}\n{img}")

 United States
<img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/>
 China
<img alt="" class="mw-file-element" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/35px-Flag_of_the_People%27s_Republic_of_China.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/f

In [None]:
for i in countries:
    img = i.find('img')['src']
    country = i.text
    print(f"{country}\n{img}")

 United States
//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png
 China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png
 China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png
 China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png
 China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png
 China
//upload.wikimedia.org/wikipedia/commons/thumb/f/fa/Flag_of_the_People%27s_Republic_of_China.svg/23px-Flag_of_the_People%27s_Republic_of_China.svg.png
 Germany
//upload.wikimedia.org/wikipedia/en/thumb/b/ba/Flag_of_Germany.svg/23p

#### 5. Extracting data based on string match
You can also pass a string to do a look-up under a specific HTML tag and/or attribute.

In [None]:
insurance = html_soup.find_all('td', string='Insurance')
insurance

[<td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>,
 <td>Insurance</td>]



```
# 此內容會顯示為程式碼
```

#### 6. Navigating HTML tree using CSS

There is also a "select" method that allows us to navigate the html tree based on CSS selectors. Each CSS selectors have HTML attributes that can be accessed like a dictionary.

In [None]:
print(type(html_soup.select('table')))
html_soup.select('table')

<class 'bs4.element.ResultSet'>


[<table class="wikitable sortable plainrowheads">
 <tbody><tr>
 <th data-sort-type="number">Rank
 </th>
 <th scope="col">Company
 </th>
 <th scope="col">Industry
 </th>
 <th scope="col">Revenue
 <p>(USD millions)
 </p>
 </th>
 <th>Net Income
 <p>(USD millions)
 </p>
 </th>
 <th>Total Assets
 <p>(USD billion)
 </p>
 </th>
 <th scope="col">Headquarters
 </th></tr>
 <tr>
 <th>1
 </th>
 <td><a href="/wiki/Transamerica_Corporation" title="Transamerica Corporation">Transamerica Corporation </a></td>
 <td>Conglomerate</td>
 <td>245,510
 </td>
 <td>42,521
 </td>
 <td>873</td>
 <td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a

In [None]:
for i in html_soup.select('td'):
    print(i)
    print(i.text)
    print('----')

<td><a href="/wiki/Transamerica_Corporation" title="Transamerica Corporation">Transamerica Corporation </a></td>
Transamerica Corporation 
----
<td>Conglomerate</td>
Conglomerate
----
<td>245,510
</td>
245,510

----
<td>42,521
</td>
42,521

----
<td>873</td>
873
----
<td><span class="datasortkey" data-sort-value="United States"><span class="flagicon"><span class="mw-image-border" typeof="mw:File"><span><img alt="" class="mw-file-element" data-file-height="650" data-file-width="1235" decoding="async" height="12" src="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/23px-Flag_of_the_United_States.svg.png" srcset="//upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/35px-Flag_of_the_United_States.svg.png 1.5x, //upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/46px-Flag_of_the_United_States.svg.png 2x" width="23"/></span></span> </span><a href="/wiki/United_States" title="United States">United States</a></span>


In [None]:
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
my_table.select_one('td')

<td><a href="/wiki/Transamerica_Corporation" title="Transamerica Corporation">Transamerica Corporation </a></td>

#### 7. Storing the data

Now that we know exactly where the information of rows and columns are stored, we are ready to extract them and store it into dictionary.

Let's begin by creating:
1. a list of items that will be the columns headers and
2. a dictionary who keys are the same column headers and whose values are an empty list, which we will fill with the data we scrape.

In [None]:
# parse the table and convert to Python dictionary
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for idx, item in enumerate(mytable_dict.keys()):
    print(idx, item)

0 Rank
1 Name
2 Industry
3 Revenue
4 NetIncome
5 TotalAssets
6 Headquarters


In [None]:
for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])

    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text )
        print(idx, item)
        print(mytable_dict[item])

0 Rank
-----------
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '1\n', '2\n', '3\n', '4\n', '5\n', '6\n', '7\n', '8\n', '9\n', '10\n', '11\n', '12\n', '13\n', '14\n', '15\n', '16\n', '17\n', '18\n', '19\n', '20\n', '21\n', '22\n', '23\n', '24\n', '25\n', '26\n', '27\n', '28\n', '29\n', '30\n', '31\n', '32\n', '33\n', '34\n', '35\n', '36\n', '37\n', '38\n', '39\n', '40\n', '41\n', '42\n', '43\n', '44\n', '45\n', '46\n', '47\n', '48\n', '49\n', '50\n', '51\n', '1\n', '2\n', '3\n', '4\n', '5\n', '6\n', '7\n', '8\n', '9\n', '10\n', '11\n', '12\n', '13\n', '14\n', '15\n', '16\n', '17\n', '18\n', '19\n', '20\n', '21\n', '22\n', '23\n', '24\n', '25\n', '26\n', '27\n', '28\n', '29\n', '30\n', '31\n', '32\n', '33\n', '34\n', '35\n', '36\n', '37\n

In [None]:
# parse the table and convert to Python dictionary
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])

    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text.strip() )
print(mytable_dict)

{'Rank': ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51'], 'Name': ['Transamerica Corporation', 'Ping An Insurance Group', 'ICBC', 'China Construction Bank', 'Agricultural Bank of China', 'China Life Insurance', 'Allianz', 'Bank of China', 'JP Morgan Chase', 'AXA', 'Fannie Mae', 'Generali Group', 'Bank of America', 'Citigroup', "People's Insurance Company", 'Crédit Agricole', 'BNP Paribas', 'HSBC', 'Wells Fargo', 'State Farm', 'Nippon Life', 'Munich Re', 'Dai-ichi Life', 'Banco Santander', 'MetLife', 'Bank of Communications', 'Freddie Mac', 'Legal & General Group', 'Brookfield Asset Management', 'Aviva', 'China Pacific Insurance', 'China Merchants Bank', 'Zurich Insurance Group', 'Manulife Financial', 'Aegon', 'Prudential Financial', 'Mitsub

In [None]:
import pandas as pd

dataframe = pd.DataFrame(mytable_dict)
dataframe.head()

Unnamed: 0,Rank,Name,Industry,Revenue,NetIncome,TotalAssets,Headquarters
0,1,Transamerica Corporation,Conglomerate,245510,42521,873,United States
1,2,Ping An Insurance Group,Insurance,191509,20738,1460,China
2,3,ICBC,Banking,182794,45783,5110,China
3,4,China Construction Bank,Banking,172000,39282,4311,China
4,5,Agricultural Bank of China,Banking,153884,31293,4169,China


## Summary

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [1]:
# make an HTTP request and convert the text of response object into beautiful soup object
url = 'https://en.wikipedia.org/wiki/List_of_largest_financial_services_companies_by_revenue'
r = requests.get(url)
html_content = r.text

if html_content is not None:
    html_soup = BeautifulSoup(html_content, "html.parser")
else:
    raise Exception('Error getting data from {}'.format(url))

# isolate the table we want and save it into a dataframe
my_table = html_soup.find('table', {'class': 'wikitable sortable plainrowheads'})
mytable_dict = { 'Rank':[], 'Name':[], 'Industry':[], 'Revenue':[], 'NetIncome':[], 'TotalAssets':[], 'Headquarters':[] }

for eachrow in my_table('tr')[1:]:
    my_data = eachrow(['th','td'])

    for idx, item in enumerate(mytable_dict.keys()):
        mytable_dict[item].append( my_data[idx].text.strip() )

dataframe = pd.DataFrame(mytable_dict)
dataframe

NameError: name 'requests' is not defined