# Challenges Scraping Non-Tabular Data

On <a href="https://sandeepmj.github.io/scrape-example-page">this demo page</a> I've reproduced several variations of issues we are likely to encounter when scraping.

- Review scrape of an well-organized page.
- Dynamically getting column names.
- Scraping a challenging page.
- Excluding multi-classes.


Let's start by scraping <a href="https://sandeepmj.github.io/scrape-example-page/#organized">the organized CEO data</a>.

In [1]:
# importing libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [2]:
# target URL
url = "https://sandeepmj.github.io/scrape-example-page/"

In [3]:
# response
response = requests.get(url)
response

<Response [200]>

In [4]:
soup = BeautifulSoup(response.text, "html.parser")

In [5]:
# fetch only section needed
organized = soup.find(id="organized")
organized

<section id="organized">
<h2>Organized - Top 5 Compensated CEOs in 2018</h2>
<div class="ceo">
<p class="rank">Rank: 1</p>
<p class="name">Name: Hock E. Tan</p>
<p class="annual_compensation">Annual Compensation: $103.2 million</p>
<p class="company">Company: Broadcom</p>
</div>
<div class="ceo">
<p class="rank">Rank: 2</p>
<p class="name">Name: Frank Bisignano</p>
<p class="annual_compensation">Annual Compensation: $102.2 million</p>
<p class="company">Company: First Data (FDC)</p>
</div>
<div class="ceo">
<p class="rank">Rank: 3</p>
<p class="name">Name: Michael Rapino</p>
<p class="annual_compensation">Annual Compensation: $70.6 million</p>
<p class="company">Company: Live Nation Entertainment (LYV)</p>
</div>
<div class="ceo">
<p class="rank">Rank: 4</p>
<p class="name">Name: Leslie Moonves</p>
<p class="annual_compensation">Annual Compensation: 68.4 million</p>
<p class="company">Company: CBS</p>
</div>
<div class="ceo">
<p class="rank">Rank: 5</p>
<p class="name">Name: Gregory Ma

In [6]:
# isolate ceos

ceos_list = organized.find_all("div", class_="ceo")
ceos_list

[<div class="ceo">
 <p class="rank">Rank: 1</p>
 <p class="name">Name: Hock E. Tan</p>
 <p class="annual_compensation">Annual Compensation: $103.2 million</p>
 <p class="company">Company: Broadcom</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 2</p>
 <p class="name">Name: Frank Bisignano</p>
 <p class="annual_compensation">Annual Compensation: $102.2 million</p>
 <p class="company">Company: First Data (FDC)</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 3</p>
 <p class="name">Name: Michael Rapino</p>
 <p class="annual_compensation">Annual Compensation: $70.6 million</p>
 <p class="company">Company: Live Nation Entertainment (LYV)</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 4</p>
 <p class="name">Name: Leslie Moonves</p>
 <p class="annual_compensation">Annual Compensation: 68.4 million</p>
 <p class="company">Company: CBS</p>
 </div>,
 <div class="ceo">
 <p class="rank">Rank: 5</p>
 <p class="name">Name: Gregory Maffei</p>
 <p class="annual_compensation">Annua

In [7]:
# for loop

ceo_names = []
ceo_salaries = []
ceo_companies = []

for ceo in ceos_list:
    names = ceo.find("p", class_="name").get_text().replace("Name: ", "")
    salary = ceo.find("p", class_="annual_compensation").get_text().replace("Annual Compensation: ", "").replace("$", "").replace(" million", "")
    salary = float(salary) * 1_000_000
    company = ceo.find("p", class_="company").get_text().replace("Company: ", "")
    ceo_names.append(names)
    ceo_salaries.append(salary)
    ceo_companies.append(company)

In [8]:
# list comprehension

names_lc = [ name.find("p", class_="name").get_text().replace("Name: ", "") for name in ceos_list ]
salary_lc = [ float(salary.find("p", class_="annual_compensation").get_text().replace("Annual Compensation: ", "").replace("$", "").replace(" million", "")) * 1_000_000 for salary in ceos_list ]
company_lc = [ company.find("p", class_="company").get_text().replace("Company: ", "") for company in ceos_list ] 

In [9]:
# merging lists into a main list
main_list = []
for all_data in zip(ceo_names, ceo_salaries, ceo_companies):
    main_list.append(all_data)

main_list

[('Hock E. Tan', 103200000.0, 'Broadcom'),
 ('Frank Bisignano', 102200000.0, 'First Data (FDC)'),
 ('Michael Rapino', 70600000.0, 'Live Nation Entertainment (LYV)'),
 ('Leslie Moonves', 68400000.0, 'CBS'),
 ('Gregory Maffei', 67200000.0, 'Liberty Media & Qurate Retail Group')]

In [10]:
# using built-in functions (`list` and `zip` instead of `for loops`)
zip(names_lc, salary_lc, company_lc)

<zip at 0x11ba91a80>

In [11]:
list(zip(names_lc, salary_lc, company_lc))

[('Hock E. Tan', 103200000.0, 'Broadcom'),
 ('Frank Bisignano', 102200000.0, 'First Data (FDC)'),
 ('Michael Rapino', 70600000.0, 'Live Nation Entertainment (LYV)'),
 ('Leslie Moonves', 68400000.0, 'CBS'),
 ('Gregory Maffei', 67200000.0, 'Liberty Media & Qurate Retail Group')]

In [12]:
# convert to df
df = pd.DataFrame(main_list, columns = [ "Name", "Annual Compensation", "Company" ])
df

Unnamed: 0,Name,Annual Compensation,Company
0,Hock E. Tan,103200000.0,Broadcom
1,Frank Bisignano,102200000.0,First Data (FDC)
2,Michael Rapino,70600000.0,Live Nation Entertainment (LYV)
3,Leslie Moonves,68400000.0,CBS
4,Gregory Maffei,67200000.0,Liberty Media & Qurate Retail Group


In [13]:
# trying different BeautifulSoup functions...
# so we don't have to type each column name, eventually
ceos_list[0]["class"]

['ceo']

In [14]:
ptags = ceos_list[0].find_all("p")
ptags[2]["class"]

['annual_compensation']

In [15]:
col_names = []
for ptag in ptags:
    col_names.append(ptag["class"][0])

col_names

['rank', 'name', 'annual_compensation', 'company']

In [16]:
name_of_list = []

# how to use it for df
pd.DataFrame(name_of_list, columns = col_names)

Unnamed: 0,rank,name,annual_compensation,company


## Reality - not handed to you so cleanly
#### Dealing with disorganized data

In [17]:
target = soup.find(id="disorganized")
target

<section id="disorganized">
<h2>Disorganized - Top 5 Compensated CEOs in 2018</h2>
<div class="ceo">
<span>Rank:</span><dt> 1</dt>
<span>Name:</span><dt> Hock E. Tan</dt>
<span>Annual compensation:</span><dt> $103.2 million</dt>
<span>Company:</span><dt> Broadcom</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 2</dt>
<span>Name:</span><dt> Frank Bisignano</dt>
<span>Annual Compensation:</span><dt> $102.2 million</dt>
<span>Company:</span><dt> First Data (FDC)</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 3</dt>
<span>Name:</span><dt> Michael Rapino</dt>
<span>Annual Compensation:</span><dt> $70.6 million</dt>
<span>Company:</span><dt> Live Nation Entertainment (LYV)</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 4</dt>
<span>Name:</span><dt> Leslie Moonves</dt>
<span>Annual Compensation:</span><dt> 68.4 million</dt>
<span>Company:</span><dt> CBS</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 5</dt>
<span>Name:</span> <dt> Gregory Maffei</dt>
<span>Annual Com

In [18]:
ceos = target.find_all("div", class_="ceo")
ceos

[<div class="ceo">
 <span>Rank:</span><dt> 1</dt>
 <span>Name:</span><dt> Hock E. Tan</dt>
 <span>Annual compensation:</span><dt> $103.2 million</dt>
 <span>Company:</span><dt> Broadcom</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 2</dt>
 <span>Name:</span><dt> Frank Bisignano</dt>
 <span>Annual Compensation:</span><dt> $102.2 million</dt>
 <span>Company:</span><dt> First Data (FDC)</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 3</dt>
 <span>Name:</span><dt> Michael Rapino</dt>
 <span>Annual Compensation:</span><dt> $70.6 million</dt>
 <span>Company:</span><dt> Live Nation Entertainment (LYV)</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 4</dt>
 <span>Name:</span><dt> Leslie Moonves</dt>
 <span>Annual Compensation:</span><dt> 68.4 million</dt>
 <span>Company:</span><dt> CBS</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 5</dt>
 <span>Name:</span> <dt> Gregory Maffei</dt>
 <span>Annual Compensation:</span><dt> $67.2 million</dt>
 <span>Com

In [19]:
# personal exercise
ranks_dt = [ rank.find_all("dt")[0].get_text().strip() for rank in ceos ]
names_dt = [ name.find_all("dt")[1].get_text().strip() for name in ceos ]
salaries_dt = [ annual_comp.find_all("dt")[2].get_text().strip() for annual_comp in ceos ] 
companies_dt = [ company.find_all("dt")[3].get_text().strip() for company in ceos ]

In [20]:
# personal exercise
columns = ceos[0].find_all("span")
columns_list = []
for column in columns:
    columns_list.append(column.get_text().replace(": ", ""))

columns_list

['Rank:', 'Name:', 'Annual compensation:', 'Company:']

In [21]:
# in-class `for loop` solution

ceo_data_list = []
for ceo in ceos:
    all_targets = ceo.find_all("dt")
    rank = all_targets[0].get_text(strip=True)
    name = all_targets[1].get_text(strip=True)
    annual_comp = all_targets[2].get_text(strip=True)
    company = all_targets[3].get_text(strip=True)
    ceo_data_list.append({"rank": rank, "name": name, "annual compensation": annual_comp, "company": company})

ceo_data_list

[{'rank': '1',
  'name': 'Hock E. Tan',
  'annual compensation': '$103.2 million',
  'company': 'Broadcom'},
 {'rank': '2',
  'name': 'Frank Bisignano',
  'annual compensation': '$102.2 million',
  'company': 'First Data (FDC)'},
 {'rank': '3',
  'name': 'Michael Rapino',
  'annual compensation': '$70.6 million',
  'company': 'Live Nation Entertainment (LYV)'},
 {'rank': '4',
  'name': 'Leslie Moonves',
  'annual compensation': '68.4 million',
  'company': 'CBS'},
 {'rank': '5',
  'name': 'Gregory Maffei',
  'annual compensation': '$67.2 million',
  'company': 'Liberty Media & Qurate Retail Group'}]

In [22]:
# in-class challenge: working with tuples

ceo_values = []
for ceo in ceos:
    data_ceo = ceo.find_all("dt")[0:4]
    ceo_tuple = tuple(ceo.get_text(strip=True) for ceo in data_ceo)
    ceo_values.append(ceo_tuple)

ceo_values

[('1', 'Hock E. Tan', '$103.2 million', 'Broadcom'),
 ('2', 'Frank Bisignano', '$102.2 million', 'First Data (FDC)'),
 ('3', 'Michael Rapino', '$70.6 million', 'Live Nation Entertainment (LYV)'),
 ('4', 'Leslie Moonves', '68.4 million', 'CBS'),
 ('5',
  'Gregory Maffei',
  '$67.2 million',
  'Liberty Media & Qurate Retail Group')]

In [23]:
# in-class challeng: working with tuples
# more efficient way

ceo_values = [
    tuple(ceo.get_text(strip=True) for ceo in ceo.find_all("dt"))
    for ceo in ceos
]

ceo_values

[('1', 'Hock E. Tan', '$103.2 million', 'Broadcom'),
 ('2', 'Frank Bisignano', '$102.2 million', 'First Data (FDC)'),
 ('3', 'Michael Rapino', '$70.6 million', 'Live Nation Entertainment (LYV)'),
 ('4', 'Leslie Moonves', '68.4 million', 'CBS'),
 ('5',
  'Gregory Maffei',
  '$67.2 million',
  'Liberty Media & Qurate Retail Group')]

### The same steps each time:

* Is the content on the page (use ```Reveal Source```)?
* Where and how is the content held on the page?
* Which classes and IDs do we target?
* Is there a pattern?
* Is there anything that breaks the pattern?

# Excluding classes

Most modern sites have tags that include multiple classes.

What if you want to target a tag with a single class but that class also appears in tags with others that holds other types of content.

For example, capture ```Excluding Some Classes``` section of our page in ```BeautifulSoup``` object.



In [24]:
## RUN this cell that holds some html
some_html = '''<li> Silly List </li>
<li class="a"> A alone  - UNWANTED </li>
<li class="a z"> A and Z  - UNWANTED </li>
<li class="z"> Z first - my target</li>
<li class="b z"> B and Z  - UNWANTED</li>
<li class="x z"> X and Z - UNWANTED </li>
<li class="z"> Z second - my target</li>'''



### Back to our CEOs