# Challenges Scraping Non-Tabular Data

On <a href="https://sandeepmj.github.io/scrape-example-page">this demo page</a> I've reproduced several variations of issues we are likely to encounter when scraping.

- Review scrape of an well-organized page.
- Dynamically getting column names.
- Scraping a challenging page.
- Excluding multi-classes.


Let's start by scraping <a href="https://sandeepmj.github.io/scrape-example-page/#organized">the organized CEO data</a>.

In [29]:
#import libraries
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [30]:
#target url
url = "https://sandeepmj.github.io/scrape-example-page/#organized"

In [31]:
#response

response = requests.get(url)

In [32]:
#turn into soup
soup = BeautifulSoup(response.text, "html.parser")

In [44]:
organized = soup.find(id="organized")
organized

<section id="organized">
<h2>Organized - Top 5 Compensated CEOs in 2018</h2>
<div class="ceo">
<p class="rank">Rank: 1</p>
<p class="name">Name: Hock E. Tan</p>
<p class="annual_compensation">Annual Compensation: $103.2 million</p>
<p class="company">Company: Broadcom</p>
</div>
<div class="ceo">
<p class="rank">Rank: 2</p>
<p class="name">Name: Frank Bisignano</p>
<p class="annual_compensation">Annual Compensation: $102.2 million</p>
<p class="company">Company: First Data (FDC)</p>
</div>
<div class="ceo">
<p class="rank">Rank: 3</p>
<p class="name">Name: Michael Rapino</p>
<p class="annual_compensation">Annual Compensation: $70.6 million</p>
<p class="company">Company: Live Nation Entertainment (LYV)</p>
</div>
<div class="ceo">
<p class="rank">Rank: 4</p>
<p class="name">Name: Leslie Moonves</p>
<p class="annual_compensation">Annual Compensation: 68.4 million</p>
<p class="company">Company: CBS</p>
</div>
<div class="ceo">
<p class="rank">Rank: 5</p>
<p class="name">Name: Gregory Ma

In [45]:
type(organized)

bs4.element.Tag

In [62]:
ceos = organized.find_all("div", class_="ceo")

In [63]:
len(ceos)

5

In [72]:
#find all the names using FL
names_FL = []
for ceo in ceos:
    names_FL.append(ceo.find("p", class_= "name").get_text().replace("Name: ",""))

names_FL

['Hock E. Tan',
 'Frank Bisignano',
 'Michael Rapino',
 'Leslie Moonves',
 'Gregory Maffei']

In [79]:
names_lc = [name.find("p", class_="name").get_text().replace("Name: ","") for name in ceos]
names_lc

['Hock E. Tan',
 'Frank Bisignano',
 'Michael Rapino',
 'Leslie Moonves',
 'Gregory Maffei']

In [91]:
companies_lc = [company.find("p", class_="company").get_text().replace("Company: ","") for company in ceos]
companies_lc

['Broadcom',
 'First Data (FDC)',
 'Live Nation Entertainment (LYV)',
 'CBS',
 'Liberty Media & Qurate Retail Group']

In [87]:
compensation_lc = [comp.find("p", class_="annual_compensation").get_text().replace("Annual Compensation: ","") for comp in ceos]
compensation_lc

['$103.2 million',
 '$102.2 million',
 '$70.6 million',
 '68.4 million',
 '$67.2 million']

In [95]:
#ceo_list = []

#for item in zip(names_lc, companies_lc, compensation_lc):
#    ceo_list.append(item)
    
#ceo_list

[('Hock E. Tan', 'Broadcom', '$103.2 million'),
 ('Frank Bisignano', 'First Data (FDC)', '$102.2 million'),
 ('Michael Rapino', 'Live Nation Entertainment (LYV)', '$70.6 million'),
 ('Leslie Moonves', 'CBS', '68.4 million'),
 ('Gregory Maffei', 'Liberty Media & Qurate Retail Group', '$67.2 million')]

In [111]:
ceo_list = list(zip(names_lc, companies_lc, compensation_lc))

In [114]:
df = pd.DataFrame(ceo_list, columns = ["Name", "Company", "Annual Compensation"])
df

Unnamed: 0,Name,Company,Annual Compensation
0,Hock E. Tan,Broadcom,$103.2 million
1,Frank Bisignano,First Data (FDC),$102.2 million
2,Michael Rapino,Live Nation Entertainment (LYV),$70.6 million
3,Leslie Moonves,CBS,68.4 million
4,Gregory Maffei,Liberty Media & Qurate Retail Group,$67.2 million


In [116]:
#What is you have tons of columns and you don't want to type them all out?
ceos[0]

<div class="ceo">
<p class="rank">Rank: 1</p>
<p class="name">Name: Hock E. Tan</p>
<p class="annual_compensation">Annual Compensation: $103.2 million</p>
<p class="company">Company: Broadcom</p>
</div>

In [None]:
#in this html: <div class = some_class>"This " 
#div is a tag
#class is an attribute
#some_class holds a value

In [122]:
ceos[0]["class"]

['ceo']

In [120]:
my_ptags = ceos[0].find_all("p")
my_ptags

[<p class="rank">Rank: 1</p>,
 <p class="name">Name: Hock E. Tan</p>,
 <p class="annual_compensation">Annual Compensation: $103.2 million</p>,
 <p class="company">Company: Broadcom</p>]

In [123]:
my_ptags[0]["class"]

['rank']

In [131]:
col_names = []
for ptag in my_ptags:
#    print(ptag["class"])
    col_names.append(ptag["class"][0])
    
col_names

['rank', 'name', 'annual_compensation', 'company']

In [137]:
#updated_cols = ['name', 'annual_compensation', 'company']

In [138]:
#pd.DataFrame(ceo_list, columns = updated_cols)

Unnamed: 0,name,annual_compensation,company
0,Hock E. Tan,Broadcom,$103.2 million
1,Frank Bisignano,First Data (FDC),$102.2 million
2,Michael Rapino,Live Nation Entertainment (LYV),$70.6 million
3,Leslie Moonves,CBS,68.4 million
4,Gregory Maffei,Liberty Media & Qurate Retail Group,$67.2 million


In [141]:
pd.DataFrame(ceo_list, columns = col_names[1:4])

Unnamed: 0,name,annual_compensation,company
0,Hock E. Tan,Broadcom,$103.2 million
1,Frank Bisignano,First Data (FDC),$102.2 million
2,Michael Rapino,Live Nation Entertainment (LYV),$70.6 million
3,Leslie Moonves,CBS,68.4 million
4,Gregory Maffei,Liberty Media & Qurate Retail Group,$67.2 million


In [143]:
#Let's scrape the disorganized section

In [147]:
target = soup.find(id = "disorganized")
target

<section id="disorganized">
<h2>Disorganized - Top 5 Compensated CEOs in 2018</h2>
<div class="ceo">
<span>Rank:</span><dt> 1</dt>
<span>Name:</span><dt> Hock E. Tan</dt>
<span>Annual compensation:</span><dt> $103.2 million</dt>
<span>Company:</span><dt> Broadcom</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 2</dt>
<span>Name:</span><dt> Frank Bisignano</dt>
<span>Annual Compensation:</span><dt> $102.2 million</dt>
<span>Company:</span><dt> First Data (FDC)</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 3</dt>
<span>Name:</span><dt> Michael Rapino</dt>
<span>Annual Compensation:</span><dt> $70.6 million</dt>
<span>Company:</span><dt> Live Nation Entertainment (LYV)</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 4</dt>
<span>Name:</span><dt> Leslie Moonves</dt>
<span>Annual Compensation:</span><dt> 68.4 million</dt>
<span>Company:</span><dt> CBS</dt>
</div>
<div class="ceo">
<span>Rank: </span><dt> 5</dt>
<span>Name:</span> <dt> Gregory Maffei</dt>
<span>Annual Com

In [153]:
ceos = target.find_all("div", class_="ceo")
ceos

[<div class="ceo">
 <span>Rank:</span><dt> 1</dt>
 <span>Name:</span><dt> Hock E. Tan</dt>
 <span>Annual compensation:</span><dt> $103.2 million</dt>
 <span>Company:</span><dt> Broadcom</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 2</dt>
 <span>Name:</span><dt> Frank Bisignano</dt>
 <span>Annual Compensation:</span><dt> $102.2 million</dt>
 <span>Company:</span><dt> First Data (FDC)</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 3</dt>
 <span>Name:</span><dt> Michael Rapino</dt>
 <span>Annual Compensation:</span><dt> $70.6 million</dt>
 <span>Company:</span><dt> Live Nation Entertainment (LYV)</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 4</dt>
 <span>Name:</span><dt> Leslie Moonves</dt>
 <span>Annual Compensation:</span><dt> 68.4 million</dt>
 <span>Company:</span><dt> CBS</dt>
 </div>,
 <div class="ceo">
 <span>Rank: </span><dt> 5</dt>
 <span>Name:</span> <dt> Gregory Maffei</dt>
 <span>Annual Compensation:</span><dt> $67.2 million</dt>
 <span>Com

In [161]:
for ceo in ceos:
#    print(ceo)
#    print("************")
    print(ceo.find_all("dt")[1].get_text())

 Hock E. Tan
 Frank Bisignano
 Michael Rapino
 Leslie Moonves
 Gregory Maffei


In [166]:
for ceo in ceos:
    print(ceo.find_all("dt")[3].get_text())

 Broadcom
 First Data (FDC)
 Live Nation Entertainment (LYV)
 CBS
 Liberty Media & Qurate Retail Group


In [173]:
#for ceo in ceos:
#    data_ceo = (ceo.find_all("dt")[0:3])
#    print(f"Data_CEO:{data_ceo}")
#    for ceo in data_ceo:
#       print(ceo.get_text())

Data_CEO:[<dt> 1</dt>, <dt> Hock E. Tan</dt>, <dt> $103.2 million</dt>]
 1
 Hock E. Tan
 $103.2 million
Data_CEO:[<dt> 2</dt>, <dt> Frank Bisignano</dt>, <dt> $102.2 million</dt>]
 2
 Frank Bisignano
 $102.2 million
Data_CEO:[<dt> 3</dt>, <dt> Michael Rapino</dt>, <dt> $70.6 million</dt>]
 3
 Michael Rapino
 $70.6 million
Data_CEO:[<dt> 4</dt>, <dt> Leslie Moonves</dt>, <dt> 68.4 million</dt>]
 4
 Leslie Moonves
 68.4 million
Data_CEO:[<dt> 5</dt>, <dt> Gregory Maffei</dt>, <dt> $67.2 million</dt>]
 5
 Gregory Maffei
 $67.2 million


In [191]:
ceo_list = []
for ceo in ceos:
    all_targets = ceo.find_all("dt")
    rank = all_targets[0].get_text(strip=True)
    name = all_targets[1].get_text(strip=True)
    annual_comp = all_targets[2].get_text(strip=True)
    company = all_targets[3].get_text(strip=True)
    ceo_list.append({"Rank":rank, "Name:":name, "Annual Compensation":annual_comp, "Company":company})

ceo_list

[{'Rank': '1',
  'Name:': 'Hock E. Tan',
  'Annual Compensation': '$103.2 million',
  'Company': 'Broadcom'},
 {'Rank': '2',
  'Name:': 'Frank Bisignano',
  'Annual Compensation': '$102.2 million',
  'Company': 'First Data (FDC)'},
 {'Rank': '3',
  'Name:': 'Michael Rapino',
  'Annual Compensation': '$70.6 million',
  'Company': 'Live Nation Entertainment (LYV)'},
 {'Rank': '4',
  'Name:': 'Leslie Moonves',
  'Annual Compensation': '68.4 million',
  'Company': 'CBS'},
 {'Rank': '5',
  'Name:': 'Gregory Maffei',
  'Annual Compensation': '$67.2 million',
  'Company': 'Liberty Media & Qurate Retail Group'}]

### The same steps each time:

* Is the content on the page (use ```Reveal Source```)?
* Where and how is the content held on the page?
* Which classes and IDs do we target?
* Is there a pattern?
* Is there anything that breaks the pattern?

# Excluding classes

Most modern sites have tags that include multiple classes.

What if you want to target a tag with a single class but that class also appears in tags with others that holds other types of content.

For example, capture ```Excluding Some Classes``` section of our page in ```BeautifulSoup``` object.



In [1]:
## RUN this cell that holds some html
some_html = '''<li> Silly List </li>
<li class="a"> A alone  - UNWANTED </li>
<li class="a z"> A and Z  - UNWANTED </li>
<li class="z"> Z first - my target</li>
<li class="b z"> B and Z  - UNWANTED</li>
<li class="x z"> X and Z - UNWANTED </li>
<li class="z"> Z second - my target</li>'''



### Back to our CEOs