# Scraping Stars

This is a quick example to structure some additional forms of data from a website dealing with some mathematical content from the Common Core State Standards.  We can see there are a few different kinds of things on the page.  Titles for groups of standards, standard codes, and a paragraph description of the standards, some including examples.



The goal is to create a dataframe that has each of these elements in it's own column.

In [1]:
import requests

In [75]:
url = 'https://en.wikipedia.org/wiki/List_of_guest_stars_on_21_Jump_Street'

In [76]:
text = requests.get(url)

In [77]:
from bs4 import BeautifulSoup

In [78]:
soup = BeautifulSoup(text.text, 'html.parser')

In [79]:
soup

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of guest stars on 21 Jump Street - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"List_of_guest_stars_on_21_Jump_Street","wgTitle":"List of guest stars on 21 Jump Street","wgCurRevisionId":820109238,"wgRevisionId":820109238,"wgArticleId":28400841,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["21 Jump Street","Lists of actors by role","Lists of American television series characters","Lists of drama television characters","Lists of guest appearances in television"],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikite

In [85]:
groups = soup.find_all('tr')

In [86]:
len(groups)

137

In [118]:
groups[0].text

'\nActor\nCharacter\nSeason #\nEpisode #\nEpisode Title\n'

In [119]:
groups[1].text

'\nBarney Martin\nCharlie\n1\n1\n"Pilot"\n'

In [120]:
rows = soup.find_all('td')

In [121]:
len(rows)

647

In [122]:
rows[0].text

'Barney Martin'

In [123]:
column_names = []
for i in soup.find_all('th'):
    column_names.append(i.get_text())

In [124]:
len(column_names)

28

In [126]:
column_names = column_names[:5]

In [127]:
column_names

['Actor', 'Character', 'Season #', 'Episode #', 'Episode Title']

In [134]:
rows = [i for i in soup.find_all('td')]

In [135]:
len(rows)

647

In [148]:
rows[:4]

[<td><a href="/wiki/Barney_Martin" title="Barney Martin">Barney Martin</a></td>,
 <td>Charlie</td>,
 <td>1</td>,
 <td>1</td>]

In [153]:
df = pd.DataFrame(columns = column_names)

In [154]:
df

Unnamed: 0,Actor,Character,Season #,Episode #,Episode Title


In [155]:
for i in soup.find('tr'):
    print(i)



<th>Actor</th>


<th>Character</th>


<th>Season #</th>


<th>Episode #</th>


<th>Episode Title</th>




In [173]:
for i in soup.find_all('a', attrs={'href':re.compile('')}):
    print(i.text)

navigation
search
television
Fox
21 Jump Street
1 Season 1
2 Season 2
3 Season 3
4 Season 4
5 Season 5
6 See also
7 References
edit
Barney Martin
Brandon Douglas
Reginald T. Dorsey
Billy Jayne
Steve Antin
Traci Lind
Leah Ayres
Geoffrey Blake
Josh Brolin
Jamie Bozian
John D'Aquino
Troy Byer
Lezlie Deane
Blair Underwood
Robert Picardo
Scott Schwartz
Liane Curtis
Byron Thames
Sherilyn Fenn
Christopher Heyerdahl
Kurtwood Smith
Sarah G. Buxton
Jason Priestley
edit
Kurtwood Smith
Ray Walston
Pauly Shore
Shannon Tweed
Lochlyn Munro
Mindy Cohn
Kent McCord
Don S. Davis
Tom Wright
Jean Sagal
Liz Sagal
Deborah Lacey
Bradford English
Christina Applegate
Peter Berg
Gabriel Jarret
Bruce French
Dann Florek
Gregory Itzin
Brad Pitt
Don S. Davis
Sam Anderson
edit
Leo Rossi
Peri Gilpin
Kelly Hu
Russell Wong
Christopher Titus
Dom DeLuise
Kehli O'Byrne
Larenz Tate
Maia Brewton
Michael DeLuise
Mario Van Peebles
*
Bridget Fonda
Conor O'Farrell
Andrew Lauer
Claude Brooks
Margot Rose
edit
Don S. Davis
Robert R

In [168]:
for row in soup.find('tr'):
    column_mark = 0
    print(row)



<th>Actor</th>


<th>Character</th>


<th>Season #</th>


<th>Episode #</th>


<th>Episode Title</th>




In [159]:
cols

-1

In [96]:
a = []
for i in groups:
    a.append(i.text.split())

In [98]:
a[0]

['Actor', 'Character', 'Season', '#', 'Episode', '#', 'Episode', 'Title']

In [99]:
a[1]

['Barney', 'Martin', 'Charlie', '1', '1', '"Pilot"']

In [104]:
class HTMLTableParser:
       
        def parse_url(self, url):
            response = requests.get(url)
            soup = BeautifulSoup(response.text, 'lxml')
            return [(table['class'],self.parse_html_table(table))\
                    for table in soup.find_all('table')]  
    
        def parse_html_table(self, table):
            n_columns = 0
            n_rows=0
            column_names = []
    
            # Find number of rows and columns
            # we also find the column titles if we can
            for row in table.find_all('tr'):
                
                # Determine the number of rows in the table
                td_tags = row.find_all('td')
                if len(td_tags) > 0:
                    n_rows+=1
                    if n_columns == 0:
                        # Set the number of columns for our table
                        n_columns = len(td_tags)
                        
                # Handle column names if we find them
                th_tags = row.find_all('th') 
                if len(th_tags) > 0 and len(column_names) == 0:
                    for th in th_tags:
                        column_names.append(th.get_text())
    
            # Safeguard on Column Titles
            if len(column_names) > 0 and len(column_names) != n_columns:
                raise Exception("Column titles do not match the number of columns")
    
            columns = column_names if len(column_names) > 0 else range(0,n_columns)
            df = pd.DataFrame(columns = columns,
                              index= range(0,n_rows))
            row_marker = 0
            for row in table.find_all('tr'):
                column_marker = 0
                columns = row.find_all('td')
                for column in columns:
                    df.iat[row_marker,column_marker] = column.get_text()
                    column_marker += 1
                if len(columns) > 0:
                    row_marker += 1
                    
            # Convert to float if possible
            for col in df:
                try:
                    df[col] = df[col].astype(float)
                except ValueError:
                    pass
            
            return df

In [115]:
hp = HTMLTableParser()
table = hp.parse_url(url) # Grabbing the table from the tuple
df.head()

Unnamed: 0,0,1
0,[wikitable],Actor Chara...
1,[wikitable],Actor Charact...
2,[wikitable],Actor Char...
3,[wikitable],Actor Character ...
4,[wikitable],Actor Character ...


In [114]:
table2 = table[0][1]
table2.head()

Unnamed: 0,Actor,Character,Season #,Episode #,Episode Title
0,Barney Martin,Charlie,1.0,1,"""Pilot"""
1,Brandon Douglas,Kenny Weckerle,1.0,1 & 2,"""Pilot"""
2,Reginald T. Dorsey,"Tyrell ""Waxer"" Thompson",1.0,1 & 2,"""Pilot"""
3,Billy Jayne,Mark Dorian,1.0,2,"""America, What a Town"""
4,Steve Antin,Stevie Delano,1.0,2,"""America, What a Town"""


In [117]:
table[1][1].head()

Unnamed: 0,Actor,Character,Season #,Episode #,Episode Title
0,Kurtwood Smith,Spencer Phillips,2.0,1.0,"""In the Custody of a Clown"""
1,Ray Walston,Judge Desmond,2.0,1.0,"""In the Custody of a Clown"""
2,Barney Martin,Edison Coulter,2.0,1.0,"""In the Custody of a Clown"""
3,Jason Priestley,Brian Krompasick,2.0,4.0,"""Two for the Road """
4,Pauly Shore,Kenny Ryan,2.0,4.0,"""Two for the Road"""
