# Web Scraping and Data Wrangling with The Office

What does this demonstrate?
1. web scraping.
2. data wrangling.
3. regular expressions.
4. descriptive analysis.

TODO:
1. Running total of character speeches to show importance over time.
2. Compare writers.
3. ?

<https://www.officequotes.net>

In [20]:
from pathlib import Path
import re
from urllib.parse import urljoin

from bs4 import BeautifulSoup
import requests
import pandas as pd

In [21]:
def scrape_episode_urls(home_page=r"https://www.officequotes.net"):
    
    r = requests.get(home_page)
    soup = BeautifulSoup(r.text)
    episode_paths = [
        a.get("href")
        for a in soup.find(id="sidebar-left").find_all("a")
        if re.match(r"/no\d+-\d+.html", a.get("href"))
    ]
    
    episode_urls = map(lambda path: urljoin(home_page, path), episode_paths)

    return episode_urls
    
# Map returns a generator. Converting to a list during development.
episode_urls_list = [e for e in scrape_episode_urls()]

In [22]:
episode_urls_list

['https://www.officequotes.net/no1-01.html',
 'https://www.officequotes.net/no1-02.html',
 'https://www.officequotes.net/no1-03.html',
 'https://www.officequotes.net/no1-04.html',
 'https://www.officequotes.net/no1-05.html',
 'https://www.officequotes.net/no1-06.html',
 'https://www.officequotes.net/no2-01.html',
 'https://www.officequotes.net/no2-02.html',
 'https://www.officequotes.net/no2-03.html',
 'https://www.officequotes.net/no2-04.html',
 'https://www.officequotes.net/no2-05.html',
 'https://www.officequotes.net/no2-06.html',
 'https://www.officequotes.net/no2-07.html',
 'https://www.officequotes.net/no2-08.html',
 'https://www.officequotes.net/no2-09.html',
 'https://www.officequotes.net/no2-10.html',
 'https://www.officequotes.net/no2-11.html',
 'https://www.officequotes.net/no2-12.html',
 'https://www.officequotes.net/no2-13.html',
 'https://www.officequotes.net/no2-14.html',
 'https://www.officequotes.net/no2-15.html',
 'https://www.officequotes.net/no2-16.html',
 'https://

In [None]:
# Extract some metadata for the episode.
episode_url = episode_urls_list[0]

episode_page = BeautifulSoup(requests.get(episode_url).text)

In [72]:
def _extract_episode_metadata(episode_page):
    """Scrape the episode's metadata from the page source.

    Args:
        episode_page (str): The page source for the episode. Best retrieved using BeautifulSoup.

    Returns:
        dict: A dictionary of the episode's metadata: Season, Episode Number, Title, Writing Credit, and Directing Credit.
    """
    p = re.compile(
        r'Season (?P<Season>\d) - Episode (?P<Episode>\d+)\n"(?P<Title>.*)"\n.*Written by (?P<WrittenBy>.*)\n.*Directed by (?P<Director>.*)\n'
    )
    s = re.search(p, episode_page.find("main").text)
    
    return s.groupdict()

In [39]:
extract_episode_metadata(episode_page)

{'Season': '1',
 'Episode': '01',
 'Title': 'Pilot',
 'WrittenBy': 'Greg Daniels, Ricky Gervais, and Stephen Merchant',
 'Director': 'Ken Kwapis'}

In [75]:
def extract_script(episode_page, include_deleted_scenes=False):
    """Extract the script from the episode's page.

    Args:
        episode_page (str): The page source for the episode. Best retrieved using BeautifulSoup.
        include_deleted_scenes (bool): Include deleted scenes? Defaults to False.

    Returns:
        DataFrame: Each row is a scene from the episode.
    """

    episode_metadata = _extract_episode_metadata(episode_page)
    
    scenes = enumerate(
        [
            str.strip(scene.text)
            for scene in episode_page.find("main").find_all("div", class_="quote")
        ]
    )
    df = pd.DataFrame.from_records(scenes, columns=["SceneNumber", "Dialogue"]).assign(
        IsDeletedScene=lambda df: df.Dialogue.str.match(r"Deleted Scene"), **episode_metadata
    )

    if include_deleted_scenes == False:
          df = df.query("~IsDeletedScene")
          
    return df

In [76]:
episode_df = extract_script(episode_page)

In [77]:
episode_df

Unnamed: 0,SceneNumber,Dialogue,IsDeletedScene,Season,Episode,Title,WrittenBy,Director
0,0,Michael: All right Jim. Your quarterlies look ...,False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
1,1,"Michael: [on the phone] Yes, I'd like to speak...",False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
2,2,"Michael: I've, uh, I've been at Dunder Mifflin...",False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
3,3,Michael: People say I am the best boss. They g...,False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
4,4,Dwight: [singing] Shall I play for you? Pa rum...,False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
5,5,Jim: My job is to speak to clients on the phon...,False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
6,6,Michael: Whassup!\nJim: Whassup! I still love ...,False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
7,7,Jan: [on her cell phone] Just before lunch. Th...,False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
8,8,Michael: Corporate really doesn't really inter...,False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis
9,9,"Jan: Alright, was there anything you wanted to...",False,1,1,Pilot,"Greg Daniels, Ricky Gervais, and Stephen Merchant",Ken Kwapis


In [78]:
def _convert_dialogue_to_cued_speeches(dialogue_string):
    """Helper function to pair the character cue with his speech in the dialogue.


    Args:
        dialogue_string (str): The complete dialogue of the scene with its character cues.

    Returns:
        list: A list of paired character cues and speeches.
    """
    p = re.compile(r"(?P<Character>[\w ]+):")
    lines = [str.strip(line) for line in p.split(dialogue_string) if line != r""]
    pair_player_and_line = zip(lines[::2], lines[1::2])
    return list(pair_player_and_line)

In [79]:
def character_speeches(episode_df):
    """Breakout individual speeches for each character from within the dialogue.

    Args:
        episode_df (DataFrame): DataFrame of the episode's script.

    Returns:
        DataFrame: The provided dataframe with dialogue exploded to cues and speeches.
    """
    df = (
        episode_df.assign(Lines=lambda x: x.Dialogue.apply(_convert_dialogue_to_cued_speeches))
        .explode("Lines")
        .reset_index()
        .assign(
            Character=lambda x: x.Lines.apply(pd.Series)[0],
            Speech=lambda x: x.Lines.apply(pd.Series)[1],
        )
    )
    return df

In [80]:
episode_df = extract_script(episode_page).pipe(character_speeches)

episode_df

In [83]:
episode_df.Speech.iloc[-1]

'Yeah, definitely. You too. Enjoy it. [looks at camera] You know what, just come here.'

Let's turn these steps into a function that we can use in a list comprehension to make dataframes we can concatenate together.

In [85]:
episode_df.Character.value_counts()

Character
Michael        81
Pam            41
Jim            36
Dwight         29
Jan            12
Ryan            8
Stanley         5
Roy             5
Todd Packer     3
Oscar           3
Phyllis         2
Michel          1
Angela          1
Kevin           1
Man             1
Name: count, dtype: int64

In [86]:
# script.groupby(['SceneNumber', 'Character']).size()
episode_df.groupby("SceneNumber")["Character"].transform("nunique")

0      2
1      2
2      2
3      2
4      2
      ..
224    2
225    2
226    2
227    2
228    2
Name: Character, Length: 229, dtype: int64