In [1]:
import requests
import re
import os
import json
from bs4 import BeautifulSoup

In [2]:
base_url = 'https://transcripts.foreverdreaming.org'
url = base_url + '/viewforum.php?f=22'

First step is to query the index page with the list of links to each individual episode page.

In [3]:
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')

There's a complication- the index is paginated. A query is needed for each page, at increments of 25 by default. 
The last page is scraped from the initial request, and it's used to generate a list of pages to query.

In [8]:
pages_raw = soup.find("b", {"class": "pagination"}).get_text()
last_page = int(pages_raw[len(pages_raw)-2])

pages = [25 * l for l in list(range(0, last_page))]

With the page queries in hand, the next step is to scrape for link tags. 
The problem is that there are non-episode links which are formatted identically. However, each link tag contains text- for the episode links these contain the episode titles, which are formatted consistently. The titles are filtered to exclude non-episode links.

In [10]:
def title_filter(title):
    ## Filter string to see if it matches the expected title format
    pat = re.compile('\d{2}x\d{2}')
    if (pat.match(title[0:5])):
        return True
    return False

In [11]:
## Loop through each page and scrape for episode titles and links
eps = []
for p in pages:
    page_url = "https://transcripts.foreverdreaming.org/viewforum.php?f=22&start=" + str(p)
    page_req = requests.get(page_url)

    page_soup = BeautifulSoup(page_req.text, 'html.parser')
    
    tds = page_soup.findAll("td", {"class": "topic-titles row2"})

    for td in tds:
        title = td.get_text().replace("\n", "")
        link = td.find("h3").find('a')['href'][1:]
        if (title_filter(title)):
            eps.append(
            {
                "title": title,
                "link": link,
            })

Each page link is scraped now to get the actual scripts.

In [12]:
## Scrape each page for episode script
for ep in eps:
    ep['script'] = [];
    ep_url = base_url + ep['link']
    ep_req = requests.get(ep_url)
    # Findall needs to be converted from a list of tags into a list of strings
    ep_soup = BeautifulSoup(ep_req.text, 'html.parser')
    for tag in ep_soup.find_all("div", class_="postbody")[0].find_all("p"):
        ep['script'].append(tag.text.strip())

Dump raw scripts to JSON

In [15]:
def write_json(soup, file_name):
    with open(file_name + '.json', 'w+') as outfile:
        json.dump(soup, outfile)

In [17]:
write_json(eps, './output/gilmore_raw_scripts')

MOVING ALL THIS JAZZ TO A SEPARATE NOTEBOOK

The scripts unfortunately are not formatted consistently- the immediate issue is that sometimes episode titles are included, as well as credits for writers, directors, transcriber, etc. Even these aren't consistently formatted, or even consistently included.

However, there are a few keywords at the beginning of each credit line, and these credits never go for more than 5 lines. By iterating backwards through the first five lines of each script, the index of the last credit line is found, and then used to remove all lines before it- the scripts end up scrubbed clean of titles and credits.

In [107]:
## Scrub non-script elements out of head of each script
def delete_multiple_element(list_object, indices):
    indices = sorted(indices, reverse=True)
    for idx in indices:
        list_object.pop(idx)
        if idx < len(list_object):
            pop = list_object.pop(idx)

for ep in eps:
    keywords = ["written", "directed", "transcript", "teleplay", "story", "transcribed"]
    for p in range(5, -1, -1):
        if ep['soup'][p].lower().split(" ")[0] in keywords:
            delete_multiple_element(ep['soup'], range(0, p+1))
            break

In [3]:
eps[20]['soup']

["CUT TO RORY'S BEDROOM ",
 '(Rory is asleep in bed. Lorelai opens the door and looks in.) ',
 'LORELAI: Hey! ',
 'RORY: What? ',
 'LORELAI: You are not sleeping through this. ',
 'RORY: Through what? ',
 '(Lorelai walks over to the bed and leans over her.) ',
 'LORELAI: The freaking Blue Man Group is outside our house! ',
 'RORY: I was sleeping through it! ',
 'LORELAI: It had to have woken you up. ',
 'RORY: No my insane mother Margot Kidder Gilmore woke me up. ',
 'CUT TO FRONT PORCH ',
 '(Lorelai walks out the door onto the front porch. Luke is hammering the porch rail.) ',
 'LORELAI: Hey. ',
 'LUKE: Hey. ',
 'LORELAI: How are you today? ',
 'LUKE: Good, how are you? ',
 'LORELAI: Good, good. What are you doing? ',
 'LUKE: Fixing your porch rail. ',
 "LORELAI: That's right. You are. You're fixing my porch rail. . . . At six thirty in the morning! ",
 'LUKE: It was the only time I could do it. ',
 'LORELAI: Why? Why? ',
 'LUKE: It was broken. I noticed last time I was here. It could

Further cleaning notes:
* There's a whitespace at the end of each line that trim isn't removing. I think it's a "\n".
* Stage directions are in both round and square brackets.
* "Cut to" transitions aren't always in parentheses.
* There's a "The End" line at the bottom of some (all?) scripts.
* Need to split out each line into character and dialog.

I think the first step is to find 

In [7]:
## Write our episodes to a json    
for ep in eps:
    write_json(ep, ep['title'].split(" ")[0])