# Experimenting with an existing scraper

This notebook is intended for playing around with web scraping, particularly for [FanFiction.net](https://www.fanfiction.net).

This first section will experiment first with the scraping library found at https://github.com/smilli/fanfiction and to see if it will get what I need.

In [1]:
# import the library, initialize the scraper
from fanfiction import Scraper
scraper = Scraper()

In [9]:
# this may or may not be my fan fiction from years ago
story_id = 7127370
metadata = scraper.scrape_story_metadata(story_id)
metadata

{'author_id': 1625333,
 'canon': 'Harry Potter',
 'canon_type': 'Books',
 'genres': ['Romance'],
 'id': 7127370,
 'lang': 'English',
 'num_chapters': 38,
 'num_favs': 58,
 'num_follows': 87,
 'num_reviews': 85,
 'num_words': 130649,
 'published': 1309299920,
 'rated': 'Fiction  T',
 'status': 'Complete',
 'title': 'Something Great',
 'updated': 1446343703}

Yup, that's the fan fiction I was thinking of! Let's see when it was published...

In [10]:
published_timestamp = metadata['published']
published_timestamp

1309299920

Let's see when it was published in a format I would understand...

In [11]:
import datetime
datetime.datetime.fromtimestamp(int(published_timestamp)).strftime('%Y-%m-%d %H:%M:%S')

'2011-06-28 18:25:20'

Great! Next step is to figure out how to get a ton of Harry Potter story IDs and parse for the metadata of all of those. While keeping in mind fanfiction.net's terms and services too.

![screenshot from fanfiction.net](fanficnet_screenshot.png)


So it seems like all of that metadata is on the list of stories as well. Perhaps I won't need to use this library, but should adapt the library's code to scrape this listing of stories so I'm not pinging fanfiction.net more than necessary, once to get story ids and once to get all metadata. Let's see if we can do that all in one go.

Looking at this screenshot, it seems like there might be a problem where 'Published' is listed as '23m ago' which is not a timestamp. Let's see if that's the case.

In [12]:
story_id = 12582445
metadata = scraper.scrape_story_metadata(story_id)

IndexError: list index out of range

Actually it seems like the problem isn't the '23m ago', but rather that in this case 'updated' should really be 'published' and that there is no 'updated' time. 

# Experimenting with scraping

Most of this will be adapted from the code from the above library. I'm thinking that rather than me scraping for all of the story IDs and then using their library to scrape for all of the metadata, can we do it all in one go? To be kind to the FanFiction.net servers :)

In [2]:
import requests

# we're only going to look at harry potter fanfics 
base_url = "https://www.fanfiction.net/book/Harry-Potter"
# this gets appended in order to 
page_suffix = "?&srt=1&r=103&p="

# 30 seconds seems reasonable for a human to quickly scroll through a page
rate_limit = 30

# let's start with page 1. this would eventually go into a for loop index, I imagine
page=23251

Alright- now let's make a request and see what we get in return

In [3]:
url = '{0}/{1}{2}'.format(base_url, page_suffix, str(page))
raw_result = requests.get(url)
html = raw_result.content

In [4]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

In [5]:
all_stories_on_page = soup.find_all('div', class_='z-list zhover zpointer ')
len(all_stories_on_page)

25

In [29]:
# choosing number two because it has some reviews/follows
a_story = all_stories_on_page[1]
print(a_story.prettify())

<div class="z-list zhover zpointer " style="min-height:77px;border-bottom:1px #cdcdcd solid;">
 <a class="stitle" href="/s/285322/1/Facing-Life">
  <img class="cimage " height="66" src="/static/images/d_60_90.jpg" style="clear:left;float:left;margin-right:3px;padding:2px;border:1px solid #ccc;-moz-border-radius:2px;-webkit-border-radius:2px;" width="50"/>
  Facing Life
 </a>
 by
 <a href="/u/58646/Sellene">
  Sellene
 </a>
 <a class="reviews" href="/r/285322/">
  reviews
 </a>
 <div class="z-indent z-padtop">
  Sketchy poem.  Repitivie.  Constructive critism needed!
  <div class="z-padtop2 xgray">
   Rated: K+ - English - Angst/Poetry - Chapters: 1 - Words: 736 - Reviews: 6 - Published:
   <span data-xutime="989996400">
    5/16/2001
   </span>
  </div>
 </div>
</div>



Whew, we're getting there! Here's the first story on page 1. Let's see if we can get all the metadata the way the fanfction library does. We'll ignore canon and canon_type since this will by default be all Harry Potter books. So we're going to look for author ID, title, updated, published, language, genres, number of reviews, number of favorites, number of follows, number of words, completion, and the rating.

In [30]:
# let's start with the title
title = a_story.find(class_='stitle').get_text()
title

'Facing Life'

Okay, that was the easy  one. I can do this!!

In [31]:
story_url = a_story.find(class_='stitle')['href']
story_url

'/s/285322/1/Facing-Life'

I guess you could also get the title from that, though it seems like it'd be more annoying since I would then have to deal with spaces. So let's stick with this way.

In [32]:
story_url.split("/")

['', 's', '285322', '1', 'Facing-Life']

In [33]:
story_id = story_url.split("/")[2]
story_id

'285322'

In [34]:
a_story.find_all('a')

[<a class="stitle" href="/s/285322/1/Facing-Life"><img class="cimage " height="66" src="/static/images/d_60_90.jpg" style="clear:left;float:left;margin-right:3px;padding:2px;border:1px solid #ccc;-moz-border-radius:2px;-webkit-border-radius:2px;" width="50"/>Facing Life</a>,
 <a href="/u/58646/Sellene">Sellene</a>,
 <a class="reviews" href="/r/285322/">reviews</a>]

In [35]:
# turns out author ID isn't always the third one. for some reason sometimes there isn't a > link
# so we'll look for /u/ 
links = a_story.find_all('a')
author_url = [link['href'] for link in links if "/u/" in link['href']]
author_id = author_url[0].split("/")[2]
author_id

'58646'

In [36]:
metadata_div = a_story.find('div', class_="z-indent z-padtop")
start = metadata_div.text.index('Rated')
metadata_div.text[start:]

'Rated: K+ - English - Angst/Poetry - Chapters: 1 - Words: 736 - Reviews: 6 - Published: 5/16/2001'

In [37]:
times = metadata_div.find_all(attrs={'data-xutime':True})
times

[<span data-xutime="989996400">5/16/2001</span>]

In [38]:
import datetime
def convertTime(time):
    return datetime.datetime.fromtimestamp(int(time)).strftime('%Y-%m-%d %H:%M:%S')

In [39]:
if len(times) == 2:
    updated = times[0]['data-xutime']
    published = times[1]['data-xutime'] 
else:
    updated = times[0]['data-xutime']
    published = updated

print(convertTime(updated))
print(convertTime(published))

2001-05-16 03:00:00
2001-05-16 03:00:00


In [40]:
metadata_div.get_text()

'Sketchy poem.  Repitivie.  Constructive critism needed!Rated: K+ - English - Angst/Poetry - Chapters: 1 - Words: 736 - Reviews: 6 - Published: 5/16/2001'

In [41]:
# looks like things are separated by -'s
metadata_parts = metadata_div.get_text().split('-')
metadata_parts

['Sketchy poem.  Repitivie.  Constructive critism needed!Rated: K+ ',
 ' English ',
 ' Angst/Poetry ',
 ' Chapters: 1 ',
 ' Words: 736 ',
 ' Reviews: 6 ',
 ' Published: 5/16/2001']

In [42]:
def get_genres(genre_text):
    if genre_text.startswith('Chapters'):
        return []
    genres = genre_text.split('/')
    # Hurt/Comfort is annoying because of the '/'
    corrected_genres = []
    for genre in genres:
        if genre == 'Hurt':
            corrected_genres.append('Hurt/Comfort')
        elif genre == 'Comfort':
            continue
        else:
            corrected_genres.append(genre)
    return corrected_genres

In [43]:
# we'll use that library's nice get_genres function
genres = get_genres(metadata_parts[2].strip())
genres

['Angst', 'Poetry']

In [44]:
language = metadata_parts[1].strip()
language

'English'

In [45]:
# put together what we have so far
metadata = {
    'id': story_id,
    'author_id': author_id,
    'title': title,
    'updated': int(updated),
    'published': int(published),
    'language': language,
    'genres': genres
}
metadata

{'author_id': '58646',
 'genres': ['Angst', 'Poetry'],
 'id': '285322',
 'language': 'English',
 'published': 989996400,
 'title': 'Facing Life',
 'updated': 989996400}

In [46]:
# much thanks to the original library for this logic
for parts in metadata_parts:
    parts = parts.strip()
    # already dealt with language and genres- everything else should have name: value
    tag_and_val = parts.split(':')
    if len(tag_and_val) != 2:
        continue
    tag, val = tag_and_val
    tag = tag.strip().lower()
    if tag not in metadata:
        val = val.strip()
        try:
            val = int(val.replace(',', ''))
            metadata['num_'+tag] = val
        except:
            metadata[tag] = val

metadata

{'author_id': '58646',
 'genres': ['Angst', 'Poetry'],
 'id': '285322',
 'language': 'English',
 'num_chapters': 1,
 'num_reviews': 6,
 'num_words': 736,
 'published': 989996400,
 'sketchy poem.  repitivie.  constructive critism needed!rated': 'K+',
 'title': 'Facing Life',
 'updated': 989996400}

Alright! We will have to deal with 'status' in a different way than the library cause it doesn't show up on the home page in the same way as it does on any given page. I'd also like to get the character associations which the original library doesn't do. 

It seems like the last metadata portion is either Published, Complete, or the Character listing. So we'll just have to use if's.

In [24]:
last_part = metadata_parts[len(metadata_parts)-1]
last_part

' Published: 5/16/2001'

In [25]:
# seems like sometimes there are brackets and sometimes there aren't...
def get_characters(character_text):
    stripped = character_text.strip()
    bracketless = stripped.replace('[', "")
    if bracketless.endswith(']'):
        characters = bracketless.replace(']', "")
    else:
        characters = bracketless.replace(']', ",")
    return characters.split(', ')
    
print(get_characters(last_part))
print(get_characters(' [Harry P., Hermione G.] Remus L.'))

['Published: 5/16/2001']
['Harry P.', 'Hermione G.', 'Remus L.']


In [26]:
test = ' [Remus L., Sirius B.] [Lily Evans P., James P.]' # [['Remus L.', 'Sirius B.'], ['Lily Evans P.', 'James P.']]
test2 = ' [Harry P., Hermione G.] James P. ' # [['Harry P.', 'Hermione G.'], 'James P.']
test3 = ' Lily Evans P., Severus S., Petunia D., OC'
test4 = ' [Harry P., Hermione G.] James P., Remus L. '

def get_characters_from_string(string):
    stripped = string.strip()
    if stripped.find('[') == -1:
        return stripped.split(', ')
    else:
        characters = []
        num_pairings = stripped.count('[')
        for idx in range(0, num_pairings):
            open_bracket = stripped.find('[')
            close_bracket = stripped.find(']')
            characters.append(get_characters_from_string(stripped[open_bracket+1:close_bracket]))
            stripped = stripped[close_bracket+1:]
        if stripped != '':
            singles = get_characters_from_string(stripped)
            [characters.append(character) for character in singles]
        return characters

get_characters_from_string(test4)
#any(isinstance(el, list) for el in get_characters_from_string(test3))

[['Harry P.', 'Hermione G.'], 'James P.', 'Remus L.']

In [27]:
get_characters_from_string(last_part)

['Published: 5/16/2001']

In [28]:
if last_part.strip() == 'Complete':
    metadata['status'] = 'Complete'
    metadata['characters'] = get_characters_from_string(metadata_parts[len(metadata_parts)-2])
else:
    metadata['status'] = 'Incomplete'
    if last_part.startswith('Published'):
        metadata['characters'] = []
    else:
        metadata['characters'] = get_characters_from_string(last_part)
metadata

{'author_id': '58646',
 'characters': ['Published: 5/16/2001'],
 'genres': [],
 'id': '285308',
 'language': 'English',
 'num_chapters': 1,
 'num_reviews': 6,
 'num_words': 1135,
 'published': 989996400,
 'status': 'Incomplete',
 "three diary entries from back in the mwpp times.  one girl discovers the gift of friendship and its resembolence to a quote about lemonade.  a little silly; the first two entries are..well read it for yourself.  please r/r (don't expect the best; i wroterated": 'K',
 'title': 'Lemonade: The Gift of Friendship',
 'updated': 989996400}

I think that's everything! Now let's put it into one function and see how it does...

## One function

In [27]:
import requests
from bs4 import BeautifulSoup
from fanfiction import Scraper

def scrape_all_stories_on_page(url):
    # names of the classes on fanfiction.net
    story_root_class = 'z-list zhover zpointer '
    
    html = requests.get(url).content
    soup = BeautifulSoup(html, "html.parser")
    
    # get all the stories on the page
    all_stories_on_page = soup.find_all('div', class_=story_root_class)
    metadata_list = {}
    for story in all_stories_on_page:
        id, metadata = scrape_story_blurb(story)
        metadata_list[id] = metadata
    return metadata_list
        
def scrape_story_blurb(story):
    # names of the classes on fanfiction.net
    title_class = 'stitle'
    metadata_div_class = 'z-padtop2 xgray'
    
    title = story.find(class_=title_class).get_text()
    story_id = story.find(class_=title_class)['href'].split("/")[2]
    
    # some steps to get to the author id
    links = story.find_all('a')
    author_url = [link['href'] for link in links if "/u/" in link['href']]
    author_id = author_url[0].split("/")[2]
    
    metadata_div = story.find('div', class_=metadata_div_class)
    
    times = metadata_div.find_all(attrs={'data-xutime':True})
    if len(times) == 2:
        updated = times[0]['data-xutime']
        published = times[1]['data-xutime']
    else:
        updated = times[0]['data-xutime']
        published = updated
    
    metadata_parts = metadata_div.get_text().split('-')
    scraper = Scraper()
    genres = scraper.get_genres(metadata_parts[2].strip())
    
    language = metadata_parts[1].strip()
    
    metadata = {
        'author_id': author_id,
        'title': title,
        'updated': int(updated),
        'published': int(published),
        'language': language,
        'genres': genres
    }
    
    for parts in metadata_parts:
        parts = parts.strip()
        # already dealt with language and genres- everything else should have name: value
        tag_and_val = parts.split(':')
        if len(tag_and_val) != 2:
            continue
        tag, val = tag_and_val
        tag = tag.strip().lower()
        if tag not in metadata:
            val = val.strip()
            try:
                val = int(val.replace(',', ''))
                metadata['num_'+tag] = val
            except:
                metadata[tag] = val
    
    # see if we have characters and/or completion
    last_part = metadata_parts[len(metadata_parts)-1]
    if last_part == 'Complete':
        metadata['status'] = 'Complete'
        # have to get the second to last now
        metadata['characters'] = get_characters(metadata_parts[len(metadata_parts)-2])
    else:
        metadata['status'] = 'Incomplete'
        metadata['characters'] = get_characters(last_part)
        
    return story_id, metadata    

def get_characters(character_text):
    altered = character_text.strip().replace('[', "")
    if altered.startswith('Published'):
        return []
    else:
        if altered.endswith(']'):
            characters = altered.replace(']', "")
        else:
            characters = altered.replace(']', ",")
        return characters.split(', ')
    

In [28]:
scraped_data = scrape_all_stories_on_page(url)
example_key = list(scraped_data.keys())[0]
print(scraped_data[example_key]['title'])
print(scraped_data[example_key]['characters'])
print(scraped_data[example_key]['genres'])

Raven and the Philosopher Stone
[]
['Adventure']


In [29]:
import json

filename = '20170723_page1.json'
with open(filename, 'w') as outfile:
    json.dump(scraped_data, outfile)

In [30]:
# make sure we can open it and that it is the same
json_data = open(filename).read()
data = json.loads(json_data)
print(data[example_key]['title'])
print(data[example_key]['characters'])
print(data[example_key]['genres'])

Raven and the Philosopher Stone
[]
['Adventure']


In [55]:
import re
pattern = re.compile(r'Last')
soup.find('center').find('a')



<a href="/book/Harry-Potter/?&amp;srt=1&amp;r=103&amp;p=23517">« Prev</a>

In [189]:
# sanity checking from the .py
import json

json_data = open('../src/fanfic/scrape/data.json')
data = json.load(json_data)
example_key = list(data.keys())[72]
#[data[key]['title'] for key in data.keys()]
data[example_key]

{'author_id': '1068464',
 'characters': [['Hermione G.', 'Severus S.'], 'Harry P.'],
 'genres': ['Romance', 'Humor'],
 'language': 'English',
 'num_chapters': 1,
 'num_favs': 1,
 'num_follows': 1,
 'num_reviews': 1,
 'num_words': 1114,
 'published': 1501191659,
 'rated': 'T',
 'status': 'Complete',
 'title': 'The Termination',
 'updated': 1501191659}