# Experimenting with an existing scraper

This notebook is intended for playing around with web scraping, particularly for [FanFiction.net](https://www.fanfiction.net).

This first section will experiment first with the scraping library found at https://github.com/smilli/fanfiction and to see if it will get what I need.

In [77]:
# import the library, initialize the scraper
from fanfiction import Scraper
scraper = Scraper()

In [9]:
# this may or may not be my fan fiction from years ago
story_id = 7127370
metadata = scraper.scrape_story_metadata(story_id)
metadata

{'author_id': 1625333,
 'canon': 'Harry Potter',
 'canon_type': 'Books',
 'genres': ['Romance'],
 'id': 7127370,
 'lang': 'English',
 'num_chapters': 38,
 'num_favs': 58,
 'num_follows': 87,
 'num_reviews': 85,
 'num_words': 130649,
 'published': 1309299920,
 'rated': 'Fiction  T',
 'status': 'Complete',
 'title': 'Something Great',
 'updated': 1446343703}

Yup, that's the fan fiction I was thinking of! Let's see when it was published...

In [10]:
published_timestamp = metadata['published']
published_timestamp

1309299920

Let's see when it was published in a format I would understand...

In [11]:
import datetime
datetime.datetime.fromtimestamp(int(published_timestamp)).strftime('%Y-%m-%d %H:%M:%S')

'2011-06-28 18:25:20'

Great! Next step is to figure out how to get a ton of Harry Potter story IDs and parse for the metadata of all of those. While keeping in mind fanfiction.net's terms and services too.

![screenshot from fanfiction.net](fanficnet_screenshot.png)


So it seems like all of that metadata is on the list of stories as well. Perhaps I won't need to use this library, but should adapt the library's code to scrape this listing of stories so I'm not pinging fanfiction.net more than necessary, once to get story ids and once to get all metadata. Let's see if we can do that all in one go.

Looking at this screenshot, it seems like there might be a problem where 'Published' is listed as '23m ago' which is not a timestamp. Let's see if that's the case.

In [12]:
story_id = 12582445
metadata = scraper.scrape_story_metadata(story_id)

IndexError: list index out of range

Actually it seems like the problem isn't the '23m ago', but rather that in this case 'updated' should really be 'published' and that there is no 'updated' time. 

# Experimenting with scraping

Most of this will be adapted from the code from the above library. I'm thinking that rather than me scraping for all of the story IDs and then using their library to scrape for all of the metadata, can we do it all in one go? To be kind to the FanFiction.net servers :)

In [5]:
import requests

# we're only going to look at harry potter fanfics 
base_url = "https://www.fanfiction.net/book/Harry-Potter"
# this gets appended in order to 
page_suffix = "?&srt=1&r=103&p="

# 30 seconds seems reasonable for a human to quickly scroll through a page
rate_limit = 30

# let's start with page 1. this would eventually go into a for loop index, I imagine
page=1

Alright- now let's make a request and see what we get in return

In [10]:
url = '{0}/{1}{2}'.format(base_url, page_suffix, str(page))
raw_result = requests.get(url)
html = raw_result.content

In [46]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

In [47]:
all_stories_on_page = soup.find_all('div', class_='z-list zhover zpointer ')
len(all_stories_on_page)

25

In [48]:
# choosing number two because it has some reviews/follows
a_story = all_stories_on_page[2]
print(a_story.prettify())

<div class="z-list zhover zpointer " style="min-height:77px;border-bottom:1px #cdcdcd solid;">
 <a class="stitle" href="/s/12565468/1/Reasons">
  <img class="lazy cimage " data-original="//ffcdn2012t-fictionpressllc.netdna-ssl.com/image/4778442/75/" height="66" src="/static/images/d_60_90.jpg" style="clear:left;float:left;margin-right:3px;padding:2px;border:1px solid #ccc;-moz-border-radius:2px;-webkit-border-radius:2px;" width="50"/>
  Reasons
 </a>
 <a href="/s/12565468/4/Reasons">
  <span class="icon-chevron-right xicon-section-arrow">
  </span>
 </a>
 by
 <a href="/u/6457851/ILoveHarmony">
  ILoveHarmony
 </a>
 <a class="reviews" href="/r/12565468/">
  reviews
 </a>
 <div class="z-indent z-padtop">
  Reasons to love Hermione Jean Granger -by Harry James Potter
  <div class="z-padtop2 xgray">
   Rated: T - English - Romance/Humor - Chapters: 4 - Words: 3,983 - Reviews: 7 - Favs: 17 - Follows: 18 - Updated:
   <span data-xutime="1500814509">
    40m
   </span>
   - Published:
   <spa

Whew, we're getting there! Here's the first story on page 1. Let's see if we can get all the metadata the way the fanfction library does. We'll ignore canon and canon_type since this will by default be all Harry Potter books. So we're going to look for author ID, title, updated, published, language, genres, number of reviews, number of favorites, number of follows, number of words, completion, and the rating.

In [75]:
# let's start with the title
title = a_story.find(class_='stitle').get_text()
title

'Reasons'

Okay, that was the easy  one. I can do this!!

In [50]:
story_url = a_story.find(class_='stitle')['href']
story_url

'/s/12565468/1/Reasons'

I guess you could also get the title from that, though it seems like it'd be more annoying since I would then have to deal with spaces. So let's stick with this way.

In [51]:
story_url.split("/")

['', 's', '12565468', '1', 'Reasons']

In [52]:
story_id = story_url.split("/")[2]
story_id

'12565468'

In [53]:
a_story.find_all('a')

[<a class="stitle" href="/s/12565468/1/Reasons"><img class="lazy cimage " data-original="//ffcdn2012t-fictionpressllc.netdna-ssl.com/image/4778442/75/" height="66" src="/static/images/d_60_90.jpg" style="clear:left;float:left;margin-right:3px;padding:2px;border:1px solid #ccc;-moz-border-radius:2px;-webkit-border-radius:2px;" width="50"/>Reasons</a>,
 <a href="/s/12565468/4/Reasons"><span class="icon-chevron-right xicon-section-arrow"></span></a>,
 <a href="/u/6457851/ILoveHarmony">ILoveHarmony</a>,
 <a class="reviews" href="/r/12565468/">reviews</a>]

In [54]:
# looks like the <a> for author id isn't an ID. hopefully it's always the third one
a_story.find_all('a')[2]

<a href="/u/6457851/ILoveHarmony">ILoveHarmony</a>

In [55]:
author_url = a_story.find_all('a')[2]['href']
author_id = author_url.split("/")[2]
author_id

'6457851'

In [61]:
metadata_div = a_story.find('div', class_="z-padtop2 xgray")
metadata_div

<div class="z-padtop2 xgray">Rated: T - English - Romance/Humor - Chapters: 4 - Words: 3,983 - Reviews: 7 - Favs: 17 - Follows: 18 - Updated: <span data-xutime="1500814509">40m</span> - Published: <span data-xutime="1499632097">7/9</span> - [Harry P., Hermione G.]</div>

In [62]:
times = metadata_div.find_all(attrs={'data-xutime':True})
times

[<span data-xutime="1500814509">40m</span>,
 <span data-xutime="1499632097">7/9</span>]

In [69]:
import datetime
def convertTime(time):
    return datetime.datetime.fromtimestamp(int(time)).strftime('%Y-%m-%d %H:%M:%S')

In [73]:
if len(times) == 2:
    updated = times[0]['data-xutime']
    published = times[1]['data-xutime'] 
else:
    updated = times[0]['data-xutime']
    published = updated

print(convertTime(updated))
print(convertTime(published))

2017-07-23 08:55:09
2017-07-09 16:28:17


In [74]:
metadata_div.get_text()

'Rated: T - English - Romance/Humor - Chapters: 4 - Words: 3,983 - Reviews: 7 - Favs: 17 - Follows: 18 - Updated: 40m - Published: 7/9 - [Harry P., Hermione G.]'

In [76]:
# looks like things are separated by -'s
metadata_parts = metadata_div.get_text().split('-')
metadata_parts

['Rated: T ',
 ' English ',
 ' Romance/Humor ',
 ' Chapters: 4 ',
 ' Words: 3,983 ',
 ' Reviews: 7 ',
 ' Favs: 17 ',
 ' Follows: 18 ',
 ' Updated: 40m ',
 ' Published: 7/9 ',
 ' [Harry P., Hermione G.]']

In [81]:
# we'll use that library's nice get_genres function
genres = scraper.get_genres(metadata_parts[2].strip())
genres

['Romance', 'Humor']

In [83]:
language = metadata_parts[1].strip()
language

'English'

In [85]:
# put together what we have so far
metadata = {
    'id': story_id,
    'author_id': author_id,
    'title': title,
    'updated': int(updated),
    'published': int(published),
    'language': language,
    'genres': genres
}
metadata

{'author_id': '6457851',
 'genres': ['Romance', 'Humor'],
 'id': '12565468',
 'language': 'English',
 'published': 1499632097,
 'title': 'Reasons',
 'updated': 1500814509}

In [86]:
# much thanks to the original library for this logic
for parts in metadata_parts:
    parts = parts.strip()
    # already dealt with language and genres- everything else should have name: value
    tag_and_val = parts.split(':')
    if len(tag_and_val) != 2:
        continue
    tag, val = tag_and_val
    tag = tag.strip().lower()
    if tag not in metadata:
        val = val.strip()
        try:
            val = int(val.replace(',', ''))
            metadata['num_'+tag] = val
        except:
            metadata[tag] = val

metadata

{'author_id': '6457851',
 'genres': ['Romance', 'Humor'],
 'id': '12565468',
 'language': 'English',
 'num_chapters': 4,
 'num_favs': 17,
 'num_follows': 18,
 'num_reviews': 7,
 'num_words': 3983,
 'published': 1499632097,
 'rated': 'T',
 'title': 'Reasons',
 'updated': 1500814509}

Alright! We will have to deal with 'status' in a different way than the library cause it doesn't show up on the home page in the same way as it does on any given page. I'd also like to get the character associations which the original library doesn't do. 

It seems like the last metadata portion is either Published, Complete, or the Character listing. So we'll just have to use if's.

In [89]:
last_part = metadata_parts[len(metadata_parts)-1]
last_part

' [Harry P., Hermione G.]'

In [106]:
# seems like sometimes there are brackets and sometimes there aren't...
def get_characters(character_text):
    stripped = character_text.strip()
    bracketless = stripped.replace('[', "").replace(']', "")
    return bracketless.split(', ')
    
get_characters(last_part)

['Harry P.', 'Hermione G.']

In [107]:
if last_part == 'Complete':
    metadata['status'] = 'Complete'
else:
    metadata['status'] = 'Incomplete'
    if last_part.startswith('Published'):
        metadata['characters'] = []
    else:
        metadata['characters'] = get_characters(last_part)
metadata

{'author_id': '6457851',
 'characters': ['Harry P.', 'Hermione G.'],
 'genres': ['Romance', 'Humor'],
 'id': '12565468',
 'language': 'English',
 'num_chapters': 4,
 'num_favs': 17,
 'num_follows': 18,
 'num_reviews': 7,
 'num_words': 3983,
 'published': 1499632097,
 'rated': 'T',
 'status': 'Incomplete',
 'title': 'Reasons',
 'updated': 1500814509}

I think that's everything! Now let's put it into one function and see how it does...

## One function

In [None]:
def scrape_story_blurb(url):
    