# Web Scraping of AO3

We want to scrape a number of texts from the popular fan fiction website called *archive of our own* (aka **AO3**). We want to analyze, if the fan fics changed their tone, style and view of some topics after J. K. Rowling tweeted her infamous *TERF wars* and made a lengthy essay public (see [HERE](https://www.jkrowling.com/opinions/j-k-rowling-writes-about-her-reasons-for-speaking-out-on-sex-and-gender-issues/)) on 10th of June 2020. We are primarily interested in the fan response regarding fan fiction.

The fan response to Rowling’s stance regarding trans people is what started this little project, but it is by no means all there is to, let's say, *uncover* the interesting world of fan fiction in the *Harry Potter* universe. To get a dataset, we will use the aforementioned website **AO3** to get a big number of texts and metadata about it. We will follow (and maybe change) the [tutorial](https://medium.com/nerd-for-tech/mining-fanfics-on-ao3-part-1-data-collection-eac8b5d7a7fa) by Sophia Z.

## Data set

We want to have all fan fics which were published in the period between Dec 2019 and Jan 2021 (half a year before and after Rowling’s essay was published) in English. We don't mind if the texts aren't finished and we don't want to look into the comments, just the actual texts. We are looking at 2833 fan fics which are considerably less than what we expected. To find the selection, you can follow the following [link](https://archiveofourown.org/works?work_search%5Bsort_column%5D=revised_at&include_work_search%5Bfandom_ids%5D%5B%5D=136512&work_search%5Bother_tag_names%5D=&work_search%5Bexcluded_tag_names%5D=&work_search%5Bcrossover%5D=F&work_search%5Bcomplete%5D=&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D=&work_search%5Bdate_from%5D=2019-12-01&work_search%5Bdate_to%5D=2020-01-01&work_search%5Bquery%5D=&work_search%5Blanguage_id%5D=en&commit=Sort+and+Filter&tag_id=Harry+Potter+-+J*d*+K*d*+Rowling).

We try some sample code from the tutorial and it actually works after some work. Apparently, they changed the *urllib* package and we need to circumvent the SSL authentication. To do that we use the *SSL* library and create a new *context* variable. StackOverflow warned us NOT to do that, but it works, soooo... yeah!

In [1]:
import urllib.request
import ssl

url = "https://archiveofourown.org/tags/Harry%20Potter%20-%20J*d*%20K*d*%20Rowling/works?commit=Sort+and+Filter&include_work_search%5Bfandom_ids%5D%5B%5D=136512"

context = ssl._create_unverified_context()
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req, context=context)
content = resp.read()

In [2]:
content[:100]

b'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <meta charset="utf-8"/>\n    <meta http-equiv="x-ua-com'

 Now to do the correct web scraping we use an updated final function of the tutorial. We added our search criteria, which are stored in the URL after the page count  and we store the scraped websites in our own folder. 
 
We should also at some point explain, what website we are actually scraping. In essence, we look at a list of fan fics. You can look at the structure of the web page, whiche would be easiest:

In [3]:
import time

headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'}

def getContent(url, start_page, end_page=142):
    context = ssl._create_unverified_context()
    basic_url = url
    search_criterea = "&work_search%5Bcomplete%5D=&work_search%5Bcrossover%5D=F&work_search%5Bdate_from%5D=2019-12-01&work_search%5Bdate_to%5D=2020-01-01&work_search%5Bexcluded_tag_names%5D=&work_search%5Blanguage_id%5D=en&work_search%5Bother_tag_names%5D=&work_search%5Bquery%5D=&work_search%5Bsort_column%5D=revised_at&work_search%5Bwords_from%5D=&work_search%5Bwords_to%5D="
    #should be of the form: "https://archiveofourown.org/tags/###TAG###/works?page="
    
    for i in range(start_page, end_page+1):
        url = basic_url+str(i) + search_criterea
        try:
            req = urllib.request.Request(url, headers=headers)
            resp = urllib.request.urlopen(req, context=context)
            pageName = "./src/websites/"+str(i)+".html"
            with open(pageName, 'w') as f:
                f.write(str(resp.read()))        
                print (pageName, end=" ")
            time.sleep(3)
        except urllib.error.HTTPError as e:
            if e.code == 429:
                print('Too many requests!---SLEEPING---')
                print('we should restart on page', i)
                print('we should restart with this url:', url)
                break
            raise

In [4]:
url = "https://archiveofourown.org/tags/Harry%20Potter%20-%20J*d*%20K*d*%20Rowling/works?commit=Sort+and+Filter&include_work_search%5Bfandom_ids%5D%5B%5D=136512&page="

We probably have to start the scraping of the website after a certain page count, because we sended to many requests to the server:  

In [5]:
# We already downloaded everything, so we absolutly DON'T have to do that again
# getContent(url, 76)

After downloading everything, we load all pages into an empty list called *pages*:

In [6]:
totalPages = 142

pages=[]

for i in range(1, totalPages+1):
    pageName = "./src/websites/"+str(i)+".html"
    with open(pageName, mode='r', encoding='utf8') as f:
        pages.append(f.read())

In [7]:
pages[141][:100]

'b\'<!DOCTYPE html>\\n<html lang="en">\\n  <head>\\n    <meta charset="utf-8"/>\\n    <meta http-equiv="x-'

Next step is scraping metadate from the website. To do that, we import *BeautifulSoup* for the actuall scraping and *re* for regular expressions:

In [8]:
from bs4 import BeautifulSoup
import re

After that we make a list called *header* for naming all our important columns later and also open a empty *csv* file for our metadata and save the colimn names in it:

In [9]:
import csv

header = ['Title', 'Author', 'ID', 'Date_updated', 'Rating', 'Pairing', 'Warning', 'Complete', 'Language', 'Word_count', 'Num_chapters', 'Num_comments', 'Num_kudos', 'Num_bookmarks', 'Num_hits']

with open('some_metadata.csv','w', encoding='utf8') as f:
    writer = csv.writer(f)
    writer.writerow(header)

## EXPLAIN HOW process_basic WORKS!

In [10]:
def process_basic(page_content):
    bs = BeautifulSoup(page_content, "html.parser")
    titles = []
    authors = []
    ids = []
    date_updated = []
    ratings = []
    pairings = []
    warnings = []
    complete = []
    languages = []
    word_count = []
    chapters = []
    comments = []
    kudos = []
    bookmarks = []
    hits = []

    for article in bs.find_all('li', {'role':'article'}):
        titles.append(article.find('h4', {'class':'heading'}).find('a').text)
        try:
            authors.append(article.find('a', {'rel':'author'}).text)
        except:
            authors.append('Anonymous')
        ids.append(article.find('h4', {'class':'heading'}).find('a').get('href')[7:])
        date_updated.append(article.find('p', {'class':'datetime'}).text)
        ratings.append(article.find('span', {'class':re.compile(r'rating\-.*rating')}).text)
        pairings.append(article.find('span', {'class':re.compile(r'category\-.*category')}).text)
        warnings.append(article.find('span', {'class':re.compile(r'warning\-.*warnings')}).text)
        complete.append(article.find('span', {'class':re.compile(r'complete\-.*iswip')}).text)
        languages.append(article.find('dd', {'class':'language'}).text)
        count = article.find('dd', {'class':'words'}).text
        if len(count) > 0:
            word_count.append(count)
        else:
            word_count.append('0')
        chapters.append(article.find('dd', {'class':'chapters'}).text.split('/')[0])
        try:
            comments.append(article.find('dd', {'class':'comments'}).text)
        except:
            comments.append('0')
        try:
            kudos.append(article.find('dd', {'class':'kudos'}).text)
        except:
            kudos.append('0')
        try:
            bookmarks.append(article.find('dd', {'class':'bookmarks'}).text)
        except:
            bookmarks.append('0')
        try:
            hits.append(article.find('dd', {'class':'hits'}).text)
        except:
            hits.append('0')

    df = pd.DataFrame(list(zip(titles, authors, ids, date_updated, ratings, pairings,\
                              warnings, complete, languages, word_count, chapters,\
                               comments, kudos, bookmarks, hits)))
    
    with open('some_metadata.csv','a', encoding='utf8') as f:
        df.to_csv(f, header=False, index=False)

In [11]:
import pandas as pd

for page in pages:
    process_basic(page)

## NOW FOR THE ACTUAL TEXTS:

In [12]:
header_row = ['ID', 'Tags', 'Summary', 'Date_published', 'Content']

with open('texts.csv','w', encoding='utf8') as f:
    writer = csv.writer(f)
    writer.writerow(header_row)

In [13]:
def get_tags(article):
    tags = []
    for child in article.find('ul', {'class':'tags commas'}).children:
        if isinstance(child, str):
            pass
        else:
            tags.append(child.text.strip())
    return ', '.join(tags)

def get_summary(article):
    try:
        out = article.find('blockquote', {'class':'userstuff summary'}).text.strip()
        return out
    except:
        return ''
    
def open_fic(work_id, headers):
    context = ssl._create_unverified_context()
    url = 'https://archiveofourown.org' + work_id + '?view_adult=true&view_full_work=true'
    req = urllib.request.Request(url, headers=headers)
    resp = urllib.request.urlopen(req, context=context)
    print('Successfully opened fiction:', url)
    bs = BeautifulSoup(resp, "html.parser")
    time.sleep(3)
    return bs

In [14]:
def article_to_row(work_id, article, headers, start_index=1):
    bs = open_fic(work_id, headers=headers)
    publish_date = bs.find('dd', {'class':'published'}).text
    content = bs.find('div', {'id':'chapters'}).text.strip()
   
    return [work_id[7:], get_tags(article), get_summary(article), publish_date, content]
  
def process_articles(articles, start_index=0, start_index2=1):
    for i, article in enumerate(articles[start_index:]):
        work_id = article.find('h4', {'class':'heading'}).find('a').get('href')
        try:
            row = article_to_row(work_id=work_id, article=article, headers=headers, start_index=start_index2)
            with open('texts.csv','a', encoding='utf8') as f:
                writer = csv.writer(f)
                writer.writerow(row)
        except urllib.error.HTTPError as e:
            if e.code == 429:
                print('---Too many requests when accessing ARTICLE---')
                print('We should try this ID later:', work_id, 'which has an index of', i)
                break
            raise

In [15]:
totalPages = 142
ix = 0
for i in range(1,totalPages+1):
    pageName = "./src/websites/"+str(i)+".html"
    with open(pageName, mode='r', encoding='utf8') as f:
        print('========We are opening page', i, '========')
        page = f.read()
        bs = BeautifulSoup(page, 'html.parser')
        l_articles_on_page = bs.find_all('li', {'role':'article'})
        if ix != 0:
            process_articles(articles=l_articles_on_page, start_index=ix, start_index2=1)
            ix = 0
        else:
            process_articles(articles=l_articles_on_page, start_index=ix, start_index2=1)
        temp = pd.read_csv('texts.csv')
        print('Now we have a total of', len(temp), 'rows of data!')

Successfully opened fiction: https://archiveofourown.org/works/22075171?view_adult=true&view_full_work=true
Successfully opened fiction: https://archiveofourown.org/works/21632368?view_adult=true&view_full_work=true
Successfully opened fiction: https://archiveofourown.org/works/22075246?view_adult=true&view_full_work=true
Successfully opened fiction: https://archiveofourown.org/works/22075135?view_adult=true&view_full_work=true
Successfully opened fiction: https://archiveofourown.org/works/21990796?view_adult=true&view_full_work=true
Successfully opened fiction: https://archiveofourown.org/works/22073674?view_adult=true&view_full_work=true
Successfully opened fiction: https://archiveofourown.org/works/22074886?view_adult=true&view_full_work=true
Successfully opened fiction: https://archiveofourown.org/works/22074667?view_adult=true&view_full_work=true
Successfully opened fiction: https://archiveofourown.org/works/22074604?view_adult=true&view_full_work=true
Successfully opened fiction:

## WE ARE FINISHED

In [16]:
df1 = pd.read_csv("some_metadata.csv")

In [17]:
df2 = pd.read_csv("texts.csv")

In [21]:
df_final = df1.merge(df2, on="ID")
df_final.head()

Unnamed: 0,Title,Author,ID,Date_updated,Rating,Pairing,Warning,Complete,Language,Word_count,Num_chapters,Num_comments,Num_kudos,Num_bookmarks,Num_hits,Tags,Summary,Date_published,Content
0,Cupcake,indoor_queer,22075171,01 Jan 2020,Explicit,F/F,No Archive Warnings Apply,Complete Work,English,1115,1,0,113,9,4054,"No Archive Warnings Apply, Hermione Granger/Gi...",\n Ginny surprises her wife at work.\n,2020-01-01,Work Text:\r\nHermione was working her way thr...
1,The Sinner\xe2\x80\x99s Redemption,oldenuf2nb,21632368,01 Jan 2020,Explicit,M/M,No Archive Warnings Apply,Complete Work,English,55432,25,350,2016,419,32692,"No Archive Warnings Apply, Draco Malfoy/Harry ...","\n When Headmaster, Harry Potter, loses h...",2019-12-01,Chapter 1: Friendships and Freudian Slips\r\n ...
2,Memory is a fleeting moment,faemalenomad,22075246,01 Jan 2020,Teen And Up Audiences,F/M,No Archive Warnings Apply,Complete Work,English,9887,4,6,48,5,981,"No Archive Warnings Apply, Sirius Black/Lily L...","\n In hindsight, Sirius should have known...",2020-01-01,Chapter 1\r\n\r\n\r\n\r\n\r\n\r\nChapter Text\...
3,What You\'ll Sorely Miss,CocosCocoaPuffsAreNotForSale,22075135,01 Jan 2020,General Audiences,F/M,No Archive Warnings Apply,Complete Work,English,2080,1,1,215,18,3566,"No Archive Warnings Apply, Draco Malfoy/Reader...",\n When Harry\xe2\x80\x99s sister disappe...,2020-01-01,"Work Text:\r\n“Harry, what exactly did the egg..."
4,Hidden Walls,ViridianStarVeil (ViridianVeil),21990796,01 Jan 2020,Teen And Up Audiences,"M/M, F/M, Multi",Choose Not To Use Archive Warnings,Work in Progress,English,11787,4,40,441,97,6417,"Creator Chose Not To Use Archive Warnings, Jam...",\n Harry Potter is getting close to the e...,2019-12-27,Chapter 1\r\n\r\n\r\n\r\n\r\n\r\nChapter Text\...


In [23]:
i = 0
for text in df_final["Content"]:
    with open("src/texts/" + str(df_final["ID"].iloc[i])+".txt", "w", encoding = "utf8") as f:
        f.write(text)
    i = i+1

In [25]:
df_final.drop("Content", axis=1, inplace=True)

KeyError: "['Content'] not found in axis"

In [28]:
df_final.to_csv("meta.csv", encoding = "utf8")