# Webscraping Archive of Our Own (AO3) Data

This code is from [Sopia Z.](https://medium.com/nerd-for-tech/mining-fanfics-on-ao3-part-1-data-collection-eac8b5d7a7fa) from Medium. Though, I did some modification to fit the needs for my project, specifically adding the fandoms, relationships, characters, and tags columns in the dataset.

---

### Importing Libraries

Of the libraries given below, the only ones you may need to install in your environment is bs4 and pandas. Everything else should be in your standard Python library.

In [None]:
import time
import csv
from bs4 import BeautifulSoup
import re
import urllib.request
import pandas as pd

### Scraping the Data from AO3

Before scraping the data, one needs to determine what type of data they want. If one decides to use the advance search/filter in AO3 for a specific subset of data, then the URL is going to be different fromt the standard AO3 URL.

So, basically, once you know what you are needing, in my case, the last 4 weeks of fanfiction updates, then copy and paste the URL to find where "page=#", where # is the page number you are on in the web page. Then, in the getContent function below, split the URL like

url = "https://archiveofourown.org/some/random/gibberish/until&page=" + str(i) + "now/everything/after/the/page/number/if/there/is/more"

in the "url" variable.


In [None]:
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_2_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Chromium/18.0.1025.168 Chrome/18.0.1025.168'}

def getContent(url, start_page=1, end_page=1):
    basic_url = url 
    #should be of the form: "https://archiveofourown.org/tags/###TAG###/works?page="
    
    for i in range(start_page, end_page+1):
        #url = basic_url+str(i)
        url = "https://archiveofourown.org/works/search?commit=Search&page="+str(i)+"&work_search%5Bbookmarks_count%5D=&work_search%5Bcharacter_names%5D=&work_search%5Bcomments_count%5D=&work_search%5Bcomplete%5D=&work_search%5Bcreators%5D=&work_search%5Bcrossover%5D=&work_search%5Bfandom_names%5D=&work_search%5Bfreeform_names%5D=&work_search%5Bhits%5D=&work_search%5Bkudos_count%5D=&work_search%5Blanguage_id%5D=en&work_search%5Bquery%5D=&work_search%5Brating_ids%5D=&work_search%5Brelationship_names%5D=&work_search%5Brevised_at%5D=4+weeks+ago&work_search%5Bsingle_chapter%5D=0&work_search%5Bsort_column%5D=kudos_count&work_search%5Bsort_direction%5D=desc&work_search%5Btitle%5D=&work_search%5Bword_count%5D="
        try:
            req = urllib.request.Request(url,headers=headers)
            resp = urllib.request.urlopen(req)
            pageName = "./Test/"+str(i)+".html"
            with open(pageName, 'w', encoding="utf-8") as f:
                f.write(resp.read().decode('utf-8'))        
                print (pageName, end=" ")
            time.sleep(5)
        except urllib.error.HTTPError as e:
            if e.code == 429:
                print('Too many requests!---SLEEPING---')
                print('we should restart on page', i)
                print('we should restart with this url:', url)
                break
            raise

### Using the Webscraping Function

Include the starting number and ending page number given on the AO3 search results. There may be a case where it will stop before reaching the end number. If that happens, just change the command, where the start page is last page it got stuck on.

In [None]:
getContent("e",1967,2005)

### Coverting the HTML Files into Useable Data

This function will open the HTML files and scrape the data in the HTML files. After looking at all of the HTML files it will convert the data frame into a CSV file.

In [None]:
def process_basic(page_content):
    bs = BeautifulSoup(page_content, 'lxml')
    titles = []
    authors = []
    ids = []
    fandoms = []
    date_updated = []
    ratings = []
    pairings = []
    warnings = []
    relationships = []
    characters = []
    tags = []
    complete = []
    languages = []
    word_count = []
    chapters = []
    comments = []
    kudos = []
    bookmarks = []
    hits = []

    for article in bs.find_all('li', {'role':'article'}):
        titles.append(article.find('h4', {'class':'heading'}).find('a').text)
        try:
            authors.append(article.find('a', {'rel':'author'}).text)
        except:
            authors.append('Anonymous')
        ids.append(article.find('h4', {'class':'heading'}).find('a').get('href')[7:])
        try:
            result = []
            for lists in article.find_all('h5', {'class':'fandoms heading'}):
                for fandom in lists:
                    if not fandom.find('a'):
                        result.append(fandom.text)
            result.pop(0)
            fandoms.append(result)
        except:
            fandoms.append([])
        date_updated.append(article.find('p', {'class':'datetime'}).text)
        ratings.append(article.find('span', {'class':re.compile(r'rating\-.*rating')}).text)
        pairings.append(article.find('span', {'class':re.compile(r'category\-.*category')}).text)
        warnings.append(article.find('span', {'class':re.compile(r'warning\-.*warnings')}).text)
        try:
            result = []
            for lists in article.find_all('li', {'class':'relationships'}):
                for relation in lists:
                    result.append(relation.text)
            relationships.append(result)
        except:
            relationships.append([])
        try:
            result = []
            for lists in article.find_all('li', {'class':'characters'}):
                for character in lists:
                    result.append(character.text)
            characters.append(result)
        except:
            characters.append([])
        try:
            result = []
            for lists in article.find_all('li', {'class':'freeforms'}):
                for tag in lists:
                    result.append(tag.text)
            tags.append(result)
        except:
            tags.append([])
        complete.append(article.find('span', {'class':re.compile(r'complete\-.*iswip')}).text)
        languages.append(article.find('dd', {'class':'language'}).text)
        count = article.find('dd', {'class':'words'}).text
        if len(count) > 0:
            word_count.append(count)
        else:
            word_count.append('0')
        chapters.append(article.find('dd', {'class':'chapters'}).text.split('/')[0])
        try:
            comments.append(article.find('dd', {'class':'comments'}).text)
        except:
            comments.append('0')
        try:
            kudos.append(article.find('dd', {'class':'kudos'}).text)
        except:
            kudos.append('0')
        try:
            bookmarks.append(article.find('dd', {'class':'bookmarks'}).text)
        except:
            bookmarks.append('0')
        try:
            hits.append(article.find('dd', {'class':'hits'}).text)
        except:
            hits.append('0')

    df = pd.DataFrame(list(zip(titles, authors, ids, fandoms, date_updated, ratings, pairings,\
                              warnings, relationships, characters, tags, complete, languages,\
                               word_count, chapters, comments, kudos, bookmarks, hits)))
    
    print('Successfully processed', len(df), 'rows!')
    
    with open('March2023_AO3.csv','a', encoding='utf8') as f:
        df.to_csv(f, header=False, index=False)
    temp = pd.read_csv('March2023_AO3.csv')
    print('Now we have a total of', len(temp), 'rows of data!')
    print('================================')

### Creating the CSV file

This will give the CSV file headers and create the CSV file.

In [None]:
header = ['Title', 'Author', 'ID', 'Fandoms', 'Date_updated', 'Rating', 'Pairing', 'Warning', 'Relationships', 'Characters', 'Tags', 'Complete', 'Language', 'Word_count', 'Num_chapters', 'Num_comments', 'Num_kudos', 'Num_bookmarks', 'Num_hits']
with open('March2023_AO3.csv','w', encoding='utf8') as f:
    writer = csv.writer(f)
    writer.writerow(header)

### Using the Function

This code will go page by page, given a maximum page number, and use the function to extract the data.

In [None]:
totalPages = 2005
for i in range(1, totalPages+1):
    pageName = "./Test/"+str(i)+".html"
    with open(pageName, mode='r', encoding='utf8') as f:
        print('Now we are opening page', i, '...')
        page = f.read()
        process_basic(page)