# Webscraping and Social Media Scraping Project

by Mateusz Kowalski & Ewa Włodarczyk

### Libraries

In [17]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import re
import time 

### StoryGraph website

The website we want to scrap: https://www.thestorygraph.com/

It consists of book data: 
- title and author
- number of pages
- genres
- mood and pacing
- average star rating
- number of reviews
- if it's a part of a series
- first publication date
- editions
- content warnings
- additional information

A look at a sample webpage with information about a book:


![](img/website-screenshot-1.png)

|![](img/website-screenshot-2.png)  |  ![](img/website-screenshot-3.png)|
|------------|-----------|

### Inspeciting main website html structure with BeautifulSoup

In [88]:
#webpage for browsing books
website_url = "https://app.thestorygraph.com/browse"

#html representatation of the webpage
website_html = requests.get(website_url) 

main_soup=BeautifulSoup(website_html.text, 'html.parser')
main_soup

<!DOCTYPE html>

<html class="system">
<head>
<script data-domain="app.thestorygraph.com" defer="" src="https://jolly-sunset-7d5a.thestorygraph.workers.dev/js/script.js"></script>
<title>Browse Books | The StoryGraph</title>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="OVdfA7lVvyezSdii7_tsrUOI-w_K9f6hb7UBAxThJqBiDLmVjADyUEWE0krbHgkCSnwgXdqx4V5DW6tcW-JAeQ" name="csrf-token"/>
<meta content="width=device-width, initial-scale=1, user-scalable=no, viewport-fit=cover" name="viewport"/>
<meta content="default" name="apple-mobile-web-app-status-bar-style"/>
<link data-turbo-track="reload" href="https://assets.thestorygraph.com/assets/tailwind-d9aa4f5ce74bdeada0ac6c24734a3caf1d5c383abb150ef38d5a3b6fcb75e220.css" rel="stylesheet"/>
<link data-turbo-track="reload" href="https://assets.thestorygraph.com/assets/application-83a6ca8fa13783cf851fcbf51c558d627dbd63554582ff0af1ef8f39538555e5.css" rel="stylesheet"/>
<link href="https://assets.thestorygraph.com/assets/actiontext-

### Finding links to book pages from the main website

In [89]:
hrefs=[]

for a in main_soup.find_all('a'):
    if('/books/' in a.attrs['href'] and 'edition' not in a.attrs['href']): # only 'a' elements with '/books' pattern
        hrefs.append(a.attrs['href'])

#removing duplicates
single_hrefs = list(set(hrefs))

print(single_hrefs)

['/books/4cb9c964-4ddb-42c7-8cec-e527a9ebe8df', '/books/c86c4a48-89aa-4805-a513-2fef489d8420', '/books/be049e13-9e22-4b99-82f5-41fc739ae7e1', '/books/edde2ee2-adc4-43d2-b89c-ff81339a749e', '/books/eb83b253-f87d-459a-b571-34db913be377', '/books/8145fb3d-8156-43f9-be7d-f8c656f81a6d', '/books/d891b0dd-229f-457c-8762-cac7d4823763', '/books/8985f7cc-02f7-4007-ad50-ba555394aa03', '/books/1ee3f05f-17bd-4cc1-bc6b-af353b97698a', '/books/c9ca5bdd-30da-48ad-a0c9-3c9ac03a297a']


### Accessing single book page information

In [189]:
#Let's consider Bride by Ali Hazelwood subpage
webpage_html = requests.get("https://app.thestorygraph.com/books/be049e13-9e22-4b99-82f5-41fc739ae7e1")
soup = BeautifulSoup(webpage_html.text, 'html.parser')
soup

<!DOCTYPE html>

<html class="system">
<head>
<script data-domain="app.thestorygraph.com" defer="" src="https://jolly-sunset-7d5a.thestorygraph.workers.dev/js/script.js"></script>
<title>Bride by Ali Hazelwood | The StoryGraph</title>
<meta content="authenticity_token" name="csrf-param"/>
<meta content="Ox7cM0iod17RVRTk7TdpdLG4CRoP26v-CZq-PmKB6REVnqvEZNRo2jfBmP-w2FVwZkmiwYvArbvONHArWc20qA" name="csrf-token"/>
<meta content="width=device-width, initial-scale=1, user-scalable=no, viewport-fit=cover" name="viewport"/>
<meta content="default" name="apple-mobile-web-app-status-bar-style"/>
<meta content="summary" name="twitter:card"/>
<meta content="@thestorygraph" name="twitter:site">
<meta content="Bride by Ali Hazelwood" property="og:title">
<meta content="/books/be049e13-9e22-4b99-82f5-41fc739ae7e1" property="og:url"/>
<meta content="A dangerous alliance between a Vampyre bride and an Alpha Werewolf becomes a love deep enough to ..." property="og:description"/>
<meta content="https://cd

### Finding the information we want to scrap

In [52]:
#1 Title and author

title_author = soup.find('meta' , {'property': "og:title"})['content']
title_author

'Bride by Ali Hazelwood'

In [53]:
#2 Number of pages

pages =  soup.find('p' , {'class': 'text-sm font-light text-darkestGrey dark:text-grey mt-1'})
text=pages.text.strip()
pages=int(re.findall(r'[0-9]+(?= pages\n)',text)[0])
pages

399

In [37]:
#3 Genres

genre=soup.find_all('span', {'class': "inline-block text-xs sm:text-sm text-teal-700 dark:text-teal-200 mr-0.5 mt-1 border border-darkGrey dark:border-darkerGrey rounded-sm py-0.5 px-2"})

genres=[]
for item in genre:
    genres.append(item.text)

genres = list(set(genres))
genres

['romance', 'fiction', 'fantasy']

In [36]:
#4 Moods and pacing

mood = soup.find_all('span', {'class': 'md:mr-1'})
percent = soup.find_all('span', {'class': 'percentage'})

moods=[]
for item in mood:
    moods.append(item.text)
    
percents=[]
for item in percent:
    percents.append(item.text)
    
moods_percents = pd.DataFrame({'mood': moods,'percent': percents})

#removing duplicates
moods_percents = moods_percents.drop_duplicates()
moods_percents

Unnamed: 0,mood,percent
0,funny,75%
1,lighthearted,46%
2,adventurous,45%
3,mysterious,45%
4,emotional,41%
5,tense,17%
6,hopeful,15%
7,dark,13%
8,relaxing,6%
9,challenging,4%


In [28]:
#5 Average star rating

star = soup.find('span', {'class': 'average-star-rating'})
stars=float(star.text.strip())
stars

4.17

In [35]:
#6 Number of reviews

review = soup.find('a' , {'class': "standard-link font-medium uppercase border-b"})
text=review.text.strip()
reviews=int(re.findall(r'[0-9]+,?[0-9]+(?= reviews)',text)[0].replace(",", ""))
reviews

20057

In [168]:
#7 Is it a part of a series
series=0
for a in soup.find_all('a'):
    if('/series/' in a.attrs['href']): 
        series=1
series

0

In [194]:
#8 Additional information

info = soup.find_all('span', {'class': 'review-response-summary'})
    
infos=[]
for item in info:
    infos.append(item.text)

mix=re.findall(r"(?<=A mix: )([0-9]+\%)",infos[0])
character=re.findall(r"(?<=Character: )([0-9]+\%)",infos[0])
plot=re.findall(r"(?<=Plot: )([0-9]+\%)",infos[0])

['14%']

In [169]:
#Creating one data frame for all the information
df = [{'title_author': title_author, 'pages': pages,'genres': genres,
       'funny':moods_percents.loc[moods_percents['mood']=='funny', "percent"].item(),
       'lighthearted':moods_percents.loc[moods_percents['mood']=='lighthearted', "percent"].item(),
       'adventurous':moods_percents.loc[moods_percents['mood']=='adventurous', "percent"].item(),
       'mysterious':moods_percents.loc[moods_percents['mood']=='mysterious', "percent"].item(),
       'emotional':moods_percents.loc[moods_percents['mood']=='emotional', "percent"].item(),
       'tense':moods_percents.loc[moods_percents['mood']=='tense', "percent"].item(),
       'hopeful':moods_percents.loc[moods_percents['mood']=='hopeful', "percent"].item(),
       'dark':moods_percents.loc[moods_percents['mood']=='dark', "percent"].item(),
       'relaxing':moods_percents.loc[moods_percents['mood']=='relaxing', "percent"].item(),
       'challenging':moods_percents.loc[moods_percents['mood']=='challenging', "percent"].item(),
       'inspiring':moods_percents.loc[moods_percents['mood']=='inspiring', "percent"].item(),
       'reflective':moods_percents.loc[moods_percents['mood']=='reflective', "percent"].item(),
        'sad':moods_percents.loc[moods_percents['mood']=='sad', "percent"].item(),
       'informative':moods_percents.loc[moods_percents['mood']=='informative', "percent"].item(),
       'medium_pace':moods_percents.loc[moods_percents['mood']=='medium', "percent"].item(),
       'fast_pace':moods_percents.loc[moods_percents['mood']=='fast', "percent"].item(),
       'slow_pace':moods_percents.loc[moods_percents['mood']=='slow', "percent"].item(),
       'avg_rating':stars, 'reviews':reviews, 'series':series}]
df = pd.DataFrame(df)
df

Unnamed: 0,title_author,pages,genres,funny,lighthearted,adventurous,mysterious,emotional,tense,hopeful,...,inspiring,reflective,sad,informative,medium_pace,fast_pace,slow_pace,avg_rating,reviews,series
0,Bride by Ali Hazelwood,399,"[fiction, fantasy]",3%,0%,75%,23%,43%,76%,2%,...,2%,4%,15%,0%,66%,16%,16%,4.17,670,0


### Scrapping a few subpages using a loop

In [175]:
#creating a common table for all moods

moods=['funny','lighthearted','adventurous','mysterious','emotional','tense',
       'hopeful','dark','relaxing','challenging','inspiring','reflective','sad','informative',
       'medium','fast','slow']
common_moods = pd.DataFrame({'mood': moods})

#creating a results table
df = [{'title_author': 'title_and_author', 
       'pages': 'number_of_pages',
       'genres': 'genres',
       'funny':'mood_funny',
       'lighthearted':'mood_lighthearted',
       'adventurous':'mood_adventurous',
       'mysterious':'mood_mysterious',
       'emotional':'mood_emotional',
       'tense':'mood_tense',
       'hopeful':'mood_hopeful',
       'dark':'mood_dark',
       'relaxing':'mood_relaxing',
       'challenging':'mood_challenging',
       'inspiring':'mood_inspiring',
       'reflective':'mood_reflective',
        'sad':'mood_sad',
       'informative':'mood_informative',
       'medium_pace':'medium_pace',
       'fast_pace':'fast_pace',
       'slow_pace':'slow_pace',
       'avg_rating':'average_star_rating', 
       'reviews':'number_reviews',
       'series':'is_part_of_a_series'}]

In [176]:
start = time.time()
websites_list=[]
for i in range(1, 26): # let's look at first 25 pages each with 10 books
    websites_list.append("https://app.thestorygraph.com/browse?page="+str(i))
#first loop with browse websites
for website in websites_list:
    website_url = website
    website_html = requests.get(website_url) 
    main_soup=BeautifulSoup(website_html.text, 'html.parser')
    hrefs=[]
    for a in main_soup.find_all('a'):
        if('/books/' in a.attrs['href'] and 'edition' not in a.attrs['href']): # only 'a' elements with '/books' pattern
            hrefs.append(a.attrs['href'])
    single_hrefs = list(set(hrefs))
    #second loop for book webpages on browse website 
    for i in single_hrefs:
        try:
            webpage_html = requests.get("https://app.thestorygraph.com/"+i)
            soup = BeautifulSoup(webpage_html.text, 'html.parser')
            #genres are more complicated
            genre=soup.find_all('span', {'class': "inline-block text-xs sm:text-sm text-teal-700 dark:text-teal-200 mr-0.5 mt-1 border border-darkGrey dark:border-darkerGrey rounded-sm py-0.5 px-2"})
            genres=[]
            for item in genre:
                genres.append(item.text)
            genres = list(set(genres))
            #moods are more complicated
            mood = soup.find_all('span', {'class': 'md:mr-1'})
            percent = soup.find_all('span', {'class': 'percentage'})
            moods=[]
            for item in mood:
                moods.append(item.text)
            percents=[]
            for item in percent:
                percents.append(item.text)
            moods_percents = pd.DataFrame({'mood': moods,'percent': percents})
            moods_percents = moods_percents.drop_duplicates()
            moods_percents=common_moods.merge(moods_percents, on='mood', how='left').fillna('0%')
            #series are more complicated
            series=0
            for a in soup.find_all('a'):
                if('/series/' in a.attrs['href']): 
                    series=1
            #the rest straight to data frame
            new_data = {'title_author': str(soup.find('meta' , {'property': "og:title"})['content']), 
                   'pages': int(re.findall(r'[0-9]+(?= pages\n)',soup.find('p' , {'class': 'text-sm font-light text-darkestGrey dark:text-grey mt-1'}).text.strip())[0]),
                    'genres': genres,
                   'funny':moods_percents.loc[moods_percents['mood']=='funny', "percent"].item(),
                   'lighthearted':moods_percents.loc[moods_percents['mood']=='lighthearted', "percent"].item(),
                   'adventurous':moods_percents.loc[moods_percents['mood']=='adventurous', "percent"].item(),
                   'mysterious':moods_percents.loc[moods_percents['mood']=='mysterious', "percent"].item(),
                   'emotional':moods_percents.loc[moods_percents['mood']=='emotional', "percent"].item(),
                   'tense':moods_percents.loc[moods_percents['mood']=='tense', "percent"].item(),
                   'hopeful':moods_percents.loc[moods_percents['mood']=='hopeful', "percent"].item(),
                   'dark':moods_percents.loc[moods_percents['mood']=='dark', "percent"].item(),
                   'relaxing':moods_percents.loc[moods_percents['mood']=='relaxing', "percent"].item(),
                   'challenging':moods_percents.loc[moods_percents['mood']=='challenging', "percent"].item(),
                   'inspiring':moods_percents.loc[moods_percents['mood']=='inspiring', "percent"].item(),
                   'reflective':moods_percents.loc[moods_percents['mood']=='reflective', "percent"].item(),
                    'sad':moods_percents.loc[moods_percents['mood']=='sad', "percent"].item(),
                   'informative':moods_percents.loc[moods_percents['mood']=='informative', "percent"].item(),
                   'medium_pace':moods_percents.loc[moods_percents['mood']=='medium', "percent"].item(),
                   'fast_pace':moods_percents.loc[moods_percents['mood']=='fast', "percent"].item(),
                   'slow_pace':moods_percents.loc[moods_percents['mood']=='slow', "percent"].item(),
                  'avg_rating':float(soup.find('span', {'class': 'average-star-rating'}).text.strip()),
                   'reviews':int(re.findall(r'[0-9]+,?[0-9]+(?= reviews)',soup.find('a' , {'class': "standard-link font-medium uppercase border-b"}).text.strip())[0].replace(",", "")),
                    'series':series}
            df.append(new_data)
        except:
            continue
df = pd.DataFrame(df)
# removing duplicates
df = df.loc[df.astype(str).drop_duplicates().index]
# saving our result
df.to_csv('books_scraped.csv', index=False)
end = time.time()
print(end - start)

375.6298887729645


In [177]:
df

Unnamed: 0,title_author,pages,genres,funny,lighthearted,adventurous,mysterious,emotional,tense,hopeful,...,inspiring,reflective,sad,informative,medium_pace,fast_pace,slow_pace,avg_rating,reviews,series
0,title_and_author,number_of_pages,genres,mood_funny,mood_lighthearted,mood_adventurous,mood_mysterious,mood_emotional,mood_tense,mood_hopeful,...,mood_inspiring,mood_reflective,mood_sad,mood_informative,medium_pace,fast_pace,slow_pace,average_star_rating,number_reviews,is_part_of_a_series
1,The Women by Kristin Hannah,471,"[historical, fiction]",1%,0%,35%,0%,93%,44%,29%,...,48%,44%,76%,48%,67%,27%,5%,4.65,5336,0
2,Iron Flame by Rebecca Yarros,625,"[romance, fiction, fantasy]",18%,1%,95%,31%,67%,68%,10%,...,7%,3%,21%,1%,59%,28%,12%,4.2,73302,1
3,Bride by Ali Hazelwood,399,"[romance, fiction, fantasy]",75%,46%,45%,45%,41%,17%,15%,...,2%,2%,2%,1%,57%,39%,3%,4.17,20118,0
4,A Fate Inked in Blood by Danielle L. Jensen,432,"[romance, fiction, fantasy]",15%,2%,97%,36%,49%,54%,7%,...,7%,4%,7%,1%,61%,33%,5%,4.12,2794,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
236,Powerless by Elsie Silver,396,"[sports, romance, contemporary, fiction]",56%,56%,14%,1%,83%,8%,48%,...,19%,16%,16%,0%,68%,23%,8%,4.08,15559,1
237,Worst Wingman Ever by Abby Jimenez,61,"[romance, contemporary, fiction]",54%,65%,1%,0%,62%,0%,45%,...,17%,12%,21%,0%,15%,82%,1%,3.96,5402,1
238,Vicious by V.E. Schwab,364,"[science fiction, fiction, fantasy]",7%,0%,59%,58%,18%,66%,1%,...,0%,9%,6%,0%,43%,52%,4%,4.28,55144,1
239,The Book Thief by Markus Zusak,552,"[young adult, historical, fiction]",10%,2%,15%,5%,90%,28%,22%,...,24%,45%,83%,13%,65%,9%,24%,4.46,159350,0


We can see that scraping information about 240 books took over 6 minutes

If we want to scrap data of 5000 books we would need over 2 hours, so let's find a more efficient way (scrapy)

We will be scraping there these pages:

https://app.thestorygraph.com/reading_challenge_prompts/1c947bef-adbd-4424-95cc-dd0d3647122f
https://app.thestorygraph.com/reading_challenge_prompts/cff86474-726d-4821-82a5-290871623f48
https://app.thestorygraph.com/reading_challenge_prompts/75f1df1f-f5b4-4535-83df-89e4c68addf8
https://app.thestorygraph.com/reading_challenge_prompts/b7462a07-b2ff-454c-831a-ba8e553afe59
https://app.thestorygraph.com/reading_challenge_prompts/d752e323-5802-4a4a-9f31-7ccc8cba51b6

(They have more books to browse)