In this notebook we attempt to scrape and gather newly relased fiction books ordered by subjects from [Barnes&Noble](https://www.barnesandnoble.com/). We collect our data as follows:
- scrape to get the list of subjects used to categorize the fiction books. 
- for each subject we will accumulate 40 different books along with the title, author, and summary(if it exists) for each book in the subject.
<br>if a summary does not exist, and if we are missing just a few then we should be able to manually replace using targeted scrape functions. 

Note: Be aware that we only take 40 because B&N website is terrible and will max out at 40 books per page. We may collect more data by scraping more than one page if we think it is neccessary for the purpose of our project. 


In [90]:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import WebDriverException

The reason we are using both Requests and Selenium Python libraries is due to the fact that some pages in B&N's website do not work with Requests, and for those pages we use Selenium.  

In [4]:
#Make the list of subjects along with their urls. 

URL = "https://www.barnesandnoble.com/b/fiction/books/_/N-2usxZ29Z8q8"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="sidebar-section-FictionSubjects")

Sci_Fic_Thr = []
for litag in results.find_all('a'):
    if litag.text == 'Science Fiction & Fantasy' or litag.text == 'Thrillers' :
        Sci_Fic_Thr.append((str(litag).split('"')[1], litag.text.strip())) 

In [5]:
Sci_Fic_Thr

[('/b/books/science-fiction-fantasy/_/N-29Z8q8Z180l',
  'Science Fiction & Fantasy'),
 ('/b/books/fiction/thrillers/_/N-29Z8q8Z1d3u', 'Thrillers')]

In [6]:
#Make the list of new releases along with their urls for the subjects Thrillers and Science Fiction & Fantasy.

def subjects_new_relases_url(list):
    new_releases_url = []
    for link in list:
        url = 'https://www.barnesandnoble.com' + link[0]
        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")
        target = soup.find(id='hotBooksWithDesc_NewReleases')
        new_releases_url.append((str(target.find('a', class_='see-all-link')).split('"')[3],link[1]))
        
    return new_releases_url

In [7]:
subjects_new_relases_url(Sci_Fic_Thr)

[('/b/books/science-fiction-fantasy/_/N-1sZ29Z8q8Z180l?Ns=P_Sales_Rank',
  'Science Fiction & Fantasy'),
 ('/b/books/fiction/thrillers/_/N-1sZ29Z8q8Z1d3u?Ns=P_Sales_Rank',
  'Thrillers')]

The new releases page by default only displays 20 books per page, but one can choose to show 40 books per page by choosing "show 40". The following code snippet accomplishes just that.

In [57]:
def show_40(list,begin,end):    
    show_40_list = []
    for i in range(begin,end):
        for link in list:
            url = 'https://www.barnesandnoble.com' + link[0].replace("s=P_Sales_Rank",f'rpp=40&Ns=P_Sales_Rank&page={i}')
            show_40_list.append((url,link[1]))
    return show_40_list

In [58]:
show_40(subjects_new_relases_url(Sci_Fic_Thr),1,3)[:5]

[('https://www.barnesandnoble.com/b/books/science-fiction-fantasy/_/N-1sZ29Z8q8Z180l?Nrpp=40&Ns=P_Sales_Rank&page=1',
  'Science Fiction & Fantasy'),
 ('https://www.barnesandnoble.com/b/books/fiction/thrillers/_/N-1sZ29Z8q8Z1d3u?Nrpp=40&Ns=P_Sales_Rank&page=1',
  'Thrillers'),
 ('https://www.barnesandnoble.com/b/books/science-fiction-fantasy/_/N-1sZ29Z8q8Z180l?Nrpp=40&Ns=P_Sales_Rank&page=2',
  'Science Fiction & Fantasy'),
 ('https://www.barnesandnoble.com/b/books/fiction/thrillers/_/N-1sZ29Z8q8Z1d3u?Nrpp=40&Ns=P_Sales_Rank&page=2',
  'Thrillers')]

In [46]:
#Make the list containing the urls for all the newly released books along with their subjects.

def get_all_new_relases(list):
    
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    browser = webdriver.Chrome(options=chrome_options)

    all_new_releases_url_list = []
    for x in list:
        browser.get(x[0])
        html = browser.page_source
        soup = BeautifulSoup(html, "html.parser")        
        target = soup.find_all('a', attrs = {'class' : 'pImageLink'})
        for y in target:
            a = str(y).split('"')[3]
            all_new_releases_url_list.append((a,x[1]))
    return all_new_releases_url_list

In [64]:
urls = get_all_new_relases(show_40(subjects_new_relases_url(Sci_Fic_Thr),1,6))

In [65]:
urls[:10]

[('/w/a-terrible-fall-of-angels-laurell-k-hamilton/1139142902;jsessionid=7E6181988B6D3C96B31F75E0F1CB050A.prodny_store02-va17?ean=9781984804464',
  'Science Fiction & Fantasy'),
 ('/w/king-bullet-richard-kadrey/1138272150;jsessionid=7E6181988B6D3C96B31F75E0F1CB050A.prodny_store02-va17?ean=9780062951588',
  'Science Fiction & Fantasy'),
 ('/w/paper-blood-kevin-hearne/1139998072;jsessionid=7E6181988B6D3C96B31F75E0F1CB050A.prodny_store02-va17?ean=9781984821287',
  'Science Fiction & Fantasy'),
 ('/w/starlight-enclave-r-a-salvatore/1138123720;jsessionid=7E6181988B6D3C96B31F75E0F1CB050A.prodny_store02-va17?ean=9780063214699',
  'Science Fiction & Fantasy'),
 ('/w/the-hidden-palace-helene-wecker/1138646126;jsessionid=7E6181988B6D3C96B31F75E0F1CB050A.prodny_store02-va17?ean=9780062468710',
  'Science Fiction & Fantasy'),
 ('/w/merely-magic-patricia-rice/1139840300;jsessionid=7E6181988B6D3C96B31F75E0F1CB050A.prodny_store02-va17?ean=2940162371359',
  'Science Fiction & Fantasy'),
 ('/w/million-

In [66]:
len(urls)

400

In [92]:
#Get the title, author and summary for each newly released book while keeping track of what subject the book belongs to.

def get_data(list):
    
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    browser = webdriver.Chrome(options=chrome_options)
    
    data_dict = {"title":[], "author":[], "overview":[], "subject":[]}
    for link in list:
        try:            
            url = 'https://www.barnesandnoble.com' + link[0]
            browser.get(url)
            html = browser.page_source
            soup = BeautifulSoup(html, 'html.parser')
        
            #get title
            title = soup.find('h1')
            title = str(title).split('>')[1].split('<')[0]
                
             #get author
            author = soup.find('span', attrs = {'id' : 'key-contributors'})
            author = str(author).split('>')[2].split('<')[0]
        
            
             #get summary if there is one
            overview = soup.find('div', attrs = {'class': 'overview-cntnt'})
            if overview:
                overview = overview.text.strip()
            else: 
                overview = 'None'
                
        except (IndexError, ValueError, WebDriverException):
            author = 'null'
            title = 'null'
            overview = 'null'
        
        data_dict["title"].append(title)
        data_dict["author"].append(author)
        data_dict["overview"].append(overview)
        data_dict["subject"].append(link[1])
            
                    
    return data_dict     
        

In [51]:
data = get_data(urls)

In [53]:
import pandas as pd

In [54]:
df = pd.DataFrame.from_dict(data)

In [55]:
df

Unnamed: 0,title,author,overview,subject
0,A Terrible Fall of Angels,Laurell K. Hamilton,"Angels walk among us, but so do other unearthl...",Science Fiction & Fantasy
1,King Bullet (Sandman Slim Series #12),Richard Kadrey,"The incredible finale of the page-turning, hig...",Science Fiction & Fantasy
2,Paper &amp; Blood (Ink &amp; Sigil Series #2),Kevin Hearne,From the New York Times bestselling author of ...,Science Fiction & Fantasy
3,Starlight Enclave (Signed Book),R. A. Salvatore,From New York Times bestselling author R. A. S...,Science Fiction & Fantasy
4,The Hidden Palace: A Novel of the Golem and th...,Helene Wecker,"""Richly nuanced and beautiful. . . . An immers...",Science Fiction & Fantasy
...,...,...,...,...
395,The Sleeping Nymph,Ilaria Tuti,A 2021 Sue Grafton Memorial Award NomineeIn th...,Thrillers
396,Seeking Salvation,Carla Barandas,Four doctors sent on a mission for one of the ...,Thrillers
397,Friend of a Friend,James V. Irving,When an investigation threatens his lucrative ...,Thrillers
398,Red War (Mitch Rapp Series #17),Vince Flynn,This instant #1 New York Times bestseller and ...,Thrillers


In [68]:
urls1 = get_all_new_relases(show_40(subjects_new_relases_url(Sci_Fic_Thr),6,11))

In [69]:
data1 = get_data(urls1)

In [70]:
df1 = pd.DataFrame.from_dict(data1)

In [71]:
df1

Unnamed: 0,title,author,overview,subject
0,The Shackled Verities Complete Collection Box Set,Tammy Salyer,The fate of this world may rest on our shoulde...,Science Fiction & Fantasy
1,War and the Wind,Tyler Krings,The war in Anu is over. The Lord of Fate has u...,Science Fiction & Fantasy
2,"The Lifehouse Trilogy (Mindkiller, Time Pressu...",Spider Robinson,THE LIFEHOUSE TRILOGY is comprised of three bo...,Science Fiction & Fantasy
3,"The Hero Is Overpowered But Overly Cautious, V...",Light Tuchihi,Is it time for Seiya’s bag carriers to fulfill...,Science Fiction & Fantasy
4,Hella,David Gerrold,A master of science fiction introduces a world...,Science Fiction & Fantasy
...,...,...,...,...
395,Pain,Adam Turner,PainBy: Adam TurnerAfter Sky blamed her years ...,Thrillers
396,On demande un cadavre,Frédéric DARD,"Alfredo Seruti, un mafieux anglais, envoie deu...",Thrillers
397,Cigarette Break,Paul Weiss,"FICTION: From its perch in The Castle, an aban...",Thrillers
398,Notchwood,Benjamin Mailloux,"Down on his luck, Allen Notchwood tries to for...",Thrillers


In [86]:
urls2 = get_all_new_relases(show_40(subjects_new_relases_url(Sci_Fic_Thr),11,16))

In [93]:
data2 = get_data(urls2)

In [94]:
df2 = pd.DataFrame.from_dict(data2)

In [100]:
df2

Unnamed: 0,title,author,overview,subject
0,Savage: The Wild,Max Bemis,Teenage heartthrob. Feral social icon. Dinosau...,Science Fiction & Fantasy
1,Phase Six,Jim Shepard,A spare and gripping novel about the next pand...,Science Fiction & Fantasy
2,The AOA (Season 1: Episode 3),Kester James Finley,"After facing the villainous Reclaimer, Becca i...",Science Fiction & Fantasy
3,The Heads of Cerberus,Francis Stevens,The Heads of Cerberus (1919) is a science fict...,Science Fiction & Fantasy
4,The Wind Rose (The Moon Singer Book 3),B. Roman,In this third and final adventure of the Moon ...,Science Fiction & Fantasy
...,...,...,...,...
395,White Fang,Jack London,White Fang is a novel by American author Jack ...,Thrillers
396,Pharos the Egyptian (Esprios Classics),Guy Boothby,Guy Newell Boothby (1867-1905) was a prolific ...,Thrillers
397,A Plague of Dissent,Nic Taylor,In a world where media companies hack into per...,Thrillers
398,Distortions: A Quinn Masterson Mystery,James M. Campbell,A message scrawled at a murder scene beckons D...,Thrillers


In [101]:
urls3 = get_all_new_relases(show_40(subjects_new_relases_url(Sci_Fic_Thr),16,21))

In [102]:
data3 = get_data(urls3)

In [103]:
df3 = pd.DataFrame.from_dict(data3)

In [104]:
df3

Unnamed: 0,title,author,overview,subject
0,Edifice Abandoned,Scott Michael Decker,"Studying ancient sites on a backwater planet, ...",Science Fiction & Fantasy
1,The Phantom,Frank Settineri,The Phantom is a realistic fantasy about Tarch...,Science Fiction & Fantasy
2,The Bonds of Osteria: Book Four of the Osteria...,Tammie Painter,"In a fierce clash for power, titans rise, hero...",Science Fiction & Fantasy
3,Soul Savers Box Set: Books 1-3,Kristie Cook,This boxset delivers the first three books in ...,Science Fiction & Fantasy
4,Visigothic: The Barbarians of Midgard,Jay Newcomb,,Science Fiction & Fantasy
...,...,...,...,...
395,Amazonia's Mythical and Legendary Creatures in...,Damon Corrie,Included are real-life oral tradition accounts...,Thrillers
396,Taunusschuld: Zweiter Fall für Melanie Gramberg,Osvin Nöller,Jede Schuld ist zu begleichen! Die eine früher...,Thrillers
397,The Soul Traders,REMI,"The Soul Traders By: REMIGhosts, both good and...",Thrillers
398,Blood on the Bayou,D. J. Donaldson,Don Donaldson brings us to the famous French Q...,Thrillers


In [76]:
urls4 = get_all_new_relases(show_40(subjects_new_relases_url(Sci_Fic_Thr),21,26))

In [77]:
data4 = get_data(urls4)

In [78]:
df4 = pd.DataFrame.from_dict(data4)

In [80]:
df4

Unnamed: 0,title,author,overview,subject
0,The Nature of Mars,Dr. Patch Lieveert,The Nature of MarsBy: Dr. Patch LieveertAs hu...,Science Fiction & Fantasy
1,Don Rodriguez (Esprios Classics),Lord Dunsany,"Edward John Moreton Drax Plunkett, 18th Baron ...",Science Fiction & Fantasy
2,A Book of This,Mackenzie Judd,Dr. Richards works for a mysterious Head Offic...,Science Fiction & Fantasy
3,The Cursed Spirit,Lilliana Rose,The Lost Souls Academy has become Zarya's home...,Science Fiction & Fantasy
4,The 24hourlies,Gaja J. Kos,The ascension to The Dark Ones had brought Ros...,Science Fiction & Fantasy
...,...,...,...,...
395,Book of Water,Robin Brande,Bradamante is camped with the king’s army in t...,Thrillers
396,The Poet,Lisa Renee Jones,New York Times bestselling author Lisa Renee J...,Thrillers
397,Future Ghost / Speedin' Bullet,Jack Grant,'Future Ghost' and 'Speedin' Bullet' are the t...,Thrillers
398,King Solomon's Mines,H. Rider Haggard,It is a curious thing that at my age-fifty-fiv...,Thrillers


In [105]:
Data = df.append([df1,df2,df3,df4])

In [106]:
Data

Unnamed: 0,title,author,overview,subject
0,A Terrible Fall of Angels,Laurell K. Hamilton,"Angels walk among us, but so do other unearthl...",Science Fiction & Fantasy
1,King Bullet (Sandman Slim Series #12),Richard Kadrey,"The incredible finale of the page-turning, hig...",Science Fiction & Fantasy
2,Paper &amp; Blood (Ink &amp; Sigil Series #2),Kevin Hearne,From the New York Times bestselling author of ...,Science Fiction & Fantasy
3,Starlight Enclave (Signed Book),R. A. Salvatore,From New York Times bestselling author R. A. S...,Science Fiction & Fantasy
4,The Hidden Palace: A Novel of the Golem and th...,Helene Wecker,"""Richly nuanced and beautiful. . . . An immers...",Science Fiction & Fantasy
...,...,...,...,...
395,Book of Water,Robin Brande,Bradamante is camped with the king’s army in t...,Thrillers
396,The Poet,Lisa Renee Jones,New York Times bestselling author Lisa Renee J...,Thrillers
397,Future Ghost / Speedin' Bullet,Jack Grant,'Future Ghost' and 'Speedin' Bullet' are the t...,Thrillers
398,King Solomon's Mines,H. Rider Haggard,It is a curious thing that at my age-fifty-fiv...,Thrillers


In [110]:
Data = Data.drop_duplicates()

In [111]:
Data

Unnamed: 0,title,author,overview,subject
0,A Terrible Fall of Angels,Laurell K. Hamilton,"Angels walk among us, but so do other unearthl...",Science Fiction & Fantasy
1,King Bullet (Sandman Slim Series #12),Richard Kadrey,"The incredible finale of the page-turning, hig...",Science Fiction & Fantasy
2,Paper &amp; Blood (Ink &amp; Sigil Series #2),Kevin Hearne,From the New York Times bestselling author of ...,Science Fiction & Fantasy
3,Starlight Enclave (Signed Book),R. A. Salvatore,From New York Times bestselling author R. A. S...,Science Fiction & Fantasy
4,The Hidden Palace: A Novel of the Golem and th...,Helene Wecker,"""Richly nuanced and beautiful. . . . An immers...",Science Fiction & Fantasy
...,...,...,...,...
395,Book of Water,Robin Brande,Bradamante is camped with the king’s army in t...,Thrillers
396,The Poet,Lisa Renee Jones,New York Times bestselling author Lisa Renee J...,Thrillers
397,Future Ghost / Speedin' Bullet,Jack Grant,'Future Ghost' and 'Speedin' Bullet' are the t...,Thrillers
398,King Solomon's Mines,H. Rider Haggard,It is a curious thing that at my age-fifty-fiv...,Thrillers


In [122]:
Data.value_counts(subset = 'subject')

subject
Science Fiction & Fantasy    998
Thrillers                    994
dtype: int64

In [139]:
Data.to_csv('new_releases_Thriller and SciFi&Fantasy.csv', index=False)