In this notebook we attempt to scrape and gather newly relased fiction books ordered by subjects from [Barnes&Noble](https://www.barnesandnoble.com/). We collect our data as follows:
- scrape to get the list of subjects used to categorize the fiction books. 
- for each subject we will accumulate 40 different books along with the title, author, and summary(if it exists) for each book in the subject.
<br>if a summary does not exist, and if we are missing just a few then we should be able to manually replace using targeted scrape functions. 

Note: Be aware that we only take 40 because B&N website is terrible and will max out at 40 books per page. We may collect more data by scraping more than one page if we think it is neccessary for the purpose of our project. 



In [74]:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

The reason we are using both Requests and Selenium Python libraries is due to the fact that some pages in B&N's website do not work with Requests, and for those pages we use Selenium.  

In [2]:
#Make the list of subjects along with their urls. 

URL = "https://www.barnesandnoble.com/b/fiction/books/_/N-2usxZ29Z8q8"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="sidebar-section-FictionSubjects")

fic_subjects = []
for litag in results.find_all('a'):
    if litag.text == 'Browse All Subjects':
        break
    else:
        fic_subjects.append((str(litag).split('"')[1], litag.text.strip()))


In [3]:
fic_subjects

[('/b/books/fiction/_/N-29Z8q8Z10h8', 'General Fiction'),
 ('/b/books/graphic-novels-comics/_/N-29Z8q8Zucb', 'Graphic Novels & Comics'),
 ('/b/books/fiction/historical-fiction/_/N-29Z8q8Z10nf', 'Historical Fiction'),
 ('/b/books/fiction/horror/_/N-29Z8q8Z1d51', 'Horror'),
 ('/b/books/literature/_/N-29Z8q8Z15v3', 'Literature'),
 ('/b/books/graphic-novels-comics/manga/_/N-29Z8q8Zucc', 'Manga'),
 ('/b/books/mystery-crime/_/N-29Z8q8Z16g4', 'Mystery & Crime'),
 ('/b/books/poetry/_/N-29Z8q8Z1pqh', 'Poetry'),
 ('/b/books/romance/_/N-29Z8q8Z17y3', 'Romance'),
 ('/b/books/science-fiction-fantasy/_/N-29Z8q8Z180l',
  'Science Fiction & Fantasy'),
 ('/b/books/fiction/thrillers/_/N-29Z8q8Z1d3u', 'Thrillers'),
 ('/b/books/fiction/westerns/_/N-29Z8q8Z10j9', 'Westerns')]

In [53]:
#Make the list of new releases along with their urls for each subject.

def subjects_new_relases_url(list):
    new_releases_url = []
    for link in list:
        url = 'https://www.barnesandnoble.com' + link[0]
        page = requests.get(url)
        soup = BeautifulSoup(page.content, "html.parser")
        if link == ('/b/books/fiction/horror/_/N-29Z8q8Z1d51', 'Horror'):
            target = soup.find(id='hotBooksWithDesc_DigitalHorror:ChillingeBooks')
            new_releases_url.append((str(target.find('a', class_='see-all-link')).split('"')[3],link[1]))
        else:
            target = soup.find(id='hotBooksWithDesc_NewReleases')
            new_releases_url.append((str(target.find('a', class_='see-all-link')).split('"')[3],link[1]))
        
    return new_releases_url

In [54]:
subjects_new_relases_url(fic_subjects)

[('/b/books/fiction/_/N-1qZ29Z8q8Z10h8?Ns=P_Sales_Rank', 'General Fiction'),
 ('/b/books/graphic-novels-comics/_/N-1sZ29Z8q8Zucb;jsessionid=A29E50FBD50899AD744DAB17E74A915D.prodny_store02-atgap18?Ns=P_Sales_Rank',
  'Graphic Novels & Comics'),
 ('/b/books/fiction/historical-fiction/_/N-1sZ29Z8q8Z10nf?Ns=P_Sales_Rank',
  'Historical Fiction'),
 ('/b/books/fiction/horror/_/N-1z13q39Z29Z8q8Z1d51', 'Horror'),
 ('/b/books/literature/_/N-1sZ29Z8q8Z15v3?Ns=P_Sales_Rank', 'Literature'),
 ('/b/books/graphic-novels-comics/manga/_/N-1sZ29Z8q8Zucc?Ns=P_Sales_Rank',
  'Manga'),
 ('/b/books/mystery-crime/_/N-1sZ29Z8q8Z16g4?Ns=P_Sales_Rank',
  'Mystery & Crime'),
 ('/b/books/poetry/_/N-1sZ29Z8q8Z1pqh?Ns=P_Sales_Rank', 'Poetry'),
 ('/b/books/romance/_/N-1sZ29Z8q8Z17y3?Ns=P_Sales_Rank', 'Romance'),
 ('/b/books/science-fiction-fantasy/_/N-1sZ29Z8q8Z180l?Ns=P_Sales_Rank',
  'Science Fiction & Fantasy'),
 ('/b/books/fiction/thrillers/_/N-1sZ29Z8q8Z1d3u?Ns=P_Sales_Rank',
  'Thrillers'),
 ('/b/books/fiction

The new releases page by default only displays 20 books, but one can choose to show 40 books once at the page. Hence, the following code snippet accomplishes just that.

Note: The horror subject doesn't have new releases and instead we scraped the category *Digital Horror: Chilling eBooks*.

In [55]:
def show_40(list):    
    show_40_list = []
    for link in list:
        if link[0] == '/b/books/fiction/horror/_/N-1z13q39Z29Z8q8Z1d51':
            url = 'https://www.barnesandnoble.com' + link[0] + '?Nrpp=40&page=1'         
        else:
            url = 'https://www.barnesandnoble.com' + link[0].replace("s=P_Sales_Rank","rpp=40&Ns=P_Sales_Rank&page=1")
        show_40_list.append((url,link[1]))
    return show_40_list

In [56]:
show_40(subjects_new_relases_url(fic_subjects))

[('https://www.barnesandnoble.com/b/books/fiction/_/N-1qZ29Z8q8Z10h8?Nrpp=40&Ns=P_Sales_Rank&page=1',
  'General Fiction'),
 ('https://www.barnesandnoble.com/b/books/graphic-novels-comics/_/N-1sZ29Z8q8Zucb;jsessionid=A29E50FBD50899AD744DAB17E74A915D.prodny_store02-atgap18?Nrpp=40&Ns=P_Sales_Rank&page=1',
  'Graphic Novels & Comics'),
 ('https://www.barnesandnoble.com/b/books/fiction/historical-fiction/_/N-1sZ29Z8q8Z10nf?Nrpp=40&Ns=P_Sales_Rank&page=1',
  'Historical Fiction'),
 ('https://www.barnesandnoble.com/b/books/fiction/horror/_/N-1z13q39Z29Z8q8Z1d51?Nrpp=40&page=1',
  'Horror'),
 ('https://www.barnesandnoble.com/b/books/literature/_/N-1sZ29Z8q8Z15v3?Nrpp=40&Ns=P_Sales_Rank&page=1',
  'Literature'),
 ('https://www.barnesandnoble.com/b/books/graphic-novels-comics/manga/_/N-1sZ29Z8q8Zucc?Nrpp=40&Ns=P_Sales_Rank&page=1',
  'Manga'),
 ('https://www.barnesandnoble.com/b/books/mystery-crime/_/N-1sZ29Z8q8Z16g4?Nrpp=40&Ns=P_Sales_Rank&page=1',
  'Mystery & Crime'),
 ('https://www.barnesa

In [57]:
#Make the list containing the urls for all the newly released books along with their subjects.

def get_all_new_relases(list):
    
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    browser = webdriver.Chrome(options=chrome_options)

    all_new_releases_url_list = []
    for x in list:
        browser.get(x[0])
        html = browser.page_source
        soup = BeautifulSoup(html, "html.parser")        
        target = soup.find_all('a', attrs = {'class' : 'pImageLink'})
        for y in target:
            a = str(y).split('"')[3]
            all_new_releases_url_list.append((a,x[1]))
    return all_new_releases_url_list

In [58]:
urls = get_all_new_relases(show_40(subjects_new_relases_url(fic_subjects)))

In [75]:
urls[:5]

[('/w/black-ice-brad-thor/1138715571;jsessionid=828D556444C073D1DA4ADD3C9BA0715C.prodny_store02-atgap04?ean=9781982186302',
  'General Fiction'),
 ('/w/false-witness-karin-slaughter/1138241185;jsessionid=828D556444C073D1DA4ADD3C9BA0715C.prodny_store02-atgap04?ean=9780063204676',
  'General Fiction'),
 ('/w/the-paper-palace-miranda-cowley-heller/1138261755;jsessionid=828D556444C073D1DA4ADD3C9BA0715C.prodny_store02-atgap04?ean=9780593329825',
  'General Fiction'),
 ('/w/the-cellist-daniel-silva/1138787007;jsessionid=828D556444C073D1DA4ADD3C9BA0715C.prodny_store02-atgap04?ean=9780062834867',
  'General Fiction'),
 ('/w/the-other-passenger-louise-candlish/1137941967;jsessionid=828D556444C073D1DA4ADD3C9BA0715C.prodny_store02-atgap04?ean=9781982174101',
  'General Fiction')]

In [62]:
#Get the title, author and summary for each newly released book while keeping track of what subject the book belongs to.

def get_data(list):
    
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    browser = webdriver.Chrome(options=chrome_options)
    
    data_dict = {"title":[], "author":[], "overview":[], "subject":[]}
    for link in list:
        url = 'https://www.barnesandnoble.com' + link[0]
        browser.get(url)
        html = browser.page_source
        soup = BeautifulSoup(html, 'html.parser')
        
        #get title
        title = soup.find('h1')
        title = str(title).split('>')[1].split('<')[0]
        
         #get author
        author = soup.find('span', attrs = {'id' : 'key-contributors'})
        author = str(author).split('>')[2].split('<')[0]
            
         #get summary if there is one
        overview = soup.find('div', attrs = {'class': 'overview-cntnt'})
        if overview:
            overview = overview.text.strip()
        else: 
            overview = 'None'
        
        data_dict["title"].append(title)
        data_dict["author"].append(author)
        data_dict["overview"].append(overview)
        data_dict["subject"].append(link[1])
            
                    
    return data_dict
        
        
        

In [66]:
data = get_data(urls)

In [64]:
import pandas as pd

In [67]:
df = pd.DataFrame.from_dict(data)

In [69]:
df

Unnamed: 0,title,author,overview,subject
0,Black Ice (Signed Book) (Scot Harvath Series #20),Brad Thor,The new Cold War is about to go hot. #1 New Yo...,General Fiction
1,False Witness (B&amp;N Exclusive Edition),Karin Slaughter,This Barnes and Noble Exclusive Edition includ...,General Fiction
2,The Paper Palace (Barnes &amp; Noble Book Club...,Miranda Cowley Heller,REESE'S BOOK CLUB PICK INSTANT #1 NEW YORK TI...,General Fiction
3,The Cellist (Gabriel Allon Series #21),Daniel Silva,"From Daniel Silva, the internationally acclaim...",General Fiction
4,The Other Passenger,Louise Candlish,One of CrimeReads’s Most Anticipated Crime Boo...,General Fiction
...,...,...,...,...
475,Nueces Blood,Mark Greathouse,"Nueces Blood: Texans Prepare for War, the four...",Westerns
476,Heart of the West,O. Henry,Several of the funniest and best stories by O....,Westerns
477,The Young Lion Hunter (Annotated),Zane Grey,"First published in 1911, this new Raging Bull ...",Westerns
478,That Girl Montana (Annotated),Marah Ellis Ryan,The author takes her characters to the wilds o...,Westerns


In [70]:
df = df.drop_duplicates()

In [71]:
df

Unnamed: 0,title,author,overview,subject
0,Black Ice (Signed Book) (Scot Harvath Series #20),Brad Thor,The new Cold War is about to go hot. #1 New Yo...,General Fiction
1,False Witness (B&amp;N Exclusive Edition),Karin Slaughter,This Barnes and Noble Exclusive Edition includ...,General Fiction
2,The Paper Palace (Barnes &amp; Noble Book Club...,Miranda Cowley Heller,REESE'S BOOK CLUB PICK INSTANT #1 NEW YORK TI...,General Fiction
3,The Cellist (Gabriel Allon Series #21),Daniel Silva,"From Daniel Silva, the internationally acclaim...",General Fiction
4,The Other Passenger,Louise Candlish,One of CrimeReads’s Most Anticipated Crime Boo...,General Fiction
...,...,...,...,...
475,Nueces Blood,Mark Greathouse,"Nueces Blood: Texans Prepare for War, the four...",Westerns
476,Heart of the West,O. Henry,Several of the funniest and best stories by O....,Westerns
477,The Young Lion Hunter (Annotated),Zane Grey,"First published in 1911, this new Raging Bull ...",Westerns
478,That Girl Montana (Annotated),Marah Ellis Ryan,The author takes her characters to the wilds o...,Westerns


In [72]:
df.to_csv('new_releases_books_by_subjects_dataframe.csv', index=False)