### Problem statement:
The link of the dataset on Kaggle:
https://www.kaggle.com/alhanoofat/goodreadsbest1500books <br>
Goodreads is the world’s largest site for readers and book recommendations. Using it, you can track the books you are reading, have read, and want to read.<br>
It also has a recommendation system that analyzes 20 billion data points to give suggestions tailored to your literary tastes.<br>
In this project, I performed web scraping to gather data about the best books of the 21st century, which are the best books published from January 1st, 2001 until today, rated by Goodreads users. The scraping process was handled usig BeautifulSoup<br>
The dataframe contains the following features about 1500 books:<br>

|Feature|Description|
|---|---|
|book_name|The title of the book| 
|author_name|The name of the author|
|book_genre|The genre of the book. For example, Fiction or Fantacy| 
|year_published|The year on which the book was published|
|edition_language|The language in which this edition of the book was written| 
|avg_rating|The average of all ratings provided by Goodreads users (from 1 to 5)|
|no_of_raters|The number of people rated tis book| 
|score|The total score of the book as being in the best books of the 21st century|
|no_of_ppl_voted|The number of people voted for tis book to be whithin the best books of the 21st century| 
|book_url|The URL of the book in Goodreads|


#### What to do?
We could apply data analysis and machine learning on the dataframe to understand the taste of the readers, how it changes over the years, and what could be the possible score of a book written by a specific author, published in a specific year, rated by a specific number of readers, and has a specific average of rating.  

In [1]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

The URLs of the pages containing the top 1500 books

In [2]:
urls=['https://www.goodreads.com/list/show/7'
     ,'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=2',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=3',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=4',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=5',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=6',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=7',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=8',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=9',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=10',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=11',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=12',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=13',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=14',
     'https://www.goodreads.com/list/show/7.Best_Books_of_the_21st_Century?page=15']

Creating a soup for each page

In [3]:
soups=[]
for url in urls:
    response = requests.get(url)
    soup= BeautifulSoup(response.text, 'html.parser')
    soups.append(soup)
    

In [314]:
# soups[0]

Acquiring data from the soups

In [23]:
books=[]
for soup in soups:
    for n in range(len(soup.find_all('a',class_='bookTitle'))):
        books.append(soup.find_all('a',class_='bookTitle')[n].text)

In [24]:
authors=[]
for soup in soups:
    for n in range(len(soup.find_all('a',class_='authorName'))):
        authors.append(soup.find_all('a',class_='authorName')[n].text)

In [27]:
rating=[]
for soup in soups:
    for n in range(len(soup.find_all('span',class_='minirating'))):
        rating.append(soup.find_all('span',class_='minirating')[n].text)

In [42]:
score=[]
for soup in soups:
    for n in range(len(soup.find_all('span',class_='smallText uitext'))):
        score.append(soup.find_all('span',class_='smallText uitext')[n].text)

Acquiring the URLs for each book to get more data contained in each book page

In [55]:
name_urls=[]
for soup in soups:
    for url in soup.find_all('a', attrs={'itemprop':'url','href': re.compile("^/book/show/")}):
        name_urls.append(url.get('href'))

In [67]:
s= 'https://www.goodreads.com'
nus=[]
for nu in name_urls:
    nus.append(s+nu)

Creating a soup for each book. Thus, having 1500 soups

In [75]:
soups3=[]
for url in nus:
    response = requests.get(url)
    soup3= BeautifulSoup(response.text, 'html.parser')
    soups3.append(soup3)

In [315]:
# soups3[1499]

Getting more data from each book's soup

In [102]:
#get genres:
genres=[]
for soup3 in soups3:
    if soup3.find('a',class_='actionLinkLite bookPageGenreLink'):
        genres.append(soup3.find('a',class_='actionLinkLite bookPageGenreLink').text)
    else:
        genres.append('unknown')

In [185]:
#get year published:
years=[]
for soup3 in soups3:

    if len(soup3.find_all('div',class_='row'))>1:
        years.append(soup3.find_all('div',class_='row')[1].text)
    else:
        years.append('unknown')

In [121]:
lang=[]
for soup3 in soups3:
    if soup3.find('div',attrs={'class':'infoBoxRowItem','itemprop':'inLanguage'}):
        lang.append(soup3.find('div',attrs={'class':'infoBoxRowItem','itemprop':'inLanguage'}).text)
    else:
        lang.append('unknown')

Creating a dataframe

In [187]:
gr_books=pd.DataFrame()
gr_books['book_name']=books
gr_books['author_name']=authors
gr_books['avg_rating']=rating
gr_books['score']=score
gr_books['book_genre']=genres
gr_books['year_published']=years
gr_books['book_url']=nus
gr_books['edition_language']=lang

### Data cleaning

splitting gr_books['avg_rating'] into two columns; one for the average rate and one for the number of raters

In [191]:
gr_books['no_of_raters']=gr_books['avg_rating'].apply(lambda x: x.split()[4])

In [200]:
gr_books['avg_rating']=gr_books['avg_rating'].apply(lambda x: x.split()[0])

removing (\n) from the book name

In [202]:
gr_books['book_name']=gr_books['book_name'].apply(lambda x: x.strip('\n'))

splitting gr_books['score'] into two columns; one for the total score and one for the number of people voted

In [209]:
gr_books['no_of_ppl_voted']=gr_books['score'].apply(lambda x: x.split('\n')[3].split()[0])

In [216]:
gr_books['score']=gr_books['score'].apply(lambda x: x.split()[1].split()[0].strip(','))

Extracting the year in which the book was published, then converting it to integer.

In [237]:
pattern = '(\d{4})'
gr_books['year_published']=gr_books['year_published'].str.extract(pattern, expand=False)

In [308]:
gr_books['year_published']=gr_books['year_published'].apply(lambda x: int(x))

Manually imputing a missing data

In [287]:
gr_books.iloc[1286].replace('unknown','Science Fiction',inplace=True)

In [288]:
gr_books.iloc[1286]

book_name                                          Moore's Mythopoeia
author_name                                     Christopher WunderLee
avg_rating                                                       4.33
score                                                             294
book_genre                                            Science Fiction
year_published                                                   2010
book_url            https://www.goodreads.com/book/show/7307551-mo...
edition_language                                              English
no_of_raters                                                       18
no_of_ppl_voted                                                     3
Name: 1286, dtype: object

Rearranging the order of the columns

In [291]:
gr_books=gr_books[['book_name', 'author_name', 'book_genre', 'year_published', 'edition_language', 'avg_rating', 'no_of_raters', 
                   'score', 'no_of_ppl_voted', 'book_url']]

In [296]:
gr_books['year_published'].fillna('unknown', inplace=True)

In [304]:
gr_books.iloc[1424].replace('unknown','2018',inplace=True)

In [305]:
gr_books.iloc[1424]

book_name           The Errors of the National Academy of Sciences...
author_name                                               Harun Yahya
book_genre                                                 Nonfiction
year_published                                                   2018
edition_language                                              English
avg_rating                                                      liked
no_of_raters                                                   rating
score                                                             255
no_of_ppl_voted                                                     3
book_url            https://www.goodreads.com/book/show/6362102-th...
Name: 1424, dtype: object

In [312]:
gr_books.head()

Unnamed: 0,book_name,author_name,book_genre,year_published,edition_language,avg_rating,no_of_raters,score,no_of_ppl_voted,book_url
0,Harry Potter and the Deathly Hallows (Harry Po...,J.K. Rowling,Fantasy,2007,English,4.61,2530201,392793,3968,https://www.goodreads.com/book/show/136251.Har...
1,"The Hunger Games (The Hunger Games, #1)",Suzanne Collins,Young Adult,2008,English,4.33,5856382,289899,2958,https://www.goodreads.com/book/show/2767052-th...
2,The Kite Runner,Khaled Hosseini,Fiction,2004,English,4.29,2222081,257039,2610,https://www.goodreads.com/book/show/77203.The_...
3,The Book Thief,Markus Zusak,Historical,2006,English,4.37,1636312,249885,2545,https://www.goodreads.com/book/show/19063.The_...
4,Harry Potter and the Half-Blood Prince (Harry ...,J.K. Rowling,Fantasy,2006,English,4.56,2172153,219198,2257,https://www.goodreads.com/book/show/1.Harry_Po...


In [313]:
gr_books.to_csv('Goodreads_best1500books.csv')