### Grand Dataset
In this notebook, we process merge the book attributes file and the reviews file. 

In [120]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Read review file

In [121]:
review = pd.read_csv("review.csv")

In [122]:
review.head()

Unnamed: 0.1,Unnamed: 0,book_id_title,book_id,book_title,review_url,review_id,date,rating,user_name,user_url,text,num_likes,sort_order,shelves
0,0,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/1855355089,1855355089,2016-12-29,2.0,Sam,/user/show/59357213-sam,I didn't really respond well to Conversations ...,1043,default,['2017-reads']
1,1,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/2098766690,2098766690,2017-08-23,5.0,Jill,/user/show/2228181-jill,I’ve been thinking a lot about aging lately: t...,937,default,[]
2,2,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/1948088321,1948088321,2017-06-09,3.0,Esil,/user/show/3643764-esil,A very tepid 3 stars. Conversations with Frien...,839,default,['netgalley']
3,3,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/2831723058,2831723058,2019-05-23,5.0,emma,/user/show/32879029-emma,have been truly dealt a series of death blows ...,592,default,"['couldn-t-wait-to-read', 'favorites', 'litera..."
4,4,32187419-conversations-with-friends,32187419-conversations-with-friends,Conversations with Friends,https://www.goodreads.com/review/show/2340296379,2340296379,2018-03-26,2.0,Barry Pierce,/user/show/4593541-barry-pierce,The narrator of Sally Rooney's Conversations w...,480,default,"['21st-century', 'read-in-2018']"


In [123]:
review.shape

(297455, 14)

#### Feature Engineering

There are quite a few duplicates in our dataset, because of goodreads good anti-scraping tools. However, we managed to scrape most of the reviews.
- There are a total of 297455 rows. 184011 rows contain duplicate values, which means ~61% of our reviews are duplicates.
- We will one from each set of identical rows and remove the rest. 

We only need the book id, review id, ratings and number of likes. We will keep these columns and remove the others. 

In [124]:
relevant_columns = ['book_id', 'review_id', 'rating', 'num_likes']
review = review[relevant_columns]

In [125]:
review['review_id'].duplicated().sum() # 184011
(~review['review_id'].duplicated()).sum() # 113444

review.drop_duplicates(subset=['review_id'], keep = 'first', inplace=True)

review.shape # 113444

(113444, 4)

Now that our review does not continan duplicated values, we will remove the review_id columns. 
As a reminder, our goal in this notebook is to ultimately produce once variable: weighted review score.

In [126]:
review = review.drop(['review_id'], axis=1)

We need to createa a new feature called weighted review score. 
**Weighted review score** For each book we sum the product of each rating by the number of likes and divide by total number of likes. 
There are two steps to do that. 
1. Create a column for product of rating and number of like 
2. Groupby book id and sum the product.

The two columns rating and number of likes will need to be numeric. 

In [127]:
from pandas.api.types import is_numeric_dtype

assert is_numeric_dtype(review['rating'])
assert is_numeric_dtype(review['num_likes'])

Now we can compute the product

In [128]:
review['rating_num_like'] = review['rating'] * review['num_likes']

Rating is not important anymore, so we will drop it. 

In [129]:
review = review.drop(['rating'], axis=1)

Each book has multiple reviews, so rating*num_likes values so will group based on book id and calculuatee weighted review score. 

In [137]:
review = review.groupby(['book_id']).agg(np.sum)
review

Unnamed: 0_level_0,num_likes,rating_num_like
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,14734,66212.0
10032672-the-language-of-flowers,3284,12623.0
10054335-rules-of-civility,6793,27561.0
10073,1626,5215.0
10073506-tinker-tailor-soldier-spy,2751,12166.0
...,...,...
99107,2339,10014.0
9915,2022,6590.0
99300,1170,4748.0
99561,11657,31122.0


In [140]:
review_copy = review.copy() # This is here to avoid: SettingWithCopyWarning:  A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead
review['weighted_review_score'] = review['rating_num_like'] / review['num_likes']
review

Unnamed: 0_level_0,num_likes,rating_num_like,weighted_review_score
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,14734,66212.0,4.493824
10032672-the-language-of-flowers,3284,12623.0,3.843788
10054335-rules-of-civility,6793,27561.0,4.057265
10073,1626,5215.0,3.207257
10073506-tinker-tailor-soldier-spy,2751,12166.0,4.422392
...,...,...,...
99107,2339,10014.0,4.281317
9915,2022,6590.0,3.259149
99300,1170,4748.0,4.058120
99561,11657,31122.0,2.669812


#### Read goodreads fiction file

In this file: 
- There are 1247 unique books
- Books with the most number of reviews: ['Inferno', 'Artemis Fowl', 'The Queen of the Damned',
       'Eat, Pray, Love: One Woman's Search for Everything Across Italy, India and Indonesia',
       'The 5th Wave']

In [146]:
fiction = pd.read_csv("goodreads_fiction.csv")

In [147]:
fiction.head()

Unnamed: 0,book_id,book_url,book_title,author_name,ratings,num_of_ratings,date_published,book_shelved,book_genre
0,2657,https://www.goodreads.com/book/show/2657.To_Ki...,To Kill a Mockingbird (Paperback),Harper Lee,4.27,5025333,1960,24464,fiction
1,40961427-1984,https://www.goodreads.com/book/show/40961427-1984,1984 (Kindle Edition),George Orwell,4.19,3609831,1949,24368,fiction
2,4671,https://www.goodreads.com/book/show/4671.The_G...,The Great Gatsby (Paperback),F. Scott Fitzgerald,3.93,4217051,1925,22232,fiction
3,170448,https://www.goodreads.com/book/show/170448.Ani...,Animal Farm (Mass Market Paperback),George Orwell,3.97,3105131,1945,20400,fiction
4,3,https://www.goodreads.com/book/show/3.Harry_Po...,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,4.47,8031019,1997,20064,fiction


In [148]:
fiction['book_title'].unique().size # 1247


1247

#### Merge two files

In [149]:
review

Unnamed: 0_level_0,num_likes,rating_num_like,weighted_review_score
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,14734,66212.0,4.493824
10032672-the-language-of-flowers,3284,12623.0,3.843788
10054335-rules-of-civility,6793,27561.0,4.057265
10073,1626,5215.0,3.207257
10073506-tinker-tailor-soldier-spy,2751,12166.0,4.422392
...,...,...,...
99107,2339,10014.0,4.281317
9915,2022,6590.0,3.259149
99300,1170,4748.0,4.058120
99561,11657,31122.0,2.669812


In [153]:
fiction_review = pd.merge(fiction, review, how='inner', on="book_id")

In [154]:
fiction_review.head()

Unnamed: 0,book_id,book_url,book_title,author_name,ratings,num_of_ratings,date_published,book_shelved,book_genre,num_likes,rating_num_like,weighted_review_score
0,2657,https://www.goodreads.com/book/show/2657.To_Ki...,To Kill a Mockingbird (Paperback),Harper Lee,4.27,5025333,1960,24464,fiction,18426,82260.0,4.464344
1,40961427-1984,https://www.goodreads.com/book/show/40961427-1984,1984 (Kindle Edition),George Orwell,4.19,3609831,1949,24368,fiction,19290,83313.0,4.318974
2,4671,https://www.goodreads.com/book/show/4671.The_G...,The Great Gatsby (Paperback),F. Scott Fitzgerald,3.93,4217051,1925,22232,fiction,14276,54753.0,3.835318
3,170448,https://www.goodreads.com/book/show/170448.Ani...,Animal Farm (Mass Market Paperback),George Orwell,3.97,3105131,1945,20400,fiction,17489,79191.0,4.528046
4,3,https://www.goodreads.com/book/show/3.Harry_Po...,Harry Potter and the Sorcerer's Stone (Harry P...,J.K. Rowling,4.47,8031019,1997,20064,fiction,31403,108057.0,3.440977


In [155]:
fiction_review.shape

(1250, 12)

In [156]:
fiction_review.to_csv('fiction_review.csv')