# Data Preparation
> Author: [Yalim Demirkesen](github.com/demirkeseny) 

> Date: March 2019

In this notebook, the main purpose is to make the data ready. 

First we have a books dataset from [Kaggle](https://www.kaggle.com/zygmunt/goodbooks-10k) called *GoodBooks-10K*. As the name suggests we have 10,000 book information in this data frame. Since in the next stages of the project, we need a description for each book, [Goodreads](https://github.com/goodreads) module is used. In this module, there are detailed information about each book like:

- Author
- Genre
- Release Date
- Average Ratings
- Rating Distribution
- Number of Pages
- Goodreads Link of the Book
- Description

Since `.description` provides a paragraph about each book that we cannot obtain from *GoodBooks-10K*, we will be able to make NLP analysis and compare the books from their descriptions.

In [1]:
# Necessary libraries:
import pandas as pd
import numpy as np
import xmltodict
import requests
import rauth
from distutils.sysconfig import get_python_lib
import pickle
import os
from goodreads import client

In [2]:
# Enter Goodreads API info the first time running this notebook, then remove
# the file so that the credentials is not uploaded to online platform. They
# will be saved in the pkl file.

if not os.path.exists('secret_goodreads_credentials.pkl'):
    goodreads={}
    goodreads['Consumer Key'] = ''
    goodreads['Consumer Secret'] = ''
    goodreads['Access Token'] = ''
    goodreads['Access Token Secret'] = ''
    with open('secret_goodreads_credentials.pkl','wb') as f:
        pickle.dump(goodreads, f)
else:
    goodreads=pickle.load(open('secret_goodreads_credentials.pkl','rb'))

gc = client.GoodreadsClient(goodreads['Consumer Key'], goodreads['Consumer Secret'])
gc.authenticate(goodreads['Access Token'], goodreads['Access Token Secret'])

In [3]:
# Assigning the book from goodreads.com with the ID number of 1. 
book = gc.book(1)

About the book with the ID number 1, every detailed information can be found from the below queries:

In [4]:
book.title

'Harry Potter and the Half-Blood Prince (Harry Potter, #6)'

In [5]:
book.authors[0]

J.K. Rowling

In [6]:
book.average_rating

'4.56'

In [7]:
book.description

'When Harry Potter and the Half-Blood Prince opens, the war against Voldemort has begun. The Wizarding world has split down the middle, and as the casualties mount, the effects even spill over onto the Muggles. Dumbledore is away from Hogwarts for long periods, and the Order of the Phoenix has suffered grievous losses. And yet, as in all wars, life goes on.<br /><br />Harry, Ron, and Hermione, having passed their O.W.L. level exams, start on their specialist N.E.W.T. courses. Sixth-year students learn to Apparate, losing a few eyebrows in the process. Teenagers flirt and fight and fall in love. Harry becomes captain of the Gryffindor Quidditch team, while Draco Malfoy pursues his own dark ends. And classes are as fascinating and confounding as ever, as Harry receives some extraordinary help in Potions from the mysterious Half-Blood Prince.<br /><br />Most importantly, Dumbledore and Harry work together to uncover the full and complex story of a boy once named Tom Riddle—the boy who bec

In [8]:
book.num_pages

'652'

In [9]:
book.link

'https://www.goodreads.com/book/show/1.Harry_Potter_and_the_Half_Blood_Prince'

In [10]:
book.publication_date

('9', '16', '2006')

In [11]:
book.rating_dist

'5:1364327|4:509638|3:147736|2:23160|1:8291|total:2053152'

In [12]:
book.ratings_count

'1912331'

In [13]:
book.text_reviews_count

'25685'

In [18]:
ids = books.book_id.tolist()

In [19]:
ids[0:10]

[2767052, 3, 41865, 2657, 4671, 11870085, 5907, 5107, 960, 1885]

In [22]:
book = gc.book(ids[0])
print(book.title)
print(book.description)

The Hunger Games (The Hunger Games, #1)
Could you survive on your own, in the wild, with everyone out to make sure you don't live to see the morning?<br /><br />In the ruins of a place once known as North America lies the nation of Panem, a shining Capitol surrounded by twelve outlying districts. The Capitol is harsh and cruel and keeps the districts in line by forcing them all to send one boy and one girl between the ages of twelve and eighteen to participate in the annual Hunger Games, a fight to the death on live TV. Sixteen-year-old Katniss Everdeen, who lives alone with her mother and younger sister, regards it as a death sentence when she is forced to represent her district in the Games. But Katniss has been close to dead before - and survival, for her, is second nature. Without really meaning to, she becomes a contender. But if she is to win, she will have to start making choices that weigh survival against humanity and life against love.<br /><br />New York Times bestselling au

After Goodreads package is well-understood, we can create a dataframe that includes the book ID, book title and its description.

In [23]:
books_detailed = pd.DataFrame(index = np.arange(len(books)), 
                              columns = ['id','title','description'])

In [24]:
books_detailed.head()

Unnamed: 0,id,title,description
0,,,
1,,,
2,,,
3,,,
4,,,


To fill this data frame, we start with the below for-loop. This is essential to come up with the description of each book.

In [25]:
for i in range(len(ids)):
    try:
        book = gc.book(ids[i])
    except:
        pass
    
    books_detailed['id'][i] = ids[i]
    books_detailed['title'][i] = book.title
    books_detailed['description'][i] = book.description

    if i % 250 == 0:
        print('Filling {}th book out of 10000!'.format(i))
        continue

Filling 0th book out of 10000!
Filling 250th book out of 10000!
Filling 500th book out of 10000!
Filling 750th book out of 10000!
Filling 1000th book out of 10000!
Filling 1250th book out of 10000!
Filling 1500th book out of 10000!
Filling 1750th book out of 10000!
Filling 2000th book out of 10000!
Filling 2250th book out of 10000!
Filling 2500th book out of 10000!
Filling 2750th book out of 10000!
Filling 3000th book out of 10000!
Filling 3250th book out of 10000!
Filling 3500th book out of 10000!
Filling 3750th book out of 10000!
Filling 4000th book out of 10000!
Filling 4250th book out of 10000!
Filling 4500th book out of 10000!
Filling 4750th book out of 10000!
Filling 5000th book out of 10000!
Filling 5250th book out of 10000!
Filling 5500th book out of 10000!
Filling 5750th book out of 10000!
Filling 6000th book out of 10000!
Filling 6250th book out of 10000!
Filling 6500th book out of 10000!
Filling 6750th book out of 10000!
Filling 7000th book out of 10000!
Filling 7250th book 

In [67]:
# Checking the number of rows and columns:
books_detailed.shape

(10000, 3)

In [40]:
# Changing the ID column to string:
books_detailed['id'] = books_detailed['id'].astype(str)

In [41]:
books_detailed.head()

Unnamed: 0,id,title,description
0,2767052,"The Hunger Games (The Hunger Games, #1)","Could you survive on your own, in the wild, wi..."
1,3,Harry Potter and the Sorcerer's Stone (Harry P...,Harry Potter's life is miserable. His parents ...
2,41865,"Twilight (Twilight, #1)",<b>About three things I was absolutely positiv...
3,2657,"To Kill a Mockingbird (To Kill a Mockingbird, #1)",The unforgettable novel of a childhood in a sl...
4,4671,The Great Gatsby,Alternate Cover Edition ISBN: 0743273567 (ISBN...


Now we need to merge this dataframe with our extensive book dataset. But first we need to upload the book.csv file.

In [95]:
books = pd.read_csv('./data/books.csv', encoding='utf-8-sig')

In [96]:
books['book_id'] = books['book_id'].astype(str)

In [97]:
# Checking whether books.csv file has also 10,000 rows:
books.shape

(10000, 23)

In [98]:
# Dropping the unnecessary columns:
books.drop(columns=['id','best_book_id','isbn13'], inplace = True)

In [99]:
books.head()

Unnamed: 0,book_id,work_id,books_count,isbn,authors,original_publication_year,original_title,title,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url
0,2767052,2792775,272,439023483,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...
1,3,4640799,491,439554934,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...
2,41865,3212258,226,316015849,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...
3,2657,3275794,487,61120081,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...
4,4671,245494,1356,743273567,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...


In [100]:
# Merging two data files on book ID so that we can have also the book descriptions on the new 'books' variable:
books = pd.merge(books, books_detailed, left_on = 'book_id', right_on='id', how='left')

In [101]:
books.head()

Unnamed: 0,book_id,work_id,books_count,isbn,authors,original_publication_year,original_title,title_x,language_code,average_rating,...,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,id,title_y,description
0,2767052,2792775,272,439023483,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,...,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,2767052,"The Hunger Games (The Hunger Games, #1)","Could you survive on your own, in the wild, wi..."
1,3,4640799,491,439554934,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,...,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,3,Harry Potter and the Sorcerer's Stone (Harry P...,Harry Potter's life is miserable. His parents ...
2,41865,3212258,226,316015849,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,...,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,41865,"Twilight (Twilight, #1)",<b>About three things I was absolutely positiv...
3,2657,3275794,487,61120081,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,...,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...,2657,"To Kill a Mockingbird (To Kill a Mockingbird, #1)",The unforgettable novel of a childhood in a sl...
4,4671,245494,1356,743273567,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,...,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...,4671,The Great Gatsby,Alternate Cover Edition ISBN: 0743273567 (ISBN...


In [102]:
# Dropping unnecessary columns:
books.drop(columns = ['work_id','id','title_y'], inplace = True)

In [103]:
books.head()

Unnamed: 0,book_id,books_count,isbn,authors,original_publication_year,original_title,title_x,language_code,average_rating,ratings_count,work_ratings_count,work_text_reviews_count,ratings_1,ratings_2,ratings_3,ratings_4,ratings_5,image_url,small_image_url,description
0,2767052,272,439023483,Suzanne Collins,2008.0,The Hunger Games,"The Hunger Games (The Hunger Games, #1)",eng,4.34,4780653,4942365,155254,66715,127936,560092,1481305,2706317,https://images.gr-assets.com/books/1447303603m...,https://images.gr-assets.com/books/1447303603s...,"Could you survive on your own, in the wild, wi..."
1,3,491,439554934,"J.K. Rowling, Mary GrandPré",1997.0,Harry Potter and the Philosopher's Stone,Harry Potter and the Sorcerer's Stone (Harry P...,eng,4.44,4602479,4800065,75867,75504,101676,455024,1156318,3011543,https://images.gr-assets.com/books/1474154022m...,https://images.gr-assets.com/books/1474154022s...,Harry Potter's life is miserable. His parents ...
2,41865,226,316015849,Stephenie Meyer,2005.0,Twilight,"Twilight (Twilight, #1)",en-US,3.57,3866839,3916824,95009,456191,436802,793319,875073,1355439,https://images.gr-assets.com/books/1361039443m...,https://images.gr-assets.com/books/1361039443s...,<b>About three things I was absolutely positiv...
3,2657,487,61120081,Harper Lee,1960.0,To Kill a Mockingbird,To Kill a Mockingbird,eng,4.25,3198671,3340896,72586,60427,117415,446835,1001952,1714267,https://images.gr-assets.com/books/1361975680m...,https://images.gr-assets.com/books/1361975680s...,The unforgettable novel of a childhood in a sl...
4,4671,1356,743273567,F. Scott Fitzgerald,1925.0,The Great Gatsby,The Great Gatsby,eng,3.89,2683664,2773745,51992,86236,197621,606158,936012,947718,https://images.gr-assets.com/books/1490528560m...,https://images.gr-assets.com/books/1490528560s...,Alternate Cover Edition ISBN: 0743273567 (ISBN...


In [105]:
# Saving the extended book data frame:
books.to_csv('./data/books_extended.csv')