# Project 4 - THABALSDE004

## Project Details
This Project scrapes data from **Good Reads website**(https://www.goodreads.com/) and *downloads 2 CSV files* one containg the information of the Books and another one containing information about the Authors.

I will be explaining in detail about Data Scraping, Merging of Relational Data and Plotting Graphs for Merged data. The below Jupyter notebook is split into 3 Parts:
1. **Data Scrapping**
2. **Merging Relational Data**
3. **Plotting Graphs**

## Section 1: Data Scrapping

For the Scrapping of the website, I have used *Scrapy(Python Framework) and Python 2.7*. There are two scrapy files one for extracting Books Data and another one for extracting Authors Data.

In [None]:
import os

# Removing the CSV files if previously present
dir_name = "."
reports = os.listdir(dir_name)
for item in reports:
    if item.endswith(".csv"):
        os.remove(os.path.join(dir_name, item))

### Scrapping Books Data
#### Running CollectContents.py
**This Python files scrapes the Contents of the Books like Author Name, Book Title, Total Ratings Recieved and Average Ratings.**

The below call will run *Books.py*- The Scrapy extracts contents from the website and Makes a CSV file with the contents - *BookContents.csv*

In [None]:
from subprocess import call

call(["scrapy", "runspider", "Books.py", "-t", "csv", "-o", "BookContents.csv"])

### Scrapping Authors Data
#### Running authors.py
**This Python files scrapes the Contents of the Books like Author Name, Book Title, Total Ratings Recieved and Average Ratings.**

The below call will run *Authors.py* - The Scrapy extracts contents from the website and Makes a CSV file with the contents - *AuthorContents.csv*

In [None]:
call(["scrapy", "runspider", "authors.py", "-t", "csv", "-o", "AuthorContents.csv"])

## Section 2: Merging Relational Data

### Importing necessary libraries for Merging the Data Tables

In [None]:
import pandas as pd
import numpy as np

### Creating Data Frames using Pandas
**Creating two Data Frames from _authors.csv_ file and othe for _BookContents.csv file_. This can be done using Pandas, We can read each CSV file as shown below** 

In [None]:
books_df = pd.read_csv("BookContents.csv", keep_default_na=False, na_values=[""])
authors_df = pd.read_csv("AuthorContents.csv", keep_default_na=False, na_values=[""])

### Details of Books Data Frame(books_df) 
We can see that *books_df* holds the contents of the BookContents.csv file. It has information of each Book that
I have scrapped from https://www.goodreads.com/search?q=authors. For this data frame, I have captured details like Author Name, Genre, Birth Date, Death Date. Below is how the Data Frame looks like: 

In [None]:
print books_df

### Details of Authors Data Frame(authors_df)
We can see that *authors_df* holds the contents of the authors.csv file. It has information of each author that
I have scrapped from https://www.goodreads.com/search?q=authors. I have captured details like Author Name, Genre, Birth Date, Death Date. Below is how the Data Frame looks like:

In [None]:
print authors_df

### Merging the Data Sets
After getting the two Data Frames which have relational content(they are connected by the *Author* name). We can observe that Author field is common in both the Data Frames, So I have merged both the dataframes using 'Author'. Now we merge both the data frames using **Pandas merge** method. For this I have used *left join* to merge both the dataframes  

In [None]:
merge_df = pd.merge(left=authors_df, right=books_df, how='left', left_on='Author', right_on='Author')

### Displaying the Contents of the Merged DataFrame

In [None]:
print merge_df

## Section 3: Visualization using Bokeh

Using Bokeh, I made Bar Graphs, Scatter Plots with lines. I have made these plots for two scenarios:
1. **Analysis for Books and Total Ratings**
2. **Analysis for Books and Average Ratings**

Here Books are labeled as Book 1, Book 2....Book 20 - Which is noting but the Books Data Frame of the merged dataframe. Total Ratings is the sum of all the ratings recieved for the Book. Average Rating is the average ratings of all the ratings recieved, Avg Rating lies between 0-5. 

### Standard Bokeh Imports

In [None]:
from bokeh.plotting import figure
from bokeh.io import show, output_notebook

#### Bar Graph for Books vs. Total Ratings

In [None]:
output_notebook()

###################################################################################
#   Getting the Relavant Data Frames from merge_df  - Books and Avg Ratings       #  
###################################################################################

# Getting Books
books = ["Book"+str(i+1) for i in range(0, len(merge_df['Book'] ))]

# Getting Total Ratings
book_ratings = merge_df["Ratings"]
ratings =  book_ratings.str.replace(" ratings", "").map(lambda x: x.replace(',', '')).apply(long)

# Creating the Plot
p = figure(x_range=books, plot_width=1000, plot_height = 500, title="Books Vs. Ratings")
p.vbar(x=books, top=ratings, width=0.9)
p.xgrid.grid_line_color = None
p.y_range.start = 0
p.background_fill_color= "#dddddd"
p.xaxis.axis_label="Books"
p.yaxis.axis_label="Total Ratings"

# Display the Plot
show(p)

#### Line Graph for Books vs. Total Ratings

In [None]:
output_notebook()

###################################################################################
#   Getting the Relavant Data Frames from merge_df  - Books and Avg Ratings       #  
###################################################################################

# Getting Books
xvalues = range(1, len(merge_df['Book'])+1 )

# Getting Total Ratings 
book_ratings = merge_df["Ratings"]
yvalues =  book_ratings.str.replace(" ratings", "").map(lambda x: x.replace(',', '')).apply(long)

# Creating the Plot
p = figure(plot_width=1000, plot_height = 500, title="Books Vs. Ratings -- Line Plot")
p.outline_line_width = 7
p.outline_line_alpha = 0.3
p.outline_line_color = "navy"
p.background_fill_color= "#dddddd"
p.xaxis.axis_label="Books"
p.yaxis.axis_label="Total Ratings"
p.line(xvalues, yvalues, line_width=2)
p.circle(xvalues, yvalues, fill_color="black", size=10)

# Display the Plot
show(p)


#### Bar Graph for Books vs. Avg Ratings

In [None]:
output_notebook()

###################################################################################
#   Getting the Relavant Data Frames from merge_df  - Books and Avg Ratings       #  
###################################################################################
# Gettings Books
books = ["Book"+str(i+1) for i in range(0, len(merge_df['Book'] ))]

# Getting Avg Ratings
avg_book_ratings = merge_df["Avg Rating"]
avg_ratings =  avg_book_ratings.str.replace(" avg rating", "").astype(float)

# Creating the Plot
p = figure(x_range=books, plot_width = 750, plot_height = 500, title="Books Vs. Avg Ratings")
p.background_fill_color= "#dddddd"
p.xaxis.axis_label="Books"
p.yaxis.axis_label="Avg Ratings"
p.vbar(x=books, top=avg_ratings, width=0.9)
p.xgrid.grid_line_color = None
p.y_range.start = 0

# Display the Plot
show(p)


#### Line Graph for Books vs. Avg Ratings

In [None]:
output_notebook()

###################################################################################
#   Getting the Relavant Data Frames from merge_df  - Books and Avg Ratings       #  
###################################################################################
# Indexes for each book number
xvalues = range(1, len(merge_df['Book'])+1 ) 

# Avg Rating for the book
avg_book_ratings = merge_df["Avg Rating"]
yvalues =  avg_book_ratings.str.replace(" avg rating", "").astype(float) 

# Creating the Plot
p = figure(plot_width=600, plot_height=600, title="Books vs. Avg Rating -- Line Plot")
p.outline_line_width = 7
p.outline_line_alpha = 0.3
p.outline_line_color = "navy"
p.background_fill_color= "#dddddd"
p.xaxis.axis_label="Books"
p.yaxis.axis_label="Avg Ratings"
p.line(xvalues, yvalues, line_width=2)
p.circle(xvalues, yvalues, fill_color='white', size=10)

# Displaying the Plot 
show(p)