Import science tools (Python packages)

In [None]:
import pandas
import matplotlib
import matplotlib.pyplot
matplotlib.pyplot.style.use('fivethirtyeight')

Read data from csv file into a dataframe.

In [None]:
books_df = pandas.read_csv('books.csv')

D'OH FAIL. Even this "clean" curated data set from Kaggle.com, specifically made to be an example data set for analysis, is formatted incorrectly and causes Pandas to fail because it is an invalid csv file. After cleaning it up by hand we get the valid file below which can be successfully loaded.  
  
(_This initial failed attempt to load the data has been left in the notebook on purpose as a real example to segue into talking about how 90% of "data science" is cleaning up bad, missing, or inconsistent data_)

In [None]:
books_df = pandas.read_csv('books_clean.csv')

Inspect the data in the `DataFrame`

In [None]:
books_df.info()

In [None]:
books_df.head()

In [None]:
books_df.describe()

Develop a hypothesis, such as _"book length correlates with book rating"_ or _"number of ratings per book correlates with book rating"_.

Discuss confounding variables, such as language. Look to see how language might affect the hypothesis before proceeding.

How does language affect rating distribution?

In [None]:
books_df.boxplot(column='average_rating', by='language_code', figsize=(22, 8))

There seems to be quite a bit of variation in average rating by language. Seems like we can isolate English to avoid confounding by language. Except Canadians; who appear to be slightly nicer reviewers than other English speakers :)

But will this leave us with enough data?

In [None]:
books_df.language_code.value_counts().plot(kind='bar', figsize=(18, 5))

Seems sufficient. Let's extract all English books from the dataset.

But first, how do we filter on a column?

In [None]:
books_df[books_df.language_code == 'zho'].head()

Cool, but what is that doing?

First it is selecting a column.

In [None]:
books_df.language_code

In [None]:
books_df['language_code']

Then it is using a condisiton to generate a vector of booleans. When this vector of booleans is passed into a dataframe, only the rows corresponding to `True` are returned.

In [None]:
books_df.language_code == 'eng'

Now lets filter for only English books.

In [None]:
eng_books_df = books_df[books_df['language_code'] == 'eng']
eng_books_df.language_code.value_counts().plot(kind='bar')

It worked! But actually there are other English books. Check the full value count chart again and you will also see `en-GB`, `en-US`, `en-CA`. How can we filter on multiple values?

In [None]:
eng_books_df = books_df[(books_df['language_code'] == 'eng') | (books_df['language_code'] == 'en-GB') | (books_df['language_code'] == 'en-US') | (books_df['language_code'] == 'en-CA')]
eng_books_df.language_code.value_counts().plot(kind='bar')

Ok, that worked, but it was really ugly. Is there a nicer way to do it?

In [None]:
target_languages = ['eng', 'en-GB', 'en-US', 'en-CA']
eng_books_df = books_df[books_df['language_code'].isin(target_languages)]
eng_books_df.language_code.value_counts().plot(kind='bar')

Great. Anything else we might want to filter? Let's check the stats again.

In [None]:
eng_books_df.describe()

Hmm, minimum average rating and rating count are both 0. What's going on?

In [None]:
eng_books_df.hist(column="average_rating", bins=100)

In [None]:
fig, axes =  matplotlib.pyplot.subplots()
eng_books_df.average_rating.hist(ax=axes, bins=100, bottom=0.1)
axes.set_yscale('log')

In [None]:
fig, axes =  matplotlib.pyplot.subplots()
eng_books_df.ratings_count.hist(ax=axes, bins=100)
axes.set_yscale('log')
axes.set_xscale('log')

Let's stick to popular books for now.

In [None]:
rated_books_df = eng_books_df[eng_books_df['ratings_count'] > 1000]

Is this enough data?

In [None]:
rated_books_df.info()

5,961 entries. Seems to be enough data. What do the stats look like now?

In [None]:
rated_books_df.describe()

And the new ratings distribution?

In [None]:
rated_books_df.hist(column='average_rating', bins=30)

Ok, time to test our hypothesis. Let's plot average rating vs rating count.

In [None]:
rated_books_df.plot(x='average_rating', y='ratings_count', style='.')

Hmm, what else can we plot?

In [None]:
rated_books_df.plot(x='# num_pages', y='ratings_count', style='.')

In [None]:
rated_books_df.plot(x='# num_pages', y='text_reviews_count', style='.', alpha=.15)

In [None]:
# TODO: MOVE THIS TO NOTES

rank_df = pandas.DataFrame({
    'pages': rated_books_df['# num_pages'].rank(),
    'reviews': rated_books_df['text_reviews_count'].rank()
})

rank_df.plot(x='pages', y='reviews', style='.')

In [None]:
rated_books_df.plot(x='# num_pages', y='average_rating', style='.')

Can we just plot ALL THE THINGS???

In [None]:
pandas.plotting.scatter_matrix(
    rated_books_df,
    figsize=(20, 20),
    hist_kwds={'bins': 30}
)

Hmm, some of them look quite strongly correlated. Can we see the actual correlation coefficients?

In [None]:
corr = rated_books_df.corr()
corr.style.background_gradient(cmap='coolwarm')

Ok, but can it be more delightful?

In [None]:
import matplotlib.pyplot as plot

f = plot.figure(figsize=(10, 8))
plot.matshow(rated_books_df.corr(), fignum=f.number)
plot.colorbar()
plot.show()

Let's take a closer look at text count vs ratings count as it seems to have the strongest correlation.

In [None]:
rated_books_df.plot(x='text_reviews_count', y='ratings_count', style='.', figsize=(8, 8), alpha=.1)

In [None]:
rank_df = pandas.DataFrame({
    'ratings': rated_books_df['ratings_count'].rank(),
    'reviews': rated_books_df['text_reviews_count'].rank()
})

rank_df.plot(x='ratings', y='reviews', style='.')

Do you think anyone bothered to take the time to write a text review but not leave a numeric rating? Seems unlikely. Can we see if any of the points go below the diagonal y = x?

In [None]:
import numpy as np
import matplotlib.pyplot as plot
import matplotlib.lines as mlines
import matplotlib.transforms as mtransforms

fig, ax = plot.subplots(figsize=(5,5))
ax.scatter(rated_books_df.text_reviews_count, rated_books_df.ratings_count)
line = mlines.Line2D([0, 1], [0, 1], color='red', linewidth=0.8)
transform = ax.transAxes
line.set_transform(transform)
ax.add_line(line)
plot.show()

TODO: THIS DOESN'T WORK. IT PLOTS THE DIAGONAL OF THE RENDERED CHART, NOT THE ACTUAL COORDINATE SYSTEM.