In [3]:
import pandas as pd
import altair
df = pd.read_csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/top-500-novels/library_top_500.csv", sep=',', header=0, low_memory=False)

In [4]:
df.head()

Unnamed: 0,top_500_rank,title,author,pub_year,orig_lang,genre,author_birth,author_death,author_gender,author_primary_lang,...,gr_num_ratings,gr_num_reviews,gr_avg_rating_rank,gr_num_ratings_rank,oclc_owi,author_viaf,gr_url,wiki_url,pg_eng_url,pg_orig_url
0,1,Don Quixote,Miguel de Cervantes,1605,Spanish,action,1547,1616,male,spa,...,269435,12053,318,211,1810748000.0,17220427,https://www.goodreads.com/book/show/3836.Don_Q...,https://en.wikipedia.org/wiki/Don_Quixote,https://www.gutenberg.org/cache/epub/996/pg996...,https://www.gutenberg.org/cache/epub/2000/pg20...
1,2,Alice's Adventures in Wonderland,Lewis Carroll,1865,English,fantasy,1832,1898,male,eng,...,561016,15380,172,133,11561320000.0,66462036,https://www.goodreads.com/book/show/24213.Alic...,https://en.wikipedia.org/wiki/Alice%27s_Advent...,https://www.gutenberg.org/cache/epub/11/pg11.txt,
2,3,The Adventures of Huckleberry Finn,Mark Twain,1884,English,action,1835,1910,male,eng,...,1262480,19440,373,68,3373178000.0,50566653,https://www.goodreads.com/book/show/2956.The_A...,https://en.wikipedia.org/wiki/Adventures_of_Hu...,https://www.gutenberg.org/cache/epub/76/pg76.txt,
3,4,The Adventures of Tom Sawyer,Mark Twain,1876,English,action,1835,1910,male,eng,...,931898,13603,301,88,3373178000.0,50566653,https://www.goodreads.com/book/show/24583.The_...,https://en.wikipedia.org/wiki/The_Adventures_o...,https://www.gutenberg.org/cache/epub/74/pg74.txt,
4,5,Treasure Island,Robert Louis Stevenson,1883,English,action,1850,1894,male,eng,...,486155,16307,368,145,3434.0,95207986,https://www.goodreads.com/book/show/295.Treasu...,https://en.wikipedia.org/wiki/Treasure_Island,https://www.gutenberg.org/cache/epub/120/pg120...,


<h1>Section One: Examining Occurences of Gender, Language, and Nationality</h2>

Let's first inspect rates of occurence of different languages.

In [9]:
df['orig_lang'] = df['orig_lang'].str.lower()

lang_counts = df['orig_lang'].value_counts(normalize=True).reset_index()
lang_counts.columns = ['orig_lang', 'percentage']
lang_counts['percentage'] *= 100

chart = alt.Chart(lang_counts).mark_bar().encode(
    x='orig_lang',
    y='percentage'
).properties(
    title='Distribution of Author Primary Languages (Percentage)'
)

chart

We see that English vastly outnumbers the other languages. This is not very surprising, as the ranking of books was made by English speakers. However, it's extremely unlikely that the truly best books ever written follow this same proportion. If Russian people, for instance, created such a ranking, we might expect to see more Russian authors. 

In any case, this is definitely an example of bias. Let's examine gender next.

In [10]:
gender_counts = df['author_gender'].value_counts(normalize=True).reset_index()
gender_counts.columns = ['author_gender', 'percentage']
gender_counts['percentage'] *= 100

gender_chart = alt.Chart(gender_counts).mark_bar().encode(
    x='author_gender',
    y='percentage'
).properties(
    title='Distribution of Author Genders (Percentage)'
)

gender_chart

We can see that male authors outnumber female authors approximalte 2 to 1. The provider of this dataset says the list is based on novels favored by men, so the bias is clear here. Again, it's extremely likely that male authors are inherently twice as likely to be a great author, so there must be underlying factors.

Let's examine nationality next.

In [11]:
nationality_counts = df['author_nationality'].value_counts(normalize=True).reset_index()
nationality_counts.columns = ['author_nationality', 'percentage']
nationality_counts['percentage'] *= 100

nationality_chart = alt.Chart(nationality_counts).mark_bar().encode(
    x='author_nationality',
    y='percentage'
).properties(
    title='Distribution of Author Nationalities (Percentage)'
)

nationality_chart

At this point, the pattern becomes very clear. We see that authors from the United States and Great Britain vastly outnumber all other countries. Again, it's extremely unlikely that there is some 'je ne sais quoi' that comes from being American or British that makes someone a better author. This distribution can probably be attributed to bias. 


<h2>Section Two: Examining Rankings of Gender, Language, and Nationality</h2>

In this section, we will examine what factors might correlate with a book being rated higher. 

In [22]:
# Calculate mean rating rank for each language
lang_rating_rank = df.groupby('orig_lang')['gr_avg_rating_rank'].mean().reset_index()
lang_rating_rank.columns = ['orig_lang', 'mean_rating_rank']

# Calculate mean rating rank for each gender
gender_rating_rank = df.groupby('author_gender')['gr_avg_rating_rank'].mean().reset_index()
gender_rating_rank.columns = ['author_gender', 'mean_rating_rank']

# Calculate mean rating rank for each nationality
nationality_rating_rank = df.groupby('author_nationality')['gr_avg_rating_rank'].mean().reset_index()
nationality_rating_rank.columns = ['author_nationality', 'mean_rating_rank']

# Plot the results
lang_chart = alt.Chart(lang_rating_rank).mark_bar().encode(
    x='orig_lang',
    y='mean_rating_rank'
).properties(
    title='Mean Rating Rank by Language'
)




lang_chart 

This plot shows interesting details about the data. Firstly, we can examine the mean rating rank by language. Interestingly, English is not ranked higher than most other languages; japanese and latin books are rated higher on average than all other languages. What can explain this discrepancy? Let's examine the value counts for each language:

In [17]:
language_counts = df['orig_lang'].value_counts().reset_index()
language_counts.columns = ['Language', 'Count']
print(language_counts)

      Language  Count
0      english    430
1       french     25
2       german     14
3      russian     11
4      spanish      7
5      italian      5
6      swedish      3
7        latin      1
8     japanese      1
9   portuguese      1
10     chinese      1
11      polish      1


We see that both latin and japanese have only one book in the list. Apparently, both of these are rated quite highly. Since the sample size of each of these languages is just one book, we can't make any conclusions on whether or not books written in these languages are actually better; we would need a larger sample. 

Next, we can analyse gender.

In [19]:
gender_chart = alt.Chart(gender_rating_rank).mark_bar().encode(
    x='author_gender',
    y='mean_rating_rank'
).properties(
    title='Mean Rating Rank by Gender'
)
gender_chart


We can see that books written by male authors are ranked higher on average. Let's examine the value counts as we did before:

In [21]:
gender_counts = df['author_gender'].value_counts().reset_index()
gender_counts.columns = ['author_gender', 'count']
print(gender_counts)

  author_gender  count
0          male    354
1        female    145


The sample size is indeed smaller for females, however it is still quite large. There are a few possible explanations for this. Firstly, we can consider if males are simply better authors than females, but this is not true. Next, it's possible that those that ranked these books are mostly men, so they conciously or unconciously favored male authors. It is also possible that, because the list was compiled by mostly men, books written by male authors tend to cater more towards male readers. Regardless, bias definitely played a role here.

Next, let's examine nationality.

In [24]:
nationality_chart = alt.Chart(nationality_rating_rank).mark_bar().encode(
    x='author_nationality',
    y='mean_rating_rank'
).properties(
    title='Mean Rating Rank by Nationality'
)
nationality_chart

In this plot, we see a similar distribution to the language plot. This makes sense, as nationality and language should be highly correlated. Just as before, we cannot make any certain conclusions about the outliers, like Japenese or Nigerian, as the sample size for these nationalities are very small. We can make a conclusion about this distribution in general. 

When we looked at occurrences, we saw that English outweighed other languages, and US and GB outweighed other countries, however when we examine the actual ratings, the distribution is far less polarizing. Therefore, we can assume that it is not the case that English books and US and GB authors are not simply better authors, but rather there is bias in the selection and representation of different languages and nationalities. We could presume that, if the people that ranked these books were presented a larger sample, including books originally written in different languages by authors of different nationalities, we might see more representation. 

However, in reality, bias is a huge factor in subjective ratings.