In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import requests
from io import StringIO
import altair as alt


url = "https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/top-500-novels/library_top_500.csv"


response = requests.get(url, verify=False)
csv_data = StringIO(response.text)
df = pd.read_csv(csv_data, sep=',', header=0, low_memory=False)
df



Unnamed: 0,top_500_rank,title,author,pub_year,orig_lang,genre,author_birth,author_death,author_gender,author_primary_lang,...,gr_num_ratings,gr_num_reviews,gr_avg_rating_rank,gr_num_ratings_rank,oclc_owi,author_viaf,gr_url,wiki_url,pg_eng_url,pg_orig_url
0,1,Don Quixote,Miguel de Cervantes,1605,Spanish,action,1547,1616,male,spa,...,269435,12053,318,211,1.810748e+09,17220427,https://www.goodreads.com/book/show/3836.Don_Q...,https://en.wikipedia.org/wiki/Don_Quixote,https://www.gutenberg.org/cache/epub/996/pg996...,https://www.gutenberg.org/cache/epub/2000/pg20...
1,2,Alice's Adventures in Wonderland,Lewis Carroll,1865,English,fantasy,1832,1898,male,eng,...,561016,15380,172,133,1.156132e+10,66462036,https://www.goodreads.com/book/show/24213.Alic...,https://en.wikipedia.org/wiki/Alice%27s_Advent...,https://www.gutenberg.org/cache/epub/11/pg11.txt,
2,3,The Adventures of Huckleberry Finn,Mark Twain,1884,English,action,1835,1910,male,eng,...,1262480,19440,373,68,3.373178e+09,50566653,https://www.goodreads.com/book/show/2956.The_A...,https://en.wikipedia.org/wiki/Adventures_of_Hu...,https://www.gutenberg.org/cache/epub/76/pg76.txt,
3,4,The Adventures of Tom Sawyer,Mark Twain,1876,English,action,1835,1910,male,eng,...,931898,13603,301,88,3.373178e+09,50566653,https://www.goodreads.com/book/show/24583.The_...,https://en.wikipedia.org/wiki/The_Adventures_o...,https://www.gutenberg.org/cache/epub/74/pg74.txt,
4,5,Treasure Island,Robert Louis Stevenson,1883,English,action,1850,1894,male,eng,...,486155,16307,368,145,3.434000e+03,95207986,https://www.goodreads.com/book/show/295.Treasu...,https://en.wikipedia.org/wiki/Treasure_Island,https://www.gutenberg.org/cache/epub/120/pg120...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,496,Stranger in a Strange Land,Robert A. Heinlein,1961,English,scifi,1907,1988,male,eng,...,311859,9961,310,190,7.894120e+05,12309757,,https://en.wikipedia.org/wiki/Stranger_in_a_St...,NA_not-pub-domain,
496,497,Vision in White,Nora Roberts,2009,English,romance,1965,ALIVE,female,eng,...,138445,4652,128,277,1.559638e+08,66448023,,https://en.wikipedia.org/wiki/Vision_in_White,NA_not-pub-domain,
497,498,The Whipping Boy,Sid Fleischman,1986,English,action,1920,2010,male,eng,...,27444,1623,476,445,4.415520e+08,66438084,,https://en.wikipedia.org/wiki/The_Whipping_Boy,NA_not-pub-domain,
498,499,Room,Emma Donoghue,2010,English,na,1969,ALIVE,female,eng,...,801989,50594,171,101,4.859780e+08,39539889,,https://en.wikipedia.org/wiki/Room_(novel),NA_not-pub-domain,


From this dataset, we will take a look at gender bias in ranking. We will specifically take a look at the top 50 ranked books and the gender distribution of those, then look at the bottom 50 and look at the gender distribution of those. From this we will create graphics to show us the differnce or similarities so we can draw our conclusion. 

Now, we will cut down the datset to only show us the gender of the author and the ranking of the book

In [2]:
df_selected = df[['author_gender', 'top_500_rank']]
df_selected

Unnamed: 0,author_gender,top_500_rank
0,male,1
1,male,2
2,male,3
3,male,4
4,male,5
...,...,...
495,male,496
496,female,497
497,male,498
498,female,499


Now we will make it show us only top 50 rankings and bottom 50 rankings. 

In [3]:
top50 = df_selected[(df_selected['top_500_rank'] >= 1) & (df_selected['top_500_rank'] <= 50)].reset_index(drop=True)
bot50 = df_selected[(df_selected['top_500_rank'] >= 451) & (df_selected['top_500_rank'] <= 500)].reset_index(drop=True)

In [4]:
top50

Unnamed: 0,author_gender,top_500_rank
0,male,1
1,male,2
2,male,3
3,male,4
4,male,5
5,female,6
6,female,7
7,female,8
8,male,9
9,male,10


In [5]:
bot50

Unnamed: 0,author_gender,top_500_rank
0,female,451
1,female,452
2,male,453
3,male,454
4,male,455
5,male,456
6,male,457
7,male,458
8,male,459
9,female,460


Now i will count how many males there are to females in top50 and then the same for the bottom 50

In [6]:
top50_counts = top50['author_gender'].value_counts().reset_index()
top50_counts.columns = ['author_gender', 'count']

bot50_counts = bot50['author_gender'].value_counts().reset_index()
bot50_counts.columns = ['author_gender', 'count']

In [7]:
top50_counts

Unnamed: 0,author_gender,count
0,male,39
1,female,11


In [8]:
bot50_counts

Unnamed: 0,author_gender,count
0,male,37
1,female,13


From this information, we can see that there is a 39/11 male to female ratio in the top 50 of publishers and 37/13 male to female in the bottom 50 of publishers. 
Now we will make some graphs to look at it visually. 

In [9]:
bar_top50 = alt.Chart(top50_counts).mark_bar().encode(
    x=alt.X('author_gender:N', title='Gender'),
    y=alt.Y('count:Q', title='Count'),
    color='author_gender:N'
).properties(
    title='Gender Distribution in Top 50 Rankings'
)
bar_bot50 = alt.Chart(bot50_counts).mark_bar().encode(
    x=alt.X('author_gender:N', title='Gender'),
    y=alt.Y('count:Q', title='Count'),
    color='author_gender:N'
).properties(
    title='Gender Distribution in Bottom 50 Rankings'
)
pie_top50 = alt.Chart(top50_counts).mark_arc().encode(
    theta=alt.Theta(field='count', type='quantitative'),
    color=alt.Color(field='author_gender', type='nominal'),
    tooltip=['author_gender', 'count']
).properties(
    title='Gender in Top 50 Rankings'
)
pie_bot50 = alt.Chart(bot50_counts).mark_arc().encode(
    theta=alt.Theta(field='count', type='quantitative'),
    color=alt.Color(field='author_gender', type='nominal'),
    tooltip=['author_gender', 'count']
).properties(
    title='Gender in Bottom 50 Rankings'
)
(bar_top50 & bar_bot50) | (pie_top50 & pie_bot50)

We can see a clear desparity between the two genders in the top and bottom rankings. Even though this difference stays about consistent, it tells us either one of two things. The first could be that there overall, aside from these top 500 authors, there are just more men authors than women, which means that the top 500 rankings would have a ratio of about how many more men authors exist than women. This to me is the more unlikely scenario. What I think it tells us is that there is a bias existing in choosing the top 500 authors. The bias comes at a rate of 3.2 male to each female. If provided with more informaiton on how the authors were chosen we could come to a more concrete conclusion. 