In this Notebook, I plan to explore how the fantasy genre has been dominated by mostly male authors, and how that is slowly chaning over time. Lets begin!

In [48]:
import pandas as pd
import altair as alt
import matplotlib.pyplot as plt


In [49]:
novels_df = pd.read_csv('https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/top-500-novels/library_top_500.csv')
nyt_bestsellers_df = pd.read_csv("https://raw.githubusercontent.com/ecds/post45-datasets/main/nyt_full.tsv", sep='\t')
nyt_bestsellers_df = nyt_bestsellers_df.rename(columns={'title': 'nyt_title'})
nyt_bestsellers_df['title'] = nyt_bestsellers_df['nyt_title'].str.capitalize()
combined_novels_nyt_df = novels_df.merge(nyt_bestsellers_df, how='left', on=['author', 'title'])
combined_novels_nyt_df.columns


Index(['top_500_rank', 'title', 'author', 'pub_year', 'orig_lang', 'genre',
       'author_birth', 'author_death', 'author_gender', 'author_primary_lang',
       'author_nationality', 'author_field_of_activity', 'author_occupation',
       'oclc_holdings', 'oclc_eholdings', 'oclc_total_editions',
       'oclc_holdings_rank', 'oclc_editions_rank', 'gr_avg_rating',
       'gr_num_ratings', 'gr_num_reviews', 'gr_avg_rating_rank',
       'gr_num_ratings_rank', 'oclc_owi', 'author_viaf', 'gr_url', 'wiki_url',
       'pg_eng_url', 'pg_orig_url', 'year', 'week', 'rank', 'title_id',
       'nyt_title'],
      dtype='object')

After combining the datasets of the Top 500 Greatest Novels and the New York Times Best Sellers list, we have a list of 721 books and authors that were in both sets of the data. 

In [50]:
print(combined_novels_nyt_df['author'].nunique())

279


In [51]:
combined_novels_nyt_df['pub_year'] = combined_novels_nyt_df['pub_year'].astype(int)
combined_novels_nyt_df['author_gender'] = combined_novels_nyt_df['author_gender'].astype(str)

In [52]:
combined_novels_nyt_df = combined_novels_nyt_df.dropna(subset=['pub_year'])

valid_years = combined_novels_nyt_df[(combined_novels_nyt_df['pub_year'] >= 1800) & (combined_novels_nyt_df['pub_year'] <= pd.Timestamp.now().year)]

# Aggregate the data
# Group by year and gender, and count the number of authors
author_counts = valid_years.groupby(['pub_year', 'author_gender']).size().reset_index(name='count')

author_counts['pub_year'] = pd.to_datetime(author_counts['pub_year'], format='%Y')
# Create the Altair chart
chart = alt.Chart(author_counts).mark_line().encode(
    x=alt.X('pub_year:T', title='Year'),
    y=alt.Y('count:Q', title='Number of Authors'),
    color=alt.Color('author_gender:N', title='Gender')
).properties(
    title='Gender Disparity in Authors Over Time'
)

# Display the chart
chart

As you can see, this graph ended up being a little wonkey. Lets retry with a scatterplot. 

In [53]:
scatter_plot = alt.Chart(author_counts).mark_circle(size=60).encode(
    x=alt.X('pub_year:T', title='Year'),
    y=alt.Y('count:Q', title='Number of Authors'),
    color=alt.Color('author_gender:N', title='Gender'),
    tooltip=['pub_year:T', 'count:Q', 'author_gender:N']
).properties(
    title='Gender Disparity in Authors Over Time'
)

# Display the scatter plot
scatter_plot

This scatterplot still doesn't tell us too much about the gender of popular authors. We can see some clear outlier years more recently, and with a couple exeptions like 1938 having a lot of female authors. Time to look a little deeper into 1938 to see what caused it to have so many female authors. 

In [54]:
female_authors_1938 = combined_novels_nyt_df[(combined_novels_nyt_df['author_gender'] == 'female') & (combined_novels_nyt_df['pub_year'] == 1938)]
print(female_authors_1938)

     top_500_rank         title                    author  pub_year orig_lang  \
167           130       Rebecca         Daphne Du Maurier      1938   English   
168           130       Rebecca         Daphne Du Maurier      1938   English   
169           130       Rebecca         Daphne Du Maurier      1938   English   
170           130       Rebecca         Daphne Du Maurier      1938   English   
171           130       Rebecca         Daphne Du Maurier      1938   English   
172           130       Rebecca         Daphne Du Maurier      1938   English   
173           130       Rebecca         Daphne Du Maurier      1938   English   
174           130       Rebecca         Daphne Du Maurier      1938   English   
175           130       Rebecca         Daphne Du Maurier      1938   English   
176           130       Rebecca         Daphne Du Maurier      1938   English   
177           130       Rebecca         Daphne Du Maurier      1938   English   
178           130       Rebe

It seems this book was listed many times in this dataset. Although interesting to note, it won't help us with looking at the gender of authors over time. More data cleaning must be done.

In [66]:
unique_authors_df = combined_novels_nyt_df.drop_duplicates(subset=['author'])
unique_authors_df

Unnamed: 0,top_500_rank,title,author,pub_year,orig_lang,genre,author_birth,author_death,author_gender,author_primary_lang,...,author_viaf,gr_url,wiki_url,pg_eng_url,pg_orig_url,year,week,rank,title_id,nyt_title
0,1,Don Quixote,Miguel de Cervantes,1605,Spanish,action,1547,1616,male,spa,...,17220427,https://www.goodreads.com/book/show/3836.Don_Q...,https://en.wikipedia.org/wiki/Don_Quixote,https://www.gutenberg.org/cache/epub/996/pg996...,https://www.gutenberg.org/cache/epub/2000/pg20...,,,,,
1,2,Alice's Adventures in Wonderland,Lewis Carroll,1865,English,fantasy,1832,1898,male,eng,...,66462036,https://www.goodreads.com/book/show/24213.Alic...,https://en.wikipedia.org/wiki/Alice%27s_Advent...,https://www.gutenberg.org/cache/epub/11/pg11.txt,,,,,,
2,3,The Adventures of Huckleberry Finn,Mark Twain,1884,English,action,1835,1910,male,eng,...,50566653,https://www.goodreads.com/book/show/2956.The_A...,https://en.wikipedia.org/wiki/Adventures_of_Hu...,https://www.gutenberg.org/cache/epub/76/pg76.txt,,,,,,
4,5,Treasure Island,Robert Louis Stevenson,1883,English,action,1850,1894,male,eng,...,95207986,https://www.goodreads.com/book/show/295.Treasu...,https://en.wikipedia.org/wiki/Treasure_Island,https://www.gutenberg.org/cache/epub/120/pg120...,,,,,,
5,6,Pride and Prejudice,Jane Austen,1813,English,romance,1775,1817,female,eng,...,102333412,https://www.goodreads.com/book/show/1885.Pride...,https://en.wikipedia.org/wiki/Pride_and_Prejudice,https://www.gutenberg.org/cache/epub/1342/pg13...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
697,494,The Naked and the Dead,Norman Mailer,1948,English,history,1923,2007,male,eng,...,7393743,,https://en.wikipedia.org/wiki/The_Naked_and_th...,NA_not-pub-domain,,,,,,
699,496,Stranger in a Strange Land,Robert A. Heinlein,1961,English,scifi,1907,1988,male,eng,...,12309757,,https://en.wikipedia.org/wiki/Stranger_in_a_St...,NA_not-pub-domain,,,,,,
700,497,Vision in White,Nora Roberts,2009,English,romance,1965,ALIVE,female,eng,...,66448023,,https://en.wikipedia.org/wiki/Vision_in_White,NA_not-pub-domain,,,,,,
701,498,The Whipping Boy,Sid Fleischman,1986,English,action,1920,2010,male,eng,...,66438084,,https://en.wikipedia.org/wiki/The_Whipping_Boy,NA_not-pub-domain,,,,,,


We now have a dataset with all unique authors in the datasets, wiht a total of 279. Although looking at this data is not as complete as looking at every book by every author, it will give us a cleaner graph to look at author's genders based on when they wrote the book they were selected for. Lets make another scatterplot to compare and see how different it looks. 

In [67]:
unique_authors_df['pub_year'] = unique_authors_df['pub_year'].astype(str)
unique_authors_df['pub_year'] = pd.to_datetime(unique_authors_df['pub_year'], format='%Y', errors='coerce')
scatter_plot = alt.Chart(unique_authors_df).mark_circle(size=60).encode(
    x=alt.X('pub_year:T', title='Year'),
    y=alt.Y('count()', title='Number of Authors'),
    color=alt.Color('author_gender:N', title='Gender'),
    tooltip=['pub_year:T', 'count()', 'author_gender:N']
).properties(
    title='Gender Disparity in Authors Over Time'
)

# Display the scatter plot
scatter_plot

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unique_authors_df['pub_year'] = unique_authors_df['pub_year'].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  unique_authors_df['pub_year'] = pd.to_datetime(unique_authors_df['pub_year'], format='%Y', errors='coerce')


This scatterplot is a little more descriptive. We can see there is some years were 2 female authors made the dataset, but only 1 year (2009) 3 female authors made it, and none for 4 or 5. Now we can start to see the gender disparity between authors in every genre, it's time to focus in on the fantasy genre. 

In [68]:
fantasy_unique_authors_df = unique_authors_df[unique_authors_df['genre'] == 'fantasy']
fantasy_unique_authors_df

Unnamed: 0,top_500_rank,title,author,pub_year,orig_lang,genre,author_birth,author_death,author_gender,author_primary_lang,...,author_viaf,gr_url,wiki_url,pg_eng_url,pg_orig_url,year,week,rank,title_id,nyt_title
1,2,Alice's Adventures in Wonderland,Lewis Carroll,1865-01-01,English,fantasy,1832,1898,male,eng,...,66462036,https://www.goodreads.com/book/show/24213.Alic...,https://en.wikipedia.org/wiki/Alice%27s_Advent...,https://www.gutenberg.org/cache/epub/11/pg11.txt,,,,,,
10,11,Gulliver's Travels,Jonathan Swift,1726-01-01,English,fantasy,1667,1745,male,eng,...,14777110,https://www.goodreads.com/book/show/7733.Gulli...,https://en.wikipedia.org/wiki/Gulliver%27s_Tra...,https://www.gutenberg.org/cache/epub/829/pg829...,,,,,,
17,18,"The Hobbit, or, There and Back Again",J.R.R. Tolkien,1937-01-01,English,fantasy,1892,1973,male,eng,...,95218067,https://www.goodreads.com/book/show/437049.The...,https://en.wikipedia.org/wiki/The_Hobbit,NA_not-pub-domain,,,,,,
29,30,The Wizard of Oz,L. Frank Baum,1900-01-01,English,fantasy,1856,1919,male,eng,...,4926394,https://www.goodreads.com/book/show/236093.The...,https://en.wikipedia.org/wiki/The_Wonderful_Wi...,https://www.gutenberg.org/cache/epub/55/pg55.txt,,,,,,
38,39,The Wind in the Willows,Kenneth Grahame,1908-01-01,English,fantasy,1859,1932,male,eng,...,36919188,https://www.goodreads.com/book/show/5659.The_W...,https://en.wikipedia.org/wiki/The_Wind_in_the_...,https://www.gutenberg.org/cache/epub/289/pg289...,,,,,,
44,45,Harry Potter and the Sorcerer's Stone,J.K. Rowling,1997-01-01,English,fantasy,1965,ALIVE,female,eng,...,116796842,https://www.goodreads.com/book/show/42844155-h...,https://en.wikipedia.org/wiki/Harry_Potter_and...,NA_not-pub-domain,,,,,,
58,51,"The Lion, the Witch, and the Wardrobe",C.S. Lewis,1950-01-01,English,fantasy,1898,1963,male,eng,...,22144877,https://www.goodreads.com/book/show/100915.The...,"https://en.wikipedia.org/wiki/The_Lion,_the_Wi...",NA_not-pub-domain,,,,,,
66,59,Peter Pan,J.M. Barrie,1911-01-01,English,fantasy,1860,1937,male,eng,...,64001320,https://www.goodreads.com/book/show/34268.Pete...,https://en.wikipedia.org/wiki/Peter_and_Wendy,https://www.gutenberg.org/cache/epub/16/pg16.txt,,,,,,
104,97,Charlotte's Web,E.B. White,1952-01-01,English,fantasy,1899,1985,male,eng,...,66475004,https://www.goodreads.com/book/show/24178.Char...,https://en.wikipedia.org/wiki/Charlotte%27s_Web,NA_not-pub-domain,,,,,,
251,183,Bridge to Terabithia,Katherine Paterson,1977-01-01,English,fantasy,1932,ALIVE,female,eng,...,98108465,https://www.goodreads.com/book/show/40940121-b...,https://en.wikipedia.org/wiki/Bridge_to_Terabi...,NA_not-pub-domain,,,,,,


This narrows us down to 23 fantasy authors, with mostly books you probably recognize. You can eyeball the data and see it's mostly male dominated, but we can make one more graph to help us vizualize it. 

In [69]:
gender_counts = fantasy_unique_authors_df['author_gender'].value_counts().reset_index()
gender_counts.columns = ['author_gender', 'count']

# Create the Altair bar chart
bar_chart = alt.Chart(gender_counts).mark_bar().encode(
    x=alt.X('author_gender:N', title='Gender'),
    y=alt.Y('count:Q', title='Number of Authors'),
    color=alt.Color('author_gender:N', title='Gender'),
    tooltip=['author_gender:N', 'count:Q']
).properties(
    title='Number of Unique Authors by Gender'
)

# Display the bar chart
bar_chart


The bar chart shows 16 males to 7 females, as after further inspection the one NAN is TH White, a male english author. Although not an overwhelming majority, it is clear fantasy is a mostly male dominated field for authors. Especially in the past, as most of the woman on the fantasy list have released their book within the last 50 years, with the exeception of The Indian in the Cupboard by Lynne Reid Banks. 