# Exploring and Visualizing Cultural Bias in Top Ranked Novels

## Introduction
This notebook performs an exploratory data analysis (EDA) on two datasets: **Top 500 Greatest Novels** and **New York Times Hardcover Fiction Bestsellers**. The goal is to uncover patterns, gaps, and biases, focusing on author demographics, genre representation, and historical popularity.


In [1]:
import pandas as pd
import altair as alt

# Load the datasets
top500_df = pd.read_csv("https://raw.githubusercontent.com/melaniewalsh/responsible-datasets-in-context/main/datasets/top-500-novels/library_top_500.csv")
nyt_df = pd.read_csv("https://raw.githubusercontent.com/ecds/post45-datasets/main/nyt_full.tsv", sep="\t")

## Initial Data Overview
We start by inspecting the structure and basic information about each dataset, including missing values and data types.


In [2]:
# Display basic info about the datasets
print("Top 500 Novels Dataset Info:")
top500_df.info()
print("\nNYT Bestsellers Dataset Info:")
nyt_df.info()


### Handling Missing Data
We will explore the missing data in the Top 500 Novels dataset and how it may affect our analysis. We use the `isna()` method to inspect missing values in specific columns and visualize their distribution.


In [3]:
# Check for missing values in key columns
missing_data = top500_df.isna().sum()
missing_data

### Merging the Datasets
We want to see if there are any common titles or authors between the datasets to understand the overlap. For this, we will use `merge()` and check the shared information between the two datasets.


In [4]:
# Merge the datasets based on common 'title' and 'author'
nyt_df['title'] = nyt_df['title'].str.capitalize()
merged_df = top500_df.merge(nyt_df, how='inner', on=['title', 'author'])
print(f"Number of shared titles: {len(merged_df)}")

## Exploratory Data Analysis (EDA)
The next step is to perform an EDA to identify any noticeable patterns or biases. We start by understanding the genre distribution and author demographics in the Top 500 dataset.


In [5]:
# Plot the distribution of genres
genre_counts = top500_df['genre'].value_counts().reset_index()
genre_counts.columns = ['Genre', 'Count']

genre_chart = alt.Chart(genre_counts).mark_bar().encode(
    x='Count:Q',
    y=alt.Y('Genre:N', sort='-x'),
    color='Genre:N'
).properties(
    title="Genre Distribution in Top 500 Novels"
)
genre_chart

### Author Demographics
The chart below shows the distribution of author genders in the dataset, helping us understand whether there are any biases in the gender representation of top-ranked novels.


In [6]:
# Count author genders
gender_counts = top500_df['author_gender'].value_counts().reset_index()
gender_counts.columns = ['Gender', 'Count']

gender_chart = alt.Chart(gender_counts).mark_bar().encode(
    x='Count:Q',
    y=alt.Y('Gender:N', sort='-x'),
    color='Gender:N'
).properties(
    title="Author Gender Distribution in Top 500 Novels"
)
gender_chart

### Genre Trends Over Time
We analyze how genre popularity has evolved over the years by creating a line chart of genre frequency over time.


In [7]:
# Create a datetime column for the publication year
top500_df['pub_date'] = pd.to_datetime(top500_df['pub_year'].astype(str) + '-01-01', errors='coerce')

# Plot genre frequency over time
genre_over_time = top500_df.groupby(['pub_date', 'genre']).size().reset_index(name='Count')

trend_chart = alt.Chart(genre_over_time).mark_line().encode(
    x='pub_date:T',
    y='Count:Q',
    color='genre:N'
).properties(
    title="Genre Trends Over Time in Top 500 Novels"
)
trend_chart

## Conclusion
Through our analysis, we observe potential biases in author gender and genre popularity. The merged dataset helps us understand overlaps between the Top 500 Greatest Novels and NYT Bestsellers, showing a trend in shared authors but not necessarily shared titles. This analysis informs future research in literature, helping to understand which genres or authors are given more visibility and why.