The Nobel Prize has been among the most prestigious international awards since 1901. Each year, awards are bestowed in chemistry, literature, physics, physiology or medicine, economics, and peace. In addition to the honor, prestige, and substantial prize money, the recipient also gets a gold medal with an image of Alfred Nobel (1833 - 1896), who established the prize.

The Nobel Foundation has made a dataset available of all prize winners from the outset of the awards from 1901 to 2023. The dataset used in this project is from the Nobel Prize API and is available in the `nobel.csv` file in the `data` folder.

In this project, you'll get a chance to explore and answer several questions related to this prizewinning data. And we encourage you then to explore further questions that you're interested in!

In [2]:
# Loading in required libraries
import pandas as pd
import seaborn as sns
import numpy as np

# Start coding here
# Load the dataset
df = pd.read_csv('data/nobel.csv')

# Display the dataframe information
print(df.head())

   year    category                                           prize  \
0  1901   Chemistry               The Nobel Prize in Chemistry 1901   
1  1901  Literature              The Nobel Prize in Literature 1901   
2  1901    Medicine  The Nobel Prize in Physiology or Medicine 1901   
3  1901       Peace                      The Nobel Peace Prize 1901   
4  1901       Peace                      The Nobel Peace Prize 1901   

                                          motivation prize_share  laureate_id  \
0  "in recognition of the extraordinary services ...         1/1          160   
1  "in special recognition of his poetic composit...         1/1          569   
2  "for his work on serum therapy, especially its...         1/1          293   
3                                                NaN         1/2          462   
4                                                NaN         1/2          463   

  laureate_type                     full_name  birth_date         birth_city  \
0    I

In [3]:
# Inspect data overview
print(df.info())

# Summary statistics for numerical columns
print(df.describe().transpose())

# Summary statistics for categorical columns
print(df.describe(include='object').transpose())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   year                  1000 non-null   int64 
 1   category              1000 non-null   object
 2   prize                 1000 non-null   object
 3   motivation            912 non-null    object
 4   prize_share           1000 non-null   object
 5   laureate_id           1000 non-null   int64 
 6   laureate_type         1000 non-null   object
 7   full_name             1000 non-null   object
 8   birth_date            968 non-null    object
 9   birth_city            964 non-null    object
 10  birth_country         969 non-null    object
 11  sex                   970 non-null    object
 12  organization_name     736 non-null    object
 13  organization_city     735 non-null    object
 14  organization_country  735 non-null    object
 15  death_date            596 non-null    o

In [4]:
# Handle missing values of categorical columns

# Define the columns to replace missing values
columns_to_fill = ['motivation', 'birth_city', 'birth_country', 'sex', 
                   'organization_name', 'organization_city', 'organization_country',
                   'death_city', 'death_country']

# Replace missing values with 'Unknown' for each column
df[columns_to_fill] = df[columns_to_fill].fillna('Unknown')

# Verify that there are no more missing values
print(df[columns_to_fill].isna().sum())

motivation              0
birth_city              0
birth_country           0
sex                     0
organization_name       0
organization_city       0
organization_country    0
death_city              0
death_country           0
dtype: int64


### Question 1: 

What is the most commonly awarded gender and birth country?

* Store your answers as string variables top_gender and top_country.

In [5]:
# Count the awards by gender
count_by_gender = df['sex'].value_counts()

# Count the awards by birth_country
count_by_country = df['birth_country'].value_counts()

# Store the results as string variables
top_gender = count_by_gender.idxmax()
top_country = count_by_country.idxmax()

# Print the most commonly awarded gender and birth country
print("The most commonly awarded gender is:", top_gender)
print("The most commonly awarded birth country is:",top_country)

The most commonly awarded gender is: Male
The most commonly awarded birth country is: United States of America


Answer: **Male** is the most commonly awarded gender and **United States of America** is the most common birth country.

### Question 2: 

Which decade had the highest ratio of US-born Nobel Prize winners to total winners in all categories?

* Store this as an integer called max_decade_usa.

In [6]:
# Create 'us_born' column to check if birth_country is US
df['us_born'] = (df['birth_country'] == 'United States of America').astype(int)

# Create 'decade' column from the 'year' column
df['decade'] = (df['year'] // 10) * 10

# Group by decade and calculate the ratio of US-born winners to total winners
ratio_df = df.groupby('decade').agg(
    total_winners=('laureate_id', 'count'),  # Count total winners
    us_born_winners=('us_born', 'sum')    # Sum of US-born flags
)

# Calculate the ratio
ratio_df['us_to_total_ratio'] = ratio_df['us_born_winners'] / ratio_df['total_winners']

# Identify the decade with the highest ratio
max_decade_usa = ratio_df['us_to_total_ratio'].idxmax()
highest_ratio_value = ratio_df['us_to_total_ratio'].max()

print(f'The decade with the highest ratio of US-born Nobel Prize winners to total winners is: {max_decade_usa}')
print(f'The highest ratio value is: {highest_ratio_value:.2f}')

The decade with the highest ratio of US-born Nobel Prize winners to total winners is: 2000
The highest ratio value is: 0.42


Answer: 2000 is the decade that had the highest ratio of US-born Nobel Prize winners to total winners in all categories.

### Question 3:

Which decade and Nobel Prize category combination had the highest proportion of female laureates?

* Store this as a dictionary called max_female_dict where the decade is the key and the category is the value. There should only be one key:value pair.

In [9]:
# Create a flag for female winners
df['is_female'] = (df['sex'] == 'Female').astype(int)

# Create 'decade' column from the 'year' column (awarded year)
df['decade'] = (df['year'] // 10) * 10

In [10]:
# Group by 'decade' and 'category' to calculate total and female winners
grouped_df = df.groupby(['decade', 'category']).agg(
    total_winners=('laureate_id', 'count'),
    female_winners=('is_female', 'sum')
).reset_index()

grouped_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72 entries, 0 to 71
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   decade          72 non-null     int64 
 1   category        72 non-null     object
 2   total_winners   72 non-null     int64 
 3   female_winners  72 non-null     int32 
dtypes: int32(1), int64(2), object(1)
memory usage: 2.1+ KB


In [11]:
# Calculate the proportion of female winners
grouped_df['female_ratio'] = grouped_df['female_winners'] / grouped_df['total_winners']

In [12]:
# Find the decade and category with the highest proportion of female winners
max_female = grouped_df.loc[grouped_df['female_ratio'].idxmax()]

print(max_female)

decade                  2020
category          Literature
total_winners              4
female_winners             2
female_ratio             0.5
Name: 68, dtype: object


In [13]:
# Create the dictionary
max_female_dict = {max_female['decade']: max_female['category']}

In [14]:
max_female_dict

{2020: 'Literature'}

Answer: '2000' decade and 'Literature" had the highest proportion of female laureates.

### Question 4: 

Who was the first woman to receive a Nobel Prize, and in what category?

Save your string answers as first_woman_name and first_woman_category.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   year                  1000 non-null   int64 
 1   category              1000 non-null   object
 2   prize                 1000 non-null   object
 3   motivation            1000 non-null   object
 4   prize_share           1000 non-null   object
 5   laureate_id           1000 non-null   int64 
 6   laureate_type         1000 non-null   object
 7   full_name             1000 non-null   object
 8   birth_date            968 non-null    object
 9   birth_city            1000 non-null   object
 10  birth_country         1000 non-null   object
 11  sex                   1000 non-null   object
 12  organization_name     1000 non-null   object
 13  organization_city     1000 non-null   object
 14  organization_country  1000 non-null   object
 15  death_date            596 non-null    o

In [20]:
female_df = df[df['is_female'] == 1]

female_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 65 entries, 19 to 999
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   year                  65 non-null     int64 
 1   category              65 non-null     object
 2   prize                 65 non-null     object
 3   motivation            65 non-null     object
 4   prize_share           65 non-null     object
 5   laureate_id           65 non-null     int64 
 6   laureate_type         65 non-null     object
 7   full_name             65 non-null     object
 8   birth_date            65 non-null     object
 9   birth_city            65 non-null     object
 10  birth_country         65 non-null     object
 11  sex                   65 non-null     object
 12  organization_name     65 non-null     object
 13  organization_city     65 non-null     object
 14  organization_country  65 non-null     object
 15  death_date            27 non-null     object


In [22]:
female_df['year'].sort_values(ascending=True)

19     1903
29     1905
51     1909
62     1911
128    1926
       ... 
982    2022
993    2023
998    2023
989    2023
999    2023
Name: year, Length: 65, dtype: int64

In [23]:
# Filter the female_df to find the earliest year
first_woman = female_df.loc[female_df['year'].idxmin()]

first_woman

year                                                                 1903
category                                                          Physics
prize                                     The Nobel Prize in Physics 1903
motivation              "in recognition of the extraordinary services ...
prize_share                                                           1/4
laureate_id                                                             6
laureate_type                                                  Individual
full_name                                     Marie Curie, née Sklodowska
birth_date                                                     1867-11-07
birth_city                                                         Warsaw
birth_country                                     Russian Empire (Poland)
sex                                                                Female
organization_name                                                 Unknown
organization_city                     

In [26]:
# Retrieve the name and category of the first female winner
first_woman_name = first_woman['full_name']
first_woman_category = first_woman['category']

print(first_woman_name)
print(first_woman_category)

Marie Curie, née Sklodowska
Physics


Answer: **Marie Curie, née Sklodowska** was the first woman to receive a Nobel Prize and in **Physics** category.

### Question 5:

Which individuals or organizations have won more than one Nobel Prize throughout the years?

Store the full names in a list named repeat_list.

In [27]:
# Count the occurrences of each laureate name
laureate_counts = df['full_name'].value_counts()

laureate_counts

full_name
Comité international de la Croix Rouge (International Committee of the Red Cross)    3
Linus Carl Pauling                                                                   2
John Bardeen                                                                         2
Frederick Sanger                                                                     2
Marie Curie, née Sklodowska                                                          2
                                                                                    ..
Karl Ziegler                                                                         1
Giulio Natta                                                                         1
Giorgos Seferis                                                                      1
Sir John Carew Eccles                                                                1
Claudia Goldin                                                                       1
Name: count, Length: 993, dtype: 

In [29]:
laureate_counts[laureate_counts >= 2]

full_name
Comité international de la Croix Rouge (International Committee of the Red Cross)    3
Linus Carl Pauling                                                                   2
John Bardeen                                                                         2
Frederick Sanger                                                                     2
Marie Curie, née Sklodowska                                                          2
Office of the United Nations High Commissioner for Refugees (UNHCR)                  2
Name: count, dtype: int64

In [32]:
# Store the full names as list
repeat_list = laureate_counts[laureate_counts >= 2].index.tolist()

# Print
for name in repeat_list:
    print(f"{name}")

Comité international de la Croix Rouge (International Committee of the Red Cross)
Linus Carl Pauling
John Bardeen
Frederick Sanger
Marie Curie, née Sklodowska
Office of the United Nations High Commissioner for Refugees (UNHCR)
