# Setup and Context

### Introduction

On November 27, 1895, Alfred Nobel signed his last will in Paris. When it was opened after his death, the will caused a lot of controversy, as Nobel had left much of his wealth for the establishment of a prize.

Alfred Nobel dictates that his entire remaining estate should be used to endow “prizes to those who, during the preceding year, have conferred the greatest benefit to humankind”.

Every year the Nobel Prize is given to scientists and scholars in the categories chemistry, literature, physics, physiology or medicine, economics, and peace.

<img src=https://i.imgur.com/36pCx5Q.jpg>

This data is not suitable for creating RAG Application we have to do some process on it. So we done data cleaning on it before.

### Import Statements

In [62]:
import pandas as pd

### Notebook Presentation

In [63]:
pd.options.display.float_format = '{:,.2f}'.format

### Read the Data

In [64]:
df_data = pd.read_csv('/content/nobel_prize_data.csv')

# Data Cleaning

* What is the shape of `df_data`? How many rows and columns?
* What are the column names?

In [65]:
df_data.shape

(1001, 16)

In [66]:
df_data.head()

Unnamed: 0,year,category,prize,motivation,prize_share,laureate_type,full_name,birth_date,birth_city,birth_country,birth_country_current,sex,organization_name,organization_city,organization_country,ISO
0,1901,Chemistry,The Nobel Prize in Chemistry 1901,"""in recognition of the extraordinary services ...",1-Jan,Individual,Jacobus Henricus van 't Hoff,1852-08-30,Rotterdam,Netherlands,Netherlands,Male,Berlin University,Berlin,Germany,NLD
1,1901,Literature,The Nobel Prize in Literature 1901,"""in special recognition of his poetic composit...",1-Jan,Individual,Sully Prudhomme,1839-03-16,Paris,France,France,Male,,,,FRA
2,1901,Medicine,The Nobel Prize in Physiology or Medicine 1901,"""for his work on serum therapy, especially its...",1-Jan,Individual,Emil Adolf von Behring,1854-03-15,Hansdorf (Lawice),Prussia (Poland),Poland,Male,Marburg University,Marburg,Germany,POL
3,1901,Peace,The Nobel Peace Prize 1901,,2-Jan,Individual,Frédéric Passy,1822-05-20,Paris,France,France,Male,,,,FRA
4,1901,Peace,The Nobel Peace Prize 1901,,2-Jan,Individual,Jean Henry Dunant,1828-05-08,Geneva,Switzerland,Switzerland,Male,,,,CHE


In [67]:
df_data.tail()

Unnamed: 0,year,category,prize,motivation,prize_share,laureate_type,full_name,birth_date,birth_city,birth_country,birth_country_current,sex,organization_name,organization_city,organization_country,ISO
996,1929,Peace,The Nobel Peace Prize 1929,,1-Jan,Individual,Frank Billings Kellogg,1856-12-22,"Potsdam, NY",United States of America,United States of America,Male,,,,USA
997,1929,Physics,The Nobel Prize in Physics 1929,"""for his discovery of the wave nature of elect...",1-Jan,Individual,Prince Louis-Victor Pierre Raymond de Broglie,1892-08-15,Dieppe,France,France,Male,Sorbonne University,Paris,France,FRA
998,1930,Chemistry,The Nobel Prize in Chemistry 1930,"""for his researches into the constitution of h...",1-Jan,Individual,Hans Fischer,1881-07-27,Hoechst,Germany,Germany,Male,Technische Hochschule (Institute of Technology),Munich,Germany,DEU
999,1930,Literature,The Nobel Prize in Literature 1930,"""for his vigorous and graphic art of descripti...",1-Jan,Individual,Sinclair Lewis,1885-02-07,"Sauk Centre, MN",United States of America,United States of America,Male,,,,USA
1000,1930,Medicine,The Nobel Prize in Physiology or Medicine 1930,"""for his discovery of human blood groups""",1-Jan,Individual,Karl Landsteiner,1868-06-14,Vienna,Austrian Empire (Austria),Austria,Male,Rockefeller Institute for Medical Research,"New York, NY",United States of America,AUT


In [68]:
df_data.columns

Index(['year', 'category', 'prize', 'motivation', 'prize_share',
       'laureate_type', 'full_name', 'birth_date', 'birth_city',
       'birth_country', 'birth_country_current', 'sex', 'organization_name',
       'organization_city', 'organization_country', 'ISO'],
      dtype='object')

###Drop Useless Columns

In [69]:
df_data.drop(columns=['ISO', 'prize_share'], inplace=True)

### Check for Duplicates

In [70]:
df_data.duplicated().sum()

np.int64(39)

###Remove Duplicates

In [71]:
df_data.drop_duplicates(inplace=True)

##Check if df_data still has duplicates after removing Duplicated rows

In [72]:
df_data.duplicated().values.any()

np.False_

### Check for NaN Values

In [73]:
df_data.isna().values.any()

np.True_

In [74]:
df_data.isna().sum()

Unnamed: 0,0
year,0
category,0
prize,0
motivation,88
laureate_type,0
full_name,0
birth_date,28
birth_city,31
birth_country,28
birth_country_current,28


###Fill missing values and delete critical row

In [75]:
sub_column = [
    "motivation", "birth_city", "birth_country",
    "birth_country_current", "organization_name", "birth_date",
    "organization_city", "organization_country","sex"
]

df_data[sub_column] = df_data[sub_column].fillna("")


In [76]:
df_data["sex"] = df_data["sex"].replace({"": "unknown"})
df_data["laureate_type"] = df_data["laureate_type"].replace("", "individual")


##Check the NULL values after fixing

In [77]:
df_data.isna().sum()

Unnamed: 0,0
year,0
category,0
prize,0
motivation,0
laureate_type,0
full_name,0
birth_date,0
birth_city,0
birth_country,0
birth_country_current,0


### Type Conversions

#### Convert Year to int and Birth Date to Datetime

In [78]:
df_data["year"] = df_data["year"].astype(int)

In [79]:
df_data.birth_date = pd.to_datetime(df_data.birth_date, format="mixed")

In [80]:
df_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 962 entries, 0 to 961
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   year                   962 non-null    int64         
 1   category               962 non-null    object        
 2   prize                  962 non-null    object        
 3   motivation             962 non-null    object        
 4   laureate_type          962 non-null    object        
 5   full_name              962 non-null    object        
 6   birth_date             934 non-null    datetime64[ns]
 7   birth_city             962 non-null    object        
 8   birth_country          962 non-null    object        
 9   birth_country_current  962 non-null    object        
 10  sex                    962 non-null    object        
 11  organization_name      962 non-null    object        
 12  organization_city      962 non-null    object        
 13  organizati

###Describe data_frame after cleaning

In [81]:
df_data.shape

(962, 14)

###Save the cleaned File for RAG Implementation

In [82]:
df_data.to_csv('cleaned_nobel_prize_data.csv', index=False)