# Overview
###### The Nobel Prize is perhaps the world's most well known scientific award. Every year it is given to scientists and scholars in chemistry, literature, physics, medicine, economics, and peace. The first Nobel Prize was handed out in 1901, and at that time the prize was Eurocentric and male-focused, but nowadays it's not biased in any way. Surely, right? Well, we’re going to find out! The Nobel Foundation has made a dataset available of all prize winners from the start of the prize, in 1901, to 2016. Let’s load it in and check it out.

## Imports

In [2]:
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

KeyboardInterrupt: 

## Reading Data

In [3]:

dff = pd.read_csv('archive.csv')
df = dff.copy()
df.sample(5)

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
446,1973,Economics,The Sveriges Riksbank Prize in Economic Scienc...,"""for the development of the input-output metho...",1/1,683,Individual,Wassily Leontief,1906-08-05,St. Petersburg,Russia,Male,Harvard University,"Cambridge, MA",United States of America,1999-02-05,"New York, NY",United States of America
278,1952,Peace,The Nobel Peace Prize 1952,,1/1,513,Individual,Albert Schweitzer,1875-01-14,Kaysersberg,Germany (France),Male,,,,1965-09-04,Lambaréné,Gabon
606,1988,Literature,The Nobel Prize in Literature 1988,"""who, through works rich in nuance - now clear...",1/1,665,Individual,Naguib Mahfouz,1911-12-11,Cairo,Egypt,Male,,,,2006-08-30,Cairo,Egypt
675,1995,Chemistry,The Nobel Prize in Chemistry 1995,"""for their work in atmospheric chemistry, part...",1/3,283,Individual,F. Sherwood Rowland,1927-06-28,"Delaware, OH",United States of America,Male,University of California,"Irvine, CA",United States of America,2012-03-10,"Corona del Mar, CA",United States of America
782,2003,Physics,The Nobel Prize in Physics 2003,"""for pioneering contributions to the theory of...",1/3,766,Individual,Alexei A. Abrikosov,1928-06-25,Moscow,Union of Soviet Socialist Republics (Russia),Male,Argonne National Laboratory,"Argonne, IL",United States of America,,,


## Data Exploration and Cleaning

### How many nobel prizes we have between 1901 and 2016

In [4]:
print("Shape\n%s Rows \n%s Columns " % (df.shape[0],df.shape[1]))

Shape
969 Rows 
18 Columns 


### How many male nobel prizes and how many female nobel prizes

In [5]:
df1 = df.groupby(['Sex']).size()
print (df1)

Sex
Female     50
Male      893
dtype: int64


## Different nationalities winning the nobel prize

In [6]:
df2 =  df.groupby("Birth Country").agg('size')

print(df2.sort_values( ascending=False))


Birth Country
United States of America                  276
United Kingdom                             88
Germany                                    70
France                                     53
Sweden                                     30
                                         ... 
Ottoman Empire (Republic of Macedonia)      1
Ottoman Empire (Turkey)                     1
Pakistan                                    1
Persia (Iran)                               1
Java, Dutch East Indies (Indonesia)         1
Length: 121, dtype: int64


### From the previous group by we can conclude that the USA dominates in the noble prizes birth country as there is 276 nobel prize holder from USA

### Calculate the proption of female lauretes per decade

In [17]:
df3=df[df["Sex"] == "Female"].groupby("Year").size()
print(df3)

Year
1903    1
1905    1
1909    1
1911    1
1926    1
1928    1
1931    1
1935    1
1938    1
1945    1
1946    1
1947    1
1963    1
1964    1
1966    1
1976    2
1977    1
1979    1
1982    1
1983    1
1986    1
1988    1
1991    2
1992    1
1993    1
1995    1
1996    1
1997    1
2003    1
2004    3
2007    1
2008    1
2009    6
2011    3
2013    1
2014    2
2015    2
dtype: int64


### Calculate the proption of female lauretes per category

In [19]:
df4=df[df["Sex"] == "Female"].groupby("Category").size()
print(df4)

Category
Chemistry      4
Economics      2
Literature    14
Medicine      12
Peace         16
Physics        2
dtype: int64


### The first woman to win the Nobel Prize


In [33]:
s=df[df["Sex"] == "Female"]
s2=s.sort_values(by=['Year'])
s3=s2.nsmallest(1,'Year')
s3

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
19,1903,Physics,The Nobel Prize in Physics 1903,"""in recognition of the extraordinary services ...",1/4,6,Individual,"Marie Curie, née Sklodowska",1867-11-07,Warsaw,Russian Empire (Poland),Female,,,,1934-07-04,Sallanches,France


### Who won the nobel prize more than once?

In [39]:
df5=  df.groupby("Full Name").agg('size')
print(df5.sort_values( ascending=False))


Full Name
Jack W. Szostak                                                                      3
Comité international de la Croix Rouge (International Committee of the Red Cross)    3
Randy W. Schekman                                                                    2
Robert J. Lefkowitz                                                                  2
Roderick MacKinnon                                                                   2
                                                                                    ..
National Dialogue Quartet                                                            1
Naguib Mahfouz                                                                       1
Nadine Gordimer                                                                      1
Médecins Sans Frontières                                                             1
A. Michael Spence                                                                    1
Length: 904, dtype: int64


### To Know the number of entries and the columns

In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 969 entries, 0 to 968
Data columns (total 18 columns):
Year                    969 non-null int64
Category                969 non-null object
Prize                   969 non-null object
Motivation              881 non-null object
Prize Share             969 non-null object
Laureate ID             969 non-null int64
Laureate Type           969 non-null object
Full Name               969 non-null object
Birth Date              940 non-null object
Birth City              941 non-null object
Birth Country           943 non-null object
Sex                     943 non-null object
Organization Name       722 non-null object
Organization City       716 non-null object
Organization Country    716 non-null object
Death Date              617 non-null object
Death City              599 non-null object
Death Country           605 non-null object
dtypes: int64(2), object(16)
memory usage: 136.4+ KB


### Checking how many null values for each column

In [49]:
df.isnull().sum()

Year                      0
Category                  0
Prize                     0
Motivation               88
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date               29
Birth City               28
Birth Country            26
Sex                      26
Organization Name       247
Organization City       253
Organization Country    253
Death Date              352
Death City              370
Death Country           364
dtype: int64

### What are the values for the Laureate Type

In [50]:
pd.unique(df["Laureate Type"])

array(['Individual', 'Organization'], dtype=object)

### Number of Cells having Laureate type = Organization

In [51]:
df[df["Laureate Type"] == "Organization"].shape

(30, 18)

In [53]:
df[df["Sex"].isnull()]

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
24,1904,Peace,The Nobel Peace Prize 1904,,1/1,467,Organization,Institut de droit international (Institute of ...,,,,,,,,,,
61,1910,Peace,The Nobel Peace Prize 1910,,1/1,477,Organization,Bureau international permanent de la Paix (Per...,,,,,,,,,,
90,1917,Peace,The Nobel Peace Prize 1917,,1/1,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,
206,1938,Peace,The Nobel Peace Prize 1938,,1/1,503,Organization,Office international Nansen pour les Réfugiés ...,,,,,,,,,,
222,1944,Peace,The Nobel Peace Prize 1944,,1/1,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,
244,1947,Peace,The Nobel Peace Prize 1947,,1/2,508,Organization,Friends Service Council (The Quakers),,,,,,,,,,
245,1947,Peace,The Nobel Peace Prize 1947,,1/2,509,Organization,American Friends Service Committee (The Quakers),,,,,,,,,,
295,1954,Peace,The Nobel Peace Prize 1954,,1/1,515,Organization,Office of the United Nations High Commissioner...,,,,,,,,,,
365,1963,Peace,The Nobel Peace Prize 1963,,1/2,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,
366,1963,Peace,The Nobel Peace Prize 1963,,1/2,523,Organization,Ligue des Sociétés de la Croix-Rouge (League o...,,,,,,,,,,


In [55]:
#df[df["Prize Share"] == "1/2"]
df[df["Full Name"] == "Paul Ehrlich"]

Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
46,1908,Medicine,The Nobel Prize in Physiology or Medicine 1908,"""in recognition of their work on immunity""",1/2,302,Individual,Paul Ehrlich,1854-03-14,Strehlen (Strzelin),Prussia (Poland),Male,Goettingen University,Göttingen,Germany,1915-08-20,Bad Homburg vor der Höhe,Germany
47,1908,Medicine,The Nobel Prize in Physiology or Medicine 1908,"""in recognition of their work on immunity""",1/2,302,Individual,Paul Ehrlich,1854-03-14,Strehlen (Strzelin),Prussia (Poland),Male,Königliches Institut für experimentelle Therap...,Frankfurt-on-the-Main,Germany,1915-08-20,Bad Homburg vor der Höhe,Germany


In [57]:

#df[df["Organization Name"].isnull()]
l = df[df["Organization Name"].isnull()]
k = l[l["Laureate Type"] == "Organization"]
print("Shape\n%s Rows \n%s Columns " % (k.shape[0],k.shape[1]))


Shape
30 Rows 
18 Columns 


## Cleaning 

### Removing Duplicates 

In [58]:
print(len(df))
df.drop_duplicates(subset = "Laureate ID",keep = "first", inplace = True)
print(len(df))
df[df["Full Name"] == "Paul Ehrlich"]

969
904


Unnamed: 0,Year,Category,Prize,Motivation,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
46,1908,Medicine,The Nobel Prize in Physiology or Medicine 1908,"""in recognition of their work on immunity""",1/2,302,Individual,Paul Ehrlich,1854-03-14,Strehlen (Strzelin),Prussia (Poland),Male,Goettingen University,Göttingen,Germany,1915-08-20,Bad Homburg vor der Höhe,Germany


### Deleting Motivation values

In [59]:
del df['Motivation']
df.sample(5)

Unnamed: 0,Year,Category,Prize,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
572,1985,Chemistry,The Nobel Prize in Chemistry 1985,1/2,262,Individual,Herbert A. Hauptman,1917-02-14,"New York, NY",United States of America,Male,The Medical Foundation of Buffalo,"Buffalo, NY",United States of America,2011-10-23,"Buffalo, NY",United States of America
432,1971,Physics,The Nobel Prize in Physics 1971,1/1,93,Individual,Dennis Gabor,1900-06-05,Budapest,Hungary,Male,Imperial College,London,United Kingdom,1979-02-08,London,United Kingdom
238,1946,Physics,The Nobel Prize in Physics 1946,1/1,51,Individual,Percy Williams Bridgman,1882-04-21,"Cambridge, MA",United States of America,Male,Harvard University,"Cambridge, MA",United States of America,1961-08-20,"Randolph, NH",United States of America
545,1981,Physics,The Nobel Prize in Physics 1981,1/4,119,Individual,Arthur Leonard Schawlow,1921-05-05,"Mount Verno, NY",United States of America,Male,Stanford University,"Stanford, CA",United States of America,1999-04-28,"Palo Alto, CA",United States of America
126,1925,Physics,The Nobel Prize in Physics 1925,1/2,30,Individual,James Franck,1882-08-26,Hamburg,Germany,Male,Goettingen University,Göttingen,Germany,1964-05-21,Göttingen,West Germany (Germany)


### Fixing Incorrect Data of Laureate Type ( Org -> Individual )

In [61]:

#df.loc[df["Sex"].isnull(), "Sex"] = "Org"

mask = (df["Laureate Type"] == "Organization") & (df["Birth Date"].notnull())
df["Laureate Type"][mask] = "Individual"

l = df[df["Laureate Type"] == "Organization"]
k = l[l["Birth Date"].notnull()]
k

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,Year,Category,Prize,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country


In [62]:
k = df[df["Laureate Type"] == "Organization"]
print("Shape\n%s Rows \n%s Columns " % (k.shape[0],k.shape[1]))
k
#df.isnull().sum()

Shape
23 Rows 
17 Columns 


Unnamed: 0,Year,Category,Prize,Prize Share,Laureate ID,Laureate Type,Full Name,Birth Date,Birth City,Birth Country,Sex,Organization Name,Organization City,Organization Country,Death Date,Death City,Death Country
24,1904,Peace,The Nobel Peace Prize 1904,1/1,467,Organization,Institut de droit international (Institute of ...,,,,,,,,,,
61,1910,Peace,The Nobel Peace Prize 1910,1/1,477,Organization,Bureau international permanent de la Paix (Per...,,,,,,,,,,
90,1917,Peace,The Nobel Peace Prize 1917,1/1,482,Organization,Comité international de la Croix Rouge (Intern...,,,,,,,,,,
206,1938,Peace,The Nobel Peace Prize 1938,1/1,503,Organization,Office international Nansen pour les Réfugiés ...,,,,,,,,,,
244,1947,Peace,The Nobel Peace Prize 1947,1/2,508,Organization,Friends Service Council (The Quakers),,,,,,,,,,
245,1947,Peace,The Nobel Peace Prize 1947,1/2,509,Organization,American Friends Service Committee (The Quakers),,,,,,,,,,
295,1954,Peace,The Nobel Peace Prize 1954,1/1,515,Organization,Office of the United Nations High Commissioner...,,,,,,,,,,
366,1963,Peace,The Nobel Peace Prize 1963,1/2,523,Organization,Ligue des Sociétés de la Croix-Rouge (League o...,,,,,,,,,,
383,1965,Peace,The Nobel Peace Prize 1965,1/1,525,Organization,United Nations Children's Fund (UNICEF),,,,,,,,,,
416,1969,Peace,The Nobel Peace Prize 1969,1/1,527,Organization,International Labour Organization (I.L.O.),,,,,,,,,,


### Dropping organization since they only represent 2.5% of the dataframe ( 23 rows out of 904 rows)

In [63]:
print("Shape\n%s Rows \n%s Columns " % (df.shape[0],df.shape[1]))
df = df[df["Laureate Type"] == "Individual"]
print("Shape\n%s Rows \n%s Columns " % (df.shape[0],df.shape[1]))
df[df["Laureate Type"] == "Individual"]
df.isnull().sum()
#df[df["Death Date"].isnull()]

Shape
904 Rows 
17 Columns 
Shape
881 Rows 
17 Columns 


Year                      0
Category                  0
Prize                     0
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date                2
Birth City                2
Birth Country             0
Sex                       0
Organization Name       220
Organization City       218
Organization Country    218
Death Date              292
Death City              309
Death Country           303
dtype: int64

### Dropping the Death City, Death Date, Death Country as they will not affect Data visualisation in any way

In [64]:
df = df.drop("Death Date", axis=1)
df = df.drop("Death City", axis=1)
df = df.drop("Death Country", axis=1)
df.isnull().sum()


Year                      0
Category                  0
Prize                     0
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date                2
Birth City                2
Birth Country             0
Sex                       0
Organization Name       220
Organization City       218
Organization Country    218
dtype: int64


### Adding missing birth date and city for individuals 

In [65]:
df.loc[df["Laureate ID"] == 841, "Birth Date"] = "1952-04-01" #venk
df.loc[df["Laureate ID"] == 864, "Birth Date"] = "1959-09-22" #saul
df.loc[df["Laureate ID"] == 747, "Birth City"] = "Chaguanas" # sir Vidiadhar
df.loc[df["Laureate ID"] == 855, "Birth City"] = "Changchun" # liu Xiaobo
df.isnull().sum()

Year                      0
Category                  0
Prize                     0
Prize Share               0
Laureate ID               0
Laureate Type             0
Full Name                 0
Birth Date                0
Birth City                0
Birth Country             0
Sex                       0
Organization Name       220
Organization City       218
Organization Country    218
dtype: int64

In [66]:

df.loc[df["Laureate ID"] == 6, "Organization Country"] = "self" 
df.loc[df["Laureate ID"] == 6, "Organization City"] = "self" 
df.loc[df["Laureate ID"] == 6, "Organization Name"] = "self" 

df.loc[df["Laureate ID"] == 318, "Organization Country"] = "Tunisia" 

df.loc[df["Laureate ID"] == 684, "Organization Country"] = "self" 
df.loc[df["Laureate ID"] == 684, "Organization City"] = "self" 
df.loc[df["Laureate ID"] == 684, "Organization Name"] = "self" 

df.loc[df["Laureate ID"] == 685, "Organization Country"] = "self" 
df.loc[df["Laureate ID"] == 685, "Organization City"] = "self" 
df.loc[df["Laureate ID"] == 685, "Organization Name"] = "self" 

df.loc[df["Laureate ID"] == 270, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 270, "Organization City"] = "Maryland" 

df.loc[df["Laureate ID"] == 461, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 461, "Organization City"] = "Maryland" 

df.loc[df["Laureate ID"] == 770, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 770, "Organization City"] = "Maryland"

df.loc[df["Laureate ID"] == 811, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 811, "Organization City"] = "Maryland"

df.loc[df["Laureate ID"] == 831, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 831, "Organization City"] = "Maryland" 

df.loc[df["Laureate ID"] == 842, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 842, "Organization City"] = "Maryland" 

df.loc[df["Laureate ID"] == 837, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 837, "Organization City"] = "Maryland" 

df.loc[df["Laureate ID"] == 878, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 878, "Organization City"] = "Maryland" 

df.loc[df["Laureate ID"] == 885, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 885, "Organization City"] = "Maryland" 

df.loc[df["Laureate ID"] == 886, "Organization Country"] = "USA" 
df.loc[df["Laureate ID"] == 886, "Organization City"] = "Maryland" 

k = df[df["Organization Country"].isnull()]
l = k[k["Category"] != "Peace"]
l[l["Category"] != "Literature"]

df[df["Organization Country"].isnull()].shape


# ba3d keda el hytba22a meen b2a? el peace wl literature bas
# homa el wa7ideen el hyb2o be null, then momken terunno el 3 lines el ta7t
#dool be el text el t7boha, take care el order yb2a keda 3shan law 3mlto 
#3 lines el ta7t dool homa el nas el physics wl mediccine msh ht3rfo tgebeehom tany

df.loc[df["Organization Name"].isnull(), "Organization Name"] = "Self"
df.loc[df["Organization City"].isnull(), "Organization City"] = "Self"
df.loc[df["Organization Country"].isnull(), "Organization Country"] = "Self"

df.isnull().sum()


Year                    0
Category                0
Prize                   0
Prize Share             0
Laureate ID             0
Laureate Type           0
Full Name               0
Birth Date              0
Birth City              0
Birth Country           0
Sex                     0
Organization Name       0
Organization City       0
Organization Country    0
dtype: int64

In [67]:
df.to_csv('archiveData_Cleaned.csv')