# Train-Test Split

The [notable physicists dataframe](../data/interim/notable_physicists.csv) consists of a list of great physicists from the Ancient Greeks to modern times. I would like to reduce this to a list of modern physicists who have lived in the years when the *Nobel Prize in Physics* has been awarded. To be more precise, the Nobel Prize in Physics was first awarded on *10 December 1901* on the anniversary of Alfred Nobel's death. The prize has been awarded on the anniversary of his death every year since, excluding the few years in which no prize was awarded. Essentially, I would like a list of physicists who were alive on and after this date. Many of these physicists have died and many of them are still alive.

Since one of the goals of this project is to try to predict the next Physics Nobel Laureate(s), I wish to form a *training set* that consists of physicists who have died. Naturally, some of these are Nobel Laureates and some are not. The aim is to use the training set to build models that predict whether a physicist who is still alive has been awarded or is likely to be awarded the Nobel Prize in Physics. So it is natural that my *test set* only consists of physicists who are still alive.

In [None]:
from datetime import datetime

import pandas as pd

## Exploratory Analysis of Physicists Birth and Death Dates

First let's read in the notable physicists data into a pandas dataframe and take a look at it.

In [None]:
physicists = pd.read_csv('../data/interim/notable_physicists.csv')
physicists.head(30)

I can see that there are some missing birth and death dates. Let's examine exactly whose birth dates are missing.

In [None]:
dates_names = ['birthDate', 'deathDate', 'fullName']
physicists[physicists.birthDate.isna()][dates_names]

Interestingly, Pythagoras is the only name I know here so it's likely that this is a list of modern physicists.
Reading some of the abstracts in the dataframe and performing a Google search on some of these physicists confirms my suspicions. Rather than just dropping them and losing valuable data, I would like to see if I can find out their dates of birth. A combination of the following allows me to *impute* exact or fairly accurate pproximate values for most of the missing birth dates for the physicists:

- Searching the abstract field of the dataframe
- Looking at the *Google Knowledge Graph* results provided from a Google search
- Looking to see if an approximate value is provided in the [list of notable physicists](https://en.wikipedia.org/wiki/List_of_physicists) or [list of notable theoretical physicists](https://en.wikipedia.org/wiki/List_of_theoretical_physicists)
- Examining their homepages and resumes for dates or birth or dates of first degrees 

In [None]:
def impute_birth_dates(physicists):
    imputed = physicists.copy()
    imputed.loc[imputed.fullName == 'Alejandro Corichi',
                'birthDate'] = str(datetime(1967, 11, 2).date())
    imputed.loc[imputed.fullName == 'Amanda Barnard',
                'birthDate'] = str(datetime(1971, 12, 31).date())
    imputed.loc[imputed.fullName == 'B. Roy Frieden',
                'birthDate'] = str(datetime(1936, 9, 10).date())
    imputed.loc[imputed.fullName == 'Carlos E.M. Wagner',
                'birthDate'] = str(datetime(1962, 1, 1).date())
    imputed.loc[imputed.fullName == 'Chennupati Jagadish',
                'birthDate'] = str(datetime(1957, 8, 10).date())
    imputed.loc[imputed.fullName == 'Craige Schensted',
                'birthDate'] = str(datetime(1928, 1, 1).date())
    imputed.loc[imputed.fullName == 'Denis Weaire',
                'birthDate'] = str(datetime(1942, 10, 17).date())
    imputed.loc[imputed.fullName == 'Eric Poisson',
                'birthDate'] = str(datetime(1965, 7, 26).date())
    imputed.loc[imputed.fullName == 'Gaetano Vignola',
                'birthDate'] = str(datetime(1947, 1, 1).date())
    imputed.loc[imputed.fullName == 'George W. Clark',
                'birthDate'] = str(datetime(1928, 1, 1).date())
    imputed.loc[imputed.fullName == 'Gerald B. Cleaver',
                'birthDate'] = str(datetime(1963, 1, 1).date())
    imputed.loc[imputed.fullName == 'James E. Faller',
                'birthDate'] = str(datetime(1934, 1, 17).date())
    imputed.loc[imputed.fullName == 'James W. LaBelle',
                'birthDate'] = str(datetime(1958, 6, 21).date())
    imputed.loc[imputed.fullName == 'Kathryn Moler',
                'birthDate'] = str(datetime(1965, 1, 1).date())
    imputed.loc[imputed.fullName == 'Kenneth Young',
                'birthDate'] = str(datetime(1947, 1, 1).date())
    imputed.loc[imputed.fullName == 'Laura Mersini-Houghton',
                'birthDate'] = str(datetime(1969, 1, 1).date())
    imputed.loc[imputed.fullName == 'Marcia Barbosa',
                'birthDate'] = str(datetime(1960, 1, 14).date())
    imputed.loc[imputed.fullName == 'Mark G. Raizen',
                'birthDate'] = str(datetime(1955, 1, 1).date())
    imputed.loc[imputed.fullName == 'Mehran Kardar',
                'birthDate'] = str(datetime(1958, 1, 1).date())
    imputed.loc[imputed.fullName == 'Oleg Sushkov',
                'birthDate'] = str(datetime(1950, 1, 1).date())
    imputed.loc[imputed.fullName == 'Paul Crowell',
                'birthDate'] = str(datetime(1965, 1, 1).date())
    imputed.loc[imputed.fullName == 'Petr Paucek',
                'birthDate'] = str(datetime(1961, 1, 1).date())
    imputed.loc[imputed.fullName == 'Rafael Sorkin',
                'birthDate'] = str(datetime(1945, 1, 1).date())
    imputed.loc[imputed.fullName == 'Raúl Rabadan',
                'birthDate'] = str(datetime(1973, 1, 1).date())
    imputed.loc[imputed.fullName == 'Ray Mackintosh',
                'birthDate'] = str(datetime(1940, 1, 1).date())
    imputed.loc[imputed.fullName == 'Richard Clegg',
                'birthDate'] = str(datetime(1957, 1, 1).date())
    imputed.loc[imputed.fullName ==
                'Scott Diddams', 'birthDate'] = str(datetime(1968, 1, 1).date())
    imputed.loc[imputed.fullName == 'Willibald Peter Prasthofer',
                'birthDate'] = str(datetime(1917, 5, 17).date())
    return imputed

In [None]:
physicists = impute_birth_dates(physicists)

OK let's check again to see how many birth dates remain missing.

In [None]:
dead_missing_no_birth_date = physicists[physicists.birthDate.isna()][dates_names]
dead_missing_no_birth_date

Down to only 3. That's good. Pythagoras is clearly dead! Wikipedia tells me that Karl-Heinrich Riewe dissapeared in controversial circumstances. The less said on that the better. And further research in fact reveals that William R. Kanne died a long time ago. So let's drop these 3 from the list.

In [None]:
physicists = physicists.drop(index=dead_missing_no_birth_date.index)
assert(physicists.birthDate.isna().sum() == 0)
assert(len(physicists) == 1047)

So now every physicist in the list has a birth date. Before I can examine the death dates to remove those who died before the Nobel Prize in Physics was first awarded, I must first deal with a `datetime` python technicality. Python's `datetime` cannot deal with dates before the year 1000 so I remove all physicists born before this date as they are dead. 

In [None]:
born_before_year_1000 = physicists[physicists.birthDate.apply(
    lambda d: len(d.split('-')[0]) != 4)][dates_names]
born_before_year_1000

In [None]:
physicists = physicists.drop(index=born_before_year_1000.index)
assert(len(physicists) == 1035)

Now I can convert the `birthDate` and `deathDate` variables to `datetime` and perform the arithmetic to find those physicists who died before the first Nobel Prize in Physics was awarded.

In [None]:
physicists['birthDate'] = physicists.birthDate.apply(
    lambda d: datetime.strptime(d, '%Y-%m-%d').date())
physicists['deathDate'] = (physicists[~physicists.deathDate.isna()]
                           .deathDate.apply(lambda d: datetime.strptime(
                               d, '%Y-%m-%d').date()))

In [None]:
date_prize_first_awarded = datetime(1901, 12, 10).date()
physicists_died_before_prize = physicists[
    physicists.deathDate < date_prize_first_awarded][dates_names]
with pd.option_context('display.max_rows', 116):
    display(physicists_died_before_prize)

So let's drop these 116 great physicists as they were never eligible to have been awarded a Nobel prize due to the era that they lived in.

In [None]:
physicists = physicists.drop(index=physicists_died_before_prize.index)
assert(len(physicists) == 919)
assert(all(physicists.deathDate.isna() |
       (physicists.deathDate > date_prize_first_awarded)))

## Test Set

Now I form the test set from the remaining physicists who are still alive.

In [None]:
test_physicists = physicists[physicists.deathDate.isna()]
assert(len(test_physicists) == 379)
with pd.option_context('display.max_rows', 379):
    display(test_physicists[dates_names])

There are 379 physicists in this list. You may have noticed that there is a very famous dead physicist in this list called Stephen Hawking. As I mentioned previously, the DBpedia data is 6-18 months behind the Wikipedia data, so this recent death is not reflected in DBpedia yet. Due to this there may be a few other dead physicists in this list that I do not know of. It's not a big issue, I will just treat them as still alive for the purposes of this study. But it is important to remember that the Nobel Prize in Physics cannot be awarded posthumously.

## Training Set

Now I form the training set from the remaining physicists who are dead, but were still alive, after the Nobel Prize in Physics was first awarded.

In [None]:
train_physicists = physicists[~physicists.deathDate.isna()]
assert(len(train_physicists) == 540)
with pd.option_context('display.max_rows', 540):
    display(train_physicists[dates_names])

There are 540 physicists in this list. Let's do a quick sanity check to make sure there is no physicist in both lists. OK everything looks good.

In [None]:
assert(not set(train_physicists.fullName).intersection(set(test_physicists.fullName)))

## Persisting the Training and Test Data

Now I have the training and test dataframes, I'd like to persist them for later analysis. So I'll write out the contents to a csv file.

In [None]:
train_physicists.to_csv(
    '../data/processed/train_notable_physicists_from_1901.csv', index=False)
test_physicists.to_csv(
    '../data/processed/test_notable_physicists_from_1901.csv', index=False)

## Cleaning Up

A few clean up steps are needed:

- Convert the notebook to a HTML file with all the output.
- Convert the notebook to another notebook with the output removed.

In [None]:
!jupyter nbconvert --output-dir html_output --to html 2.1-train-test-split.ipynb

In [None]:
!jupyter nbconvert --ClearOutputPreprocessor.enabled=True --to notebook 2.1-train-test-split.ipynb