# Train-Validation-Test Split

The [physicists dataframe](../data/interim/physicists.csv) consists of a list of great physicists from the Ancient Greeks to modern times. We would like to reduce this to a list of modern physicists who were eligible to be awarded the Nobel Prize in Physics. To be more precise, the Nobel Prize in Physics was first awarded on **10 December 1901** on the anniversary of *Alfred Nobel's* death. The prize has been awarded on the anniversary of his death every year since, excluding the few years in which no prize was awarded. Essentially, we would like a list of physicists who were alive on and after this date. Some of these physicists are deceased and some are alive.

Since the Nobel Prize in Physics cannot be awarded posthumously and the goal is to develop a model to predict laureates, it makes sense to form a **training set** that consists of deceased physicists and a **test set** and **validation set** that both consist of living physicists. We choose to use a validation set for model selection instead of cross-validation as the later is not appropriate due to this "pseudo-time component" of the data.

In [None]:
from datetime import datetime

import pandas as pd

## Exploratory Analysis of Physicists Birth and Death Dates

First let's read in the physicists data into a pandas dataframe and take a look at it.

In [None]:
physicists = pd.read_csv('../data/interim/physicists.csv')
physicists.head(30)

We can see that there are two issues:

1. Missing birth and death dates.
2. Death dates equal to birth dates.

In [None]:
dates_names = ['birthDate', 'deathDate', 'fullName']
physicists[physicists.birthDate.isna() |
           (physicists.birthDate == physicists.deathDate) |
           (physicists.deathDate.isna() & physicists.categories.str.contains(
               'death'))][dates_names]

The second issue is due to poor validation of data in DBpedia. Since these physicists are deceased, the death dates just need to be corrected.

For the first issue, let's examine exactly whose birth dates are missing. Interestingly, *Pythagoras* is the most famous name here, so it's likely that this is a list of modern physicists. Reading some of the abstracts in the dataframe and performing a Google search on some of these physicists confirms the suspicion. Rather than just dropping them and losing valuable data, we would like to see if we can find out their dates of birth. A combination of the following allows us to *impute* exact or sufficiently accurate approximate values for most of the missing birth and death dates for the physicists:

- Searching the **abstract field** of the dataframe
- Looking at the **Google Knowledge Graph** results provided from a Google search
- Looking to see if an **approximate date** is provided in the [list of physicists](https://en.wikipedia.org/w/index.php?title=List_of_physicists&oldid=864677795) or [list of theoretical physicists](https://en.wikipedia.org/w/index.php?title=List_of_theoretical_physicists&oldid=855745137)
- Examining their online **homepages** and **resumes** for *dates or birth* or *dates of degrees* 

In [None]:
def impute_birth_dates(physicists):
    imputed = physicists.copy()
    imputed.loc[imputed.fullName == 'Alejandro Corichi',
                'birthDate'] = str(datetime(1967, 11, 2).date())
    imputed.loc[imputed.fullName == 'Amanda Barnard',
                'birthDate'] = str(datetime(1971, 12, 31).date())
    imputed.loc[imputed.fullName == 'B. Roy Frieden',
                'birthDate'] = str(datetime(1936, 9, 10).date())
    imputed.loc[imputed.fullName == 'Carlos E.M. Wagner',
                'birthDate'] = str(datetime(1962, 1, 1).date())
    imputed.loc[imputed.fullName == 'Charlotte Riefenstahl',
                'deathDate'] = str(datetime(1993, 1, 6).date())
    imputed.loc[imputed.fullName == 'Chennupati Jagadish',
                'birthDate'] = str(datetime(1957, 8, 10).date())
    imputed.loc[imputed.fullName == 'Craige Schensted',
                'birthDate'] = str(datetime(1928, 1, 1).date())
    imputed.loc[imputed.fullName == 'David Bohm',
                'deathDate'] = str(datetime(1992, 10, 27).date())
    imputed.loc[imputed.fullName == 'Denis Weaire',
                'birthDate'] = str(datetime(1942, 10, 17).date())
    imputed.loc[imputed.fullName == 'Eric Poisson',
                'birthDate'] = str(datetime(1965, 7, 26).date())
    imputed.loc[imputed.fullName == 'Gaetano Vignola',
                'birthDate'] = str(datetime(1947, 1, 1).date())
    imputed.loc[imputed.fullName == 'George W. Clark',
                'birthDate'] = str(datetime(1928, 1, 1).date())
    imputed.loc[imputed.fullName == 'Gerald B. Cleaver',
                'birthDate'] = str(datetime(1963, 1, 1).date())
    imputed.loc[imputed.fullName == 'James E. Faller',
                'birthDate'] = str(datetime(1934, 1, 17).date())
    imputed.loc[imputed.fullName == 'James W. LaBelle',
                'birthDate'] = str(datetime(1958, 6, 21).date())
    imputed.loc[imputed.fullName == 'Johannes Fischer',
                'deathDate'] = str(datetime(1977, 1, 1).date())
    imputed.loc[imputed.fullName == 'John Archibald Wheeler',
                'deathDate'] = str(datetime(2008, 4, 13).date())
    imputed.loc[imputed.fullName == 'Kathryn Moler',
                'birthDate'] = str(datetime(1965, 1, 1).date())
    imputed.loc[imputed.fullName == 'Kenneth Young',
                'birthDate'] = str(datetime(1947, 1, 1).date())
    imputed.loc[imputed.fullName == 'Laura Mersini-Houghton',
                'birthDate'] = str(datetime(1969, 1, 1).date())
    imputed.loc[imputed.fullName == 'Marcia Barbosa',
                'birthDate'] = str(datetime(1960, 1, 14).date())
    imputed.loc[imputed.fullName == 'Mark G. Raizen',
                'birthDate'] = str(datetime(1955, 1, 1).date())
    imputed.loc[imputed.fullName == 'Mehran Kardar',
                'birthDate'] = str(datetime(1958, 1, 1).date())
    imputed.loc[imputed.fullName == 'Oleg Sushkov',
                'birthDate'] = str(datetime(1950, 1, 1).date())
    imputed.loc[imputed.fullName == 'Paul Crowell',
                'birthDate'] = str(datetime(1965, 1, 1).date())
    imputed.loc[imputed.fullName == 'Petr Paucek',
                'birthDate'] = str(datetime(1961, 1, 1).date())
    imputed.loc[imputed.fullName == 'Rafael Sorkin',
                'birthDate'] = str(datetime(1945, 1, 1).date())
    imputed.loc[imputed.fullName == 'Raúl Rabadan',
                'birthDate'] = str(datetime(1973, 1, 1).date())
    imputed.loc[imputed.fullName == 'Ray Mackintosh',
                'birthDate'] = str(datetime(1940, 1, 1).date())
    imputed.loc[imputed.fullName == 'Richard Clegg',
                'birthDate'] = str(datetime(1957, 1, 1).date())
    imputed.loc[imputed.fullName == 'Sam Treiman',
                'deathDate'] = str(datetime(1999, 11, 30).date())
    imputed.loc[imputed.fullName ==
                'Scott Diddams', 'birthDate'] = str(datetime(1968, 1, 1).date())
    imputed.loc[imputed.fullName == 'Willibald Peter Prasthofer',
                'birthDate'] = str(datetime(1917, 5, 17).date())
    return imputed

In [None]:
physicists = impute_birth_dates(physicists)

OK let's check again to see if any of these issues remain.

In [None]:
dead_missing_no_birth_date = physicists[physicists.birthDate.isna() |
           (physicists.birthDate == physicists.deathDate) |
           (physicists.deathDate.isna() & physicists.categories.str.contains(
               'death'))][dates_names]
dead_missing_no_birth_date

Down to only 3. That's a big improvement. Well *Pythagoras* is clearly dead! Wikipedia tells us that *Karl-Heinrich Riewe* dissapeared in controversial circumstances. The less said on that the better! And further research in fact reveals that *William R. Kanne* died a long time ago. So let's drop these 3 from the list.

In [None]:
physicists = physicists.drop(index=dead_missing_no_birth_date.index)
assert(physicists.birthDate.isna().sum() == 0)
assert(len(physicists) == 1055)

So now every physicist in the list has a birth date. Before we can examine the death dates to remove those who died before the Nobel Prize in Physics was first awarded, we must first deal with a `datetime` python technicality. Python's `datetime` cannot deal with dates before the year 1000 so let's remove all physicists born before this date as they too, like *Pythagoras*, are clearly deceased. 

In [None]:
born_before_year_1000 = physicists[physicists.birthDate.apply(
    lambda d: len(d.split('-')[0]) != 4)][dates_names]
born_before_year_1000

In [None]:
physicists = physicists.drop(index=born_before_year_1000.index)
assert(len(physicists) == 1043)

Now we can convert the `birthDate` and `deathDate` variables to the `datetime` type and perform the arithmetic to find those physicists who died before the first Nobel Prize in Physics was awarded.

In [None]:
physicists['birthDate'] = physicists.birthDate.apply(
    lambda d: datetime.strptime(d, '%Y-%m-%d').date())
physicists['deathDate'] = (physicists[~physicists.deathDate.isna()]
                           .deathDate.apply(lambda d: datetime.strptime(
                               d, '%Y-%m-%d').date()))

In [None]:
date_prize_first_awarded = datetime(1901, 12, 10).date()
physicists_died_before_prize = physicists[
    physicists.deathDate < date_prize_first_awarded][dates_names]
assert(len(physicists_died_before_prize) == 116)
with pd.option_context('display.max_rows', 116):
    display(physicists_died_before_prize)
len(physicists_died_before_prize)

So let's drop these great physicists who were never eligible to have been awarded a Nobel.

In [None]:
physicists = physicists.drop(index=physicists_died_before_prize.index)
assert(len(physicists) == 927)
assert(all(physicists.deathDate.isna() | (physicists.deathDate > date_prize_first_awarded)))

## Training Set

Now we can form the training set from the remaining physicists who are deceased, but were still alive, after the Nobel Prize in Physics was first awarded.

In [None]:
train_physicists = physicists[~physicists.deathDate.isna()]
assert(len(train_physicists) == 542)
with pd.option_context('display.max_rows', 542):
    display(train_physicists[dates_names])

## Validation and Test Sets

Now let's form a dataframe of the remaining living physicists.

In [None]:
physicists_alive = physicists[physicists.deathDate.isna()]
assert(len(physicists_alive) == 385)
physicists_alive[dates_names]

Now we can randomly sample this dataframe to ensure that there is approximately a 50-50 split between physicists in the validation and test sets.

In [None]:
validation_physicists = physicists_alive.sample(frac=0.5, random_state=0).sort_index()
assert(len(validation_physicists) == 192)
with pd.option_context('display.max_rows', 192):
    display(validation_physicists[dates_names])

In [None]:
test_physicists = physicists_alive.iloc[~physicists_alive.index.isin(
    validation_physicists.index)].sort_index()
assert(len(test_physicists) == 193)
with pd.option_context('display.max_rows', 193):
    display(test_physicists[dates_names])

You may have noticed that there are some deceased physicists (including laureates) in these dataframes. Examples include *Stephen Hawking*, *Nicolaas Bloembergen*, *Emil Wolf* and *Hans Georg Dehmelt*. As was mentioned previously, the DBpedia data is 6-18 months behind the Wikipedia data, so recent deaths are not reflected in the data yet. Due to this there may be a few other dead physicists in the dataframes. It's not a big issue, we will just treat them as still living for the purposes of this study. However, it is important to remember that the Nobel Prize in Physics cannot be awarded posthumously.

Let's do the following quick sanity checks to ensure:

1. The correct total number of physicists. 
2. There is no physicist in more than one dataframe.

In [None]:
assert(len(train_physicists) + len(validation_physicists) + len(test_physicists) == len(physicists))
assert(not set(train_physicists.fullName).intersection(set(validation_physicists.fullName)))
assert(not set(train_physicists.fullName).intersection(set(test_physicists.fullName)))
assert(not set(validation_physicists.fullName).intersection(set(test_physicists.fullName)))

OK everything looks good. So what percentage of the data is in each of the dataframes?

In [None]:
training_fraction = len(train_physicists) / len(physicists)
validation_fraction = len(validation_physicists) / len(physicists)
test_fraction = len(test_physicists) / len(physicists)
pd.Series(data=[round(100 * training_fraction, 1), round(100 * validation_fraction, 1),
                round(100 * test_fraction, 1)], index=['Training %', 'Validation %', 'Test %'])

This looks like a healthy enough split to proceed with.

## Persisting the Data

Now we have the training, validation and test dataframes, let's persist them for later analysis by writing their contents to a csv file.

In [None]:
train_physicists.to_csv('../data/processed/train-physicists-from-1901.csv', index=False)
validation_physicists.to_csv('../data/processed/validation-physicists-from-1901.csv', index=False)
test_physicists.to_csv('../data/processed/test-physicists-from-1901.csv', index=False)