# Build Target

As a recap, the [training data](../data/processed/train-physicists-from-1901.csv), [validation data](../data/processed/validation-physicists-from-1901.csv) and [test data](../data/processed/test-physicists-from-1901.csv) contain information on physicists who were eligible to receive a Nobel Prize in Physics. That is, they were alive on and after 10 December 1901, the date the prize was first awarded. 

All of the physicists in the training data are deceased and all the physicists in the validation and test data are alive. Recall that the Nobel Prize in Physics cannot be awarded posthumously and one of the goals of this project is to try to predict the next Physics Nobel Laureates. As a result, the data was purposely sampled in this way, so that the training set can be used to build models, which predict whether a living physicist is likely to be awarded the Nobel Prize in Physics.

It is time to use the training, validation and test data, along with the [Nobel Physics Laureates](../data/raw/nobel-physics-prize-laureates.csv) data, to create the target that indicates whether a physicist is a Physics Nobel Laureate.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

%matplotlib inline

## Reading in the Data

First let's read in the training, validation and test data and the list of Nobel Physics Laureates.

In [None]:
train_physicists = pd.read_csv('../data/processed/train-physicists-from-1901.csv')
train_physicists.head()

In [None]:
validation_physicists = pd.read_csv('../data/processed/validation-physicists-from-1901.csv')
validation_physicists.head()

In [None]:
# The double t naming of "ttest_" variables in this file is for testing purposes. When `ipytest` cleans the
# tests it deletes ANY object in the global namespace that begins with "test_", not just functions.
ttest_physicists = pd.read_csv('../data/processed/test-physicists-from-1901.csv')
ttest_physicists.head()

In [None]:
nobel_physicists = pd.read_csv('../data/raw/nobel-physics-prize-laureates.csv')
nobel_physicists.head()

## Creating the Target

It is now time to create the target from our data.

In [None]:
def build_target(full_name, laureate):
    
    """Build the target variable indicating whether the physicist is a Nobel Laureate or not.
    
    Args:
        full_name (pandas.Series): Full name of physicist.
        laureate (pandas.Series): Full name of Physics Nobel Laureate.
        
    Returns:
        pandas.Series: Target variable indicating whether the physicist is a Nobel Laureate or not.

        """
    
    target = full_name.to_frame(name='full_name')
    target['physics_laureate'] = target.full_name.apply(
        lambda name: name in laureate.values).map({True: 'yes', False: 'no'})
    target = target.set_index('full_name')['physics_laureate']
    return target

In [None]:
train_target = build_target(train_physicists.fullName, nobel_physicists.Laureate)
assert((len(train_target) == len(train_physicists)))
assert(isinstance(train_target, pd.core.series.Series))
assert((train_target == 'yes').sum() == 123)
assert(all(train_target.notna()))
train_target.head()

In [None]:
validation_target = build_target(validation_physicists.fullName, nobel_physicists.Laureate)
assert((len(validation_target) == len(validation_physicists)))
assert(isinstance(validation_target, pd.core.series.Series))
assert((validation_target == 'yes').sum() == 41)
assert(all(validation_target.notna()))
validation_target.head()

In [None]:
ttest_target = build_target(ttest_physicists.fullName, nobel_physicists.Laureate)
assert((len(ttest_target) == len(ttest_physicists)))
assert(isinstance(ttest_target, pd.core.series.Series))
assert((ttest_target == 'yes').sum() == 42)
assert(all(ttest_target.notna()))
ttest_target.head()

So what percentage of the physicists in each of the dataframes are laureates?

In [None]:
training_fraction = (train_target == 'yes').sum() / len(train_target)
validation_fraction = (validation_target == 'yes').sum() / len(validation_target)
ttest_fraction = (ttest_target == 'yes').sum() / len(ttest_target)
laureate_fraction = pd.Series(
    data=[round(100 * training_fraction, 1), round(100 * validation_fraction, 1),
          round(100 * ttest_fraction, 1)],
    index=['Training', 'Validation', 'Test']
)

In [None]:
ax = laureate_fraction.plot(kind='bar', title='Percentage of Laureates')
ax.set_ylabel('%', labelpad=10, rotation='horizontal')
ax.set_yticks(ticks=np.linspace(0, 40, num=5))
ax.tick_params(left=False, bottom=False)
plt.xticks(rotation=0)
plt.box(False)

This looks like a well balanced proportion of laureates in each of the datasets. There are no real surprises here as it's obvious there are more non-laureates than laureates. Naturally, due to the class imbalance, an appropriate metric for selecting and evaluating models will need to be chosen.

## Persisting the Data

Now we have the training, validation and test target series, we will persist them for future use.

In [None]:
train_target.to_csv('../data/processed/train-target.csv', header=True)
validation_target.to_csv('../data/processed/validation-target.csv', header=True)
ttest_target.to_csv('../data/processed/test-target.csv', header=True)