# Build Target

As a recap, the [training data](../data/processed/train-physicists-from-1901.csv) and [test data](../data/processed/test-physicists-from-1901.csv) contain information on physicists who were eligible to receive a Nobel Prize in Physics. That is, they were alive on and after 10 December 1901, the date the prize was first awarded. 

All of the physicists in the training data are deceased and all the physicists in the test data are alive (up to the last 6-18 months since this is the approximate length of time DBpedia data is behind Wikipedia articles). Since one of the goals of this project is to try to predict the next Physics Nobel Laureate(s). The data was purposely sampled in this way as the aim is to use the training set to build models that predict whether a physicist who is still alive has been awarded or is likely to be awarded the *Nobel Prize in Physics*.

It is finally time to use the training and test data, along with the [Nobel Physics Laureates](../data/raw/nobel-physics-prize-laureates.csv) collected, in order to create the target which indicates whether a physicist is a *Nobel Laureate in Physics*.

In [None]:
import pandas as pd

## Reading in the Data

First let's read in the training and test data and the list of Nobel Physics laureates.

In [None]:
train_physicists = pd.read_csv(
    '../data/processed/train-physicists-from-1901.csv')
train_physicists.head()

In [None]:
test_physicists = pd.read_csv(
    '../data/processed/test-physicists-from-1901.csv')
test_physicists.head()

In [None]:
nobel_physicists = pd.read_csv(
    '../data/raw/nobel-physics-prize-laureates.csv')
nobel_physicists.head()

## Creating the Target

It is now time to create the target from the data I have collected.

In [None]:
def build_target(full_name, laureate):
    laureate = full_name.apply(
        lambda name: name in laureate.values).map({True: 'yes', False: 'no'})
    laureate.name = 'physics_laureate'
    return laureate

In [None]:
train_target = build_target(train_physicists.fullName, nobel_physicists.Laureate)
assert((len(train_target) == len(train_physicists)))
assert(isinstance(train_target, pd.core.series.Series))
assert((train_target == 'yes').sum() == 123)
train_target.head()

In [None]:
test_target = build_target(test_physicists.fullName, nobel_physicists.Laureate)
assert((len(test_target) == len(test_target)))
assert(isinstance(test_target, pd.core.series.Series))
assert((test_target == 'yes').sum() == 83)
test_target.head()

## Persisting the Data

Now I have the training and test target series, I'll persist them for future use.

In [None]:
train_target.to_csv('../data/processed/train-target.csv',
                    index=False, header=True)
test_target.to_csv('../data/processed/test-target.csv',
                   index=False, header=True)

Let's perform a quick sanity check to make sure the data is as expected.

In [None]:
train_target_on_disk = pd.read_csv('../data/processed/train-target.csv',
                                   squeeze=True)
assert(train_target_on_disk.equals(train_target))
test_target_on_disk = pd.read_csv('../data/processed/test-target.csv',
                                  squeeze=True)
assert(test_target_on_disk.equals(test_target))