# Build Features

As a recap, the [training data](../data/processed/train-physicists-from-1901.csv) and [test data](../data/processed/test-physicists-from-1901.csv) contain information on physicists who were eligible to receive a Nobel Prize in Physics. That is, they were alive on and after 10 December 1901, the date the prize was first awarded. 

All of the physicists in the training data are deceased and all the physicists in the test data are alive (up to the last 6-18 months since this is the approximate length of time DBpedia data is behind Wikipedia articles). Since one of the goals of this project is to try to predict the next Physics Nobel Laureate(s). The data was purposely sampled in this way as the aim is to use the training set to build models that predict whether a physicist who is still alive has been awarded or is likely to be awarded the *Nobel Prize in Physics*.

It is finally time to use the training and test data, along with the other various pieces of data ([Nobel Physics Laureates](../data/raw/nobel-physics-prize-laureates.csv), [Nobel Chemistry Laureates](../data/raw/nobel-chemistry-prize-laureates.csv), [Places](../data/processed/places.csv) and [Countries](../data/processed/Countries-List.csv)) that I have collected, in order to create features that may help in predicting *Nobel Laureates in Physics*.

## Setting up the Environment

An initialization step is needed to setup the environment:
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of words with accents.

In [None]:
import locale
    
locale.setlocale(locale.LC_ALL, '')

In [None]:
from datetime import datetime

import numpy as np
import pandas as pd
from pycountry_convert import country_alpha2_to_country_name
from pycountry_convert import country_name_to_country_alpha3
from pycountry_convert import country_alpha2_to_continent_code
from pycountry_convert import country_alpha3_to_country_alpha2
from sklearn.preprocessing import MultiLabelBinarizer

from src.data.country_utils import nationality_to_alpha2_code 

## Reading in the Data

First let's read in the training and test data and the list of Nobel Physics laureates.

In [None]:
train_physicists = pd.read_csv(
    '../data/processed/train-physicists-from-1901.csv')
train_physicists.head()

In [None]:
test_physicists = pd.read_csv(
    '../data/processed/test-physicists-from-1901.csv')
test_physicists.head()

In [None]:
nobel_physicists = pd.read_csv(
    '../data/raw/nobel-physics-prize-laureates.csv')
nobel_physicists.head()

There are some variants of the name in the training and test data. Since I'll be searching for whether academic advisors, students, spouses, children etc. of a physicist are physics laureates, for convenience it's useful to merge the `name` field into *Nobel Physicists* dataframe.

In [None]:
nobel_columns = ['Year', 'Laureate', 'name', 'Country', 'Rationale']
nobel_physicists = pd.merge(nobel_physicists,
                            train_physicists.append(test_physicists),
                            how = 'left', left_on = 'Laureate',
                            right_on = 'fullName')[nobel_columns]
nobel_physicists.head()

Now let's read in the list of Nobel Chemistry laureates.

In [None]:
nobel_chemists = pd.read_csv(
    '../data/raw/nobel-chemistry-prize-laureates.csv')
nobel_chemists.head()

Again, I'll be searching for whether academic advisors, students, spouses, children etc. of a physicist are chemistry laureates. So for convenience it's useful to merge the `name` field into *Nobel Chemists* dataframe.

In [None]:
nobel_chemists = pd.merge(nobel_chemists, train_physicists.append(test_physicists),
                          how = 'left', left_on = 'Laureate',
                          right_on = 'fullName')[nobel_columns]
nobel_chemists.head()

These are essentially physicists who are *Chemistry Nobel Laureates*. Surpringly there are quite a few of them. Of course, as noted previously, *Marie Curie* is the only double laureate in Physics and Chemistry.

In [None]:
nobel_chemists[nobel_chemists.name.notna()].name

It is worth noting that if there are alternative names of Chemistry Nobel Laureates in the physicists dataframe other than those above, they will *not* be found. However, I do not expect many of these as at one point I got all the redirected URLS for the names and there were very few associated with laureates. In fact you can still see that some of these names are present in the [DBpedia redirects](../data/raw/dbpedia-redirects.csv) (e.g. search for "Marie_Curie"). The reason I removed the imputing of these redirects for names when processing the physicist data earlier was that a few of them were plain wrong. For instance, Richard Feynman's children redirect back to him! (e.g. search for "Carl_Feynman" and "Michelle_Feynmann" in the DBpedia redirects or directly try in your browser http://dbpedia.org/page/Carl_Feynman or http://dbpedia.org/page/Michelle_Feynmann.

Now, let's read the places and nationalities data into a dataframe. It's important at this point to turn off the default behavior of *pandas* which is to treat the string literal 'NA' as a missing value. In the dataset, 'NA' is both the continent code of North America and the ISO 3166 alpha-2 country code of Namibia. I then have to impute the missing values since *pandas* replaces them with the empty string.  

In [None]:
places = pd.read_csv('../data/processed/places.csv',
                     keep_default_na=False)
places = places.replace('', np.nan)
assert(all(places[
    places.countryAlpha3Code == 'USA']['continentCode'].values == 'NA'))
places.head()

In [None]:
nationalities = pd.read_csv('../data/processed/Countries-List.csv',
                            keep_default_na=False)
nationalities = nationalities.replace('', np.nan)
assert(nationalities[
    nationalities.Name == 'Namibia']['ISO 3166 Code'].values == 'NA')
nationalities.head()

Finally, with all the data read in, I can now move on to the bulk of the work, which is creating the features.

## Creating the Features

It is now time to create the features from the data I have collected. The *features* I am going to create are listed in the table below along with their *type* and *description*. The features can be grouped into three main groups, with the bulk of features falling in the first group, then the second group and so on:

1. Features related to *professional and personal relationships* that the physicists have to *physics or chemistry laureates*, *educational institutions*, *work institutions*, *countries* / *continents*.

2. Features related to the subfield of focus of the physicist denoting whether s/he is a *experimental physicist*, *theoretical physicist* and / or an *astronomer*.

3. Features related to personal characteristics of the physicist, namely, *gender* and *number of years lived*.

Remember that in the first group, there are people and institutions from different countries / continents that are directly involved in the [selection and voting process for the Nobel Prize in Physics](https://www.nobelprize.org/nomination/physics/) and therefore have a direct influence on those who become laureates. The second group is connected to subjective biases that may or may not exist concerning the major subfield of research of the physicist. Whilst the third group is connected to subjective biases that may or may not exist concerning the gender and age of a physicist. Although the latter is also related to the invention or discovery "standing the test of time".

| Feature                                  | Type        | Description                                         |
| :---:                                    | :---:       | :---:                                               |
| alma_mater                               | Categorical | List of universities attended                       |
| alma_mater_continent_codes               | Categorical | List of continent codes of universities attended    |
| alma_mater_country_alpha_3_codes         | Categorical | List of country codes of universities attended      |
| birth_continent_codes                    | Categorical | List of continent codes of birth countries          |
| birth_country_alpha_3_codes              | Categorical | List of country codes of birth countries            |
| citizenship_continent_codes              | Categorical | List of continent codes of coutries of citizenship  |
| citizenship_country_alpha_3_codes        | Categorical | List of country codes of citizenship                |
| death_continent_codes                    | Categorical | List of continent codes of death countries          |
| death_country_alpha_3_codes              | Categorical | List of country codes of death countries            |
| gender                                   | Binary      | Gender of physicist (male / female)                 |
| is_astronomer                            | Binary      | Is the physicist an astronomer? (yes / no)          |
| is_experimental_physicist                | Binary      | Is the physicist an experimental physicist? (yes / no) |
| is_theoretical_physicist                 | Binary      | Is the physicist a theoretical physicist? (yes / no) |
| num_alma_mater                           | Count       | No. of universities attended                        |
| num_alma_mater_continent_codes           | Count       | No. of continent codes of universities attended     |
| num_alma_mater_country_alpha_3_codes     | Count       | No. of country codes of universities attended       |
| num_birth_continent_codes                | Count       | No. of continent codes of birth countries           |
| num_birth_country_alpha_3_codes          | Count       | No. of birth country codes                          | 
| num_chemistry_laureate_academic_advisors | Count       | No. of chemistry laureate academic advisors         |
| num_chemistry_laureate_children          | Count       | No. of chemistry laureate children                  |
| num_chemistry_laureate_doctoral_advisors | Count       | No. of chemistry laureate doctoral advisors         |
| num_chemistry_laureate_doctoral_students | Count       | No. of chemistry laureate doctoral students         |
| num_chemistry_laureate_influenced        | Count       | No. of chemistry laureates the physicist influenced |
| num_chemistry_laureate_influenced_by     | Count       | No. of chemistry laureates the physicist was influenced by | 
| num_chemistry_laureate_notable_students  | Count       | No. of chemistry laureate notable students          |
| num_chemistry_laureate_parents           | Count       | No. of chemistry laureate parents                   |
| num_chemistry_laureate_spouses           | Count       | No. of chemistry laureate spouses                   | 
| num_citizenship_continent_codes          | Count       | No. of continent codes of countries of citizenship  |
| num_citizenship_country_alpha_3_codes    | Count       | No. of country codes of citizenship                 |
| num_death_continent_codes                | Count       | No. of continent codes of death countries           |
| num_death_country_alpha_3_codes          | Count       | No. of country codes of death countries             |
| num_physics_laureate_academic_advisors   | Count       | No. of physics laureate academic advisors           |
| num_physics_laureate_children            | Count       | No. of physics laureate children                    |
| num_physics_laureate_doctoral_advisors   | Count       | No. of physics laureate doctoral advisors           |
| num_physics_laureate_doctoral_students   | Count       | No. of physics laureate doctoral students           |
| num_physics_laureate_influenced          | Count       | No. of physics laureates the physicist influenced   |
| num_physics_laureate_influenced_by       | Count       | No. of physics laureates the physicist was influenced by |
| num_physics_laureate_notable_students    | Count       | No. of physics laureate notable students            |
| num_physics_laureate_parents             | Count       | No. of physics laureate parents                     |
| num_physics_laureate_spouses             | Count       | No. of physics laureate spouses                     |
| num_residence_continent_codes            | Count       | No. of continent codes of residence countries       |
| num_residence_country_alpha_3_codes      | Count       | No. of residence country codes                      |
| num_workplaces                           | Count       | No. of workplaces                                   |
| num_workplaces_continent_codes           | Count       | No. of continent codes of countries of workplaces   |
| num_workplaces_country_alpha_3_codes     | Count       | No. of country codes of countries worked in         |
| num_years_lived                          | Count       | No. of years lived (equals age or age of death)     |
| residence_continent_codes                | Categorical | List of continent codes of countries of residence   |
| residence_country_alpha_3_codes          | Categorical | List of country codes of countries of residence     |
| workplaces                               | Categorical | List of workplaces                                  |
| workplaces_continent_codes               | Categorical | List of continent codes of countries worked in      |
| workplaces_country_alpha_3_codes         | Categorical | List of country codes of countries worked in        |

Some comments are warranted with regards to the types of the feature variables also. As you can see we have three types of variables, with the bulk falling in the first group, then the second group and so on:

1. **Count** variables of a *discrete*, *quantitative* nature.

2. **Categorical** variables of a qualitative nature.

3. **Binary** (**dichotomous**) variables of a categorical nature.

The categorical variables are all lists of varying lengths of *places* and therefore are not in the appropriate form for machine learning. Once I create them I will actually *one-hot-encode* them into binary variables and discard the lists. You may ask why the one-hot-encoding is done with categorical yes / no values rather than 0 / 1 values? It is because the algorithms I will be processing the data with would treat 0 / 1 values as quantitive in nature which is clearly not what is desired. Essentially I will be left with two variable types, just binary variables and counts. OK time to go ahead and create the features. 

In [None]:
def build_features(physicists, nobel_physicists, nobel_chemists,
                   places, nationalities):
    """Build features for the physicists.

    Args:
        physicists (pandas.DataFrame): Physicists dataframe.
        nobel_physicists (pandas.DataFrame): Nobel Physics
            Laureate dataframe.
        nobel_chemists (pandas.DataFrame): Nobel Chemistry
            Laureate dataframe.
        places (pandas.DataFrame): Places dataframe.
        nationality (pandas.DataFrame): Nationalies dataframe.

    Returns:
        pandas.DataFrame: Features dataframe.
    """
    
    features = physicists.copy()[['fullName', 'name', 'gender']].rename(
        mapper={'fullName': 'full_name'}, axis='columns')
    features['num_years_lived'] = _build_num_years_lived(physicists.birthDate,
                                                         physicists.deathDate)
    
    _build_physics_subfield_features(features, physicists)
    _build_num_laureates_features(features, physicists,
                                  nobel_physicists, nobel_chemists)
    
    _build_citizenship_features(features, physicists, nationalities)
    
    _build_places_features(features, physicists, places)
    
    features = _binarize_list_features(features)

    features = features.drop('name', axis='columns')
    return features


def _build_physics_subfield_features(features, physicists):
    features_to_build = {
        'is_theoretical_physicist': {'categories': 'Theoretical physicists',
                                     'others': 'theoretical physic'},
        'is_experimental_physicist': {'categories': 'Experimental physicists',
                                      'others': 'experimental physic'},
        'is_astronomer': {'categories': 'astronomers',
                          'others': 'astronom'}
    }
    
    for feature, search_terms in features_to_build.items():
        features[feature] = _build_physics_subfield(
            physicists.categories, physicists.field, physicists.description,
            physicists.comment, search_terms=search_terms)
    


def _build_num_laureates_features(features, physicists, nobel_physicists,
                                  nobel_chemists):
    features_to_build = {
        'laureate_academic_advisors': 'academicAdvisor',
        'laureate_doctoral_advisors': 'doctoralAdvisor',
        'laureate_doctoral_students': 'doctoralStudent',
        'laureate_notable_students': 'notableStudent',
        'laureate_children': 'child',
        'laureate_parents': 'parent',
        'laureate_spouses': 'spouse',
        'laureate_influenced': 'influenced',
        'laureate_influenced_by': 'influencedBy'
    }
    
    for feature, relation in features_to_build.items():
        features['num_physics_' + feature] = _build_num_laureates(
            physicists[relation], nobel_physicists.Laureate, nobel_physicists.name)
        features['num_chemistry_' + feature] = _build_num_laureates(
            physicists[relation], nobel_chemists.Laureate, nobel_chemists.name)


    
def _build_places_features(features, physicists, places):
    features_to_build = {
        'birth_country_alpha_3_codes': 'birthPlace',
        'birth_continent_codes': 'birthPlace',
        'death_country_alpha_3_codes': 'deathPlace',
        'death_continent_codes': 'deathPlace',
        'residence_country_alpha_3_codes': 'residence',
        'residence_continent_codes': 'residence',
        'alma_mater': 'almaMater',
        'alma_mater_country_alpha_3_codes': 'almaMater',
        'alma_mater_continent_codes': 'almaMater',
        'workplaces': 'workplaces',
        'workplaces_country_alpha_3_codes': 'workplaces',
        'workplaces_continent_codes': 'workplaces'
    }
    
    for feature, place in features_to_build.items():
        code = 'countryAlpha3Code'
        if 'continent' in feature:
            code = 'continentCode'
            
        if feature in ['alma_mater', 'workplaces']:
            features[feature] = physicists[place].apply(
                _get_alma_mater_or_workplaces)           
        else:
            features[feature] = _build_places_codes(
                physicists[place], places.fullName, places[code])
        features['num_' + feature] = features[feature].apply(len)


    
def _build_citizenship_features(features, physicists, nationalities):
    citizenship = physicists.citizenship.apply(
        _get_citizenship_codes, args=(nationalities,))
    nationality = physicists.nationality.apply(
        _get_citizenship_codes, args=(nationalities,))
    citizenship_description = physicists.description.apply(
        _get_citizenship_codes, args=(nationalities,))
    features['citizenship_country_alpha_3_codes'] = (
        (citizenship + nationality + citizenship_description).apply(
            lambda ctz: list(sorted(set(ctz)))))
    features['num_citizenship_country_alpha_3_codes'] = (
        features.citizenship_country_alpha_3_codes.apply(len))
    features['citizenship_continent_codes'] = (
        features.citizenship_country_alpha_3_codes.apply(
            lambda al3: list(sorted({country_alpha2_to_continent_code(
                country_alpha3_to_country_alpha2(cd)) for cd in al3}))))
    features['num_citizenship_continent_codes'] = (
        features.citizenship_continent_codes.apply(len))



def _binarize_list_features(features):
    # union of places and citizenship (without the counts)
    series_to_binarize = {
        'birth_country_alpha_3_codes': 'born_in_',
        'birth_continent_codes': 'born_in_',
        'death_country_alpha_3_codes': 'died_in_',
        'death_continent_codes': 'died_in_',
        'residence_country_alpha_3_codes': 'lived_in_',
        'residence_continent_codes': 'lived_in_',
        'alma_mater': 'alumnus_of_',
        'alma_mater_country_alpha_3_codes': 'alumnus_in_',
        'alma_mater_continent_codes': 'alumnus_in_',
        'workplaces': 'worked_at_',
        'workplaces_country_alpha_3_codes': 'worked_in_',
        'workplaces_continent_codes': 'worked_in_',
        'citizenship_country_alpha_3_codes': 'citizen_of_',
        'citizenship_continent_codes': 'citizen_in_'
    }
        
    for series, prefix in series_to_binarize.items():
        binarized = _binarize_list_feature(features[series], prefix)
        features = features.drop(series, axis='columns').join(binarized)
    return features
    
    

def _build_num_years_lived(birth_date, death_date):
    death_date_no_nan = death_date.apply(_date_no_nan)
    birth_date_no_nan = birth_date.apply(_date_no_nan)
    years_lived = ((death_date_no_nan - birth_date_no_nan) / pd.to_timedelta(1, 'Y'))
    return years_lived.astype('int64')


def _build_physics_subfield(categories, field, description, comment, search_terms):
    cat_theoretical_physicist = categories.apply(
        lambda cat: search_terms['categories'] in cat)
    field_theoretical_physicist = field.apply(
        lambda fld: search_terms['others'] in fld.lower() if isinstance(fld, str)
        else False)
    desc_theoretical_physicist = description.apply(
        lambda desc: search_terms['others'] in desc.lower() if isinstance(desc, str)
        else False)
    comm_theoretical_physicist = description.apply(
        lambda comm: search_terms['others'] in comm.lower() if isinstance(comm, str)
        else False)
    subfield = (cat_theoretical_physicist |
                field_theoretical_physicist |
                desc_theoretical_physicist |
                comm_theoretical_physicist)
    subfield = subfield.apply(lambda val: 'yes' if val == True else 'no')
    return subfield


def _binarize_list_feature(series, prefix):
    mlb = MultiLabelBinarizer()
    binarized = pd.DataFrame(
        mlb.fit_transform(series),
        columns=[prefix + class_.replace(' ', '_') for class_ in mlb.classes_],
        index=series.index)
    binarized = binarized.applymap(lambda val: 'yes' if val == 1 else 'no')
    return binarized
    


def _build_num_laureates(series, laureates, names):
    laureate_names = series.apply(_get_nobel_laureates,
                                  args=(laureates, names))
    return laureate_names.apply(len)


def _build_places_codes(places_in_physicists, full_name_in_places, places_codes):
    codes = places_in_physicists.apply(_get_places_codes,
                                       args=(full_name_in_places, places_codes))
    return codes


def _get_alma_mater_or_workplaces(cell):
    if isinstance(cell, float):
        return list()
    
    places = set()
    places_in_cell = cell.split('|')
    for place_in_cell in places_in_cell:
        # group colleges of University of Oxford and University of Cambridge
        # with their respective parent university
        if place_in_cell.endswith(', Cambridge'):
            places.add('University of Cambridge')
        elif place_in_cell.endswith(', Oxford'):
            places.add('University of Oxford')
        else:
            places.add(place_in_cell)
    
    places = list(places)
    places.sort(key=locale.strxfrm)
    return places


def _get_citizenship_codes(series, nationalities):
    alpha_2_codes = nationality_to_alpha2_code(series, nationalities)
    if isinstance(alpha_2_codes, float):
        return list()
    alpha_2_codes = alpha_2_codes.split('|')
    alpha_3_codes = [country_name_to_country_alpha3(
        country_alpha2_to_country_name(alpha_2_code))
                     for alpha_2_code in alpha_2_codes]
    return alpha_3_codes


def _get_nobel_laureates(cell, laureates, names):
    laureates_in_cell = set()
    
    if isinstance(cell, str):
        # assume the same name if only differs by a hyphen
        # or whitespace at front or end of string
        values = cell.strip().replace('-', ' ').split('|')
        for value in values:
            if value in laureates.values:
                laureates_in_cell.add(value)
            if names.str.contains(value, regex=False).sum() > 0:
                laureates_in_cell.add(value)
                    
    laureates_in_cell = list(laureates_in_cell)
    return laureates_in_cell

    
def _get_places_codes(cell, full_name_in_places, places_codes):
    codes = set()

    if isinstance(cell, str):
        places = cell.split('|')
        for place in places:
            code_indices = full_name_in_places[
                full_name_in_places == place].index
            assert(len(code_indices) <= 1)
            if len(code_indices) != 1:
                continue
            code_index = code_indices[0]
            codes_text = places_codes[code_index]
            if isinstance(codes_text, float):
                continue
            codes_in_cell = codes_text.split('|')
            for code_in_cell in codes_in_cell:
                if code_in_cell:
                    codes.add(code_in_cell)

    codes = list(codes)
    codes.sort()
    return codes
    

def _date_no_nan(date):
    if isinstance(date, str):
        return datetime.strptime(date, '%Y-%m-%d').date()
    return datetime.now().date()

In [None]:
train_physicists_features = build_features(train_physicists, nobel_physicists,
                                           nobel_chemists, places, nationalities)
assert((len(train_physicists_features) == len(train_physicists)))
assert(len(train_physicists_features.columns) == 779)
train_physicists_features

In [None]:
test_physicists_features = build_features(test_physicists, nobel_physicists,
                                          nobel_chemists, places, nationalities)
assert((len(test_physicists_features) == len(test_physicists)))
assert(len(test_physicists_features.columns) == 663)
test_physicists_features

Hold on a second, there are less features for the test set than for the training set! Let's inspect the dataframe columns to see the difference. The difference is obviously due to the one-hot encoding as there are many differing country codes, workplaces, educational institutions etc. in the training and test sets. Some of this is due to the way that the data was sampled. For instance, the `died_in_[country_code]` features cannot possibly appear in the test set features since all the physicists are still alive. However, the majority of differences are due to the sheer variability in the data, which shows how difficult learning may be in this problem.

In [None]:
train_features_cols = set(train_physicists_features.columns.values)
test_features_cols = set(test_physicists_features.columns.values)
feature_cols_difference = train_features_cols.difference(test_features_cols)
display(feature_cols_difference)
len(feature_cols_difference)

But which and how many features are actually in common between the training and test sets? Under half of the features that are in the training set are in common. Unsurprisingly, these are the main educational institutions, workplaces and countries that are associated with physics research.

In [None]:
feature_cols_intersection = train_features_cols.intersection(test_features_cols)
display(feature_cols_intersection)
len(feature_cols_intersection)

Since any machine models I build will be evaluated on the test set, the tempting thing to do is to reduce the features to the common set of features between the training and test sets. However, this would clearly be *data snooping* (cheating) since the test set is meant to be unseen data. The other issue is if half of the features in the training set are thrown away and new examples come along with those exact features, the model would not be able to leverage this information. So the only logical thing to do is to ensure that the test set features are identical to the full set of features in the training set. And yes, this does mean that over half the test set features will consist of binary columns in which the values are all "no". Let's go ahead and do this now.

In [None]:
def sync_test_features_with_training_features(
    test_features, train_features):
    """Synchronize test features dataframe with train features for physicists.
    
    The test features dataframe is changed so that its columns
    match those in the training features dataframe. The union of
    the columns in the training and test features is first taken
    and then columns present in the training set that are not
    present in the test set are joined with it. These columns are
    filled with "no" values. This method should be called after
    `build_features`.

    Args:
        physicists (pandas.DataFrame): Physicists dataframe.
        nobel_physicists (pandas.DataFrame): Nobel Physics
            Laureate dataframe.
        nobel_chemists (pandas.DataFrame): Nobel Chemistry
            Laureate dataframe.
        places (pandas.DataFrame): Places dataframe.
        nationality (pandas.DataFrame): Nationalies dataframe.

    Returns:
        pandas.DataFrame: Test features dataframe.
        
        Columns are identical to the columns in the training
        features dataframe.
    """
    
    train_features_cols = set(train_features.columns.values)
    test_features_cols = set(test_features.columns.values)
    feature_cols_intersection = train_features_cols.intersection(
        test_features_cols)
    feature_cols_difference = train_features_cols.difference(
        test_features_cols)
    
    test_features_synced = test_features.copy()
    test_features_synced = test_features_synced[
        list(feature_cols_intersection)]
    shape=(len(test_features_synced), len(feature_cols_difference))
    features_to_pad = pd.DataFrame(np.full(shape, 'no'),
                                   columns=feature_cols_difference)
    test_features_synced = test_features_synced.join(features_to_pad)
    return test_features_synced

In [None]:
test_physicists_features = sync_test_features_with_training_features(
    test_physicists_features, train_physicists_features)
assert(sorted(test_physicists_features.columns) == sorted(
    train_physicists_features.columns))
assert((len(test_physicists_features) == len(test_physicists)))
test_physicists_features

This looks much better now. 

It is clear to see that the one-hot encoding has tremendously increased the dimensionality of the problem. There are now 778 features (excluding the `full_name`) for 540 observations in the training set and 387 observations in the test set! Any model that is fit to such data would clearly be extremely prone to overfitting, so a dimensionality reduction is clearly needed on this data.

## Persisting the Data

Now I have the training and test features dataframes I'll persist them for future use.

In [None]:
train_physicists_features = train_physicists_features.reindex(
    sorted(train_physicists_features.columns), axis='columns')
train_physicists_features.head()

In [None]:
test_physicists_features = test_physicists_features.reindex(
    sorted(test_physicists_features.columns), axis='columns')
test_physicists_features.head()

In [None]:
train_physicists_features.to_csv('../data/processed/train-features.csv',
                                 index=False)
test_physicists_features.to_csv('../data/processed/test-features.csv',
                                index=False)

Let's perform a quick sanity check to make sure the data is as expected.

In [None]:
train_on_disk = pd.read_csv('../data/processed/train-features.csv')
assert(train_on_disk.equals(train_physicists_features))
test_on_disk = pd.read_csv('../data/processed/test-features.csv')
assert(test_on_disk.equals(test_physicists_features))