# Build Features

As a recap, the [training data](../data/processed/train-physicists-from-1901.csv) and [test data](../data/processed/test-physicists-from-1901.csv) contain information on physicists who were eligible to receive a Nobel Prize in Physics. That is, they were alive on and after 10 December 1901, the date the prize was first awarded. 

All of the physicists in the training data are deceased and all the physicists in the test data are alive (up to the last 6-18 months since this is the approximate length of time DBpedia data is behind Wikipedia articles). Since one of the goals of this project is to try to predict the next Physics Nobel Laureate(s). The data was purposely sampled in this way as the aim is to use the training set to build models that predict whether a physicist who is still alive has been awarded or is likely to be awarded the *Nobel Prize in Physics*.

It is finally time to use the training and test data, along with the other various pieces of data ([Nobel Physics Laureates](../data/raw/nobel-physics-prize-laureates.csv), [Nobel Chemistry Laureates](../data/raw/nobel-chemistry-prize-laureates.csv), [Places](../data/processed/places.csv) and [Countries](../data/processed/Countries-List.csv)) that I have collected, in order to create features that may help in predicting *Nobel Laureates in Physics*.

## Setting up the Environment

An initialization step is needed to setup the environment:
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of words with accents.

In [None]:
import locale
    
locale.setlocale(locale.LC_ALL, '')

In [None]:
from datetime import datetime

import numpy as np
import pandas as pd
from pycountry_convert import country_alpha2_to_country_name
from pycountry_convert import country_name_to_country_alpha3
from pycountry_convert import country_alpha2_to_continent_code
from pycountry_convert import country_alpha3_to_country_alpha2
from sklearn.preprocessing import MultiLabelBinarizer

from src.data.country_utils import nationality_to_alpha2_code 

## Reading in the Data

First let's read in the training and test data and the list of Nobel Physics laureates.

In [None]:
train_physicists = pd.read_csv(
    '../data/processed/train-physicists-from-1901.csv')
train_physicists.head()

In [None]:
test_physicists = pd.read_csv(
    '../data/processed/test-physicists-from-1901.csv')
test_physicists.head()

In [None]:
nobel_physicists = pd.read_csv(
    '../data/raw/nobel-physics-prize-laureates.csv')
nobel_physicists.head()

There are some variants of the name in the training and test data. Since I'll be searching for whether academic advisors, students, spouses, children etc. of a physicist are physics laureates, for convenience it's useful to merge the `name` field into *Nobel Physicists* dataframe.

In [None]:
nobel_columns = ['Year', 'Laureate', 'name', 'Country', 'Rationale']
nobel_physicists = pd.merge(nobel_physicists,
                            train_physicists.append(test_physicists),
                            how = 'left', left_on = 'Laureate',
                            right_on = 'fullName')[nobel_columns]
nobel_physicists.head()

Now let's read in the list of Nobel Chemistry laureates.

In [None]:
nobel_chemists = pd.read_csv(
    '../data/raw/nobel-chemistry-prize-laureates.csv')
nobel_chemists.head()

Again, I'll be searching for whether academic advisors, students, spouses, children etc. of a physicist are chemistry laureates. So for convenience it's useful to merge the `name` field into *Nobel Chemists* dataframe.

In [None]:
nobel_chemists = pd.merge(nobel_chemists, train_physicists.append(test_physicists),
                          how = 'left', left_on = 'Laureate',
                          right_on = 'fullName')[nobel_columns]
nobel_chemists.head()

These are essentially physicists who are *Chemistry Nobel Laureates*. Surpringly there are quite a few of them. Of course, as noted previously, *Marie Curie* is the only double laureate in Physics and Chemistry.

In [None]:
nobel_chemists[nobel_chemists.name.notna()].name

It is worth noting that if there are alternative names of Chemistry Nobel Laureates in the physicists dataframe other than those above, they will *not* be found. However, I do not expect many of these as at one point I got all the redirected URLS for the names and there were very few associated with laureates. In fact you can still see that some of these names are present in the [DBpedia redirects](../data/raw/dbpedia-redirects.csv) (e.g. search for "Marie_Curie"). The reason I removed the imputing of these redirects for names when processing the physicist data earlier was that a few of them were plain wrong. For instance, Richard Feynman's children redirect back to him! (e.g. search for "Carl_Feynman" and "Michelle_Feynmann" in the DBpedia redirects or directly try in your browser http://dbpedia.org/page/Carl_Feynman or http://dbpedia.org/page/Michelle_Feynmann.

Another interesting fact about this list is that the only one still alive is *Manfred Eigen*. So I would not expect to see many (if any) physicists in the test set who have Chemistry Nobel Laureate academic advisors, notable students, spouses etc. This is clearly a facet in which the training and test data is very different. Such differences make learning quite difficult.

Now, let's read the places and nationalities data into a dataframe. It's important at this point to turn off the default behavior of *pandas* which is to treat the string literal 'NA' as a missing value. In the dataset, 'NA' is both the continent code of North America and the ISO 3166 alpha-2 country code of Namibia. I then have to impute the missing values since *pandas* replaces them with the empty string.  

In [None]:
places = pd.read_csv('../data/processed/places.csv',
                     keep_default_na=False)
places = places.replace('', np.nan)
assert(all(places[
    places.countryAlpha3Code == 'USA']['continentCode'].values == 'NA'))
places.head()

In [None]:
nationalities = pd.read_csv('../data/processed/Countries-List.csv',
                            keep_default_na=False)
nationalities = nationalities.replace('', np.nan)
assert(nationalities[
    nationalities.Name == 'Namibia']['ISO 3166 Code'].values == 'NA')
nationalities.head()

Finally, with all the data read in, I can now move on to the bulk of the work, which is creating the features.

## Creating the Features

It is now time to create the features from the data I have collected. The *features* I am going to create are listed in the table below along with their *type* and *description*. The features can be grouped into three main groups, with the bulk of features falling in the first group, then the second group and so on:

1. Features related to *professional and personal relationships* that the physicists have to *physics or chemistry laureates*, *educational institutions*, *work institutions*, *countries* / *continents*.

2. Features related to the subfield of focus of the physicist denoting whether s/he is a *experimental physicist*, *theoretical physicist* and / or an *astronomer*.

3. Features related to personal characteristics of the physicist, namely, *gender* and *number of years lived*.

Remember that in the first group, there are people and institutions from different countries / continents that are directly involved in the [selection and voting process for the Nobel Prize in Physics](https://www.nobelprize.org/nomination/physics/) and therefore have a direct influence on those who become laureates. The second group is connected to subjective biases that may or may not exist concerning the major subfield of research of the physicist. Whilst the third group is connected to subjective biases that may or may not exist concerning the gender and age of a physicist. Although the latter is also related to the invention or discovery "standing the test of time".

| Feature                                        | Type        | Description                                         |
| :---:                                          | :---:       | :---:                                               |
| alma_mater                                     | Categorical | List of universities attended                       |
| alma_mater_continent_codes                     | Categorical | List of continent codes of universities attended    |
| alma_mater_country_alpha_3_codes               | Categorical | List of country codes of universities attended      |
| birth_continent_codes                          | Categorical | List of continent codes of birth countries          |
| birth_country_alpha_3_codes                    | Categorical | List of country codes of birth countries            |
| citizenship_continent_codes                    | Categorical | List of continent codes of coutries of citizenship  |
| citizenship_country_alpha_3_codes              | Categorical | List of country codes of citizenship                |
| death_continent_codes                          | Categorical | List of continent codes of death countries          |
| death_country_alpha_3_codes                    | Categorical | List of country codes of death countries            |
| gender                                         | Binary      | Gender of physicist (male / female)                 |
| is_astronomer                                  | Binary      | Is the physicist an astronomer? (yes / no)          |
| is_experimental_physicist                      | Binary      | Is the physicist an experimental physicist? (yes / no) |
| is_theoretical_physicist                       | Binary      | Is the physicist a theoretical physicist? (yes / no) |
| ratio_num_alma_mater                           | Ratio       | Ratio of no. of universities attended                       |
| ratio_num_alma_mater_continent_codes           | Ratio       | Ratio of no. of continent codes of universities attended     |
| ratio_num_alma_mater_country_alpha_3_codes     | Ratio       | Ratio of no. of country codes of universities attended       |
| ratio_num_birth_continent_codes                | Ratio       | Ratio of no. of continent codes of birth countries           |
| ratio_num_birth_country_alpha_3_codes          | Ratio       | Ratio of no. of birth country codes                          | 
| ratio_num_chemistry_laureate_academic_advisors | Ratio       | Ratio of no. of chemistry laureate academic advisors         |
| ratio_num_chemistry_laureate_children          | Ratio       | Ratio of no. of chemistry laureate children                  |
| ratio_num_chemistry_laureate_doctoral_advisors | Ratio       | Ratio of no. of chemistry laureate doctoral advisors         |
| ratio_num_chemistry_laureate_doctoral_students | Ratio       | Ratio of no. of chemistry laureate doctoral students         |
| ratio_num_chemistry_laureate_influenced        | Ratio       | Ratio of no. of chemistry laureates the physicist influenced |
| ratio_num_chemistry_laureate_influenced_by     | Ratio       | Ratio of no. of chemistry laureates the physicist was influenced by | 
| ratio_num_chemistry_laureate_notable_students  | Ratio       | Ratio of no. of chemistry laureate notable students          |
| ratio_num_chemistry_laureate_parents           | Ratio       | Ratio of no. of chemistry laureate parents                   |
| ratio_num_chemistry_laureate_spouses           | Ratio       | Ratio of no. of chemistry laureate spouses                   | 
| ratio_num_citizenship_continent_codes          | Ratio       | Ratio of no. continent codes of countries of citizenship  |
| ratio_num_citizenship_country_alpha_3_codes    | Ratio       | Ratio of no. of country codes of citizenship                 |
| ratio_num_death_continent_codes                | Ratio       | Ratio of no. of continent codes of death countries           |
| ratio_num_death_country_alpha_3_codes          | Ratio       | Ratio of no. of country codes of death countries             |
| ratio_num_physics_laureate_academic_advisors   | Ratio       | Ratio of no. of physics laureate academic advisors           |
| ratio_num_physics_laureate_children            | Ratio       | Ratio of no. of physics laureate children                    |
| ratio_num_physics_laureate_doctoral_advisors   | Ratio       | Ratio of no. of physics laureate doctoral advisors           |
| ratio_num_physics_laureate_doctoral_students   | Ratio       | Ratio of no. of physics laureate doctoral students           |
| ratio_num_physics_laureate_influenced          | Ratio       | Ratio of no. of physics laureates the physicist influenced   |
| ratio_num_physics_laureate_influenced_by       | Ratio       | Ratio of no. of physics laureates the physicist was influenced by |
| ratio_num_physics_laureate_notable_students    | Ratio       | Ratio of no. of physics laureate notable students            |
| ratio_num_physics_laureate_parents             | Ratio       | Ratio of no. of physics laureate parents                     |
| ratio_num_physics_laureate_spouses             | Ratio       | Ratio of no. of physics laureate spouses                     |
| ratio_num_residence_continent_codes            | Ratio       | Ratio of no. of continent codes of residence countries       |
| ratio_num_residence_country_alpha_3_codes      | Ratio       | Ratio of no. of residence country codes                      |
| ratio_num_workplaces                           | Ratio       | Ratio of no. of workplaces                                   |
| ratio_num_workplaces_continent_codes           | Ratio       | Ratio of no. of continent codes of countries of workplaces   |
| ratio_num_workplaces_country_alpha_3_codes     | Ratio       | Ratio of no. of country codes of countries worked in         |
| ratio_num_years_lived                          | Ratio       | Ratio of no. of years lived              |
| residence_continent_codes                      | Categorical | List of continent codes of countries of residence   |
| residence_country_alpha_3_codes                | Categorical | List of country codes of countries of residence     |
| workplaces                                     | Categorical | List of workplaces                                  |
| workplaces_continent_codes                     | Categorical | List of continent codes of countries worked in      |
| workplaces_country_alpha_3_codes               | Categorical | List of country codes of countries worked in        |

Some comments are warranted with regards to the types of the feature variables also. As you can see we have three types of variables:

1. **Ratio** variables of a *continuous*, *quantitative* nature.

2. **Categorical** variables of a qualitative nature.

3. **Binary** (**dichotomous**) variables of a categorical nature.

Every ratio variable is to be calculated by dividing the *individual physicist count* by the *mean count in the training set* for a particular feature. So a ratio represents how much more or less than the average value a particular physicist is. Values above one indicate that the physicist is above the average whilst values below one indicate that s/he is below the average for a specific feature.  For instance, if `ratio_num_alma_mater_country_alpha_3_codes` = 2.5 for a particular physicist, it means that the physicist has attended two and a half times more universities than the typical physicist in the training set.  

The categorical variables are all lists of varying lengths of *places* and therefore are not in the appropriate form for machine learning. Once I create them I will actually *one-hot-encode* them into binary variables and discard the lists. You may ask why the one-hot-encoding is done with categorical yes / no values rather than 0 / 1 values? It is because the algorithms I will be processing the data with would treat 0 / 1 values as quantitive in nature which is clearly not what is desired. Essentially I will be left with two variable types, just binary variables and ratio variables. OK time to go ahead and create the features. 

In [None]:
def build_features(physicists, nobel_physicists, nobel_chemists,
                   places, nationalities):
    """Build features for the physicists.

    Args:
        physicists (pandas.DataFrame): Physicists dataframe.
        nobel_physicists (pandas.DataFrame): Nobel Physics
            Laureate dataframe.
        nobel_chemists (pandas.DataFrame): Nobel Chemistry
            Laureate dataframe.
        places (pandas.DataFrame): Places dataframe.
        nationality (pandas.DataFrame): Nationalies dataframe.
            
    Returns:
        pandas.DataFrame: Features dataframe.
    """
    
    features = physicists.copy()[['fullName', 'name', 'gender']].rename(
        mapper={'fullName': 'full_name'}, axis='columns')
    features['num_years_lived'] = _build_num_years_lived(
        physicists.birthDate, physicists.deathDate)
    
    _build_physics_subfield_features(features, physicists)
    _build_num_laureates_features(features, physicists,
                                  nobel_physicists, nobel_chemists)
    
    _build_citizenship_features(features, physicists, nationalities)
    
    _build_places_features(features, physicists, places)
    
    features = features.drop('name', axis='columns')
    return features


def _build_physics_subfield_features(features, physicists):
    features_to_build = {
        'is_theoretical_physicist': {'categories': 'Theoretical physicists',
                                     'others': 'theoretical physic'},
        'is_experimental_physicist': {'categories': 'Experimental physicists',
                                      'others': 'experimental physic'},
        'is_astronomer': {'categories': 'astronomers',
                          'others': 'astronom'}
    }
    
    for feature, search_terms in features_to_build.items():
        features[feature] = _build_physics_subfield(
            physicists.categories, physicists.field, physicists.description,
            physicists.comment, search_terms=search_terms)
    


def _build_num_laureates_features(features, physicists, nobel_physicists,
                                  nobel_chemists):
    features_to_build = {
        'laureate_academic_advisors': 'academicAdvisor',
        'laureate_doctoral_advisors': 'doctoralAdvisor',
        'laureate_doctoral_students': 'doctoralStudent',
        'laureate_notable_students': 'notableStudent',
        'laureate_children': 'child',
        'laureate_parents': 'parent',
        'laureate_spouses': 'spouse',
        'laureate_influenced': 'influenced',
        'laureate_influenced_by': 'influencedBy'
    }
    
    for feature, relation in features_to_build.items():
        features['num_physics_' + feature] = _build_num_laureates(
            physicists[relation], nobel_physicists.Laureate, nobel_physicists.name)
        features['num_chemistry_' + feature] = _build_num_laureates(
            physicists[relation], nobel_chemists.Laureate, nobel_chemists.name)


    
def _build_places_features(features, physicists, places):
    features_to_build = {
        'birth_country_alpha_3_codes': 'birthPlace',
        'birth_continent_codes': 'birthPlace',
        'death_country_alpha_3_codes': 'deathPlace',
        'death_continent_codes': 'deathPlace',
        'residence_country_alpha_3_codes': 'residence',
        'residence_continent_codes': 'residence',
        'alma_mater': 'almaMater',
        'alma_mater_country_alpha_3_codes': 'almaMater',
        'alma_mater_continent_codes': 'almaMater',
        'workplaces': 'workplaces',
        'workplaces_country_alpha_3_codes': 'workplaces',
        'workplaces_continent_codes': 'workplaces'
    }
    
    for feature, place in features_to_build.items():
        code = 'countryAlpha3Code'
        if 'continent' in feature:
            code = 'continentCode'
            
        if feature in ['alma_mater', 'workplaces']:
            features[feature] = physicists[place].apply(
                _get_alma_mater_or_workplaces)           
        else:
            features[feature] = _build_places_codes(
                physicists[place], places.fullName, places[code])
        features['num_' + feature] = features[feature].apply(len)


    
def _build_citizenship_features(features, physicists, nationalities):
    citizenship = physicists.citizenship.apply(
        _get_citizenship_codes, args=(nationalities,))
    nationality = physicists.nationality.apply(
        _get_citizenship_codes, args=(nationalities,))
    citizenship_description = physicists.description.apply(
        _get_citizenship_codes, args=(nationalities,))
    features['citizenship_country_alpha_3_codes'] = (
        (citizenship + nationality + citizenship_description).apply(
            lambda ctz: list(sorted(set(ctz)))))
    features['num_citizenship_country_alpha_3_codes'] = (
        features.citizenship_country_alpha_3_codes.apply(len))
    features['citizenship_continent_codes'] = (
        features.citizenship_country_alpha_3_codes.apply(
            lambda al3: list(sorted({country_alpha2_to_continent_code(
                country_alpha3_to_country_alpha2(cd)) for cd in al3}))))
    features['num_citizenship_continent_codes'] = (
        features.citizenship_continent_codes.apply(len))


def _build_num_years_lived(birth_date, death_date):
    death_date_no_nan = death_date.apply(_date_no_nan)
    birth_date_no_nan = birth_date.apply(_date_no_nan)
    years_lived = ((death_date_no_nan - birth_date_no_nan) /
                   pd.to_timedelta(1, 'Y'))
    return years_lived.astype('int64')


def _build_physics_subfield(categories, field, description, comment, search_terms):
    cat_theoretical_physicist = categories.apply(
        lambda cat: search_terms['categories'] in cat)
    field_theoretical_physicist = field.apply(
        lambda fld: search_terms['others'] in fld.lower() if isinstance(fld, str)
        else False)
    desc_theoretical_physicist = description.apply(
        lambda desc: search_terms['others'] in desc.lower() if isinstance(desc, str)
        else False)
    comm_theoretical_physicist = description.apply(
        lambda comm: search_terms['others'] in comm.lower() if isinstance(comm, str)
        else False)
    subfield = (cat_theoretical_physicist |
                field_theoretical_physicist |
                desc_theoretical_physicist |
                comm_theoretical_physicist)
    subfield = subfield.apply(lambda val: 'yes' if val == True else 'no')
    return subfield


def _build_num_laureates(series, laureates, names):
    laureate_names = series.apply(_get_nobel_laureates,
                                  args=(laureates, names))
    return laureate_names.apply(len)


def _build_places_codes(places_in_physicists, full_name_in_places, places_codes):
    codes = places_in_physicists.apply(_get_places_codes,
                                       args=(full_name_in_places, places_codes))
    return codes


def _get_alma_mater_or_workplaces(cell):
    if isinstance(cell, float):
        return list()
    
    places = set()
    places_in_cell = cell.split('|')
    for place_in_cell in places_in_cell:
        # group colleges of University of Oxford and University of Cambridge
        # with their respective parent university
        if place_in_cell.endswith(', Cambridge'):
            places.add('University of Cambridge')
        elif place_in_cell.endswith(', Oxford'):
            places.add('University of Oxford')
        else:
            places.add(place_in_cell)
    
    places = list(places)
    places.sort(key=locale.strxfrm)
    return places


def _get_citizenship_codes(series, nationalities):
    alpha_2_codes = nationality_to_alpha2_code(series, nationalities)
    if isinstance(alpha_2_codes, float):
        return list()
    alpha_2_codes = alpha_2_codes.split('|')
    alpha_3_codes = [country_name_to_country_alpha3(
        country_alpha2_to_country_name(alpha_2_code))
                     for alpha_2_code in alpha_2_codes]
    return alpha_3_codes


def _get_nobel_laureates(cell, laureates, names):
    laureates_in_cell = set()
    
    if isinstance(cell, str):
        # assume the same name if only differs by a hyphen
        # or whitespace at front or end of string
        values = cell.strip().replace('-', ' ').split('|')
        for value in values:
            if value in laureates.values:
                laureates_in_cell.add(value)
            if names.str.contains(value, regex=False).sum() > 0:
                laureates_in_cell.add(value)
                    
    laureates_in_cell = list(laureates_in_cell)
    return laureates_in_cell

    
def _get_places_codes(cell, full_name_in_places, places_codes):
    codes = set()

    if isinstance(cell, str):
        places = cell.split('|')
        for place in places:
            code_indices = full_name_in_places[
                full_name_in_places == place].index
            assert(len(code_indices) <= 1)
            if len(code_indices) != 1:
                continue
            code_index = code_indices[0]
            codes_text = places_codes[code_index]
            if isinstance(codes_text, float):
                continue
            codes_in_cell = codes_text.split('|')
            for code_in_cell in codes_in_cell:
                if code_in_cell:
                    codes.add(code_in_cell)

    codes = list(codes)
    codes.sort()
    return codes
    

def _date_no_nan(date):
    if isinstance(date, str):
        return datetime.strptime(date, '%Y-%m-%d').date()
    return datetime(2018, 10, 24).date()  # fix the date for reproducibility

In [None]:
train_features = build_features(
    train_physicists, nobel_physicists, nobel_chemists, places, nationalities)
assert((len(train_features) == len(train_physicists)))
assert(len(train_features.columns) == 52)
train_features.head()

In [None]:
test_features = build_features(
    test_physicists, nobel_physicists, nobel_chemists, places, nationalities)
assert((len(test_features) == len(test_physicists)))
assert(test_features.columns.tolist() == train_features.columns.tolist())
test_features.head()

Now I one-hot encode the list features. Due to one-hot encoding there are less features in the test set than in the training set. This is because there are differing country codes, workplaces, educational institutions, etc. The majority of the differences are due to the way that the data was sampled. For instance, the `died_in_[country_code]` features cannot possibly appear in the test set features since these
physicists are still alive. The rest of the few differences are due variability in the data.

Since any machine models I build will be evaluated on the test set, the tempting thing to do is to reduce the features to the common set of features between the training and test sets. However, this would clearly be *data snooping* (cheating) since the test set is meant to be unseen data. The other issue is if some of the features in the training set are thrown away and new examples come along with those exact features, the model would not be able to leverage this information. So the only logical thing to do is to ensure that the test set features are identical to the training set features. I do this by "padding" the extra features in the test set with all "no" values. Let's go ahead and do this now.

In [None]:
def binarize_list_features(features, train_features=None,
                           presence_threshold=0.0):
    """Binarize list features.
    
    One-hot encode the list categorical features in the
    features dataframe.

    Args:
        features (pandas.DataFrame): Features dataframe.
        train_features (pandas.DataFrame): Training features
            dataframe. Pass this parameter when building features
            for a test set so that that identical features are
            created for the test set.
        presence_threshold (float): For each category in a 
            categorical list feature, the fraction of
            physicists for which the category is present will
            be calculated. If the fraction is below this
            threshold it will grouped into the "other"
            category (represented by one or more "*'s'" in its
            name). This is intended for "bucketing" rare
            values to keep the dimensionality of the feature
            space down and reduce chances of overfitting. Set
            this value to zero to prevent any grouping of
            values. Note that this value will be ignored when
            `train_features` is not None.
            
    Returns:
        pandas.DataFrame: Features dataframe.
    """
    
    # union of places and citizenship (without the counts)
    series_to_binarize = {
        'birth_country_alpha_3_codes': 'born_in_',
        'birth_continent_codes': 'born_in_',
        'death_country_alpha_3_codes': 'died_in_',
        'death_continent_codes': 'died_in_',
        'residence_country_alpha_3_codes': 'lived_in_',
        'residence_continent_codes': 'lived_in_',
        'alma_mater': 'alumnus_of_',
        'alma_mater_country_alpha_3_codes': 'alumnus_in_',
        'alma_mater_continent_codes': 'alumnus_in_',
        'workplaces': 'worked_at_',
        'workplaces_country_alpha_3_codes': 'worked_in_',
        'workplaces_continent_codes': 'worked_in_',
        'citizenship_country_alpha_3_codes': 'citizen_of_',
        'citizenship_continent_codes': 'citizen_in_'
    }
        
    for series, prefix in series_to_binarize.items():
        binarized = _binarize_list_feature(features[series], prefix,
                                           train_features, presence_threshold)
        features = features.drop(series, axis='columns').join(binarized)
        
    # add extra features in test set to sync with training set
    if train_features is not None:
        cols_to_add = set(train_features.columns) - set(features.columns)
        shape=(len(features), len(cols_to_add))
        features_to_pad = pd.DataFrame(
            np.full(shape, 'no'), index=features.index, columns=cols_to_add)
        features = features.join(features_to_pad)
    return features
    
    
def _binarize_list_feature(series, prefix, train_features=None,
                           presence_threshold=0.0):
    mlb = MultiLabelBinarizer()
    binarized = pd.DataFrame(
        mlb.fit_transform(series),
        columns=[prefix + class_.replace(' ', '_') for class_ in mlb.classes_],
        index=series.index)
    
    if not (presence_threshold <= 0.0) or train_features is not None:
        if train_features is not None:
            cols_to_group = [col for col in binarized.columns if col not in
                             train_features.columns]
        else:
            cols_to_group = binarized.mean() < presence_threshold
            cols_to_group = cols_to_group[cols_to_group.values].index.tolist()
            
        # look for at least one '1' value in the row for a physicist
        if cols_to_group:
            other_col = binarized[cols_to_group].applymap(
                lambda val: True if val == 1 else False).any(axis='columns')
            other_col.name = _series_name(series.name, prefix)
            binarized = binarized.drop(cols_to_group, axis='columns').join(other_col)

    binarized = binarized.applymap(lambda val: 'yes' if val == 1 else 'no')
    return binarized


def _series_name(name, prefix):
    if name.endswith('alpha_3_codes'):
        other_name = '***'
    elif name.endswith('continent_codes'):
        other_name = '**'
    else:
        other_name = '*'
    return prefix + other_name

In [None]:
presence_threshold = 0.01
train_features = binarize_list_features(
    train_features, presence_threshold=presence_threshold)
assert((len(train_features) == len(train_physicists)))
assert(len(train_features.columns) == 187)
train_features.head()

In [None]:
test_features = binarize_list_features(test_features,
                                       train_features=train_features)
assert((len(test_features) == len(test_physicists)))
assert(sorted(test_features.columns.tolist()) == sorted(
    train_features.columns.tolist()))
test_features.head()

Now I convert the count features to ratio features by diving the cell values by the mean value of the feature. I also drop any features with zero counts in all the columns since they are uninformative.

In [None]:
def convert_counts_to_ratios(features, train_features=None):
    """Convert count features to ratios.
    
    Converts all counts features in the `features` dataframe to
    ratios by dividing cell values by the mean value of the feature.

    Args:
        features (pandas.DataFrame): Features dataframe.
        train_features (pandas.DataFrame): Training features
            dataframe. Pass this parameter when building features
            for a test set so that the ratios can be created
            from the test set data.

    Returns:
        pandas.DataFrame: Features dataframe with counts replaced by ratios.
    """
    
    numerator = features.select_dtypes('int64') 
    if train_features is None:
        # drop columns in training features where the counts are all zero
        numerator = numerator.loc[:, (numerator != 0).any(axis='rows')]
        denominator = numerator.mean()
    else:
        non_zero_cols = train_features.select_dtypes('int64')
        non_zero_cols = non_zero_cols.loc[
            :, (non_zero_cols != 0).any(axis='rows')].columns
        numerator = numerator[non_zero_cols]
        denominator = train_features.select_dtypes('int64')[
            non_zero_cols].mean()

    features_with_ratios = features.drop(
        features.select_dtypes('int64'), axis='columns')
    ratio = numerator / denominator
    ratio.columns = ['ratio_' + col_name for col_name in ratio.columns]
    features_with_ratios = features_with_ratios.join(ratio)

    return features_with_ratios

In [None]:
test_features = convert_counts_to_ratios(test_features,
                                         train_features=train_features)
assert((len(test_features) == len(test_physicists)))
assert(len(test_features.columns) == 184)
assert(test_features.select_dtypes('int64').empty)
test_features.head()

In [None]:
train_features = convert_counts_to_ratios(train_features)
assert((len(train_features) == len(train_physicists)))
assert(sorted(train_features.columns.tolist()) == sorted(
    test_features.columns.tolist()))
assert(train_features.select_dtypes('int64').empty)
train_features.head()

The one-hot encoding has increased the dimensionality of the problem. There are now 183 features (excluding the `full_name`) for 540 observations in the training set and 387 observations in the test set. A model that is fit to such data could be prone to overfitting and a dimensionality reduction on this data may be warranted.

## Persisting the Data

Now I have the training and test features dataframes I'll persist them for future use.

In [None]:
train_features = train_features.reindex(
    sorted(train_features.columns), axis='columns')
train_features.head()

In [None]:
test_features = test_features.reindex(
    sorted(test_features.columns), axis='columns')
test_features.head()

In [None]:
train_features.to_csv('../data/processed/train-features.csv', index=False)
test_features.to_csv('../data/processed/test-features.csv', index=False)