# Build Features

As a recap, the [training data](../data/processed/train-physicists-from-1901.csv), [validation data](../data/processed/validation-physicists-from-1901.csv) and [test data](../data/processed/test-physicists-from-1901.csv) contain information on physicists who were eligible to receive a Nobel Prize in Physics. That is, they were alive on and after 10 December 1901, the date the prize was first awarded. 

All of the physicists in the training data are deceased and all the physicists in the validation and test data are alive. Recall that the Nobel Prize in Physics cannot be awarded posthumously and one of the goals of this project is to try to predict the next Physics Nobel Laureates. As a result, the data was purposely sampled in this way, so that the training set can be used to build models, which predict whether a living physicist is likely to be awarded the Nobel Prize in Physics.

It is time to use the training, validation and test data, along with the other various pieces of data: [Nobel Physics Laureates](../data/raw/nobel-physics-prize-laureates.csv), [Nobel Chemistry Laureates](../data/raw/nobel-chemistry-prize-laureates.csv), [Places](../data/processed/places.csv) and [Countries](../data/processed/Countries-List.csv), to create features that may help in predicting Physics Nobel Laureates.

## Setting up the Environment

An initialization step is needed to setup the environment:
- The locale needs to be set for all categories to the user’s default setting (typically specified in the LANG environment variable) to enable correct sorting of words with accents.

In [None]:
import locale
    
locale.setlocale(locale.LC_ALL, '')

In [None]:
from datetime import datetime

import numpy as np
import pandas as pd
from pycountry_convert import country_alpha2_to_country_name
from pycountry_convert import country_name_to_country_alpha3
from pycountry_convert import country_alpha2_to_continent_code
from pycountry_convert import country_alpha3_to_country_alpha2
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.preprocessing import OneHotEncoder

from src.data.country_utils import nationality_to_alpha2_code
from src.features.features_utils import rank_hot_encode

## Reading in the Data

First let's read in the training, validation and test data and the list of Nobel Physics Laureates.

In [None]:
train_physicists = pd.read_csv('../data/processed/train-physicists-from-1901.csv')
train_physicists.head()

In [None]:
validation_physicists = pd.read_csv('../data/processed/validation-physicists-from-1901.csv')
validation_physicists.head()

In [None]:
test_physicists = pd.read_csv('../data/processed/test-physicists-from-1901.csv')
test_physicists.head()

In [None]:
nobel_physicists = pd.read_csv('../data/raw/nobel-physics-prize-laureates.csv')
nobel_physicists.head()

There are some variants of laureate names in the training, validation and test data. As we will be searching for whether academic advisors, students, spouses, children, etc. of a physicist are physics laureates, for convenience it's useful to merge the `name` field into Nobel Physicists dataframe.

In [None]:
nobel_columns = ['Year', 'Laureate', 'name', 'Country', 'Rationale']
nobel_physicists = pd.merge(nobel_physicists,
                            train_physicists.append(validation_physicists).append(test_physicists),
                            how = 'left', left_on = 'Laureate',
                            right_on = 'fullName')[nobel_columns]
nobel_physicists.head()

Now let's read in the list of Nobel Chemistry Laureates.

In [None]:
nobel_chemists = pd.read_csv('../data/raw/nobel-chemistry-prize-laureates.csv')
nobel_chemists.head()

Again, we will be searching for whether academic advisors, students, spouses, children, etc. of a physicist are chemistry laureates. So for convenience it's useful to merge the `name` field into Nobel Chemists dataframe.

In [None]:
nobel_chemists = pd.merge(nobel_chemists,
                          train_physicists.append(validation_physicists).append(test_physicists),
                          how = 'left', left_on = 'Laureate',
                          right_on = 'fullName')[nobel_columns]
nobel_chemists.head()

These are essentially physicists who are Chemistry Nobel Laureates. Surpringly there are quite a few of them. Of course, as noted previously, *Marie Curie* is the only double laureate in Physics and Chemistry.

In [None]:
nobel_chemists[nobel_chemists.name.notna()].name

It is worth noting that if there are alternative names of Chemistry Nobel Laureates in the physicists dataframe other than those above, they will *not* be found. However, we do not expect many of these, as at one point all the redirected URLS for the names were retrieved and very few were associated with laureates. In fact you can still see that some of these names are present in the [DBpedia redirects](../data/raw/dbpedia-redirects.csv) (e.g. search for "Marie_Curie"). When processing the physicist data earlier, the imputing of these redirects was removed for the names as a few of them were wrong. For instance, *Richard Feynman's* children redirect back to him! (e.g. search for "Carl_Feynman" and "Michelle_Feynmann" in the DBpedia redirects or directly try http://dbpedia.org/page/Carl_Feynman or http://dbpedia.org/page/Michelle_Feynmann in your browser.

Another interesting observation is that the only one in this list still alive is *Manfred Eigen*. So we should not expect to see many, if any, physicists in the validation or test set who have Chemistry Nobel Laureate academic advisors, notable students, spouses etc. This is clearly a facet in which the training data is different from the validation and test data. Such differences can make learning difficult.

Now, let's read the places and nationalities data into a dataframe. It's important at this point to turn off the default behavior of pandas which is to treat the string literal 'NA' as a missing value. In the dataset, 'NA' is both the continent code of North America and the ISO 3166 alpha-2 country code of Namibia. We then have to impute the missing values since pandas replaces them with the empty string.  

In [None]:
places = pd.read_csv('../data/processed/places.csv', keep_default_na=False)
places = places.replace('', np.nan)
assert(all(places[places.countryAlpha3Code == 'USA']['continentCode'].values == 'NA'))
places.head()

In [None]:
nationalities = pd.read_csv('../data/processed/Countries-List.csv', keep_default_na=False)
nationalities = nationalities.replace('', np.nan)
assert(nationalities[nationalities.Name == 'Namibia']['ISO 3166 Code'].values == 'NA')
nationalities.head()

Finally, with all the data read in, we can now move on to the real work, creating the features.

## Creating the Features

It is now time to create the features from the collected data. The features we will create are listed in the table below along with their type and description. The features can be grouped into three main groups, with the bulk of features falling in the first group, then the second group and so on:

1. Features related to *professional and personal relationships* that the physicists have to *physics or chemistry laureates*, *educational institutions*, *work institutions*, *countries* and *continents*.

2. Features related to the subfield of focus of the physicist denoting whether s/he is a *experimental physicist*, *theoretical physicist* or an *astronomer*.

3. Features related to personal characteristics of the physicist, namely, *gender* and *number of years lived*.

Remember that in the first group, there are people and institutions from different countries and continents that are directly involved in the [selection and voting process for the Nobel Prize in Physics](https://www.nobelprize.org/nomination/physics/) and therefore have a direct influence on those who become laureates. The second group is connected to subjective biases that may or may not exist concerning the major subfield of research of the physicist. While the third group is connected to subjective biases that may or may not exist concerning the gender and age of a physicist. Although the latter is also related to the invention or discovery "standing the test of time".

| Feature                                        | Type        | Description                                         |
| :---:                                          | :---:       | :---:                                               |
| alma_mater                                     | Categorical | List of universities attended                       |
| alma_mater_continent_codes                     | Categorical | List of continent codes of universities attended    |
| alma_mater_country_alpha_3_codes               | Categorical | List of country codes of universities attended      |
| birth_continent_codes                          | Categorical | List of continent codes of birth countries          |
| birth_country_alpha_3_codes                    | Categorical | List of country codes of birth countries            |
| citizenship_continent_codes                    | Categorical | List of continent codes of coutries of citizenship  |
| citizenship_country_alpha_3_codes              | Categorical | List of country codes of citizenship                |
| gender                                         | Binary      | Gender of the physicist (male / female)             |
| is_astronomer                                  | Binary      | Is the physicist an astronomer? (yes / no)          |
| is_experimental_physicist                      | Binary      | Is the physicist an experimental physicist? (yes / no) |
| is_theoretical_physicist                       | Binary      | Is the physicist a theoretical physicist? (yes / no) |
| num_alma_mater                                 | Ordinal     | No. of universities attended                       |
| num_alma_mater_continent_codes                 | Ordinal     | No. of continent codes of universities attended     |
| num_alma_mater_country_alpha_3_codes           | Ordinal     | No. of country codes of universities attended       |
| num_birth_continent_codes                      | Ordinal     | No. of continent codes of birth countries           |
| num_birth_country_alpha_3_codes                | Ordinal     | No. of birth country codes                          | 
| num_chemistry_laureate_academic_advisors       | Ordinal     | No. of chemistry laureate academic advisors         |
| num_chemistry_laureate_children                | Ordinal     | No. of chemistry laureate children                  |
| num_chemistry_laureate_doctoral_advisors       | Ordinal     | No. of chemistry laureate doctoral advisors         |
| num_chemistry_laureate_doctoral_students       | Ordinal     | No. of chemistry laureate doctoral students         |
| num_chemistry_laureate_influenced              | Ordinal     | No. of chemistry laureates the physicist influenced |
| num_chemistry_laureate_influenced_by           | Ordinal     | No. of chemistry laureates the physicist was influenced by | 
| num_chemistry_laureate_notable_students        | Ordinal     | No. of chemistry laureate notable students          |
| num_chemistry_laureate_parents                 | Ordinal     | No. of chemistry laureate parents                   |
| num_chemistry_laureate_spouses                 | Ordinal     | No. of chemistry laureate spouses                   | 
| num_citizenship_continent_codes                | Ordinal     | No. continent codes of countries of citizenship  |
| num_citizenship_country_alpha_3_codes          | Ordinal     | No. of country codes of citizenship                 |
| num_physics_laureate_academic_advisors         | Ordinal     | No. of physics laureate academic advisors           |
| num_physics_laureate_children                  | Ordinal     | No. of physics laureate children                    |
| num_physics_laureate_doctoral_advisors         | Ordinal     | No. of physics laureate doctoral advisors           |
| num_physics_laureate_doctoral_students         | Ordinal     | No. of physics laureate doctoral students           |
| num_physics_laureate_influenced                | Ordinal     | No. of physics laureates the physicist influenced   |
| num_physics_laureate_influenced_by             | Ordinal     | No. of physics laureates the physicist was influenced by |
| num_physics_laureate_notable_students          | Ordinal     | No. of physics laureate notable students            |
| num_physics_laureate_parents                   | Ordinal     | No. of physics laureate parents                     |
| num_physics_laureate_spouses                   | Ordinal     | No. of physics laureate spouses                     |
| num_residence_continent_codes                  | Ordinal     | No. of continent codes of residence countries       |
| num_residence_country_alpha_3_codes            | Ordinal     | No. of residence country codes                      |
| num_workplaces                                 | Ordinal     | No. of workplaces                                   |
| num_workplaces_continent_codes                 | Ordinal     | No. of continent codes of countries of workplaces   |
| num_workplaces_country_alpha_3_codes           | Ordinal     | No. of country codes of countries worked in         |
| num_years_lived_group                          | Ordinal     | No. of years lived group (18-24, 25-34, etc.)             |
| residence_continent_codes                      | Categorical | List of continent codes of countries of residence   |
| residence_country_alpha_3_codes                | Categorical | List of country codes of countries of residence     |
| workplaces                                     | Categorical | List of workplaces                                  |
| workplaces_continent_codes                     | Categorical | List of continent codes of countries worked in      |
| workplaces_country_alpha_3_codes               | Categorical | List of country codes of countries worked in        |

Some comments are also warranted with regards to the types of the feature variables. As you can see, there are three types of variables:

1. **Ordinal** variables.

2. **Categorical** variables.

3. **Binary** (**dichotomous**) variables.


The categorical variables are all lists of varying lengths of **places** and therefore are not in the appropriate form for machine learning. Once we create them they will be encoded into binary variables and the lists will be discarded. You may ask why the encoding is done with categorical yes / no values rather than 0 / 1 values? It is because the algorithms that we will be processing the data with would treat 0 / 1 values as quantitive in nature, which clearly is not desired. Essentially, we will be left with two variable types, binary variables and ordinal variables. OK time to create the features. 

In [None]:
def build_features(physicists, nobel_physicists, nobel_chemists, places, nationalities):
    """Build features for the physicists.

    Args:
        physicists (pandas.DataFrame): Physicists dataframe.
        nobel_physicists (pandas.DataFrame): Nobel Physics
            Laureate dataframe.
        nobel_chemists (pandas.DataFrame): Nobel Chemistry
            Laureate dataframe.
        places (pandas.DataFrame): Places dataframe.
        nationality (pandas.DataFrame): Nationalies dataframe.
            
    Returns:
        pandas.DataFrame: Features dataframe.
    """
    
    features = physicists.copy()[['fullName', 'name', 'gender']].rename(
        mapper={'fullName': 'full_name'}, axis='columns')
    features['num_years_lived_group'] = _build_num_years_lived_group(
        physicists.birthDate, physicists.deathDate)
    
    _build_physics_subfield_features(features, physicists)
    _build_num_laureates_features(features, physicists,
                                  nobel_physicists, nobel_chemists)
    
    _build_citizenship_features(features, physicists, nationalities)
    
    _build_places_features(features, physicists, places)
    
    features = features.drop('name', axis='columns')
    return features


def _build_physics_subfield_features(features, physicists):
    features_to_build = {
        'is_theoretical_physicist': {'categories': 'Theoretical physicists',
                                     'others': 'theoretical physic'},
        'is_experimental_physicist': {'categories': 'Experimental physicists',
                                      'others': 'experimental physic'},
        'is_astronomer': {'categories': 'astronomers',
                          'others': 'astronom'}
    }
    
    for feature, search_terms in features_to_build.items():
        features[feature] = _build_physics_subfield(
            physicists.categories, physicists.field, physicists.description,
            physicists.comment, search_terms=search_terms)
    


def _build_num_laureates_features(features, physicists, nobel_physicists,
                                  nobel_chemists):
    features_to_build = {
        'laureate_academic_advisors': 'academicAdvisor',
        'laureate_doctoral_advisors': 'doctoralAdvisor',
        'laureate_doctoral_students': 'doctoralStudent',
        'laureate_notable_students': 'notableStudent',
        'laureate_children': 'child',
        'laureate_parents': 'parent',
        'laureate_spouses': 'spouse',
        'laureate_influenced': 'influenced',
        'laureate_influenced_by': 'influencedBy'
    }
    
    for feature, relation in features_to_build.items():
        features['num_physics_' + feature] = _build_num_laureates(
            physicists[relation], nobel_physicists.Laureate, nobel_physicists.name)
        features['num_chemistry_' + feature] = _build_num_laureates(
            physicists[relation], nobel_chemists.Laureate, nobel_chemists.name)
    # drop columns where the counts are all zeros
    non_zero = (features != 0).any(axis='rows')
    features.drop(non_zero[non_zero == False].index, axis='columns', inplace=True)


    
def _build_places_features(features, physicists, places):
    features_to_build = {
        'birth_country_alpha_3_codes': 'birthPlace',
        'birth_continent_codes': 'birthPlace',
        'residence_country_alpha_3_codes': 'residence',
        'residence_continent_codes': 'residence',
        'alma_mater': 'almaMater',
        'alma_mater_country_alpha_3_codes': 'almaMater',
        'alma_mater_continent_codes': 'almaMater',
        'workplaces': 'workplaces',
        'workplaces_country_alpha_3_codes': 'workplaces',
        'workplaces_continent_codes': 'workplaces'
    }
    
    for feature, place in features_to_build.items():
        code = 'countryAlpha3Code'
        if 'continent' in feature:
            code = 'continentCode'
            
        if feature in ['alma_mater', 'workplaces']:
            features[feature] = physicists[place].apply(
                _get_alma_mater_or_workplaces)           
        else:
            features[feature] = _build_places_codes(
                physicists[place], places.fullName, places[code])
        features['num_' + feature] = features[feature].apply(len)


    
def _build_citizenship_features(features, physicists, nationalities):
    citizenship = physicists.citizenship.apply(
        _get_citizenship_codes, args=(nationalities,))
    nationality = physicists.nationality.apply(
        _get_citizenship_codes, args=(nationalities,))
    citizenship_description = physicists.description.apply(
        _get_citizenship_codes, args=(nationalities,))
    features['citizenship_country_alpha_3_codes'] = (
        (citizenship + nationality + citizenship_description).apply(
            lambda ctz: list(sorted(set(ctz)))))
    features['num_citizenship_country_alpha_3_codes'] = (
        features.citizenship_country_alpha_3_codes.apply(len))
    features['citizenship_continent_codes'] = (
        features.citizenship_country_alpha_3_codes.apply(
            lambda al3: list(sorted({country_alpha2_to_continent_code(
                country_alpha3_to_country_alpha2(cd)) for cd in al3}))))
    features['num_citizenship_continent_codes'] = (
        features.citizenship_continent_codes.apply(len))


def _build_num_years_lived_group(birth_date, death_date):
    death_date_no_nan = death_date.apply(_date_no_nan)
    birth_date_no_nan = birth_date.apply(_date_no_nan)
    years_lived = ((death_date_no_nan - birth_date_no_nan) /
                   pd.to_timedelta(1, 'Y'))
    years_lived = years_lived.apply(np.floor)
    years_lived_group = years_lived.apply(_years_lived_group)
    return years_lived_group

        
def _years_lived_group(years_lived):
    assert(years_lived >= 18 and years_lived <= 120)
    
    groups = {
        range(18, 25): '18-24',
        range(25, 35): '25-34',
        range(35, 50): '35-49',
        range(50, 65): '50-64',
        range(65, 80): '65-79',
        range(80, 95): '80-94',
        range(95, 121): '95-120'
    }
    
    for range_, code in groups.items():
        if years_lived in range_:
            return groups[range_]


def _build_physics_subfield(categories, field, description, comment, search_terms):
    cat_theoretical_physicist = categories.apply(
        lambda cat: search_terms['categories'] in cat)
    field_theoretical_physicist = field.apply(
        lambda fld: search_terms['others'] in fld.lower() if isinstance(fld, str)
        else False)
    desc_theoretical_physicist = description.apply(
        lambda desc: search_terms['others'] in desc.lower() if isinstance(desc, str)
        else False)
    comm_theoretical_physicist = description.apply(
        lambda comm: search_terms['others'] in comm.lower() if isinstance(comm, str)
        else False)
    subfield = (cat_theoretical_physicist |
                field_theoretical_physicist |
                desc_theoretical_physicist |
                comm_theoretical_physicist)
    subfield = subfield.apply(lambda val: 'yes' if val == True else 'no')
    return subfield


def _build_num_laureates(series, laureates, names):
    laureate_names = series.apply(_get_nobel_laureates,
                                  args=(laureates, names))
    return laureate_names.apply(len)


def _build_places_codes(places_in_physicists, full_name_in_places, places_codes):
    codes = places_in_physicists.apply(_get_places_codes,
                                       args=(full_name_in_places, places_codes))
    return codes


def _get_alma_mater_or_workplaces(cell):
    if isinstance(cell, float):
        return list()
    
    places = set()
    places_in_cell = cell.split('|')
    for place_in_cell in places_in_cell:
        # group colleges of University of Oxford and University of Cambridge
        # with their respective parent university
        if place_in_cell.endswith(', Cambridge'):
            places.add('University of Cambridge')
        elif place_in_cell.endswith(', Oxford'):
            places.add('University of Oxford')
        else:
            places.add(place_in_cell)
    
    places = list(places)
    places.sort(key=locale.strxfrm)
    return places


def _get_citizenship_codes(series, nationalities):
    alpha_2_codes = nationality_to_alpha2_code(series, nationalities)
    if isinstance(alpha_2_codes, float):
        return list()
    alpha_2_codes = alpha_2_codes.split('|')
    alpha_3_codes = [country_name_to_country_alpha3(
        country_alpha2_to_country_name(alpha_2_code))
                     for alpha_2_code in alpha_2_codes]
    return alpha_3_codes


def _get_nobel_laureates(cell, laureates, names):
    laureates_in_cell = set()
    
    if isinstance(cell, str):
        # assume the same name if only differs by a hyphen
        # or whitespace at front or end of string
        values = cell.strip().replace('-', ' ').split('|')
        for value in values:
            if value in laureates.values:
                laureates_in_cell.add(value)
            if names.str.contains(value, regex=False).sum() > 0:
                laureates_in_cell.add(value)
                    
    laureates_in_cell = list(laureates_in_cell)
    return laureates_in_cell

    
def _get_places_codes(cell, full_name_in_places, places_codes):
    codes = set()

    if isinstance(cell, str):
        places = cell.split('|')
        for place in places:
            code_indices = full_name_in_places[
                full_name_in_places == place].index
            assert(len(code_indices) <= 1)
            if len(code_indices) != 1:
                continue
            code_index = code_indices[0]
            codes_text = places_codes[code_index]
            if isinstance(codes_text, float):
                continue
            codes_in_cell = codes_text.split('|')
            for code_in_cell in codes_in_cell:
                if code_in_cell:
                    codes.add(code_in_cell)

    codes = list(codes)
    codes.sort()
    return codes
    

def _date_no_nan(date):
    if isinstance(date, str):
        return datetime.strptime(date, '%Y-%m-%d').date()
    return datetime(2018, 10, 24).date()  # fix the date for reproducibility

In [None]:
train_features = build_features(train_physicists, nobel_physicists, nobel_chemists, places, nationalities)
assert((len(train_features) == len(train_physicists)))
assert(len(train_features.columns) == 45)
train_features.head()

In [None]:
validation_features = build_features(
    validation_physicists, nobel_physicists, nobel_chemists, places, nationalities)
assert((len(validation_features) == len(validation_physicists)))
assert(len(validation_features.columns) == 37)
validation_features.head()

In [None]:
test_features = build_features(test_physicists, nobel_physicists, nobel_chemists, places, nationalities)
assert((len(test_features) == len(test_physicists)))
assert(len(test_features.columns) == 36)
test_features.head()

So there are more features in the training set than in the validation and test sets. So what are these extra features? These are mainly related to the relationships physicists have with chemistry and physics laureates. It seems like the data is not so rich, especially with regards to more modern physicists.

In [None]:
train_features.columns.difference(validation_features.columns).tolist()

In [None]:
train_features.columns.difference(test_features.columns).tolist()

Any machine models that we build will have parameters chosen using the validation set and be evaluated on the test set. The tempting thing to do is to reduce the features to the common set of features between the training, validation and test sets, which have variability across all three datasets. We will do this for the training and validation sets as it seems a perfectly reasonable thing to do. However, using the test set would clearly be *data snooping* (cheating) as it is meant to be unseen data, and as such, cannot be used to make any decisions during the modeling process. So to ensure that the test set features are identical to the training set features, we will "pad" the extra features in the test set with all "0" values and remove any extra features that are not present in the training set.

In [None]:
feature_cols = train_features.columns.intersection(validation_features.columns)
assert(validation_features.equals(validation_features[feature_cols]))
train_features = train_features[feature_cols]
assert((len(train_features.columns) == len(validation_features.columns)))
assert(sorted(train_features.columns.tolist()) == sorted(validation_features.columns.tolist()))
train_features.head()

In [None]:
feature_cols = test_features.columns.intersection(train_features.columns)
test_features = test_features[feature_cols]
test_features['num_physics_laureate_influenced'] = 0
test_features['num_physics_laureate_influenced_by'] = 0
assert((len(test_features.columns) == len(train_features.columns)))
assert(sorted(test_features.columns.tolist()) == sorted(train_features.columns.tolist()))
test_features.head()

Now we will binary encode the list features. Due to the binary encoding there will be a differing number of features in the training, test and validation sets. We will follow a methodology analagous to the above in order to ensure that the features are identical in the training, validation and test sets. 

The differing features occur due to the differing *country codes*, *workplaces*, *educational institutions*, etc. that the physicists are associated with. Some of the differences are due variability in the data and some are caused by the **selection** bias that we deliberately introduced in our data sampling process. The latter issue is an important one that we will return to in a later notebook.

We will also be using a `presence_threshold` to group binary features that only appear in a few instances into an "other" category. This is intended to reduce the dimensionality of the feature space and help to prevent overfitting during the model building phase. Let's go ahead and "binarize" the list features now.

In [None]:
def binarize_list_features(features, train_features=None, presence_threshold=0.0, pad_features=False):
    """Binarize list features.
    
    Binary encode the list categorical features in the
    features dataframe.

    Args:
        features (pandas.DataFrame): Features dataframe.
        train_features (pandas.DataFrame, optional): Defaults to None.
            Training features dataframe. Pass this parameter when 
            building features for a test or validation set so that 
            features not found in the training features are grouped
            into the "other category" that is mentioned in
            `presence_threshold` below.
        presence_threshold (float, optional): Defaults to 0.0. For
            each category in a categorical list feature, the 
            fraction of physicists for which the category is 
            present will be calculated. If the fraction is below
            this threshold it will grouped into the "other"
            category (represented by one or more "*'s'" in its
            name). This is intended for "bucketing" rare
            values to keep the dimensionality of the feature
            space down and reduce chances of overfitting. Set
            this value to zero to prevent any grouping of
            values. Note that this value will be ignored when
            `train_features` is provided.
        pad_features (bool, optional): Defaults to False. Pad binary
            features not found in the training set with all 'no'
            values. This should be set to True for a test set to
            ensure that the test set features will match the training
            set features.
            
    Returns:
        pandas.DataFrame: Features dataframe.
    """
    
    # union of places and citizenship (without the counts)
    series_to_binarize = {
        'birth_country_alpha_3_codes': 'born_in_',
        'birth_continent_codes': 'born_in_',
        'residence_country_alpha_3_codes': 'lived_in_',
        'residence_continent_codes': 'lived_in_',
        'alma_mater': 'alumnus_of_',
        'alma_mater_country_alpha_3_codes': 'alumnus_in_',
        'alma_mater_continent_codes': 'alumnus_in_',
        'workplaces': 'worked_at_',
        'workplaces_country_alpha_3_codes': 'worked_in_',
        'workplaces_continent_codes': 'worked_in_',
        'citizenship_country_alpha_3_codes': 'citizen_of_',
        'citizenship_continent_codes': 'citizen_in_'
    }
        
    for series, prefix in series_to_binarize.items():
        binarized = _binarize_list_feature(features[series], prefix,
                                           train_features, presence_threshold)
        features = features.drop(series, axis='columns').join(binarized)
        
    # add extra features in test set to sync with training set
    if pad_features:
        cols_to_add = set(train_features.columns) - set(features.columns)
        shape=(len(features), len(cols_to_add))
        features_to_pad = pd.DataFrame(
            np.full(shape, 'no'), index=features.index, columns=cols_to_add)
        features = features.join(features_to_pad)
    return features
    
    
def _binarize_list_feature(series, prefix, train_features=None,
                           presence_threshold=0.0):
    mlb = MultiLabelBinarizer()
    binarized = pd.DataFrame(
        mlb.fit_transform(series),
        columns=[prefix + class_.replace(' ', '_') for class_ in mlb.classes_],
        index=series.index)
    
    if not (presence_threshold <= 0.0) or train_features is not None:
        if train_features is not None:
            cols_to_group = [col for col in binarized.columns if col not in
                             train_features.columns]
        else:
            cols_to_group = binarized.mean() < presence_threshold
            cols_to_group = cols_to_group[cols_to_group.values].index.tolist()
            
        # look for at least one '1' value in the row for a physicist
        if cols_to_group:
            other_col = binarized[cols_to_group].applymap(
                lambda val: True if val == 1 else False).any(axis='columns')
            other_col.name = _series_name(series.name, prefix)
            binarized = binarized.drop(cols_to_group, axis='columns').join(other_col)

    binarized = binarized.applymap(lambda val: 'yes' if val == 1 else 'no')
    return binarized


def _series_name(name, prefix):
    if name.endswith('alpha_3_codes'):
        other_name = '***'
    elif name.endswith('continent_codes'):
        other_name = '**'
    else:
        other_name = '*'
    return prefix + other_name

In [None]:
train_features = binarize_list_features(train_features, presence_threshold=0.01)
assert(len(train_features.columns) == 157)
train_features.head()

In [None]:
validation_features = binarize_list_features(validation_features, train_features=train_features)
assert(len(validation_features.columns) == 147)
validation_features.head()

So there are more features in the training set than in the validation set. So what are these extra features?

In [None]:
train_features.columns.difference(validation_features.columns).tolist()

Working on the Manhattan Project certainly makes sense! OK let's reduce the training and validation set to the common set of features amongst them.

In [None]:
feature_cols = train_features.columns.intersection(validation_features.columns)
assert(validation_features.equals(validation_features[feature_cols]))
train_features = train_features[feature_cols]
assert((len(train_features.columns) == len(validation_features.columns)))
assert(sorted(train_features.columns.tolist()) == sorted(validation_features.columns.tolist()))
train_features.head()

There are less features in the test set so let's "pad" the remaining binary features with "no" to ensure that the features are identical between the training and test sets.

In [None]:
test_features = binarize_list_features(test_features, train_features=train_features, pad_features=True)
assert(sorted(test_features.columns.tolist()) == sorted(train_features.columns.tolist()))
test_features.head()

The features almost look good now, but there is one thing that is troubling. The mix of binary and ordinal variables complicates matters when it comes to machine learning. There are issues related to the following:
- [How to correctly scale features for machine learning algorithms](https://stats.stackexchange.com/questions/69568/whether-to-rescale-indicator-binary-dummy-predictors-for-lasso)?
- [Difficulty of interpretability of coefficients in generalized linear models](https://andrewgelman.com/2009/07/11/when_to_standar/)
- [The bias introduced by lower importance given towards binary variables in random forests](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)

We would like to avoid these issues altogether.

Seeing as the majority of the features are binary, it makes sense to convert the ordinal variables to binary variables. However, we do not want to lose the ordinal information that is present in these variables. The [rank-hot encoder](http://scottclowe.com/2016-03-05-rank-hot-encoder/) is an encoding that converts ordinal variables to binary variables whilst maintaining the ordinal information. *Scott C. Lowe*, PhD student studying neuroinformatics at the University of Edinburgh, explains that "the **rank-hot encoder** is similar to a *one-hot encoder*, except every feature up to and including the current rank is hot." He illustrates this with the following example:

<table>
  <thead>
    <tr>
      <th>Satisfaction</th>
      <th>Rank Index</th>
      <th>One-Hot Encoding</th>
      <th>Rank-Hot Encoding</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Very bad</td>
      <td>0</td>
      <td><code class="highlighter-rouge">[1, 0, 0, 0, 0]</code></td>
      <td><code class="highlighter-rouge">[0, 0, 0, 0]</code></td>
    </tr>
    <tr>
      <td>Bad</td>
      <td>1</td>
      <td><code class="highlighter-rouge">[0, 1, 0, 0, 0]</code></td>
      <td><code class="highlighter-rouge">[1, 0, 0, 0]</code></td>
    </tr>
    <tr>
      <td>Neutral</td>
      <td>2</td>
      <td><code class="highlighter-rouge">[0, 0, 1, 0, 0]</code></td>
      <td><code class="highlighter-rouge">[1, 1, 0, 0]</code></td>
    </tr>
    <tr>
      <td>Good</td>
      <td>3</td>
      <td><code class="highlighter-rouge">[0, 0, 0, 1, 0]</code></td>
      <td><code class="highlighter-rouge">[1, 1, 1, 0]</code></td>
    </tr>
    <tr>
      <td>Very good</td>
      <td>4</td>
      <td><code class="highlighter-rouge">[0, 0, 0, 0, 1]</code></td>
      <td><code class="highlighter-rouge">[1, 1, 1, 1]</code></td>
    </tr>
  </tbody>
</table>

He goes on to say, "Instead of answering the query “Is the satisfaction x?”, the entries in a rank-hot encoder tell us “Is the satisfaction level at least x?”. This representation of the data allows a linear model to explain the effect of a high-rank as the additive composition of the effect of each rank in turn."

Sounds very useful doesn't it! Plus there are some other very nice properties of this encoding scheme that are explained in the blog. The main cons of rank-hot encoding, which are shared with one-hot encoding, are:

- The feature space gets larger.
- Information is lost whenever a categorical value is observed in a new instance (i.e. in the test set) that was not observed in the training (or validation) set.

However, the benefits mentioned earlier are so important that they outweigh these downsides. Plus there are ways of dealing with the increase in the size of the feature space. OK let's go ahead and **rank-hot encode** the ordinal features.

In [None]:
ordinal_cols = [col for col in train_features.columns if col.startswith('num_')]
enc = OneHotEncoder(categories='auto', sparse=False, dtype='int64', handle_unknown='ignore')
enc.fit(train_features[ordinal_cols].append(validation_features[ordinal_cols]))

In [None]:
train_features = rank_hot_encode(train_features, enc, columns=ordinal_cols)
train_features = train_features.replace({0: 'no', 1: 'yes'})
assert(len(train_features.columns) == 206)
assert(train_features.select_dtypes('int64').empty)
assert(all(train_features.notna()))
train_features.head()

In [None]:
validation_features = rank_hot_encode(validation_features, enc, columns=ordinal_cols)
validation_features = validation_features.replace({0: 'no', 1: 'yes'})
assert(sorted(validation_features.columns.tolist()) == sorted(train_features.columns.tolist()))
assert(validation_features.select_dtypes('int64').empty)
assert(all(validation_features.notna()))
validation_features.head()

In [None]:
test_features = rank_hot_encode(test_features, enc, columns=ordinal_cols)
test_features = test_features.replace({0: 'no', 1: 'yes'})
assert(sorted(test_features.columns.tolist()) == sorted(train_features.columns.tolist()))
assert(test_features.select_dtypes('int64').empty)
assert(all(test_features.notna()))
test_features.head()

The following columns in the training features have no variation in the values. Nothing can be learnt from these features, so let's drop them.

In [None]:
no_variation = (train_features != 'no').any(axis='rows')
no_variation = no_variation[no_variation == False]
assert(len(no_variation) ==  3)
no_variation

In [None]:
train_features = train_features.drop(no_variation.index, axis='columns')
assert(len(train_features) == len(train_physicists))
assert(len(train_features.columns.tolist()) == 203)
assert(len(train_features.select_dtypes('object').columns) == len(train_features.columns.tolist()))
assert(all(train_features.notna()))
train_features.head()

In [None]:
validation_features = validation_features.drop(no_variation.index, axis='columns')
assert(len(validation_features) == len(validation_physicists))
assert(sorted(validation_features.columns.tolist()) == sorted(train_features.columns.tolist()))
assert(len(validation_features.select_dtypes('object').columns) == len(validation_features.columns.tolist()))
assert(all(validation_features.notna()))
validation_features.head()

In [None]:
test_features = test_features.drop(no_variation.index, axis='columns')
assert(len(test_features) == len(test_physicists))
assert(sorted(test_features.columns.tolist()) == sorted(train_features.columns.tolist()))
assert(len(test_features.select_dtypes('object').columns) == len(test_features.columns.tolist()))
assert(all(test_features.notna()))
test_features.head()

Let's take a quick look at the features that remain.

In [None]:
sorted(train_features.drop('full_name', axis='columns').columns.tolist())

The binary encoding has increased the dimensionality of the problem. There are now 202
features (excluding the `full_name`) for 542 observations in the training set, 192 observations in the validation set and 193 observations in the test set. A model that is fit to such data could be prone to overfitting and a dimensionality reduction on this data may be warranted.

## Persisting the Data

Now we have the training, validation and test features dataframes, let's persist them for future use.

In [None]:
train_features = train_features.reindex(sorted(train_features.columns), axis='columns')
train_features.head()

In [None]:
validation_features = validation_features.reindex(sorted(validation_features.columns), axis='columns')
validation_features.head()

In [None]:
test_features = test_features.reindex(sorted(test_features.columns), axis='columns')
test_features.head()

In [None]:
train_features.to_csv('../data/processed/train-features.csv', index=False)
validation_features.to_csv('../data/processed/validation-features.csv', index=False)
test_features.to_csv('../data/processed/test-features.csv', index=False)