# Exploratory Data Analysis

Having created the [training features](../data/processed/train-features.csv) and [target variable](../data/processed/train-target.csv), we would now like to perform an [exploratory data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) (EDA) to see what insights we can derive from the data. In particular, we would like to get a feeling for how useful the features may be in predicting the target, that is, whether a physicist is a laureate.

In [None]:
import matplotlib
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

## Reading in the Data

First let's read in the training features and the target variable.

In [None]:
train_features = pd.read_csv('../data/processed/train-features.csv')
train_features.head()

In [None]:
train_target = pd.read_csv('../data/processed/train-target.csv', squeeze=True)
display(train_target.head())
train_target = train_target.physics_laureate

# Target Distrubution

Since we will be predicting Nobel Laureates in Physics, it seems appropriate to start by looking at the distribution of the target variable.

In [None]:
train_target_dist = train_target.value_counts(normalize=True)
train_target_dist.index = ['Non-laureates', 'Laureates']
ax = train_target_dist.plot(kind='bar')
ax.set_title('Nobel Physics Laureates')
ax.set_ylabel('Fraction of Laureates')
ax.set_xticklabels(train_target_dist.index, rotation='horizontal')
ax.set_yticks(np.linspace(start=0, stop=0.8, num=5))
ax.tick_params(axis='both', left=False, bottom=False)
sns.despine(ax=ax, left=True, bottom=True)

It is evident that the ratio of non-laureates to laureates is about 3.5:1. Due to the class imbalance,  an appropriate metric for selecting and evaluating models will need to be chosen later.

## Binary Variables Distributions

There are a lot of feature variables to look at. We would like to get a sense of whether being a laureate has any effect on the distribution of these variables. So we will:

1. Group the binary variables thematically (e.g. group *all* "alumnus" variables).
2. Group the physicists into laureates and non-laureates.
3. For each group of physicists and each category in a theme, determine the fraction of physicists. 

By comparison of the fractions in the categories we should be able to see what effect, if any, being a laureate has. A categorical scatter plot seems like a good way to visualize this so let's go ahead and take a look at the plots.

In [None]:
binary = train_features.select_dtypes('object').drop('full_name', axis='columns').join(train_target)
binary.head()

In [None]:
def plot_catplot(features, target, columns, title='', figsize=(6, 4)):
    """Plot categorical boxplots.
    
    Plot boxplots of categorical features.

    Args:
        features (pandas.DataFrame): Features dataframe.
        target (pandas.Series): Target series.
        columns (list of `str`): Columns in features dataframe
            to plot.
        title (str): Plot title.
        figsize (tuple(int, int)): Default is (6, 4). matplotlib
            figure size in inches x inches. 
    """
    
    if columns[0] == 'gender':
        data = features['gender'].to_frame().join(target)
        groups = data.groupby(by='physics_laureate')
        data = groups['gender'].value_counts()
        
        no = data.loc['no'] / groups.count().loc['no', :].item()
        no = no.to_frame().T
        yes = data.loc['yes'] / groups.count().loc['yes', :].item()
        yes = yes.to_frame().T
    else:
        data = features[columns].applymap(lambda var: 1 if var == 'yes' else 0)
        data = data.join(target).groupby(by='physics_laureate').sum()
        data = data.div(data.sum(axis='columns'), axis='rows')
        
        no = pd.DataFrame(data.loc['no', :]).T
        yes = pd.DataFrame(data.loc['yes', :]).T
    
    # Adapted from:
    # https://stackoverflow.com/questions/47391702/matplotlib-making-a-colored-markers-legend-from-scratch
    grid = sns.catplot(data=no, orient='horizontal', height=10, color='black')
    grid.map(sns.stripplot, data=yes, order=yes.columns, orient='horizontal',
             color='blue', marker='^')

    grid.ax.set_title(title)
    grid.set_xlabels('Fraction')
    grid.ax.set_xlim((0, 1.0))
    grid.ax.tick_params(axis='both', left=False, bottom=False)
    
    black_circle = mlines.Line2D([], [], color='black', marker='o', linestyle='None',
                                 markersize=5, label='Non-laureate')
    blue_triangle = mlines.Line2D([], [], color='blue', marker='^', linestyle='None',
                                  markersize=5, label='Laureate')
    grid.ax.legend(handles=[black_circle, blue_triangle],
                   labels=['Non-laureate', 'Laureate'])
    grid.despine(left=True, bottom=True)

In [None]:
alumnus = [col for col in binary.columns if col.startswith('alumnus')]
plot_catplot(binary, train_target, alumnus, title='Alumnus fractions')

We can see that there is a slight effect with alumnus. For example, being an alumnus in Germany or France is favorable, whereas being an alumnus in Russia is detrimental to becoming a laureate.

In [None]:
num_alma_mater = [col for col in binary.columns if col.startswith('num_alma_mater')]
plot_catplot(binary, train_target, num_alma_mater, title='Number of alma mater fractions')

It is clear that the biggest postive effect, when it comes to alma mater, is when the number of alma mater attended is 2.

In [None]:
num_physics_laureate = [col for col in binary.columns if col.startswith('num_physics_laureate')]
plot_catplot(binary, train_target, num_physics_laureate, title='Number of physics laureate fractions')

When regards to the number of relationships with physics laureates, the data is interesting and may even be counter-intuitive in several ways. Surprisingly, having exactly one laureate doctoral advisor severely damages the chances of a Nobel Prize win, but having 1 or 2 academic laureate advisors actually improves the chances. On the other hand, it seems to pay off to teach well as having good doctoral and notable students really help a physicist's chances of becoming a laureate. It is interesting that being an influencer or being influenced by 1 or 2 other laureates has a big negative effect on one's chances of walking away with the big prize. This is pure speculation, as it is difficult to know the real reason for this, but maybe this is due to too much similar work being produced?

In [None]:
born = [col for col in binary.columns if col.startswith('born')]
plot_catplot(binary, train_target, born, title='Born fractions')

We can see that there is an effect of birth place. For instance, being born in the United States, Great Britain or France is very favorable. On the other hand, being born in Russia or any other country not in the list (represented by "***") seems to hurt one's chances of scooping the Nobel Prize. The story is similar with citizenship below.

In [None]:
citizen = [col for col in binary.columns if col.startswith('citizen')]
plot_catplot(binary, train_target, citizen, title='Citizen fractions')

In [None]:
num_birth = [col for col in binary.columns if col.startswith('num_birth')]
plot_catplot(binary, train_target, num_birth, title='Number of birthplaces fractions')

Here the effect is barely noticeable. However, there is a slight increase in the fraction of laureates for physicists in the mid range of country counts. The story is similar for the higher counts for the number of citizenship countries below.

In [None]:
num_citizenship = [col for col in binary.columns if col.startswith('num_citizenship')]
plot_catplot(binary, train_target, num_citizenship, title='Number of citizenship fractions')

In [None]:
gender = [col for col in binary.columns if col == 'gender']
plot_catplot(binary, train_target, gender, title='Gender fractions')

Unsurprisingly, it doesn't look like it helps to be female if you want to become a Physics Nobel Laureate!

In [None]:
is_ = [col for col in binary.columns if col.startswith('is')]
plot_catplot(binary, train_target, is_, title='Physics field fractions')

There seems to be a really big effect when it comes the type of physics endeavor. It seems to be all about experiment and little love is given to the theorists and astronomers.

In [None]:
lived = [col for col in binary.columns if col.startswith('lived')]
plot_catplot(binary, train_target, lived, title='Lived fractions')

It's all about living in the USA / North America. This seems to have a really big effect on the chances of winning a Nobel Prize. Most likely this is due to a lot of the top physics talent emigrating to the United States. Seems like it doesn't pay to stay in Germany or on the Asian continent!  

In [None]:
num_residence = [col for col in binary.columns if col.startswith('num_residence')]
plot_catplot(binary, train_target, num_residence, title='Number of residence fractions')

It also seems like living exclusively in one country really improves the chances of winning a Nobel Laureate.

In [None]:
worked = [col for col in binary.columns if col.startswith('worked')]
plot_catplot(binary, train_target, worked, title='Worked fractions')

Interestingly, working in the USA / North America seems to have a detrimental effect on winning a Nobel Prize, which contradicts the above. This may have more to do with the quality of the data than anything. It may be caused by the fact that a lot of workplaces data is not the most complete. From the data, it certainly seems beneficial to work in Great Britain and in particular at the University of Cambridge.

In [None]:
num_workplaces = [col for col in binary.columns if col.startswith('num_workplaces')]
plot_catplot(binary, train_target, num_workplaces, title='Number of workplaces fractions')

Working in 2 or 3 workplaces seems to improve the chances of becoming a Nobel Laureate, whereas the diversity of workplace location doesn't make any difference.

In [None]:
num_years_lived_group = [col for col in binary.columns if col.startswith('num_years_lived_group')]
plot_catplot(binary, train_target, num_years_lived_group, title='Number of years lived group fractions')

Being in the age range groups `65-79` and `80-94` seem to improve the chances of winning a Physics Nobel Prize, whilst being in the youngest age range groups seems to negatively affect the chances of picking up the big prize.

## Conclusion

It appears that some of the features may be useful in helping to predict whether a physicist will or will not be awarded the Nobel Prize in Physics. However, in the EDA, we have totally ignored the following:

- Correlations between categorical features.
- Correlations between categorical features and the target.

All these relationships are not so easy to analyze due to the size of the feature space. A more formalized exploratory approach is needed to reduce the size of the feature space and gain futher insight into the factors which affect whether a physicist will have the title of laureate bestowed upon them.