# Exploratory Data Analysis

Having created the [training features](../data/processed/train-features.csv) and [target variable](../data/processed/train-target.csv), I would now like to perform an [exploratory data analysis](https://en.wikipedia.org/wiki/Exploratory_data_analysis) (EDA) to see what insights I can derive from the data. In particular, I'd like to get a feeling for how useful the features may be in predicting the target, that is, whether a physicist is or is likely to become a *Nobel Laureate in Physics*.

In [None]:
import matplotlib
import matplotlib.lines as mlines
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline

## Reading in the Data

First let's read in the training features and the target variable.

In [None]:
train_features = pd.read_csv('../data/processed/train-features.csv')
train_features.head()

In [None]:
target = pd.read_csv('../data/processed/train-target.csv', squeeze=True)
display(target.head())
target = target.physics_laureate

# Target Distrubution

Since the goal of the study is to predict *Nobel Laureates in Physics*, it seems appropriate to start by looking at the distribution of the target variable.

In [None]:
target_dist = target.value_counts(normalize=True)
target_dist.index = ['Non-laureates', 'Laureates']
ax = target_dist.plot(kind='bar')
ax.set_title('Nobel Physics Laureates')
ax.set_ylabel('Fraction of Laureates')
ax.set_xticklabels(target_dist.index, rotation='horizontal')
ax.set_yticks(np.linspace(start=0, stop=0.8, num=5))
ax.tick_params(axis='both', left=False, bottom=False)
sns.despine(ax=ax, left=True, bottom=True)

It is evident that the ratio of non-laureates to laureates is about 3.5:1. Due to the class imbalance,  an appropriate metric for selecting and evaluating models will need to be chosen later.

## Continuous Variables Distributions

There are a lot of feature variables to look at, so for the time-being let's consider only the numerical variables. I'd like to get a sense of the distribution of these variables across laureates and non-laureates. A box plot seems like a good way to visualize this so let's go ahead and take a look at the box plots for all the numerical variables. 

In [None]:
numerical = train_features.select_dtypes('float64').join(target)
numerical.head()

In [None]:
def plot_grouped_boxplots(features, target, columns, title='', figsize=(6, 4)):
    """Plot grouped boxplots.
    
    Plot grouped boxplots of numerical features.

    Args:
        features (pandas.DataFrame): Features dataframe.
        target (pandas.Series): Target series.
        columns (list of `str`): Columns in features dataframe
            to plot.
        title (str): Plot title.
        figsize (tuple(int, int)): Default is (6, 4). matplotlib
        figure size in inches x inches.
    """

    columns = features.columns
    fig, ax = plt.subplots(nrows=len(columns), sharex=True, figsize=figsize)
    data = features.join(target)
    for i in range(len(columns)):
        if isinstance(ax, np.ndarray):
            axes = ax[i]
        else:
            axes = ax
        sns.boxplot(data=data, y='physics_laureate', x=data.columns[i], ax=axes,
                    hue='physics_laureate')
        sns.despine(left=True, bottom=True)
        axes.set_xlabel(data.columns[i].replace('_', ' '))
        axes.set_ylabel('Physics laureate')
        axes.tick_params(axis='both', left=False, bottom=False)
        axes.legend_.remove()
        if i == 0:
            axes.set_title(title)
    fig.tight_layout()

In [None]:
alma_mater = [col for col in train_features.columns.tolist()
              if 'alma_mater' in col]
alma_mater = train_features[alma_mater]
plot_grouped_boxplots(alma_mater, target, alma_mater.columns, title='Alma maters')

The ratio of the number of alma mater seems like it may be significant in separating laureates from non-laureates since the value is consistently higher for laureates. The ratio for the country and continent codes looks useful too as there is a spread of lower values for non-laureates, whilst on average laureates seem to consistently study in more than one country and on more than one continent.

In [None]:
workplaces = [col for col in train_features.columns.tolist() if 'workplaces' in col]
workplaces = train_features[workplaces]
plot_grouped_boxplots(workplaces, target, workplaces.columns, title='Workplaces')

The ratio of the number of workplaces seems like it may be significant in distinguishing laureates as the median value and interquartile range are larger for laureates. Although, it is interesting to see that there are several outliers amongst the non-laureates. The ratio for the country and continent codes do not look particularly useful.

In [None]:
years = [col for col in train_features.columns.tolist() if 'years' in col]
years = train_features[years]
plot_grouped_boxplots(years, target, years.columns, title='Years lived')

There seems to be some slight effect of the ratio of the number of years lived on the target. The median value is slightly higher for laureates and the interquartile range and range is lower.

The rest of the variables which are associated with birth and death places, chemistry and physics laureate relationships, citizenships and residences do not seem to be very useful in predicting the target. Although, maybe there is some useful information to be gleamed from some of the outliers.

In [None]:
birth = [col for col in train_features.columns.tolist() if 'birth' in col]
birth = train_features[birth]
plot_grouped_boxplots(birth, target, birth.columns, title='Birth places')

In [None]:
chemistry = [col for col in train_features.columns.tolist() if 'chemistry' in col]
chemistry = train_features[chemistry]
plot_grouped_boxplots(chemistry, target, chemistry.columns,
                      title='Chemistry laureate relationships', figsize=(12, 10))

In [None]:
citizenship = [col for col in train_features.columns.tolist() if 'citizenship' in col]
citizenship = train_features[citizenship]
plot_grouped_boxplots(citizenship, target, citizenship.columns, title='Citizenships')

In [None]:
death = [col for col in train_features.columns.tolist() if 'death' in col]
death = train_features[death]
plot_grouped_boxplots(death, target, death.columns, title='Death places')

In [None]:
physics = [col for col in train_features.columns.tolist() if 'physics' in col]
physics = train_features[physics]
plot_grouped_boxplots(physics, target, physics.columns,
                      title='Physics laureate relationships', figsize=(12, 10))

In [None]:
residence = [col for col in train_features.columns.tolist() if 'residence' in col]
residence = train_features[residence]
plot_grouped_boxplots(residence, target, residence.columns, title='Residences')

## Continuous Variables Correlations

Are there any strong correlations between the numerical variables and between any numerical variables and the target? Let's take a look at the correlation matrix.

In [None]:
# Adapted from https://seaborn.pydata.org/examples/many_pairwise_correlations.html

# Compute the correlation matrix
corr = pd.concat([numerical.drop('physics_laureate', axis='columns'),
                  target.map({'yes': 1, 'no': 0})], axis='columns').corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
ax = sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1.0, vmax=1.0, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
ax.set_title('Correlation matrix for continuous variables');

Here we see that there are only weak correlations between the numerical variables and the target. Of these, the strongest correlations are with the variables associated with alma mater, workplaces and physics laureate doctoral students and academic advisors. However, the following are the few strong correlations amongst the features themselves:

- Alma mater variables
- Workplaces variables
- Residence variables
- Citizenship variables
- Ratio of the number of chemistry and physics laureate notable students
- Ratio of the number of chemistry and physics spouses (these are the Curie's)

## Categorical Variables Distributions

Now let's consider only the categorical variables. I'd like to get a sense of whether being a laureate has any effect on the distribution of these variables. So I'll:

1. Group the categorical variables thematically (e.g. group all "alumnus" variables).
2. Group the physicists into laureates and non-laureates.
3. For each group of physicists and each category in a theme, determine the fraction of physicists. 

By comparison of the fractions in the categories I should be able to see what effect, if any, being a laureate versus being a non-laureate has. A categorical scatter plot seems like a good way to visualize this so let's go ahead and take a look at the plots.

In [None]:
categorical = train_features.select_dtypes('object').drop(
    'full_name', axis='columns').join(target)
categorical.head()

In [None]:
def plot_catplot(features, target, columns, title='', figsize=(6, 4)):
    """Plot categorical boxplots.
    
    Plot boxplots of categorical features.

    Args:
        features (pandas.DataFrame): Features dataframe.
        target (pandas.Series): Target series.
        columns (list of `str`): Columns in features dataframe
            to plot.
        title (str): Plot title.
        figsize (tuple(int, int)): Default is (6, 4). matplotlib
            figure size in inches x inches. 
    """
    
    if columns[0] == 'gender':
        data = categorical['gender'].to_frame().join(target)
        groups = data.groupby(by='physics_laureate')
        data = groups['gender'].value_counts()
        
        no = data.loc['no'] / groups.count().loc['no', :].item()
        no = no.to_frame().T
        yes = data.loc['yes'] / groups.count().loc['yes', :].item()
        yes = yes.to_frame().T
    else:
        data = categorical[columns].applymap(lambda var: 1 if var == 'yes' else 0)
        data = data.join(target).groupby(by='physics_laureate').sum()
        data = data.div(data.sum(axis='columns'), axis='rows')
        
        no = pd.DataFrame(data.loc['no', :]).T
        yes = pd.DataFrame(data.loc['yes', :]).T
    
    # Adapted from:
    # https://stackoverflow.com/questions/47391702/matplotlib-making-a-colored-markers-legend-from-scratch
    grid = sns.catplot(data=no, orient='horizontal', height=10, color='black')
    grid.map(sns.stripplot, data=yes, order=yes.columns, orient='horizontal',
             color='blue', marker='^')

    grid.ax.set_title(title)
    grid.set_xlabels('Fraction')
    grid.ax.set_xlim((0, 1.0))
    grid.ax.tick_params(axis='both', left=False, bottom=False)
    
    black_circle = mlines.Line2D([], [], color='black', marker='o', linestyle='None',
                                 markersize=5, label='Non-laureate')
    blue_triangle = mlines.Line2D([], [], color='blue', marker='^', linestyle='None',
                                  markersize=5, label='Laureate')
    grid.ax.legend(handles=[black_circle, blue_triangle],
                   labels=['Non-laureate', 'Laureate'])
    grid.despine(left=True, bottom=True)

In [None]:
alumnus = [col for col in categorical.columns if col.startswith('alumnus')]
plot_catplot(categorical, target, alumnus, title='Alumnus fractions')

We can see that there is a slight effect with alumnus. For example, being an alumnus in Germany or France is favorable, whereas being an alumnus of the University of Vienna is detrimental to becoming a laureate.

In [None]:
born = [col for col in categorical.columns if col.startswith('born')]
plot_catplot(categorical, target, born, title='Born fractions')

We can see that there is an effect of birth place. For instance, being born in the United States, Great Britain or France is very favorable. On the other hand, being born in Russia or any other country not in the list (represented by ***) really hurts your chances of walking away with the Nobel Prize. The story is similar with citizenship and death places below.

In [None]:
citizen = [col for col in categorical.columns if col.startswith('citizen')]
plot_catplot(categorical, target, citizen, title='Citizen fractions')

In [None]:
died = [col for col in categorical.columns if col.startswith('died')]
plot_catplot(categorical, target, died, title='Died fractions')

In [None]:
gender = [col for col in categorical.columns if col == 'gender']
plot_catplot(categorical, target, gender, title='Gender fractions')

Unsurprisingly, it doesn't look like it helps to be female if you want to become a Physics Nobel Laureate!

In [None]:
is_ = [col for col in categorical.columns if col.startswith('is')]
plot_catplot(categorical, target, is_, title='Physics subtype fractions')

There seems to be a really big effect when it comes the type of physics endeavor. It seems to be all about experiment and little love is given to the theorists and astronomers.

In [None]:
lived = [col for col in categorical.columns if col.startswith('lived')]
plot_catplot(categorical, target, lived, title='Lived fractions')

It's all about living in the USA / North America. This has a really big effect on the chances of winning a Nobel Prize. Most likely this is due to a lot of the top physics talent emigrating to the United States. Seems like it doesn't pay to stay in Germany or on the Asian continent!  

In [None]:
worked = [col for col in categorical.columns if col.startswith('worked')]
plot_catplot(categorical, target, worked, title='Worked fractions')

Interestingly enough, working in the USA / North America seems to have a detrimental effect on winning a Nobel Prize which contradicts the above. I think this may have more to do with the quality of the data than anything. It probably is caused by the fact that a lot of workplaces data is not the most complete? From the data, it certainly seems beneficial to work in Great Britain and in particular at the University of Cambridge.

## Conclusion

It appears that some of the features may be useful in helping to predict whether a physicist will or will not be awarded the Nobel Prize in Physics. However, in the EDA, I have totally ignored the following:

- Correlations between categorical variables.
- Correlations between continuous and categorical variables.
- Correlations between categorical variables and the target.

All these relationships are not so easy to analyze due to the size of the feature space. A more formalized exploratory approach is needed to reduce the size of the feature space and gain futher insight into the factors which affect whether a physicist will have the title of Laureate bestowed upon them.