In [1]:
from IPython.core.display import Markdown
Markdown('../../conclusion.md')

# Conclusion

The goals of this study were to answer the following questions:

- Do demographics play a major role in selecting the winner of the Nobel Prize in Physics?
- Which demographic factors have the biggest influence on the outcome?
- Who are the most likely winners of [The Nobel Prize in Physics 2018](https://www.nobelprize.org/prizes/physics/2018/summary/)?

To try to answer these questions we collected demographic data on almost a thousand world renowned physicists from [DBpedia](https://wiki.dbpedia.org/about). 

From the data, two sets of binary features were constructed. The first was a relatively large dimensionality feature set from the original demographic data. The second was a reduced dimensionality feature set, constructed from the original features, using the [corex topic modeling](https://github.com/gregversteeg/corex_topic) approach.

Furthermore, we split the data into training, validation and test sets for learning, model selection and assessment of generalization performance, respectively. As the Nobel Prize in Physics cannot be awarded posthumously and we needed to develop a model to predict laureates, the data was sampled to create a training set that consisted of deceased physicists and test and validation sets that consisted of living physicists. A **classifier two-sample hypothesis test** was used to formally detect that this sample selection bias introduced a [covariate shift](nobel_physics_prizes/notebooks/5.1-covariate-shift.ipynb) between the training and validation / test sets. We tried to correct for the covariate shift during learning by reweighting training samples according to their importance using the [Kullback-Leibler Importance Estimation Procedure (KLIEP)](https://www.ism.ac.jp/editsec/aism/pdf/060_4_0699.pdf). **Logistic regression**, **support vector machine** and **random forest** classifiers were trained using both feature sets, with and without **importance weighting**, in order to predict Physics Nobel Laureates.

A new performance measure known as **normalized area under the Matthews Correlation Coefficient curve**, $NAUC_{MCC}$, was introduced and used for model selection. This measure allowed us to compare the performance of the models across all classification thresholds and has the nice property that it's interpretation is analogous to that of the [Matthews Correlation Coefficient](https://en.wikipedia.org/wiki/Matthews_correlation_coefficient) (MCC). It has an upper limit of +1 indicating a perfect prediction, a lower limit of -1 indicating total disagreement between prediction and observation and a mid value of 0 representing a random prediction. The best performing model was the logistic regression model trained on the original feature set with an $NAUC_{MCC}$ of 0.37 on the validation data. Neither reducing the feature dimensionality or importance weighting were found to be advantageous.

An optimal threshold of 0.513 corresponding to the maximum value of the MCC, where the true negative rate (TNR) is higher than the true positive rate (TPR), was chosen as the operating point of the model. This threshold was chosen as minimizing false positives (i.e. maximizing the TNR) is more important than minimizing false negatives (i.e. maximizing the TPR) when classifying the physicists as laureates and non-laureates.

The logistic regression classifier achieved an MCC of 0.36 when evaluated on the test data, which indicates that the classifier performs much better than both random chance (MCC = 0) and the "naive" [baseline classifier](5.0-baseline-model.ipynb) (MCC = 0.19). From this the conclusion is that there are significant underlying patterns in the demographic data that correlate with being a Physics Nobel Laureate. However, we avoided making strong statements about the classifier's performance in absolute terms and concluded that we would not be willing to make recommendations to the *Nobel Committee* based on its predictions. Essentially, the number of false postives was too high to make any substantial claims about biases that may be present when deciding Nobel Physics Prize Winners.

In spite of this, we looked at the demographic factors that had the biggest influence on the logistic regression model classifying a physicist as a laureate. Being an *experimental physicist* was by far the most influential feature. The next two most influential features were *having at least one physics laureate doctoral student* and *living for at least 65-79 years*. Other interesting influential features were being a citizen of France or Switzerland, working at Bell Labs or The University of Cambridge, being an alumnus in Asia and having at least two alma mater.

Finally, we used the logistic regression model to predict the most likely winners of [2018 Nobel Physics Prize](https://www.nobelprize.org/prizes/physics/2018/summary/). We were unable to correctly predict the winners as they were never in the original [list of physicists](../data/raw/physicists.txt) derived from Wikipedia. However, we found that the actual winners ([Gerard Mourou](https://en.wikipedia.org/wiki/G%C3%A9rard_Mourou), [Arthur Ashkin](https://en.wikipedia.org/wiki/Arthur_Ashkin) and [Donna Strickland](https://en.wikipedia.org/wiki/Donna_Strickland)) possessed several of the most important demographic factors identified by the logistic regression model.