## Lecture 9 Notebook: Descriptives

In this notebook, we will revisit the replication data from the paper <a href="https://www.cambridge.org/core/journals/american-political-science-review/article/mps-for-sale-returns-to-office-in-postwar-british-politics/E4C2B102194AA1EA0D2F1F777EAE3C08">"MPs for Sale? Returns to Office in Postwar British Politics"</a> by Eggers and Hainmuller. 

Let's do some more basic descriptive analysis.

Like before, we start by importing some libraries.




In [None]:
# Importing libraries for tables and plots
from datascience import Table
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')
import numpy as np
from scipy import stats

The datascience library we used before is a bit limited in plotting capabilities, so this week we will us the "pandas" library which calls tables dataframes.

In [None]:
# Importing the data into a table called mps
mps = Table.read_table("MPs.csv")
mps = mps.to_df()
mps

One way that we can refer to columns in a data frame is with the format dataframe['column']. So, to pull the 'party' variable from the mps dataframe we type mps['party']. Think of this as a "shortcut" for the mps.column('party') syntax we used before.

In [None]:
mps["party"]

# Categorical Variables

How should we summarize this variable? For categorical and ordinal variables, a good place to start is with a frequency table. We can get that with the `.value_counts()` function. 


In [None]:
mps["party"].value_counts()

Pretty straightforward: our data frame has 223 Tory and 204 Labour candidates. Often it is nice to convert this to percentages, which we do by adding 'normalize=TRUE' to the function:

In [None]:
mps["party"].value_counts(normalize=True)

Another categorical variable is the region:

In [None]:
mps["region"].value_counts()

In [None]:
mps["region"].value_counts(normalize=True)

To summarize categorical variables visually, we can use bar charts. The functionality builds nicely on what we did before: we first make a table with the value counts, then add `.plot(kind='bar')` at the end.

In [None]:
mps["party"].value_counts().plot(kind='bar')

In [None]:
mps["region"].value_counts().plot(kind='bar')

We can also display this with a pie chart:

In [None]:
mps["region"].value_counts().plot(kind='pie')

What is a "typical" value for a categorical variable? If there isn't any "order" to the values, the best way to think about a typical value is the most common one, or the *mode*. We can compute this with the `stats.mode()` function.

In [None]:
stats.mode(mps['party'])

In [None]:
stats.mode(mps['region'])

# Ordinal Variables
There aren't any ordinal variables in the dataset, so here we will make one up. A good ordinal variable for this context is how highly ranked the MP was at the peak of their career. In the UK parliament, a natural ranking is "backbencher" (think regular MP), "cabinet member", and "Prime Minister". And we can actually figure out from the data whether they won or lost using the 'margin' variable. 

So, we are going to create a rank variable that is 0 for losers (from the real data), and for winners we will randomly assign them a 1 for backbencher, 2, for cabinet member, and 3 for PM. So, higher rank means "more success".

First, lets create the winner variable

In [None]:
mps['winner'] = 1*(mps['margin'] > 0)

In [None]:
mps['winner'].value_counts()

In the following line of code we will do our random draw for winners, don't worry about the details here.

In [None]:
mps['rank']=mps['winner'] * np.random.choice([1,2,3], 427, p=[.75, .2, .05])

Like with categorical variables, we can summarize these with `.value_counts()`

In [None]:
mps['rank'].value_counts()

And a bar chart

In [None]:
mps['rank'].value_counts().plot(kind='bar')

Also as with categorical variables, the mode will tell us the most frequent value:

In [None]:
stats.mode(mps['rank'])

This tells us the most common outcome is to remain a 0, or a backbencher. 

A second way we can define a typical value is the median. One way to think of the median is that we sort into increasing order, and then pick the number in the middle.

To pick some shorter examples, suppose we have five data points: 2, 6, 5, 4, 1. These sort to 1,2,4,5,6. And, hopefully this makes sense visually, the middle is the third point, in the sense that two are smaller than the median and two are larger. So, here the median is 3. 

If we have 7 data points the median is the 4th largest, if we have 101 data points it is the 51st largest, etc. In general, if we have an **odd** number of data points n, the median is the (n+1)/2 largest. 


Things are a bit tricker if we have an even number of data points. Say we add 100 to our initial list of 5, giving 1,2,4,5,6,100. In a sense the "middle" is now "between 4 and 5". So for this case we define the median as halfway between these two points. 

Checking this:

In [None]:
np.median([2,6,5,4,1])

In [None]:
np.median([2,6,5,4,1,100])

Lets see the median of our rank variable:

In [None]:
np.median(mps['rank'])

This tells us that the "typical" person in our data set doesn't even get into parliament! 

[Optional: suppose we dropped 200 losers from the data set. What would the median be now?]

Since our data are numbers, we can also compute the mean or average. The mean is the sum of all of the numbers, divided by the number of observations. The formula for this is that for an array of numbers $X_1,X_2,...,X_n$, the mean is:
\begin{align}
    \bar{X} = \frac {\sum_{i=1}^n X_i}{n}
\end{align}


So in our 5 data point example it would be (2 + 6 + 5 + 4  +1)/5. Checking this.

In [None]:
(2 + 6 + 5 + 4 + 1)/5

In [None]:
np.mean([2,6,5,4,1])

Lets apply it to our rank variable

In [None]:
np.mean(mps['rank'])

Does this really mean anything? Arguably not. .45 is not a valid rank, at it doesn't really make sense to describe someone as 45% between a loser and a backbencher.

For some ordinal variables means are informative. For example, if we gave them an ideological score like in Problem set 2, where -2 is extreme left, -1 is moderate left, 0 is centrist, 1 is moderate right, and 2 is extreme right, then the average of those would give us a number between -2 and 2 which would give a sense of the "average" ideology of the candidates. 

# Numeric variables
There are a few numeric variables here, include the vote margin one. For numeric variables it usually doesn't make sense to construct a table, because often each value shows up only once. A nice analog to a frequency table for numeric variables is a *histogram*.

In [None]:
mps.hist('margin')

... and losers

Histograms break numeric variables into "bins" and then plot the frequency of those bins. Here the bins have a width of about .05. So, the highest bin is telling us that about 115 candidates in the data with a margin of about -.04 to .01. (If you wanted to break these into more natural cutoffs like 0 to .05, that can be done, but let's just take the default).

This gives us a general shape of the data. The margin variable is typically close to zero, with fewer and fewer cases as the margin gets very big or small.

Remember last week we also looked at the net wealth at death, and we had to do a bit of mathematical trickery to determine this from the "natural logarithm of net wealth at death" variable.

In [None]:
mps['net'] = np.exp(mps['ln.net'])
mps.hist('net')

Here wealth is being plotted in "tens of millions of pounds" (the 1e7 part). So .2 means 2 million, .4 means 4 million, etc. This is telling us that the vast majority of candidates have a net wealth below 1 million at death.

Now let's compute typical values. Again we can look at the mean, median and mode:

In [None]:
print("Mean: ", np.mean(mps['margin']))
print("Median: ", np.median(mps['margin']))
print("Mode: ", stats.mode(mps['margin']))

For the margin variable, the mean and median are vary similar. The mode is way off though: someone who lost by 48%! In general we won't pay attention to the mode of numeric varaiables. This is because, as with many numeric variables, all of the margin amounts are unique. So the frequency of all of them is 1! It appears by default the `stats.mode` function picks the lowest value. 

Now let's do net wealth:

In [None]:
print("Mean: ", np.mean(mps['net']))
print("Median: ", np.median(mps['net']))

Here we get much different results when looking at the mean vs the mode. This is because the mean is very sensitive to "extreme observations". 

Here is a simple way to see that. Remember when we started with a data array of [2,6,5,4,1] and then added 100, the median went up just a bit, from 4 to 4.5. Now let's do the same but for the mean

In [None]:
np.mean([2,6,5,4,1])

In [None]:
np.mean([2,6,5,4,1,100])

Adding this one new obseration makes the mean go up by 15! In some sense this gives a weird answer for a "typical value", as no value in this data array is near 20. 

A similar thing could happen in our real data. Suppose we add a (deceased) Jeff Bezos to our data. His current net worth in pounds is about 135 billion. How would adding him change the mean and median?

In [None]:
print("Mean with Bezos: ", np.mean(np.append(mps['net'], [135000000000])))
print("Median with Bezos: ", np.median(np.append(mps['net'], [135000000000])))

So, which is a better measure of a typical value? It depends! The median is less sensitive to outliers, describing a typical member independent of how extreme the non-typical units are. And if we want to answer questions like "how does winning office affect wealth", we don't want our answer to be entirely determined by whether Bezos was a winner or a loser when he ran (particularly if we know that his wealth was independent of his hypothetical run for office).

However, averages are often meaningful too. For example, if we did know that a handful of MPs took advantage of their office to become billionaires, then this would be politically meaningful, even if the typical MP got little out of holding office. 

# Measuring Spread
When dealing with numeric variables, we also care a lot about the *spread* of the values. One way to think about this is "how far is a typical value from the average"? A first way to define this is the variance. If we have an array with n numeric observations,  $X_1,X_2,...X_n$ with mean $\bar{X}$, the variance is given by:
\begin{align}
    \frac {\sum_{i=1}^n (X_i - \bar{X})^2} {n-1}
\end{align}

This is close to saying "what is the average squared distance from the mean", though instead of dividing by n we divide by n-1. The reason for this difference is a bit subtle and has no noticeable impact once our $n$ is above 5 or so, as will almost always be the case.  

Here is how we compute the variance.

In [None]:
np.var(mps['margin'])

In [None]:
np.var(mps['net'])

Note that in the margin case we got a very small number, and in the net worth we got an extremely large number. That is because our margin variable is typically around .1, and squaring this give .01. On the other hand, with the wealth variable, we are taking values in the hundred thousands and squaring them. One way to think abotu this problem is that the "unit" of the variance is the "unit squared" of the original variable. In order to get something easier to interpret, we often focus on the *standard deviation* of the variable, which is given by the square root of the variance. 

In [None]:
np.std(mps['margin'])

In [None]:
np.std(mps['net'])

We can also compute variances and standard deviations for ordinal variables, though like with the mean whether or not this has much meaning depends on what we are doing.

In [None]:
np.var(mps['rank'])

In [None]:
np.std(mps['rank'])