Table of Contents:  
1) [Measuring Undemocratic Parties](#gt)  
2) [Deaths in War](#war)  
3) [Presidential Approval](#putin)  
4) [GDP Distributions](#gdp)

## 1. Measuring Undemocratic Parties <a class="anchor" id="gt"></a>

Political parties are often accused of behaving in an undemocratic manner. Is it possible to come up with a measure of this?

Here are two proposals to measure whether parties were undemocratic in 2020. Proposal 1 is to do an expert survey, where those who know a lot about a country are asked to rate parties on a 3 point scale ("No undemocratic behavior", "Minor undemocratic behavior", "Major undemocratic behavior"). Proposal 2 is to come up with 3 examples of undemocratic behaviors which are possible to code systematically (e.g., Did they propose or enact changes to electoral rules in their favor? Did they censor media outlets who criticized the party?), and then count how many of these things a party did in 2020. 

**<span style="color:blue">1.1 What kind of variable (categorical, ordinal, numeric) would measure 1 produce, and why? (2 pts)</span>**

*ANSWER TO 1.1 HERE*

**<span style="color:blue">1.2 What kind of variable (categorical, ordinal, numeric) would measure 2 produce, and why? (2 pts) </span>**

*ANSWER TO 1.2 HERE*

**<span style="color:blue">1.3 Which measure do you think is more reliable, and why? (2 pts)</span>**

*ANSWER TO 1.3 HERE*

**<span style="color:blue">1.4 Which measure do you think is more valid, and why? (2 pts) </span>**

*ANSWER TO 1.4 HERE*

## 2. Deaths in War <a class="anchor" id="war"></a>

A researcher is interested in the severity of wars across time. She first collects data on wars worldwide, finding the average number of people who died in every war with more than 25,000 deaths.

She puts together a data frame called `wars` with the following four columns:
- `War`: the name of the war
- `Location`: the primary location of the war
- `Deaths`: the average deaths from the war
- `Region`: a more general region where the war took place

The cell below imports some libraries and the `wars` table. Run it to create the table!

In [None]:
import numpy as np
from scipy import stats
from datascience import Table
import pandas as pd
from datascience.predicates import are
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

wars = Table.read_table('data/wardeaths.csv')
wars = wars.to_df()
wars

Each column of the `wars` data table is made up of an array. As a reminder, you can access the arrays that make up the columns by writing the name of the data frame folled by the variable name in quotation marks and square brackets. Heres an example that will return the array of values that make up the `War` column. 

In [None]:
wars["War"]

Recall we can use the `.value_counts()` function to count how often values show up in a column of our data (or any array). For example, this creates a table of some random simulated die rolls and and then makes a table of how often each number comes up. 

In [None]:
# Don't worry about the pd.Index part: 
# this is just putting our array in a format that value_counts() knows how to work with.
fakedata = pd.Index(np.random.randint(1, 7, 200))
fakedata.value_counts()

**<span style="color:blue"> Question 2.1 Use the `.value_counts()` function to make a table to count the frequencies of the `Region` variable. (1 pt) </span>**

In [None]:
# Code for 2.1 here
wars["Region"].value_counts()

The deaths variable is numeric, and so one way to describe the distribution is with a histogram. 

`sns.distplot(values)` allows us to create a histogram of input values. In the cell below, we plot the distribution of a uniform random sample of values between 0 and 100.


In [None]:
sns.distplot(np.random.uniform(0,100, 30))

**<span style="color:blue"> Question 2.2 Make a histogram of the deaths varible. (1 pt) </span>**

In [None]:
# Code for 2.2 here
sns.distplot(wars['Deaths'])

You should see that there are a few extremely high observations. One way we can see this is by pulling out a subset of the table. We can do this by putting a "condition" in square brackets after the data frame name. For example, this code pulls the wars with relatively few deaths.

In [None]:
wars[wars['Deaths'] < 100000]

**<span style="color:blue">Question 2.3. Write a line of code to identify the wars with more than 1 million deaths. (1 pt)</span>**

In [None]:
# Code for 2.3 here
wars[wars['Deaths'] > 10000000]

**<span style="color:blue">Question 2.4: Set the `mean_deaths` and `median_deaths` equal to the mean and median deaths using `np.mean` and `np.median` in the cells below.  (2 pts)</span>** 

In [None]:
mean_deaths = np.mean(wars['Deaths'])
mean_deaths

In [None]:
median_deaths= np.median(wars['Deaths'])
median_deaths

**<span style="color:blue">Question 2.5: Explain why the mean (or median) is larger. (2 pts)</span>** 

*ANSWER TO 2.5 HERE*

Let's see how these measure of typical values change if we drop extreme observations. Run the following lines of code to explore this. 

In [None]:
wars_noww = wars[wars['Deaths'] <= 25000000]
np.mean(wars_noww['Deaths'])

In [None]:
np.median(wars_noww['Deaths'])

**<span style="color:blue">Question 2.6. What do these line of code do? (Hints: refer back to the table produced by Question 2.3, and note that `<=` means "less than or equal to")? Compare the output here to your answers to Question 2.4  (2 pts) </span>**

*ANSWER TO 2.6 HERE*

Collecting data on historic wars can be challenging and time consuming. For the sake of illustration, suppose we could only figure this out for a random sample of the wars. A way we can quickly simulate this is with the `.sample` function. For example, the following takes a random sample of the number of deaths from 10 wars:

In [None]:
wars['Deaths'].sample(10)

Let's suppose we had enough resources to collect data on about half of the major wars, or 45. To get a sense of whether this would give us a reliable estimate of the average deaths per war, we can simulate repeated samples of this size with the following code, and then plot the distribution of sample means.

In [None]:
n=45
np.random.seed(32020)
sample_dist45 = [np.mean(wars['Deaths'].sample(n)) for _ in range(10000)]
sns.distplot(sample_dist45)
plt.axvline(np.mean(wars['Deaths']), ymax=1, color='r')

Note that this isn't quite normally distributed around the real average (represented by the red line): in fact there are two peaks, one below the real average and one above. 

**<span style="color:blue"> Ungraded question: what explains this pattern? (Hint: how often will our sample include World War II? </span>)**

*ANSWER TO UNGRADED QUESTION*

One way to make our analysis closer to the ideal of the Central Limit Theorem is to not include the most extreme wars, subsetting to those with fewer than a million deaths.

In [None]:
wars_nobig = wars[wars['Deaths'] < 1000000]

In [None]:
n=45
np.random.seed(32020)
# Drawing samples with size 10 and 100
sample_dist_nobig = [np.mean(wars_nobig['Deaths'].sample(n)) for _ in range(10000)]
# Plotting the two distributions
sns.distplot(sample_dist_nobig)
plt.axvline(np.mean(wars_nobig['Deaths']), ymax=1, color='r')

Let's compare this to what would happen if we only had the time to collect data on 20 wars. 

In [None]:
smalln=10
np.random.seed(32020)
# Drawing samples with size 10 and 100
smallsample_dist_nobig = [np.mean(wars_nobig['Deaths'].sample(smalln)) for _ in range(10000)]
# Plotting the two distributions
sns.distplot(sample_dist_nobig)
sns.distplot(smallsample_dist_nobig)
plt.axvline(np.mean(wars_nobig['Deaths']), ymax=1, color='r')

**<span style="color:blue">Question 2.5 Identify three aspects of the Central Limit Theorem which are illustrated by this picture. (3 pts.) </span>**

*ANSWER TO 2.5 HERE*

While it is less realistic, one way we can illustrate the theoretical properties of the Central Limit Theorem is to sample *with replacement* from the data, which allows us to take very large samples even from a small population. In the following cell, we can see that with big enough samples, means from the original data with the outliers of the World Wars are still roughly normally distributed:

In [None]:
replacen=20000
np.random.seed(32020)
# Drawing samples with size 10 and 100
sample_dist_rep = [np.mean(np.random.choice(wars['Deaths'], replacen)) for _ in range(10000)]
# Plotting the two distributions
sns.distplot(sample_dist_rep)
plt.axvline(np.mean(wars['Deaths']), ymax=1, color='r')

The theoretical result of the Central Limit Theorem predicts that the standard deviation of the sampling distribution will be:

In [None]:
np.std(wars['Deaths'])/np.sqrt(replacen)

<span style="color:blue">**Question 2.6. Write a line of code to check that the standard deviation of the samplign distribution is very close to the theoretical prediction. (1 pt)**</span>

In [None]:
# Code for 2.6 below
np.std(sample_dist_rep)

## 3. Presidential Approval <a class="anchor" id="putin"></a>
In this question, we will explore Russian presidential approval rating. We use data from the Levada Center, which conducts public opinion polls in Russia, which are generally considered among the most credible surveys in a non-democratic country. Their most recent poll in August 2020, with a sample of 1600, found that 69% of respondents approve of Vladimir Putin’s performance as Prime Minister.  

The following cell loads in Putin's approval as the table `putin`.

In [None]:
from utils import table_dict, time
from datascience import Table
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
import scipy.stats as stats
%matplotlib inline

putin = Table().with_columns(table_dict)
putin

For a better visual of Putin's approval, we will graph his approval and disapproval ratings between 1999 and 2020. For now, you don't need to know how the plotting works, but it's useful to get an idea of the data.

In [None]:
#Plot Putin's approval rates against his disapproval rates from 1999 to 2020
putin = putin.to_df()
putin.plot(figsize=(8, 6), linewidth=2.5)
sns.set(font_scale=1.4)
plt.xlabel("Date", labelpad=15)
plt.ylabel("Percentage of People", labelpad=15)
plt.title("Putin's Approval Ratings", y=1.02, fontsize=22);


If we want to know the highest and lowest approval ratings registered we can use the `min` and `max` functions:

In [None]:
np.min(putin["Approved"])

In [None]:
np.max(putin["Approved"])

And figure out when these numbers were recorded using some similar tricks as section 2:

In [None]:
putin[putin["Approved"] == 31]

In [None]:
putin[putin["Approved"] == 89]

Suppose that the August 1999 survey was a simple random sample with 1500 respondents. The standard error of the estimate for approval would then be:

In [None]:
se_aug99 = np.sqrt((.31)*(1-.31)/1500)
se_aug99

And a 95% confidence interval has lower and upper bounds:

In [None]:
lower_aug99 = .31 - 1.96*se_aug99
upper_aug99 = .31 + 1.96*se_aug99
print('The 95% confidence interval is ['+ str(lower_aug99)+' , '+ str(upper_aug99)+ ']')

**<span style="color:blue">Question 3.1. Suppose the June 2015 survey was a simple random sample with 1600 respondents. Modify the code below to set `se_jun15` to the standard error on the approval. (1 pt)**

In [None]:
se_jun15 = np.sqrt(.89*(1-.89)/1600)
se_jun15

**<span style="color:blue">Question 3.2: Modify the code below to produce a 95% confidence interval for the June 2015 approval rating (2 pts)</span>** 

In [None]:
lower_jun15 = .89 - 1.96*se_jun15
upper_jun15 = .89 + 1.96*se_jun15
print('The 95% confidence interval is ['+ str(lower_jun15)+' , '+ str(upper_jun15)+ ']')

**<span style="color:blue"> Question 3.3. Now write code in the cell below to produce a 90% confidence interval. As a reminder, for any normally distributed variable about 90% of the data will lie within 1.64 standard deviations of the mean. (1 pt) </span>**

In [None]:
# Code for 3.3 here
lower_jun15_90 = .89 - 1.64*se_jun15
upper_jun15_90 = .89 + 1.64*se_jun15
print('The 90% confidence interval is ['+ str(lower_jun15)+' , '+ str(upper_jun15)+ ']')

**<span style="color:blue">Question 3.4. Compare your answers to 3.2 and 3.3. Which interval is "wider", and why? (2 pts)</span>**

*ANSWER TO 3.4 HERE*

Now let's ask whether Putin is more or less popular in some subgroups. Here is a formula we can use to calculate a standard error for a difference of proportions.

In [None]:
def se_dprop(p1, p2, n1, n2):
    sd1 = np.sqrt(p1*(1-p1))
    sd2 = np.sqrt(p2*(1-p2))
    return np.sqrt((sd1**2)/n1 + (sd2**2)/n2)

For example, if Group 1 is 500 people and 10% support a candidate, and group 2 has 700 people and 20% support, the standard error for the the difference of proportions between the groups is:

In [None]:
se_dprop(p1=.1, p2=.2, n1=500, n2=700)

In the June 2020 Survey, the overall approval was 60%. Suppose that 480 people in the sample live in the Moscow area, and among these residents the approval rating was 55% while the remaining 1120 of residents outside of Moscow give Putin a 72% approval rating.

**<span style="color:blue"> Question 3.5 Use the `se_dprop` function to compute the standard error on this difference in approval (1 pt) </span>**

In [None]:
se_putin_diff = se_dprop(p1=.55, p2=.72, n1=480, n2=1120)
se_putin_diff

<span style="color:blue">**Question 3.6. Now, write some code like you used for question 3.2 to compute a 95% confidence interval for the difference in Putin approval rating (2 pts)**</span>

In [None]:
# Code for 3.5 here
lower_diff = .55 - .72 - 1.96*se_putin_diff
upper_diff = .55 -.72 + 1.96*se_putin_diff
print('The 90% confidence interval is ['+ str(lower_diff)+' , '+ str(upper_diff)+ ']')

**<span style="color:blue">Question 3.7. Does 0 lie in this confidence interval? Interpret this finding. (2 pts) </span>**

*ANSWER TO 3.5 HERE*