# PS 3 - Problem Set 4.
Due date: Wednesday, 12/9, at 5pm. 

## 1. Correlation with Tips <a class="anchor" id="corr"></a>
We'll be looking a dataset that looks at the amount tipped on bills at restaurants and several other characteristics. These characteristics include the bill size, whether or not the payer smokes, the day they ate, if they ate dinner or lunch, and how many people were in their party.


In [None]:
import pandas as pd
import numpy as np
import seaborn as sns 
from scipy import stats

import statsmodels.api as sm
import statsmodels.formula.api as smf

import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
import plotly.express as px
from ipywidgets import *
%matplotlib inline

tips = sns.load_dataset("tips")
tips.head()


As you probably know, a common rule of thumb is to tip 15%. Here is a plot of the total bill and the tip, with a line corresponding to tipping 15%.

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip")
x = np.arange(0,60)
tip_15 = .15*x
plt.plot(x, tip_15)

Recall one way to assess the strength of the relationship is by looking at the correlation between the two variables. A function we can use for this is `stats.pearsonr(VAR1, VAR2)`. 

<span style="color:blue">**Question 1.1. Use the `stats.pearsonr` function to find the correlation between the `tip` variable and the `total_bill` variable. Hint: one way to pull a column named `colname` from a data frame called `df` is `df['colname']`. (1 pt)**</style>

In [None]:
# Code for 1.1
tip_cor = ...
tip_cor

<span style="color:blue">**Question 1.2. Is the correlation positive or negative? What does this mean? (1 pt)**</span>

*Answer to 1.2 here*

The correlation just tells us how strong the relationship is, but not whether people tip close to 15%. An easier to interpret way to look at the realtionship is by running a bivariate regression.

Let's do this with the `smf.ols` function. Recall the syntax here is `smf.ols(formula, data=df).fit()`. For a bivariate regression, the formula should look like 'DV ~ IV', where DV and IV are the column names in the data frame `df`.

In [None]:
tip_ols = smf.ols('tip ~ total_bill', data=tips).fit()
tip_ols.summary()

<span style="color:blue">**Question 1.3. What is the slope on the `total_bill` variable? How does this compare to what it would be if everyone tipped 15%? (1 pt)**</span>

*Answer to 1.3*

<span style="color:blue">**Question 1.4. Use some of the output from above to answer the question "could this relationship just be driven by random chance? (2 pts)**</span>

*Answer to 1.4*

Next, let's do a visual comparison of the predicted tips from the data compared to the 15% rule. The following plots the OLS prediction with an orange line and the 15% prediction with a blue line

In [None]:
sns.scatterplot(data=tips, x="total_bill", y="tip")
x = np.arange(0,60)
tip_15 = .15*x
tip_pred = tip_ols.params[0] + tip_ols.params[1]*x
plt.plot(x, tip_15)
plt.plot(x, tip_pred)

<span style="color:blue">**Question 1.5. What does the comparison between these lines tell us about how predicted tips compare to the 15% rule? (1 pt)** </span>

*Answer to 1.5*

Another variable that might affect the tip is the size of the party, represented by the `size` variable. 


<span style="color:blue">**Question 1.6 Modify the code below to use the `smf.ols` function to run a bivariate regression where `tip` is the dependent variable and `size` is the independent variable. (1 pt)**</span>

In [None]:
#Code for 1.6
tip_size = smf.ols(...).fit()
tip_size.summary()

<span style="color:blue">**Question 1.7. Interpret the coefficient on the "size" variable (1 pt).**</span>

*Answer to 1.7*

Now let's see what happens if we run a regression trying to predict tips with both of these variables.

<span style="color:blue">**Question 1.8. Modify the code below to run a regression where `tip` is the dependent variable and `size` and `total_bill` are dependent variables. Remember the "formula" part of the code for multivariate regressions is "DV ~ IV1 + IV2 + ...". (1 pt)**</span>

In [None]:
# Code for 1.8
tip_multi = ...
tip_multi.summary()

<span style="color:blue">**Question 1.9. Compare the coefficient on the `size` variable in the multivariate regression to the bivariate regression from question 1.7. What does this tell us about why larger groups tend to give more tip? (2 pts)**</span>

*Answer to 1.9*

## 2. Extremists in Primaries<a class="anchor" id="prim"></a>
We will be working with the `extremist` table below, which comes from the replication data from <a href="https://www.cambridge.org/core/journals/american-political-science-review/article/who-punishes-extremist-nominees-candidate-ideology-and-turning-out-the-base-in-us-elections/366A518712BE9BCC1CB035BF53095D65">this paper</a>, which studies the effect of nominating an "extreme" candidate in a primary on general election performance. 

Each row in the data set correspond to a congressional district that had a competitive primary in a given year. Using data on campaign contributions, they look at how the more "extreme" of the top two candidates did in the primary, and then how the winner of the primary did in the general election.

The `extremist` table has 4 columns: 
* `treat`: whether the extremist candidate won in the primary (1=yes, 0= no)
* `vote_general`: the party vote share in the general election
* `rv`: the difference in vote share for the extreme candidate minus the less extreme candidate in the primary
* `absdist`: the difference in ideology between the primary candidates.

Run the cell below to load up view the table.

In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scipy import stats
import seaborn as sns 
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
%matplotlib inline

from datascience import Table
from datascience.predicates import are
df = pd.read_stata('data/rd_analysis.dta').dropna(subset=['turnout_party_share'])
extremist_df = df[['treat', 'vote_general', 'rv', 'absdist' ]].dropna()


We can check that the `treat` variable is coded as we want by looking at the relationship between the primary vote share and `treat`.

In [None]:
sns.scatterplot(data=extremist_df, x='rv', y='treat')
plt.axvline(0)

<span style="color:blue">**Question 2.1 Explain what this graph shows in words (1 pt)**</span>

*Answer for 2.1*

Let's compare the distribution of general election vote shares for the more extreme vs more moderate candidates.

The code below creates arrays with the extreme candidates and moderate candidates, then makes histograms, with the extremists in blue and the moderates in orange. 

In [None]:
gen_extreme = extremist_df.vote_general[extremist_df['treat']==1]
gen_mod = extremist_df.vote_general[extremist_df['treat']==0]
sns.distplot(gen_extreme)
sns.distplot(gen_mod)

It looks like the moderates do a bit better, but let's check this more formally. 


<span style="color:blue">**Question 2.2. Modify the code below to compute the average vote share for extermists, the average vote share for moderates, and the difference of means. (2 pts).**</span>

In [None]:
# Code for 2.2
avg_extreme = ...
avg_extreme

In [None]:
# Code for 2.2
avg_mod = ...
avg_mod

In [None]:
# Code for 2.2
dom = ...
dom

<span style="color:blue">**Question 2.3. Now do a t test to check if this difference is statistically significant. Recall the syntax here is `stats.ttest_ind(array1, array2)`. (1 pt)**</span>

In [None]:
# Code for 2.3 
stats.ttest_ind(...)

<span style="color:blue">**Question 2.4. Interpret this t test. (1 pt)**</span>

*Answer for 2.4*

We can also look at this relationship as a bivariate regression. 

<span style="color:blue">**Question 2.5. Modify the code below to run a bivariate regression where `vote_general` is the dependent variable and `treat` is the independent variable. (1 pt)**</span>

In [None]:
# Code for 2.5
all_ols = ...
all_ols.summary()

<span style="color:blue">**Question 2.6. The output above should have values that are equal to the difference of means, t value, and p value you computed in questions 2.2 and 2.3. Why is this true? (2 pts)**</span>

*Answer for 2.6*

In the analysis above we are comparing the results for all elections in the data set. In the paper most of the analysis focuses on elections where the *primary* was close. 

<span style="color:blue">**Question 2.7. Why might this do a better job of isolating the causal effect of nominating an extremist? (2 pts)**</span>

*Answer for 2.7

The following block of code creates a subset of the data called `extremist_close` where the primarily election margin was less than 10%.

In [None]:
width = .1
extremist_close = extremist_df[abs(extremist_df['rv']) < width]
extremist_close

<span style="color:blue">**Question 2.8. Write code to run a bivariate regression with `vote_general` as the dependent variable and `treat` as the independent variable, but restricted to close elections. (1 pt)**</span>

In [None]:
#Code for 2.7
close_ols = ...
close_ols.summary()

<span style="color:blue">**Question 2.9. Compare the results from question 2.7 to the results from question 2.5. Assuming the coefficient on `treat` in the regression for just close elections represents the causal effect of nominating an extremist, what does this tell us about the selection bias in using the difference of means in the entire data set to estimate the causal effect? (2 pts)**</span>

*Answer to 2.9*

The analysis above always compares the candidate that the methods use consider more extreme, whether this difference is small or large. One way to probe the plausibility that extremity matters is to restrict attention to races where we can be more confident that one of the two candidates was considerably more extreme. To do this, we will run the analysis (not accounting for how close the eleciton is) for cases where the is a large difference in the ideology of the noniminees using the `subset` option in `smf.ols`:

In [None]:
clear_ols = smf.ols('vote_general ~ treat', data=extremist_df, subset=extremist_df['absdist'] > .3).fit()
clear_ols.summary()

<span style="color:blue">**Question 2.10. Compare the coefficient on the `treat` variable from this regression to the one from part 2.5. What does this tell us about the effect of nominating a candidate who is definitely more extreme (vs maybe more extreme)? (2 pts)**</span>

*Answer to 2.10 here*

## The End
Great job, you're done with this homework!
Once you have finished working on your problem set, go to File ->Download as-> PDF via Latex. Do not download it as PDF via html

Authors: William McEachen, Mikalya Tom, Carlos Calderon, Aishah Mahmud, Andrew Little