# Python and R

In [29]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# show all columns on pandas dataframes
pd.set_option('display.max_columns', None)


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [30]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [31]:
%%R

# My commonly used R imports

require('tidyverse')



# Read the data



The cell below loads the data in python:

In [32]:
df = pd.read_csv('raw-polls.csv')
df.sample(5)

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
2994,38942,33004,8655,2006,2006_House-G_US,US,House-G,House-G,CBS News/The New York Times,36,Live Phone,,10/29/06,598.0,Generic Candidate,13746,DEM,52.0,Generic Candidate,13747,REP,34.0,,18.0,11/7/06,51.6,44.07,7.53,10.47,1.0,
612,63613,117077,820,2000,2000_Pres-G_IL,IL,Pres-G,Pres-G,Rasmussen Reports/Pulse Opinion Research,277,IVR,,10/23/00,880.0,Al Gore,222,DEM,44.0,George W. Bush,241,REP,40.0,,4.0,11/7/00,54.6,42.58,12.01,-8.01,1.0,
216,54228,87764,1437,1998,1998_Gov-G_CO,CO,Gov-G,Gov-G,Ciruli Associates,51,Live Phone,,10/28/98,500.0,Gail Schoettler,13022,DEM,42.0,William F. Owens,13021,REP,46.0,,-4.0,11/3/98,48.42,49.04,-0.62,-3.38,1.0,
10282,72230,135502,6241,2020,2020_Pres-G_NH,NH,Pres-G,Pres-G,University of New Hampshire,357,Online,,10/26/20,864.0,Joseph R. Biden Jr.,13256,DEM,53.0,Donald Trump,13254,REP,45.0,1.0,8.0,11/3/20,52.71,45.36,7.35,0.65,1.0,
9613,64102,117945,7735,2020,2020_Pres-D_NV,NV,Pres-P,Pres-D,Point Blank Political,550,Online,,2/14/20,256.0,Bernard Sanders,13257,DEM,13.0,Joseph R. Biden Jr.,13256,DEM,14.3,12.6,-1.3,2/22/20,33.99,17.57,16.43,,0.0,


The cell below loads the same data in R:

In [33]:
%%R

df <- read_csv('raw-polls.csv')

df

Rows: 10776 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (17): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10,776 × 31
   poll_id questio…¹ race_id  year race  locat…² type_…³ type_…⁴ polls…⁵ polls…⁶
     <dbl>     <dbl>   <dbl> <dbl> <chr> <chr>   <chr>   <chr>   <chr>     <dbl>
 1   26013     87909    1455  1998 1998… NY      Gov-G   Gov-G   Blum &…      32
 2   26255     87926    1456  1998 1998… OH      Gov-G   Gov-G   Univer…     346
 3   26026     31266    1736  1998 1998… NV      Sen-G   Sen-G   FM3 Re…      91
 4   26013     31253    1738  1998 1998… NY      Sen-G   Sen-G   Blum &…      32
 5   63632    117103    1738  1998 1998… NY      Sen-G 

# Guided Exploration

In this section you'll make a few charts to explore the data. Here I will raise some questions for you to dig around in the data and answer. You can use summary statistics and/or charts to help answer the questions. You will have to make some methodological choices along the way. Be aware of what choices you're making! I'll ask you about them shortly.


## Question 1: How accurate are polls from the following pollsters?
Characterize the accuracy of each of these pollsters in a sentence or two. Then, write another few sentences justifying your characterization with insights from the data.
- Siena College/The New York Times Upshot
- Jayhawk Consulting
- Fox News/Beacon Research/Shaw & Co. Research
- Brown University
- American Research Group


👉 **Siena College/The New York Times Upshot** 

In [40]:
df.pollster.value_counts()


SurveyUSA                                          834
Rasmussen Reports/Pulse Opinion Research           748
Zogby Interactive/JZ Analytics                     477
YouGov                                             455
Public Policy Polling                              454
                                                  ... 
Iona College                                         1
Rivercity Polling                                    1
The Polling Company Inc./National Research Inc.      1
Schoen Cooperman Research                            1
Dynamics Marketing                                   1
Name: pollster, Length: 493, dtype: int64

In [41]:

# Siena College/The New York Times Upshot

filter = df.pollster == 'Siena College/The New York Times Upshot'
Sena_NYT = df[filter]['bias'].mean()
Sena_NYT


1.4229268292682928

👉 **Jayhawk Consulting**

In [35]:
filter_2 = df.pollster == 'Jayhawk Consulting Services'
Jayhawk = df[filter_2]['bias'].mean()



👉 **Fox News/Beacon Research/Shaw & Co. Research**

In [42]:

filter_3 = df.pollster == 'Fox News/Beacon Research/Shaw & Co. Research'
Fox = df[filter_3]['bias'].mean()


👉 **Brown University**

In [44]:

filter_4 = df.pollster == 'Brown University'
Brown = df[filter_4]['bias'].mean()
Brown

-2.212857142857143

👉 **American Research Group**

In [45]:

filter_5 = df.pollster == 'American Research Group'
American = df[filter_5]['bias'].mean()
American

0.11302325581395348

In [48]:
ranking = pd.DataFrame({'pollster': ['Siena College/The New York Times Upshot', 'Jayhawk Consulting', 'Fox News/Beacon Research/Shaw & Co. Research', 'Brown University', 'American Research Group'], 'bias': [Sena_NYT, Jayhawk, Fox, Brown, American]})
ranking

# American Research Group is the pollster with the least bias among all five pollsters, because it has the lowest average bias from all the polls it has conducted.

Unnamed: 0,pollster,bias
0,Siena College/The New York Times Upshot,1.422927
1,Jayhawk Consulting,37.615
2,Fox News/Beacon Research/Shaw & Co. Research,3.073226
3,Brown University,-2.212857
4,American Research Group,0.113023


### Question 2: Which pollsters are the most accurate? Which are the least accurate?

👉 Which pollsters are the most accurate?

👉 Which are the least accurate?

### Question 2 Reflections

👉 Write a summary paragraph explaining how you decided what constitutes “most accurate” and "least accurate"?


👉 In bullet point form, name **methodological choices** you made in the process of determining which pollsters were the most and least accurate.


👉 In bullet point form, list the **limitations** of your approach 
