# Python and R

In [3]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# show all columns on pandas dataframes
pd.set_option('display.max_columns', None)


In [4]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [5]:
%%R

# My commonly used R imports

require('tidyverse')


R[write to console]: Loading required package: tidyverse



── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.4 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.0      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()



# Read the data



The cell below loads the data in python:

In [6]:
df = pd.read_csv('raw-polls.csv')
df.sample(5)

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
9478,56795,91522,141,2018,2018_Gov-G_ID,ID,Gov-G,Gov-G,Change Research,48,Online,,11/3/18,838.0,Paulette E. Jordan,11357,DEM,39.0,Brad Little,11358,REP,55.0,3.0,-16.0,11/6/18,38.19,59.77,-21.58,5.58,1.0,
474,6474,7943,836,2000,2000_Pres-G_NH,NH,Pres-G,Pres-G,Research 2000,281,Live Phone,,10/17/00,603.0,Al Gore,222,DEM,44.0,George W. Bush,241,REP,41.0,1.0,3.0,11/7/00,46.8,48.07,-1.27,4.27,0.0,
913,54524,117321,853,2000,2000_Pres-G_WA,WA,Pres-G,Pres-G,SurveyUSA,325,IVR,,11/1/00,500.0,Al Gore,222,DEM,46.0,George W. Bush,241,REP,44.0,8.0,2.0,11/7/00,50.16,44.58,5.58,-3.58,1.0,
7834,36621,50264,7621,2016,2016_Pres-D_AZ,AZ,Pres-P,Pres-D,Merrill Poll,206,Live Phone,,3/9/16,300.0,Hillary Rodham Clinton,9207,DEM,50.0,Bernard Sanders,9739,DEM,24.0,,26.0,3/22/16,56.29,41.39,14.91,,1.0,for Arizona State
473,6473,7942,820,2000,2000_Pres-G_IL,IL,Pres-G,Pres-G,Research 2000,281,Live Phone,,10/17/00,601.0,Al Gore,222,DEM,47.0,George W. Bush,241,REP,40.0,3.0,7.0,11/7/00,54.6,42.58,12.01,-5.01,1.0,


The cell below loads the same data in R:

In [7]:
%%R

df <- read_csv('raw-polls.csv')

df

Rows: 10776 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (17): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10,776 × 31
   poll_id questio…¹ race_id  year race  locat…² type_…³ type_…⁴ polls…⁵ polls…⁶
     <dbl>     <dbl>   <dbl> <dbl> <chr> <chr>   <chr>   <chr>   <chr>     <dbl>
 1   26013     87909    1455  1998 1998… NY      Gov-G   Gov-G   Blum &…      32
 2   26255     87926    1456  1998 1998… OH      Gov-G   Gov-G   Univer…     346
 3   26026     31266    1736  1998 1998… NV      Sen-G   Sen-G   FM3 Re…      91
 4   26013     31253    1738  1998 1998… NY      Sen-G   Sen-G   Blum &…      32
 5   63632    117103    1738  1998 1998… NY      Sen-G 

# Guided Exploration

In this section you'll make a few charts to explore the data. Here I will raise some questions for you to dig around in the data and answer. You can use summary statistics and/or charts to help answer the questions. You will have to make some methodological choices along the way. Be aware of what choices you're making! I'll ask you about them shortly.


## Question 1: How accurate are polls from the following pollsters?
Characterize the accuracy of each of these pollsters in a sentence or two. Then, write another few sentences justifying your characterization with insights from the data.
- Siena College/The New York Times Upshot
- Jayhawk Consulting
- Fox News/Beacon Research/Shaw & Co. Research
- Brown University
- American Research Group


In [48]:
%%R

abs(df$bias) %>% summary()

df_G <- df %>% 
    filter(type_simple %in% c('Pres-G', 'Sen-G', 'House-G', 'Gov-G'))

👉 **Siena College/The New York Times Upshot** 

In [19]:
%%R

siena <- df %>% 
    filter(pollster=='Siena College/The New York Times Upshot')
abs(siena$bias) %>% summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.000   1.907   4.070   4.362   5.885  15.010       1 


👉 **Jayhawk Consulting**

In [24]:
%%R

jayhawk <- df %>% 
    filter(pollster=='Jayhawk Consulting Services')

abs(jayhawk$bias) %>% summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  32.29   34.95   37.62   37.62   40.28   42.94 


👉 **Fox News/Beacon Research/Shaw & Co. Research**

In [25]:
%%R

fox <- df %>% 
    filter(pollster=='Fox News/Beacon Research/Shaw & Co. Research')

abs(fox$bias) %>% summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.020   2.120   3.880   4.751   6.090  15.610      15 


👉 **Brown University**

In [26]:
%%R

brown <- df %>% 
    filter(pollster=='Brown University')

abs(brown$bias) %>% summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  3.870   6.100   8.270   8.827  10.465  16.520       5 


👉 **American Research Group**

In [27]:
%%R

arg <- df %>% 
    filter(pollster=='American Research Group')

abs(arg$bias) %>% summary()

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.130   1.738   3.425   4.245   5.652  26.760     191 


### Question 2: Which pollsters are the most accurate? Which are the least accurate?

👉 Which pollsters are the most accurate?

In [29]:
df.head()

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
0,26013,87909,1455,1998,1998_Gov-G_NY,NY,Gov-G,Gov-G,Blum & Weprin Associates,32,Live Phone,,10/13/98,364.0,Peter Vallone,13080,DEM,26.0,George Pataki,13083,REP,57.0,9.0,-31.0,11/3/98,33.16,54.32,-21.15,-9.85,1.0,for New York Daily News | WABC-TV (New York)
1,26255,87926,1456,1998,1998_Gov-G_OH,OH,Gov-G,Gov-G,University of Cincinnati (Ohio Poll),346,Live Phone,,10/13/98,540.0,Lee Fisher,13085,DEM,37.0,Bob Taft,13086,REP,52.0,,-15.0,11/3/98,44.69,50.05,-5.36,-9.64,1.0,
2,26026,31266,1736,1998,1998_Sen-G_NV,NV,Sen-G,Sen-G,FM3 Research,91,Live Phone,D,10/13/98,488.0,Harry Reid,3964,DEM,49.0,John Ensign,3965,REP,44.0,,5.0,11/3/98,47.86,47.77,0.09,4.91,1.0,for unspecified Democratic sponsor
3,26013,31253,1738,1998,1998_Sen-G_NY,NY,Sen-G,Sen-G,Blum & Weprin Associates,32,Live Phone,,10/13/98,364.0,Charles E. Schumer,3966,DEM,44.0,Alfonse M. D'Amato,3967,REP,52.0,,-8.0,11/3/98,54.62,44.08,10.54,-18.54,0.0,for New York Daily News | WABC-TV (New York)
4,63632,117103,1738,1998,1998_Sen-G_NY,NY,Sen-G,Sen-G,Garin-Hart-Yang Research Group,113,Live Phone,D,10/13/98,902.0,Charles E. Schumer,3966,DEM,46.0,Alfonse M. D'Amato,3967,REP,42.0,,4.0,11/3/98,54.62,44.08,10.54,-6.54,1.0,for Charles E. Schumer


In [34]:
df['bias_abs'] = df.bias.abs()

In [45]:
df.groupby('pollster').bias_abs.median().sort_values().head(10)

pollster
Winthrop University                   0.18
Amber Integrated                      0.38
Missouri State University             0.50
1892 Polling                          0.61
Ogden & Fry                           0.65
Saint Anselm College                  0.65
Mercyhurst University                 0.74
Singularis Group                      0.80
Public Religion Research Institute    0.86
Louis Harris & Associates             0.86
Name: bias_abs, dtype: float64

👉 Which are the least accurate?

In [47]:
df.groupby('pollster').bias_abs.median().sort_values().tail(50)

pollster
GOP Calls                                                                   21.040
Bainbridge Media Group                                                      21.180
Dane & Associates                                                           21.300
Massie & Associates                                                         23.980
Riggs Research Services                                                     33.650
Jayhawk Consulting Services                                                 37.615
Alabama State University                                                       NaN
Baruch College                                                                 NaN
Beacon Research                                                                NaN
CPEC                                                                           NaN
Canisius College                                                               NaN
Castleton University                                                          

### Question 2 Reflections

👉 Write a summary paragraph explaining how you decided what constitutes “most accurate” and "least accurate"?


The most accurate pollster is, based on all of their polls, has the smallest median value that diverges 

👉 In bullet point form, name **methodological choices** you made in the process of determining which pollsters were the most and least accurate.


👉 In bullet point form, list the **limitations** of your approach 
