# Python and R

In [78]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# show all columns on pandas dataframes
pd.set_option('display.max_columns', None)


The rpy2.ipython extension is already loaded. To reload it, use:
  %reload_ext rpy2.ipython
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [79]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [80]:
%%R

# My commonly used R imports

require('tidyverse')



# Read the data



The cell below loads the data in python:

In [81]:
df = pd.read_csv('raw-polls.csv')
df.sample(5)

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
2282,4435,26357,1639,2004,2004_Sen-G_OK,OK,Sen-G,Sen-G,Cole Hargrave Snodgrass & Associates,56,Live Phone,,10/26/04,500.0,Brad R. Carson,3311,DEM,38.0,Tom Coburn,3310,REP,43.0,6.0,-5.0,11/2/04,41.24,52.77,-11.52,6.52,1.0,among registered voters
9801,71455,133836,6281,2020,2020_Sen-G_MN,MN,Sen-G,Sen-G,Change Research,48,Online,,10/14/20,1021.0,Tina Smith,15691,DEM,48.0,Jason Lewis,15692,REP,44.0,3.0,4.0,11/3/20,48.74,43.5,5.24,-1.24,1.0,for MinnPost
7496,34385,125142,1247,2014,2014_Gov-G_CO,CO,Gov-G,Gov-G,Public Policy Polling,263,IVR/Online,D,11/2/14,739.0,John Wright Hickenlooper,8708,DEM,47.0,Bob Beauprez,8704,REP,47.5,2.0,-0.5,11/4/14,49.3,45.95,3.34,-3.84,0.0,for unspecified Democratic sponsor; average of...
9376,56723,91369,112,2018,2018_Sen-G_NV,NV,Sen-G,Sen-G,Trafalgar Group,338,IVR,R,10/31/18,2587.0,Jacky Rosen,11150,DEM,45.6,Dean Heller,11151,REP,48.9,,-3.3,11/6/18,50.41,45.38,5.03,-8.33,0.0,for unspecified Republican sponsor
9139,66738,124506,149,2018,2018_Gov-G_MN,MN,Gov-G,Gov-G,St. Cloud State University,312,Live Phone,,10/23/18,404.0,Timothy J. Walz,12420,DEM,50.0,Jeff Johnson,12421,REP,34.0,,16.0,11/6/18,53.84,42.43,11.42,4.58,1.0,


The cell below loads the same data in R:

In [82]:
%%R

df <- read_csv('raw-polls.csv')

df

Rows: 10776 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (17): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10,776 × 31
   poll_id questio…¹ race_id  year race  locat…² type_…³ type_…⁴ polls…⁵ polls…⁶
     <dbl>     <dbl>   <dbl> <dbl> <chr> <chr>   <chr>   <chr>   <chr>     <dbl>
 1   26013     87909    1455  1998 1998… NY      Gov-G   Gov-G   Blum &…      32
 2   26255     87926    1456  1998 1998… OH      Gov-G   Gov-G   Univer…     346
 3   26026     31266    1736  1998 1998… NV      Sen-G   Sen-G   FM3 Re…      91
 4   26013     31253    1738  1998 1998… NY      Sen-G   Sen-G   Blum &…      32
 5   63632    117103    1738  1998 1998… NY      Sen-G 

# Guided Exploration

In this section you'll make a few charts to explore the data. Here I will raise some questions for you to dig around in the data and answer. You can use summary statistics and/or charts to help answer the questions. You will have to make some methodological choices along the way. Be aware of what choices you're making! I'll ask you about them shortly.


## Question 1: How accurate are polls from the following pollsters?
Characterize the accuracy of each of these pollsters in a sentence or two. Then, write another few sentences justifying your characterization with insights from the data.
- Siena College/The New York Times Upshot
- Jayhawk Consulting
- Fox News/Beacon Research/Shaw & Co. Research
- Brown University
- American Research Group


👉 **Siena College/The New York Times Upshot** 

In [83]:
#get rid of the rows with samplesize below 600
df_600 = df[df['samplesize'] >= 600]


In [84]:

# Siena College/The New York Times Upshot

filter = df_600.pollster == 'Siena College/The New York Times Upshot'
Sena_NYT = df_600[filter]['bias'].mean()
Sena_NYT


4.526410256410257

👉 **Jayhawk Consulting**

In [85]:
filter_2 = df_600.pollster == 'Jayhawk Consulting Services'
Jayhawk = df_600[filter_2]['bias'].mean()

Jayhawk

32.29

👉 **Fox News/Beacon Research/Shaw & Co. Research**

In [86]:

filter_3 = df_600.pollster == 'Fox News/Beacon Research/Shaw & Co. Research'
Fox = df_600[filter_3]['bias'].mean()
Fox

3.073225806451613

👉 **Brown University**

In [87]:

filter_4 = df_600.pollster == 'Brown University'
Brown = df_600[filter_4]['bias'].mean()
Brown

6.63

👉 **American Research Group**

In [88]:

filter_5 = df_600.pollster == 'American Research Group'
American = df_600[filter_5]['bias'].mean()
American

0.11116883116883129

In [89]:
ranking = pd.DataFrame({'pollster': ['Siena College/The New York Times Upshot', 'Jayhawk Consulting', 'Fox News/Beacon Research/Shaw & Co. Research', 'Brown University', 'American Research Group'], 'bias': [Sena_NYT, Jayhawk, Fox, Brown, American]})
ranking

# American Research Group is the pollster with the least bias among all five pollsters, because it has the lowest average bias from all the polls it has conducted.

Unnamed: 0,pollster,bias
0,Siena College/The New York Times Upshot,4.52641
1,Jayhawk Consulting,32.29
2,Fox News/Beacon Research/Shaw & Co. Research,3.073226
3,Brown University,6.63
4,American Research Group,0.111169


### Question 2: Which pollsters are the most accurate? Which are the least accurate?

👉 Which pollsters are the most accurate?

👉 Which are the least accurate?

### Question 2 Reflections

👉 Write a summary paragraph explaining how you decided what constitutes “most accurate” and "least accurate"?


👉 In bullet point form, name **methodological choices** you made in the process of determining which pollsters were the most and least accurate.


👉 In bullet point form, list the **limitations** of your approach 
