# Python and R

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# show all columns on pandas dataframes
pd.set_option('display.max_columns', None)


In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [3]:
%%R

# My commonly used R imports

require('tidyverse')


R[write to console]: Loading required package: tidyverse



── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()



# Read the data



The cell below loads the data in python:

In [4]:
df = pd.read_csv('raw-polls.csv')
df.sample(5)

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
3225,24521,29759,1594,2006,2006_Sen-G_MO,MO,Sen-G,Sen-G,Rasmussen Reports/Pulse Opinion Research,277,IVR,,11/5/06,500.0,Claire McCaskill,2935,DEM,48.0,Jim Talent,2936,REP,49.0,,-1.0,11/7/06,49.58,47.31,2.27,-3.27,0.0,
6927,63968,117731,1256,2013,2013_Gov-G_NJ,NJ,Gov-G,Gov-G,Quinnipiac University,267,Live Phone,,10/24/13,1203.0,Barbara Buono,13765,DEM,31.0,Chris Christie,13766,REP,64.0,,-33.0,11/5/13,38.19,60.3,-22.11,-10.89,1.0,
3798,2241,2816,7414,2008,2008_Pres-D_PA,PA,Pres-P,Pres-D,Franklin & Marshall College,106,Live Phone,,4/11/08,367.0,Hillary Rodham Clinton,45,DEM,49.0,Barack Obama,41,DEM,42.0,,7.0,4/22/08,54.57,45.43,9.14,,1.0,
7720,36487,50013,7587,2016,2016_Pres-D_VA,VA,Pres-P,Pres-D,YouGov,391,Online,,2/24/16,481.0,Hillary Rodham Clinton,9207,DEM,59.0,Bernard Sanders,9739,DEM,39.0,,20.0,3/1/16,64.28,35.22,29.06,,1.0,
3041,74106,138988,3832,2006,2006_House-G_NH-1,NH-1,House-G,House-G,University of New Hampshire,357,Live Phone,,10/31/06,340.0,Carol Shea-Porter,14220,DEM,42.0,Jeb E. Bradley,14221,REP,47.0,,-5.0,11/7/06,51.27,48.64,2.63,-7.63,0.0,for WMUR


The cell below loads the same data in R:

In [5]:
%%R

df <- read_csv('raw-polls.csv')

df

Rows: 10776 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (17): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10,776 × 31
   poll_id questio…¹ race_id  year race  locat…² type_…³ type_…⁴ polls…⁵ polls…⁶
     <dbl>     <dbl>   <dbl> <dbl> <chr> <chr>   <chr>   <chr>   <chr>     <dbl>
 1   26013     87909    1455  1998 1998… NY      Gov-G   Gov-G   Blum &…      32
 2   26255     87926    1456  1998 1998… OH      Gov-G   Gov-G   Univer…     346
 3   26026     31266    1736  1998 1998… NV      Sen-G   Sen-G   FM3 Re…      91
 4   26013     31253    1738  1998 1998… NY      Sen-G   Sen-G   Blum &…      32
 5   63632    117103    1738  1998 1998… NY      Sen-G 

# Guided Exploration

In this section you'll make a few charts to explore the data. Here I will raise some questions for you to dig around in the data and answer. You can use summary statistics and/or charts to help answer the questions. You will have to make some methodological choices along the way. Be aware of what choices you're making! I'll ask you about them shortly.


## Question 1: How accurate are polls from the following pollsters?
Characterize the accuracy of each of these pollsters in a sentence or two. Then, write another few sentences justifying your characterization with insights from the data.
- Siena College/The New York Times Upshot
- Jayhawk Consulting
- Fox News/Beacon Research/Shaw & Co. Research
- Brown University
- American Research Group


👉 **Siena College/The New York Times Upshot** 

In [6]:
#get rid of the rows with samplesize below 600
df_600 = df[df['samplesize'] >= 600]


In [7]:

# Siena College/The New York Times Upshot

filter = df_600.pollster == 'Siena College/The New York Times Upshot'
Sena_NYT = df_600[filter]['bias'].mean()
Sena_NYT


4.526410256410257

👉 **Jayhawk Consulting**

In [8]:
filter_2 = df_600.pollster == 'Jayhawk Consulting Services'
Jayhawk = df_600[filter_2]['bias'].mean()

Jayhawk

32.29

👉 **Fox News/Beacon Research/Shaw & Co. Research**

In [9]:

filter_3 = df_600.pollster == 'Fox News/Beacon Research/Shaw & Co. Research'
Fox = df_600[filter_3]['bias'].mean()
Fox

3.073225806451613

👉 **Brown University**

In [10]:

filter_4 = df_600.pollster == 'Brown University'
Brown = df_600[filter_4]['bias'].mean()
Brown

6.63

👉 **American Research Group**

In [11]:

filter_5 = df_600.pollster == 'American Research Group'
American = df_600[filter_5]['bias'].mean()
American

0.11116883116883129

In [12]:
ranking = pd.DataFrame({'pollster': ['Siena College/The New York Times Upshot', 'Jayhawk Consulting', 'Fox News/Beacon Research/Shaw & Co. Research', 'Brown University', 'American Research Group'], 'bias': [Sena_NYT, Jayhawk, Fox, Brown, American]})
ranking

# American Research Group is the pollster with the least bias among all five pollsters, because it has the lowest average bias from all the polls it has conducted.

Unnamed: 0,pollster,bias
0,Siena College/The New York Times Upshot,4.52641
1,Jayhawk Consulting,32.29
2,Fox News/Beacon Research/Shaw & Co. Research,3.073226
3,Brown University,6.63
4,American Research Group,0.111169


### Question 2: Which pollsters are the most accurate? Which are the least accurate?

👉 Which pollsters are the most accurate?

In [13]:
# American Research Group is the pollster with the least bias among all five pollsters, and we believe that it is the most reliable pollster among the five. We filtered out the polls that had a sample size below 600, because we believe that the sample size is a good indicator of the reliability of the poll. 

👉 Which are the least accurate?

In [14]:
# we think Jayhawk Consulting is the least accurate pollster because it has the highest average bias from all the polls it has conducted. Jayhawk Consulting only conducted two polls, and only one of the two polls had a sample size equals 600, which is a good indicator that Jayhawk Consulting is just a small pollster and is not very reliable.

### Question 2 Reflections

👉 Write a summary paragraph explaining how you decided what constitutes “most accurate” and "least accurate"?


In [15]:
# we use the average bias of each pollster to rank the pollsters from most reliable to least reliable. 

👉 In bullet point form, name **methodological choices** you made in the process of determining which pollsters were the most and least accurate.


In [16]:
#1. We only calculated the average bias of the pollsters that had a sample size of 600 or above
#2. We used the average of the bias of each pollster because we think the average bias is more accurate than the bias for each individual poll


👉 In bullet point form, list the **limitations** of your approach 


In [17]:
#1. Jayhawk Consulting only has one poll that met our standard of having a sample size of 600 or above, so we have limited information to make a conclusion about Jayhawk Consulting, though the data we have is probably enough to indicate Jayhawk Consulting	is not a reliable pollster.

#2. We didn't make a plot to see the distribution of the bias of each pollster because we think the average bias is more accurate than the bias for each individual poll.

#3. We didn't study on the bias change of one specific pollster over time, because some pollers might have improve their methodology over time.