# Python and R

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# show all columns on pandas dataframes
pd.set_option('display.max_columns', None)


In [2]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [3]:
%%R

# My commonly used R imports

require('tidyverse')


R[write to console]: Loading required package: tidyverse



── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()



# Read the data



The cell below loads the data in python:

In [4]:
df = pd.read_csv('raw-polls.csv')
df.sample(5)

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
3183,55745,89297,1355,2006,2006_Gov-G_SC,SC,Gov-G,Gov-G,SurveyUSA,325,IVR,,11/3/06,485.0,Tommy Moore,12752,DEM,40.0,Mark Sanford,12753,REP,57.0,,-17.0,11/7/06,44.79,55.12,-10.33,-6.67,1.0,
5193,73913,138680,4717,2010,2010_House-G_NM-1,NM-1,House-G,House-G,ccAdvertising,396,IVR,,10/18/10,600.0,Martin Heinrich,10891,DEM,39.9,Jonathan L. Barela,10892,REP,47.4,,-7.5,11/2/10,51.8,48.2,3.61,-11.11,0.0,sample size unavailable; estimated at 600 as a...
4512,16583,29313,1577,2008,2008_Sen-G_VA,VA,Sen-G,Sen-G,CNN/Opinion Research Corp.,37,Live Phone,,10/26/08,721.0,Mark R. Warner,2808,DEM,63.0,James S. Gilmore III,2809,REP,35.0,,28.0,11/4/08,65.03,33.73,31.3,-3.3,1.0,for CNN
2687,4090,29978,1591,2006,2006_Sen-G_MI,MI,Sen-G,Sen-G,Strategic Vision LLC,320,Live Phone,,10/22/06,1200.0,Debbie Stabenow,2915,DEM,48.0,Mike Bouchard,2916,REP,42.0,,6.0,11/7/06,56.91,41.26,15.65,-9.65,1.0,
5536,51730,81778,4571,2010,2010_House-G_FL-25,FL-25,House-G,House-G,Susquehanna Polling & Research Inc.,326,IVR,,10/26/10,700.0,Joe Garcia,10754,DEM,43.0,David Rivera,10755,REP,44.0,6.0,-1.0,11/2/10,42.59,52.15,-9.56,8.56,1.0,


The cell below loads the same data in R:

In [5]:
%%R

df <- read_csv('raw-polls.csv')

df

Rows: 10776 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (17): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10,776 × 31
   poll_id questio…¹ race_id  year race  locat…² type_…³ type_…⁴ polls…⁵ polls…⁶
     <dbl>     <dbl>   <dbl> <dbl> <chr> <chr>   <chr>   <chr>   <chr>     <dbl>
 1   26013     87909    1455  1998 1998… NY      Gov-G   Gov-G   Blum &…      32
 2   26255     87926    1456  1998 1998… OH      Gov-G   Gov-G   Univer…     346
 3   26026     31266    1736  1998 1998… NV      Sen-G   Sen-G   FM3 Re…      91
 4   26013     31253    1738  1998 1998… NY      Sen-G   Sen-G   Blum &…      32
 5   63632    117103    1738  1998 1998… NY      Sen-G 

# Guided Exploration

In this section you'll make a few charts to explore the data. Here I will raise some questions for you to dig around in the data and answer. You can use summary statistics and/or charts to help answer the questions. You will have to make some methodological choices along the way. Be aware of what choices you're making! I'll ask you about them shortly.


## Question 1: How accurate are polls from the following pollsters?
Characterize the accuracy of each of these pollsters in a sentence or two. Then, write another few sentences justifying your characterization with insights from the data.
- Siena College/The New York Times Upshot
- Jayhawk Consulting
- Fox News/Beacon Research/Shaw & Co. Research
- Brown University
- American Research Group


👉 **Siena College/The New York Times Upshot** 

In [11]:
df.loc[(df.pollster == 'Siena College/The New York Times Upshot')]['bias'].describe()

count    82.000000
mean      1.422927
std       5.219059
min     -15.010000
25%      -2.075000
50%       1.515000
75%       5.125000
max      11.200000
Name: bias, dtype: float64

👉 **Jayhawk Consulting**

In [14]:
df.loc[(df.pollster == 'Jayhawk Consulting Services')]['bias'].describe()

count     2.000000
mean     37.615000
std       7.530687
min      32.290000
25%      34.952500
50%      37.615000
75%      40.277500
max      42.940000
Name: bias, dtype: float64

👉 **Fox News/Beacon Research/Shaw & Co. Research**

In [15]:
df.loc[(df.pollster == 'Fox News/Beacon Research/Shaw & Co. Research')]['bias'].describe()

count    31.000000
mean      3.073226
std       5.096175
min      -5.870000
25%      -0.290000
50%       2.630000
75%       6.060000
max      15.610000
Name: bias, dtype: float64

👉 **Brown University**

In [16]:
df.loc[(df.pollster == 'Brown University')]['bias'].describe()

count     7.000000
mean     -2.212857
std      10.138818
min     -11.080000
25%      -9.060000
50%      -5.570000
75%       1.380000
max      16.520000
Name: bias, dtype: float64

👉 **American Research Group**

In [17]:
df.loc[(df.pollster == 'American Research Group')]['bias'].describe()

count    86.000000
mean      0.113023
std       5.737122
min     -10.100000
25%      -3.502500
50%      -0.560000
75%       2.825000
max      26.760000
Name: bias, dtype: float64

### Question 2: Which pollsters are the most accurate? Which are the least accurate?

👉 Which pollsters are the most accurate?

The Upshot is the most accurate. Its mean, and median are relatively low, and its max and min are also close to 0.

👉 Which are the least accurate?

Jayhawk is the least accurate. Its prediction is way off. However, we are not to be able to say it for sure as only 2 of its polls were recorded.

### Question 2 Reflections

👉 Write a summary paragraph explaining how you decided what constitutes “most accurate” and "least accurate"?


There are several points to consider. First of all, its median and mean should be as close to 0 as possible, showing that its prediction lies close to the right answer. Also, the sd should be reasonably small, so that as many attempts are mostly accurate as possible. Lastly, I will see the min and max, figure out whether if the database have a possible outlier, and how far are these outliers.  

👉 In bullet point form, name **methodological choices** you made in the process of determining which pollsters were the most and least accurate.


I was trying to determine the accuracy of these pollsters through its mean, median, 25% and 75% value. In general, the closer the mean and median is to 0, and the smaller the range of the IQR means the pollster is more accurate in terms of the prediction.

👉 In bullet point form, list the **limitations** of your approach 


My approach is a result based on numbers alone. Without the graph for varification, I am uncertain if the database have any special distributions. Also, The conductor with fewer attempts might be disadvantaged because its data are more likely to be influenced by the outliers.