Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [1]:
NAME = "Aseem Sachdeva"


# Presenting Uncertainty
## School of Information, University of Michigan

## Week 4: Assignment Overview
Version 1.1
### The objectives for this week are for you to:
- recreate the visualization for tracking Donald Trump’s approval ratings throughout his presidency
- create an alternative uncertainty visualization of the same data using principles you learned in class
- analyze and compare the two visualizations

Read the FiveThirtyEight article [How popular/unpopular is Donald Trump](https://projects.fivethirtyeight.com/trump-approval-ratings/). 

The article is based on the datasets [approval_topline.csv](asset/approval_topline.csv) and [approval_poll_list.csv](asset/approval_poll_list.csv). Both of them are in the asset folder.

In [2]:
import pandas as pd
import altair as alt
import numpy as np

In [3]:
alt.data_transformers.enable('json')
pd.set_option('display.max_columns', None)
import time
import altair as alt
import pandas as pd
import ipywidgets as widgets
from ipywidgets import interact
from sklearn import linear_model
from sklearn import gaussian_process
import numpy as np

import operator
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures


## Part 0. Load the data (2 points)

Load the "approval_topline.csv" file into the `topline_df` variable and the "approval_polllist.csv" file into the `polls_df` variable.

In [4]:
# YOUR CODE HERE
topline_df = pd.read_csv("asset/approval_topline.csv")


polls_df = pd.read_csv("asset/approval_poll_list.csv")


In [5]:
topline_df.head()

Unnamed: 0,president,subgroup,modeldate,approve_estimate,approve_hi,approve_lo,disapprove_estimate,disapprove_hi,disapprove_lo,timestamp
0,Donald Trump,Voters,4/15/2020,44.7615,48.405868,41.117133,51.533103,55.28009,47.786116,10:03:57 15 Apr 2020
1,Donald Trump,Adults,4/15/2020,43.994962,47.722998,40.266926,50.842083,55.956679,45.727487,10:02:28 15 Apr 2020
2,Donald Trump,All polls,4/15/2020,44.35656,48.075278,40.637841,51.377554,55.887691,46.867418,10:01:29 15 Apr 2020
3,Donald Trump,Adults,4/14/2020,43.764737,47.704991,39.824482,50.748517,56.060162,45.436873,19:23:26 14 Apr 2020
4,Donald Trump,Voters,4/14/2020,44.645348,48.374384,40.916311,51.774275,55.606283,47.942267,19:24:55 14 Apr 2020


In [6]:
polls_df.tail()

Unnamed: 0,president,subgroup,modeldate,startdate,enddate,pollster,grade,samplesize,population,weight,influence,approve,disapprove,adjusted_approve,adjusted_disapprove,multiversions,tracking,url,poll_id,question_id,createddate,timestamp
11551,Donald Trump,Voters,4/9/2020,4/5/2020,4/7/2020,YouGov,B-,1147.0,rv,0.253976,0.237406,45.0,53.0,44.283645,52.550837,,,https://docs.cdn.yougov.com/ogvntw3mu9/econTab...,65691,121386,4/8/2020,14:30:51 9 Apr 2020
11552,Donald Trump,Voters,4/9/2020,4/5/2020,4/7/2020,YouGov,B-,793.0,rv,0.175591,0.164135,43.0,54.0,42.283645,53.550837,,,https://docs.cdn.yougov.com/gekugap92v/tabs_Tr...,65695,121414,4/9/2020,14:30:51 9 Apr 2020
11553,Donald Trump,Voters,4/9/2020,4/6/2020,4/7/2020,Ipsos,B-,959.0,rv,0.690561,0.690561,42.0,55.0,42.277652,53.066692,,,https://www.ipsos.com/sites/default/files/ct/n...,65688,121411,4/8/2020,14:30:51 9 Apr 2020
11554,Donald Trump,Voters,4/9/2020,4/6/2020,4/8/2020,YouGov,B-,775.0,rv,0.187596,0.187596,43.0,53.0,42.283645,52.550837,,,https://docs.cdn.yougov.com/38z7i3o4q3/tabs_Tr...,65696,121416,4/9/2020,14:30:51 9 Apr 2020
11555,Donald Trump,Voters,4/9/2020,4/6/2020,4/8/2020,Rasmussen Reports/Pulse Opinion Research,C+,1500.0,lv,0.490965,0.490965,46.0,52.0,41.720496,52.781656,,T,http://www.rasmussenreports.com/public_content...,65697,121418,4/9/2020,14:30:51 9 Apr 2020


In [7]:
assert len(topline_df) == 3537, "Load the data: topline_df does not have the required number of rows"
assert len(topline_df.columns) == 10, "Load the data: topline_df does not have the required number of columns"
assert len(polls_df) == 11556, "Load the data: polls_df does not have the required number of rows"
assert len(polls_df.columns) == 22, "Load the data: polls_df does not have the required number of columns"

## Part 1. Recreate the first visualization for tracking Donald Trump’s approval ratings (10 points)

Note that the original visualization has a dropdown for selecting All Polls, Polls of likely or registered voters, or Polls of Adults. Pick just one of these subsets to visualize. (Hint: filter on the `subgroup` column in the dataset)

In [8]:
# YOUR CODE HERE
polls_df_filtered = polls_df.loc[polls_df['subgroup'] == 'Voters']

topline_df_filtered = topline_df.loc[topline_df['subgroup'] == 'Voters']


source1 = polls_df_filtered

source2 = topline_df_filtered

line1 = alt.Chart(source2).mark_line(color="#ff7400").encode(
    x='modeldate:T',
    y=alt.Y('disapprove_estimate', axis=alt.Axis(title="disapproval/approval percentage", tickCount=8),  scale=alt.Scale(domain=[20, 80]))
)




band1 = alt.Chart(source2).mark_errorband(extent='ci', color="#ff7400").encode(
    x= alt.X('modeldate:T', axis=alt.Axis(title=None)),
    y=alt.Y('disapprove_hi', axis=alt.Axis(title="disapproval/approval percentage")),
    y2='disapprove_lo'
)

dots1 = alt.Chart(source1).mark_circle(color="#ffe3cc").encode(x='enddate:T', y='disapprove')




line2 = alt.Chart(source2).mark_line(color="#009f29").encode(
    x=alt.X('modeldate:T', axis=alt.Axis(title=None)),
    y=alt.Y('approve_estimate', axis=alt.Axis(title="disapproval/approval percentage", tickCount=8), scale=alt.Scale(domain=[20, 80]))
)

band2 = alt.Chart(source2).mark_errorband(extent='ci', color="#009f29" ).encode(
    x='modeldate:T',
    y=alt.Y('approve_hi', axis=alt.Axis(title="disapproval/approval percentage")),
    y2='approve_lo'
)

dots2 = alt.Chart(source1).mark_circle(color='#ccecd4').encode(x='enddate:T', y='approve')


approve_line = band1  + dots1 + line1

disapprove_line = band2  + dots2 + line2

combined_line = disapprove_line + approve_line

combined_line = combined_line.properties(height=1000, width=1000, title= {"text": ['How Popular/Unpopular is Donald Trump?']})
combined_line

## Part 2. Use one of the techniques from class to create an alternative version of the visualization from Part 1 (15 points) 

Create an alternative uncertainty visualization to your visualization from Part 1. 

**NOTE:** You will either have to make some assumptions in order to construct an alternative visualization, or fit your own model to the data. Document your assumptions in Part 3.

In [9]:

def get_one_bootstrap_disapprove_approve_fit():
    '''Get one bootstrap sampled polynomial regression fit to the data'''
    #resample the data with replacement (replace=True) to a data frame with 
    #the same number of data points (frac=1.0)
    resampled_df = source1.sample(frac=1.0, replace=True)

    #fit model to resampled data
    X = resampled_df[['disapprove']] #[[ ]] subsets so X remains a DataFrame
    y = resampled_df['enddate']   #y should be an array, so we use [ ]
    
    base1 = alt.Chart(resampled_df).mark_circle(color="#ffe3cc").encode(x='enddate:T', y=alt.Y('disapprove', scale=alt.Scale(domain=[20, 80]), axis=alt.Axis(tickCount=8))).properties(height=1000, width=1000)

    base2 = alt.Chart(resampled_df).mark_circle(color='#ccecd4').encode(x='enddate:T', y=alt.Y('approve', scale=alt.Scale(domain=[20, 80]), axis=alt.Axis(tickCount=8))).properties(height=1000, width=1000)

    polynomial_fit1 = [
        base1.transform_regression(
        "enddate", "disapprove", method="poly", order=3
    ).mark_line(color='red')]
    
    
    polynomial_fit2 = [
    base2.transform_regression(
    "enddate", "approve", method="poly", order=3
    ).mark_line(color='red')]
    
    chart1 = alt.layer(base1, *polynomial_fit1)
    
    chart2 = alt.layer(base2, *polynomial_fit2)
    
    return chart1 + chart2


   

 







In [10]:
B = 50

# get `B` bootstrapped fit line charts
# Note opacity=0.1 sets the line opacity so it is easier to see the overlapping lines. Make
# sure your get_salary_linear_fit_chart() function (defined above) properly uses the opacity argument!
line_charts = [get_one_bootstrap_disapprove_approve_fit() for _ in range(B)]

#combine all the line charts together and layer on the points chart
alt.layer(*line_charts) + get_one_bootstrap_disapprove_approve_fit() 

## Part 3. Assumptions, Justification, and Comparison (5 points)

Document any assumptions you made in Part 2. Then, compare your visualization to the original, justifying your design choices.

In constructing the spaghetti plot visualization above, I made the prior assumption that the underlying data would be nonlinear in nature, as is most often the case with data in the wild, and furthermore assumed that it would not be binomial in nature, in the sense that we are not assessing the number of successes in a series of trials. Therefore, I assumed a polynomial fit would be most appropriate, and selected a lower degree of three in order to fit the data adequately, avoiding an aggressively high degree so as not to overfit the data. Upon observing the output, it seems as if a third degree polynomial fits the general trend of the data well.

In regards to the effectiveness of the spaghetti plot, it is possible for the viewer to gather a general sense of uncertainty regarding the model fit simply on account of the spread of the lines. However, even for a relatively small amount of runs(as little as 5), it is almost infeasible to discern each individual trendline, adding little value in the way of examining individual differences. Therefore, for this data, it might be more appropriate to explore some variety of small multiples visualization approach instead, so the viewer can effectively observe differences in output for each bootstrap sample.

Please remember to submit both the HTML and .ipynb formats of your completed notebook. When generating your HTML, be sure to run your complete code first before downloading as HTML. Please remember to work on your explanations and interpretations!