# Lab Assignment 12: Interactive Visualizations
## DS 6001: Practice and Application of Data Science

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

## Problem 0
Import the following libraries:

In [1]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import dash
from jupyter_dash import JupyterDash
import dash_core_components as dcc
import dash_html_components as html
from dash.dependencies import Input, Output
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']

The dash_core_components package is deprecated. Please replace
`import dash_core_components as dcc` with `from dash import dcc`
  import dash_core_components as dcc
The dash_html_components package is deprecated. Please replace
`import dash_html_components as html` with `from dash import html`
  import dash_html_components as html


For this lab, we will be working with the 2019 General Social Survey one last time.

In [2]:
%%capture
gss = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
                 encoding='cp1252', na_values=['IAP','IAP,DK,NA,uncodeable', 'NOT SURE',
                                               'DK', 'IAP, DK, NA, uncodeable', '.a', "CAN'T CHOOSE"])

Here is code that cleans the data and gets it ready to be used for data visualizations:

In [23]:
mycols = ['id', 'wtss', 'sex', 'educ', 'region', 'age', 'coninc',
          'prestg10', 'mapres10', 'papres10', 'sei10', 'satjob',
          'fechld', 'fefam', 'fepol', 'fepresch', 'meovrwrk'] 
gss_clean = gss[mycols]
gss_clean = gss_clean.rename({'wtss':'weight', 
                              'educ':'education', 
                              'coninc':'income', 
                              'prestg10':'job_prestige',
                              'mapres10':'mother_job_prestige', 
                              'papres10':'father_job_prestige', 
                              'sei10':'socioeconomic_index', 
                              'fechld':'relationship', 
                              'fefam':'male_breadwinner', 
                              'fehire':'hire_women', 
                              'fejobaff':'preference_hire_women', 
                              'fepol':'men_bettersuited', 
                              'fepresch':'child_suffer',
                              'meovrwrk':'men_overwork'},axis=1)
gss_clean.age = gss_clean.age.replace({'89 or older':'89'})
gss_clean.age = gss_clean.age.astype('float')

The `gss_clean` dataframe now contains the following features:

* `id` - a numeric unique ID for each person who responded to the survey
* `weight` - survey sample weights
* `sex` - male or female
* `education` - years of formal education
* `region` - region of the country where the respondent lives
* `age` - age
* `income` - the respondent's personal annual income
* `job_prestige` - the respondent's occupational prestige score, as measured by the GSS using the methodology described above
* `mother_job_prestige` - the respondent's mother's occupational prestige score, as measured by the GSS using the methodology described above
* `father_job_prestige` -the respondent's father's occupational prestige score, as measured by the GSS using the methodology described above
* `socioeconomic_index` - an index measuring the respondent's socioeconomic status
* `satjob` - responses to "On the whole, how satisfied are you with the work you do?"
* `relationship` - agree or disagree with: "A working mother can establish just as warm and secure a relationship with her children as a mother who does not work."
* `male_breadwinner` - agree or disagree with: "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."
* `men_bettersuited` - agree or disagree with: "Most men are better suited emotionally for politics than are most women."
* `child_suffer` - agree or disagree with: "A preschool child is likely to suffer if his or her mother works."
* `men_overwork` - agree or disagree with: "Family life often suffers because men concentrate too much on their work."

In [35]:
gss_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2348 entries, 0 to 2347
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   2348 non-null   int64  
 1   weight               2348 non-null   float64
 2   sex                  2348 non-null   object 
 3   education            2345 non-null   float64
 4   region               2348 non-null   object 
 5   age                  2341 non-null   float64
 6   income               2152 non-null   float64
 7   job_prestige         2248 non-null   float64
 8   mother_job_prestige  1657 non-null   float64
 9   father_job_prestige  1842 non-null   float64
 10  socioeconomic_index  2248 non-null   float64
 11  satjob               1739 non-null   object 
 12  relationship         1550 non-null   object 
 13  male_breadwinner     1545 non-null   object 
 14  men_bettersuited     1499 non-null   object 
 15  child_suffer         1536 non-null   o

## Problem 1
Our goal in this lab is to build a dashboard that presents our findings from the GSS. A dashboard is meant to be shared with an audience, whether that audience is a manager, a client, a potential employer, or the general public. So we need to provide context for our results. One way to provide context is to write text using markdown code.

Find one or two websites that discuss the gender wage gap, and write a short paragraph in markdown code summarizing what these sources tell us. Include hyperlinks to these websites. Then write another short paragraph describing what the GSS is, what the data contain, how it was collected, and/or other information that you think your audience ought to know. A good starting point for information about the GSS is here: http://www.gss.norc.org/About-The-GSS

Then save the text as a Python string so that you can use the markdown code in your dashboard later.

It should go without saying, but no plagiarization! If you summarize a website, make sure you put the summary in your own words. Anything that is copied and pasted from the GSS webpage, Wikipedia, or another website without attribution will receive no credit.

(Don't spend too much time on this, and you might want to skip it during the Zoom session and return to it later so that you can focus on working on code with your classmates.) [1 point]

In [42]:
gender_wage_gap_note = """
[Gender pay gap](https://en.wikipedia.org/wiki/Gender_pay_gap) refers to differences in renumiration earned by working men and women. This seems to be a highly debated issue because most agree that it is unfair that on average working women are earning less. The counter argument states that adjusted pay gap is statistically insignificant. Moreover, it can be shown that the observed difference is due to numerous arbitrary choices made by people during their lifes rather than discriminatory practices (https://thesuffolkjournal.com/28647/opinion/the-gender-wage-gap-is-a-myth/).

"""



general_social_survey_note = """
The [General Social Survey](https://gss.norc.org/About-The-GSS) is US personal interview survey to monitor changes in socioeconomics as well as attidutes. The survey was conducted for a long period of time and presently it aims to identify trends and explain long-term changes in attitudes and behaviors.

"""

In [36]:
markdown_text = '''
The [American National Election Study](https://electionstudies.org) (ANES) is a massive public opinion survey conducted after every national election. It is one of the greatest sources of data available about the voting population of the United States. It contains far more information than a typical public opinion poll. Iterations of the survey contain thousands of features from thousands of respondents, and examines people's attitudes on the election, the candidates, the parties, it collects massive amounts of demographic information and other characteristics from voters, and it records people's opinions on a myriad of political and social issues.

Prior to each election the ANES conducts a "pilot study" that asks many of the questions that will be asked on the post-election survey. The idea is to capture a snapshot of the American electorate prior to the election and to get a sense of how the survey instrument is working so that adjustments can be made in time. Here we will work with the [2019 ANES pilot data](https://electionstudies.org/data-center/2019-pilot-study/). To understand the features and the values used to code responses, the data have an associated [questionnaire](https://electionstudies.org/wp-content/uploads/2020/02/anes_pilot_2019_questionnaire.pdf) and [codebook](https://electionstudies.org/wp-content/uploads/2020/02/anes_pilot_2019_userguidecodebook.pdf). The pilot data were collected in December 2019 and contain 900 features collected from 3,165 respondents. 
'''

In [44]:
gender_wage_gap_note

'\n[Gender pay gap](https://en.wikipedia.org/wiki/Gender_pay_gap) refers to differences in renumiration earned by working men and women. This seems to be a highly debated issue because most agree that it is unfair that on average working women are earning less. The counter argument states that adjusted pay gap is statistically insignificant. Moreover, it can be shown that the observed difference is due to numerous arbitrary choices made by people during their lifes rather than discriminatory practices (https://thesuffolkjournal.com/28647/opinion/the-gender-wage-gap-is-a-myth/).\n\n'

## Problem 2
Generate a table that shows the mean income, occupational prestige, socioeconomic index, and years of education for men and for women. Use a function from a `plotly` module to display a web-enabled version of this table. This table is for presentation purposes, so round every column to two decimal places and use more presentable column names. [3 points]

In [25]:
gss_clean.columns

Index(['id', 'weight', 'sex', 'education', 'region', 'age', 'income',
       'job_prestige', 'mother_job_prestige', 'father_job_prestige',
       'socioeconomic_index', 'satjob', 'relationship', 'male_breadwinner',
       'men_bettersuited', 'child_suffer', 'men_overwork'],
      dtype='object')

In [45]:
stats_df = gss_clean.groupby(['sex'])[['income', 'job_prestige', 'socioeconomic_index', 'education']].mean()
stats_df.reset_index(inplace=True)
stats_df = round(stats_df, 2)
stats_df = stats_df.rename(columns = {'sex':'Gender', 
                          'income':'Income',
                          'job_prestige':'Job Prestige',
                          'socioeconomic_index':'SocioEcon index',
                          'education':'Years of Education'                         
                         })

In [126]:
table = ff.create_table(stats_df)
table.show()

## Problem 3
Create an interactive barplot that shows the number of men and women who respond with each level of agreement to `male_breadwinner`. Write presentable labels for the x and y-axes, but don't bother with a title because we will be using a subtitle on the dashboard for this graphic. [3 points]

In [75]:
# Prototype 
barplot_df = pd.DataFrame(gss_clean.groupby(['sex', 'male_breadwinner'])['male_breadwinner'].count())
barplot_df.columns = ['count']
barplot_df.reset_index(inplace=True)

In [76]:
barplot_df

Unnamed: 0,sex,male_breadwinner,count
0,female,agree,152
1,female,disagree,377
2,female,strongly agree,48
3,female,strongly disagree,286
4,male,agree,158
5,male,disagree,337
6,male,strongly agree,40
7,male,strongly disagree,147


In [124]:
fig = px.bar(barplot_df, x='count', y='male_breadwinner', color='sex',
             color_discrete_map = {'male':'blue', 'female':'red'},
            labels={'male_breadwinner':'Vote choice', 'count':'Count'},
            #title = 'Vote choice for male_breadwinner by gender',
            hover_data = ['sex', 'male_breadwinner', 'count'],
            #text='sex',
            barmode = 'group')
fig.update_layout(showlegend=True)
fig.update(layout=dict(title=dict(x=0.5)))
fig.show()

## Problem 4
Create an interactive scatterplot with `job_prestige` on the x-axis and `income` on the y-axis. Color code the points by `sex` and make sure that the figure includes a legend for these colors. Also include two best-fit lines, one for men and one for women. Finally, include hover data that shows us the values of `education` and `socioeconomic_index` for any point the mouse hovers over. Write presentable labels for the x and y-axes, but don't bother with a title because we will be using a subtitle on the dashboard for this graphic. [3 points]

In [81]:
scatterplot_df = gss_clean[['sex', 'job_prestige', 'income', 'education', 'socioeconomic_index']]
scatterplot_df.head()

Unnamed: 0,sex,job_prestige,income,education,socioeconomic_index
0,male,47.0,,14.0,65.3
1,female,22.0,22782.5,10.0,14.8
2,male,61.0,112160.0,16.0,83.4
3,female,59.0,158201.8412,16.0,69.3
4,male,53.0,158201.8412,18.0,68.6


In [123]:
fig = px.scatter(scatterplot_df, x='job_prestige', y='income', 
                 color="sex", #trendline='ols',
                 height=600, width=600,
                 color_discrete_map = {'male':'blue', 'female':'red'},
                 labels={'job_prestige':'Job prestige', 
                        'income':'Income'},
                 hover_data=['education', 'socioeconomic_index'],
                 #title = 'Income and prestige by gender'
                )
fig.update(layout=dict(title=dict(x=0.5)))
fig.show()

## Problem 5
Create two interactive box plots: one that shows the distribution of `income` for men and for women, and one that shows the distribution of `job_prestige` for men and for women. Write presentable labels for the axis that contains `income` or `job_prestige` and remove the label for `sex`. Also, turn off the legend. Don't bother with titles because we will be using subtitles on the dashboard for these graphics. [3 points]

In [127]:
fig = px.box(scatterplot_df, x='income', y = 'sex', color = 'sex',
             color_discrete_map = {'male':'blue', 'female':'red'},
                   labels={'income':'Income', 'sex':'Gender'},
                   #title = 'Distribution of income by gender'
            )
fig.update_layout(showlegend=False) # turn off the legend
fig.update(layout=dict(title=dict(x=0.5)))
fig.show()

In [128]:
fig = px.box(scatterplot_df, x='job_prestige', y = 'sex', color = 'sex',
             color_discrete_map = {'male':'blue', 'female':'red'},
                   labels={'job_prestige':'Job prestige', 'sex':'Gender'},
             
             
                   #title = 'Distribution of income by gender'
            )
fig.update_layout(showlegend=False) # turn off the legend
fig.update(layout=dict(title=dict(x=0.5)))
fig.show()

## Problem 6
Create a new dataframe that contains only `income`, `sex`, and `job_prestige`. Then create a new feature in this dataframe that breaks `job_prestige` into six categories with equally sized ranges. Finally, drop all rows with any missing values in this dataframe.

Then create a facet grid with three rows and two columns in which each cell contains an interactive box plot comparing the income distributions of men and women for each of these new categories. 

(If you want men to be represented by blue and women by red, you can include `color_discrete_map = {'male':'blue', 'female':'red'}` in your plotting function. Or use different colors if you want!) [3 points]

In [132]:
facetgrid_df = gss_clean[['sex', 'income', 'job_prestige']]
facetgrid_df['prestige_cat'] = pd.cut(facetgrid_df['job_prestige'], 6)
list(facetgrid_df['prestige_cat'].unique())

[Interval(37.333, 48.0, closed='right'),
 Interval(15.936, 26.667, closed='right'),
 Interval(58.667, 69.333, closed='right'),
 Interval(48.0, 58.667, closed='right'),
 Interval(26.667, 37.333, closed='right'),
 Interval(69.333, 80.0, closed='right'),
 nan]

In [133]:
facetgrid_df.isnull().sum()

sex               0
income          196
job_prestige    100
prestige_cat    100
dtype: int64

In [135]:
facetgrid_df.dropna(inplace=True)
facetgrid_df.isnull().sum()

sex             0
income          0
job_prestige    0
prestige_cat    0
dtype: int64

In [139]:
fig = px.box(facetgrid_df, x='income', y = 'sex', color = 'sex',
             facet_col='prestige_cat', facet_col_wrap=2,
             color_discrete_map = {'male':'blue', 'female':'red'},
            labels={'sex':'', 'income':''},
            #title = 'Income by prestige categories for different genders,
            )
fig.update(layout=dict(title=dict(x=0.5)))
fig.update_layout(showlegend=False)
fig.show()

## Problem 7
Create a dashboard that displays the following elements:

* A descriptive title

* The markdown text you wrote in problem 1

* The table you made in problem 2

* The barplot you made in problem 3

* The scatterplot you made in problem 4

* The two boxplots you made in problem 5 side-by-side

* The faceted boxplots you made in problem 6

* Subtitles for all of the above elements

Use `JupyterDash` to display this dashboard directly in your Jupyter notebook.

Any working dashboard that displays all of the above elements will receive full credit. [4 points]

## Extra Credit (up to 10 bonus points)
Dashboards are all about good design, functionality, and accessability. For this extra credit problem, create another version of the dashboard you built for problem 7, but take extra steps to improve the appearance of the dashboard, add user-inputs, and host it on the internet with its own URL.

**Challenge 1**: Be creative and use a layout that significantly departs from the one used for the ANES data in the module 12 notebook. A good place to look for inspiration is the [Dash gallery](https://dash-gallery.plotly.host/Portal/). We will award up to 3 bonus points for creativity, novelty, and style.

**Challenge 2**: Alter the barplot from problem 3 to include user inputs. Create two dropdown menus on the dashboard. The first one should allow a user to display bars for the categories of `satjob`, `relationship`, `male_breadwinner`, `men_bettersuited`, `child_suffer`, or `men_overwork`. The second one should allow a user to group the bars by `sex`, `region`, or `education`. After choosing a feature for the bars and one for the grouping, program the barplot to update automatically to display the user-inputted features. One bonus point will be awarded for a good effort, and 3 bonus points will be awarded for a working user-input barplot in the dashboard.

**Challenge 3**: Follow the steps listed in the module notebook to deploy your dashboard on Heroku. 1 bonus point will be awarded for a Heroku link to an app that isn't working. 4 bonus points will be awarded for a working Heroku link.