# Exercises for Practice

## Exercise 01 -- Nobel Prize Winners 
Georgios Karamanis gathered and shared data on Nobel prize winners over the years, with a fair amount of detail, and used in the `tidytuesday` series a while back. These data are to be used for the questions that follow. 

In [None]:
readr::read_csv(
    "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-05-14/nobel_winners.csv"
    ) -> nobel_winners 

|variable             |class     |description |
|:---|:---|:-----------|
|prize_year           |double    | Year that Nobel Prize was awarded|
|category             |character | Field of study/category|
|prize                |character | Prize Name |
|motivation           |character | Motivation of the award |
|prize_share          |character | Share eg 1 of 1, 1 of 2, 1 of 4, etc |
|laureate_id          |double    | ID assigned to each winner |
|laureate_type        |character | Individual or organization  |
|full_name            |character | name of the winner|
|birth_date           |double    | birth date of winner |
|birth_city           |character | birth city/state of winner |
|birth_country        |character | birth country of winner |
|gender               |character | binary gender of the winner |
|organization_name    |character | organization name |
|organization_city    |character | organization city |
|organization_country |character | organization country |
|death_date           |double    | death date of the winner (if dead) |
|death_city           |character | death city (if dead) |
|death_country        |character | death country (if dead) |

(a) First create `nobel.df` that keeps only records starting in the year 1960, and only for the "Physics" category. Now generate an appropriate chart that shows the distribution of winners by `birth_country` 

In [None]:
library(tidyverse)

nobel_winners %>%
    filter(
        prize_year >= 1960, category == "Physics"
    ) -> nobel.df

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16) 

ggplot(
    data = nobel.df, 
    aes(x = birth_country)
    ) +
    geom_bar() +
    labs(
        title = "Winners of the Nobel Prize in Physics, by Country of Birth",
        x = "Number of Winners",
        y = "Country of Birth"
    ) +
    coord_flip() 


(b) Now break this distribution out by `gender` to see how winners by country differs across gender 

In [None]:
ggplot(
    data = nobel.df, 
    aes(x = birth_country, fill = gender)
    ) +
    geom_bar() +
    labs(
        title = "Winners of the Nobel Prize in Physics, by Country of Birth and Gender",
        x = "Number of Winners",
        y = "Country of Birth"
    ) +
    facet_wrap(~ gender) +
    coord_flip() +
    theme(legend.position = "none") 

(c) Now go back to `noble_winners`, the full data-set, and create a simple plot that shows the distribution of prize winners by `death_country`, `gender`, and `category`  

In [None]:
options(repr.plot.width = 20, repr.plot.height = 20) 

ggplot(
    data = nobel_winners, 
    aes(x = death_country, fill = gender)
    ) +
    geom_bar(position = "dodge") +
    labs(
        title = "Winners of the Nobel Prize, by Country of Death, Category, and Gender",
        x = "Number of Winners",
        y = "Country of Birth",
        fill = "Gender"
    ) +
    facet_wrap(~ category) +
    coord_flip() +
    theme(legend.position = "top") 

## Exercise 02 -- Water levels in the Great Lakes

Download the monthly Great Lakes water level data-set [SPSS format from here](https://aniruhil.github.io/avsr/teaching/dataviz/greatlakes.sav) and [Excel format from here](https://aniruhil.github.io/avsr/teaching/dataviz/greatlakes.xlsx). *Note that water level is in meters.* 

You may use the following command to read in the excel file: 

In [None]:
library(readxl)
url <- "https://aniruhil.github.io/avsr/teaching/dataviz/greatlakes.xlsx"
destfile <- "greatlakes.xlsx"
curl::curl_download(url, destfile)
read_excel(destfile, col_types = c("date", 
     "numeric", "numeric", "numeric", "numeric", 
     "numeric")) -> greatlakes 

greatlakes %>%
    head()

Now use an appropriate chart to show the water level for Lake Superior. 

In [None]:
options(repr.plot.width = 16, repr.plot.height = 10) 

ggplot(
    data = greatlakes,
    aes(x = monthyear, y = Superior)
    ) +
    geom_line() +
    labs(
        title = "Lake Superior's Water Level (in meters)",
        x = "Date",
        y = "Water Level (in Meters)"
        )
    

## Exercise 03 -- County Health Rankings

Download the 2017 County Health Rankings data [SPSS format from here](https://aniruhil.github.io/avsr/teaching/dataviz/CountyHealthRankings2017.sav), [Excel format from here](https://aniruhil.github.io/avsr/teaching/dataviz/CountyHealthRankings2017.xlsx) and the [accompanying codebook](http://www.countyhealthrankings.org/sites/default/files/2017TrendsDocumentation.pdf). 

These data can also be downloaded with the code provided below: 

In [None]:
library(readxl)

url <- "https://aniruhil.github.io/avsr/teaching/dataviz/CountyHealthRankings2017.xlsx"
destfile <- "CountyHealthRankings2017.xlsx"
curl::curl_download(url, destfile)
read_excel(destfile) -> chr.df 

chr.df %>%
    head()

Construct appropriate plots that shows the relationship between the following pairs of variables 

(a) Adult obesity and High school graduation 

In [None]:
options(repr.plot.width = 10, repr.plot.height = 7) 

ggplot(
    data = chr.df,
    aes(x = Adult_obesity, y = High_school_graduation)
    ) +
    geom_point() +
    labs(
        title = "Scatterplot of Adult Obesity and High School Graduation",
        x = "Proportion of Adults Obese",
        y = "Proportion of High School Graduates"
    )

(b) Children in poverty and High school graduation 

In [None]:
ggplot(
    data = chr.df,
    aes(x = Children_in_poverty, y = High_school_graduation)
    ) +
    geom_point() +
    labs(
        title = "Scatterplot of Children in Poverty and High School Graduation",
        x = "Proportion of Children Living in Poverty",
        y = "Proportion of High School Graduates"
    )

(c) Preventable hospital stays and Unemployment rate 

In [None]:
ggplot(
    data = chr.df,
    aes(x = Preventable_hospital_stays, y = Unemployment_rate)
    ) +
    geom_point() +
    labs(
        title = "Scatterplot of Children in Poverty and High School Graduation",
        x = "Preventable Hospital Stays",
        y = "Unemployment Rate"
    )

## Exercise 04 -- Unemployment Rates

Use the unemployment data given to you `(unemprate.RData)` and construct appropriate plots that show the distribution of unemployment rates across years for each of the four educational attainment groups. 

In [None]:
load("data/unemprate.RData")

urate %>%
    head()

Be sure to use a unique color for each educational attainment group

In [None]:
options(repr.plot.width = 20, repr.plot.height = 15) 

ggplot(
    data = urate,
    aes(x = yearmonth, y = rate, color = educ_group)) +
    geom_point() +
    geom_line() +
    theme(legend.position = "top") +
    labs(
        title = "Unemployment Rate by Year and Education Group",
        x = "Date",
        y = "Unemployment Rate",
        color = ""
        )