In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab11.ipynb")

# Lab 11: Tidy Data and Interactive Visualization

Welcome to Lab 11 of Data Wrangling and Visualization!

## Overview
Statistician Hadley Wickham draws an analaogy for data to Leo Tolstoy's quote "Happy families are all alike; every unhappy family is unhappy in its own way."  Wickham's insight is that tidy datasets are all alike but every messy dataset is messy in its own way ([paper](https://vita.had.co.nz/papers/tidy-data.pdf)).

Tidy data sets are desirable because they provide a standard way to connect the structure of the dataset (e.g., columns, physical layout) to its meaning.  Tidy data follows three principles:

- Every column is a variable
- Every row is an observation
- Every cell is a single value.

Messy data refers to any other arrangement of a dataset.


The five most common problems with messy datasets are
- column headers are values, not variable names
- multiple variables are stored in one column
- variables are stored in both rows and columns
- multiple types of observational units are stored in the same table
- a single observational unit is stored in multiple tables.


The most common tools to address messiness are
- melting (changing wide format data to long format data)
- string splitting
- pivoting (changing long format data to wide format data)

Other tools that are nice for cleaning data include 
- filter (subsetting or removing observations based on some condition)
- transform (adding or modifying variables, e.g., log transforming a single variable or computing force from mass and acceleration variables)
- aggregate (collapsing multiple values into a single value, such as taking the mean, or summing the total count)
- sort (changing the order of observations, for example, sorting by date or sorting alphabetically)

Tidying a dataset puts it in the form described above (3 principles), and tidying is a subset of data cleaning.

## In today's lab, we will
- work on understanding concept of tidy data
- recognize whether a given dataset is tidy or messy
- utilize common tools to reshape data into tidy form

In [None]:
import pandas as pd
import numpy as np

### 1. Practice: Reshaping data

**Question 1.1:** Given the dataset below, use Pandas methods to reshape the data frame on the left to match the data frame on the right. 


<table style="text-align:center;"><tr>
<th style='text-align:center; vertical-align:middle'> Original </th>
<th style='text-align:center; vertical-align:middle'> Reshaped </th></tr>
<tr>
<td> <img src="df_orig.PNG" alt="Drawing" style="width: 280px;"/> </td>
<td> <img src="df_reshape1.PNG" alt="Drawing" style="width: 180px;"/> </td>
</tr></table>

*NOTES:* 
- The data frame on the right above has been truncated (for space). Your resulting data frame should include all the data from the original data frame.
- The goal in this problem is not to tidy the data. It is to practice and refresh our memories reshaping methods in Pandas. (We went over these during midterm week, so we didn't get as much practice with them)

In [None]:
# Create original dataframe:
df_orig = pd.DataFrame(
        {"Person":["Alan","Berta","Charlie","Danielle"], #Name of Person
        "House":["A","B","A","C"],                      #Name of houses they live in
        "Age":[32,46,35,28],                            #Age of Person
        "Books":[100,30,20,40],                         #Number of books owned
        "Movies":[10,20,80,60]                          #Number of movie watched
        })
df_orig

In [None]:
df_reshaped1 = ...
df_reshaped1

In [None]:
grader.check("q1_1")

**Question 1.2:** Given the dataset below, use Pandas methods to reshape the data frame on the left to match the data frame on the right. 

<table style="text-align:center;"><tr>
<th style='text-align:center; vertical-align:middle'> Original </th>
<th style='text-align:center; vertical-align:middle'> Reshaped </th></tr>
<tr>
<td> <img src="df_orig.PNG" alt="Drawing" style="width: 280px;"/> </td>
<td> <img src="df_reshape2.PNG" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

*NOTES:* 
- - The goal in this problem is not to tidy the data. It is to practice and refresh our memories reshaping methods in Pandas.

In [None]:
df_reshaped2 = ...
df_reshaped2

In [None]:
grader.check("q1_2")

**Question 1.3:** Given the dataset below, use Pandas methods to reshape the data frame on the left to match the data frame on the right. 

<table style="text-align:center;"><tr>
<th style='text-align:center; vertical-align:middle'> Original </th>
<th style='text-align:center; vertical-align:middle'> Reshaped </th></tr>
<tr>
<td> <img src="df_orig.PNG" alt="Drawing" style="width: 280px;"/> </td>
<td> <img src="df_reshape3.PNG" alt="Drawing" style="width: 200px;"/> </td>
</tr></table>

*NOTES:* 
- The data frame on the right above has *not* been truncated. Your resulting data frame should only include the data that is shown.
- The goal in this problem is not to tidy the data. It is to practice and refresh our memories reshaping methods in Pandas.

In [None]:
df_reshaped3 = ...
df_reshaped3

In [None]:
grader.check("q1_3")

**Question 1.4:** Given the dataset below, use Pandas methods to reshape the data frame on the left to match the data frame on the right. 

<table style="text-align:center;"><tr>
<th style='text-align:center; vertical-align:middle'> Original </th>
<th style='text-align:center; vertical-align:middle'> Reshaped </th></tr>
<tr>
<td> <img src="df_orig.PNG" alt="Drawing" style="width: 280px;"/> </td>
<td> <img src="df_reshape4.PNG" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

*NOTES:* 
- The goal in this problem is not to tidy the data. It is to practice and refresh our memories reshaping methods in Pandas.

In [None]:
df_reshaped4 = ...
df_reshaped4

In [None]:
grader.check("q1_4")

**Question 1.5:** Given the dataset below, use Pandas methods to reshape the data frame on the left to match the data frame on the right. 

<table style="text-align:center;"><tr>
<th style='text-align:center; vertical-align:middle'> Original </th>
<th style='text-align:center; vertical-align:middle'> Reshaped </th></tr>
<tr>
<td> <img src="df_orig3.PNG" alt="Drawing" style="width: 520px;"/> </td>
<td> <img src="df1_reshape.PNG" alt="Drawing" style="width: 250px;"/> </td>
</tr></table>

*NOTES:* 
- You will likely need more than one line of code
- The goal in this problem is not to tidy the data. It is to practice and refresh our memories reshaping methods in Pandas.

In [None]:
df1 = pd.DataFrame({
    'ID': [1,2,3],
    'Name': ['John', 'Alice', 'Bob'],
    'Math_Score_Feb': [85, 90, 78],
    'Math_Score_Mar': [88, 85, 90],
    'Science_Score_Feb': [92, 88, 85],
    'Science_Score_Mar': [85, 90, 88]
})
df1

In [None]:
df1_reshaped = ...
df1_reshaped

In [None]:
grader.check("q1_5")

**Question 1.6:** Given the dataset below, use Pandas methods to reshape the data frame on the left to match the data frame on the right. 

<table style="text-align:center;"><tr>
<th style='text-align:center; vertical-align:middle'> Original </th>
<th style='text-align:center; vertical-align:middle'> Reshaped </th></tr>
<tr>
<td> <img src="df_orig3.PNG" alt="Drawing" style="width: 520px;"/> </td>
<td> <img src="df3_reshape.PNG" alt="Drawing" style="width: 300px;"/> </td>
</tr></table>

*NOTES:* 
- You will likely need more than one line of code.
- The goal in this problem is not to tidy the data. It is to practice and become more familiar with reshaping methods in Pandas.

In [None]:
df1_reshaped2 = ...
df1_reshaped2

In [None]:
grader.check("q1_6")

### 2. Tidying Data: Weather in Mexico

**Question 2.1:** Read in the data from `weather_raw.csv` (in your working directory). This dataset contains daily weather data from the Global Historical Climatology Network for one weather station (MX17004) in Mexico for several months in 2010. It has variables in individual columns (`id`, `year`, `month`), spread across columns (day, `d1`–`d31`) and across rows (`tmin`, `tmax`) (minimum and maximum temperature). Months with less than 31 days have
structural missing values for the last day(s) of the month. The `element` column is not a variable; it stores the names of variables.

In [None]:
weather_df = ...
weather_df.head()

In [None]:
grader.check("q2_1")

**Question 2.2:** Put your weather data frame in long form so that it contains the columns, `id`,`year`,`month`,`element`,`day_raw` (containing `d1`,`d2`,etc.), and `value`. The `value` column should contain many NaNs at this stage.

In [None]:
weather_long = ...
weather_long.head()

In [None]:
grader.check("q2_2")

**Question 2.3:** Since `element` is not a variable (it contains the names of variables), reshape the data so that each unique value in `element` is a column name and that corresponding values are appropriately placed in the columns.

*NOTE:* In this step, remove any rows with NaNs in your reshaped dataframe. 

In [None]:
weather_nearly_tidy = ...
weather_nearly_tidy.head()

In [None]:
grader.check("q2_3")

**Question 2.4:** The `year`, `month`, and `day_raw` columns contain information for a single variable, date. Use those three columns to create a new column called `date`. Make sure `date` is a Pandas datetime type. Then remove the `year`, `month`, and `day_raw` columns. 

In [None]:
weather_with_date = ...
weather_with_date.head()

In [None]:
grader.check("q2_4")

**Question 2.5:** Sort the data by `date`. 

In [None]:
weather_sorted = ...
weather_sorted.head(10)

In [None]:
grader.check("q2_5")

<!-- BEGIN QUESTION -->

**Question 2.6:** Create an interactive plot that allows users to visualize tmax vs time or tmin vs time with a dropdown menu. 

<!-- END QUESTION -->

## 3. Tidying Data: Ebola Outbreak

<!-- BEGIN QUESTION -->

**Question 3:** Import the data from `country_timeseries.csv`. Tidy the data and then create an interactive data visualization which shows the number of cases or the number of deaths through time for Guinea, SierraLeone, and Liberia.

<!-- END QUESTION -->

## You're done! 

Congratulations on finishing the lab! Gus is proud of you! Run the cell below and submit to Canvas. 

<img src="gus_high_five.JPG" alt="drawing" width="500"/>

## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False, run_tests=True)