# Python Activity
## Data Wrangling

This notebook is designed to acquaint you with some of the data wrangling and transformation tools in the Pandas module of Python. Refer to the content in Chapters 8 and 10 of _**Python for Data Analysis (3rd Ed.)**_ for examples of the type of code you need for these exercises.

For EACH exercise:

1. Read the description of the task
2. Type your solution in the code cell marked ```### YOUR CODE HERE```
3. Run your code (fix any issues and re-run if needed)
4. Run the TEST CELL that FOLLOWS your code cell. **_DO NOT MODIFY THE TEST CELL._**

The output from the TEST CELL will indicate whether you have performed the task correctly. If the result does not say _`Passed!`_ then you should return to your code cell and revise your code.

### Import Libraries

We will start by loading Pandas and other Python libraries needed for this notebook.

In [None]:
# Libraries needed in this notebook
import numpy as np
import pandas as pd
from IPython.display import display


# Exercises


## Basic Tidying Transformations: Gathering ("Melting") and Spreading ("Casting")

Given a data set and a target set of variables, there are at least two common issues that require tidying.

> Melting and casting are Wickham's terms from [his original paper on tidying data](http://www.jstatsoft.org/v59/i10/paper). In his more recent writing, [on which this tutorial is based](http://r4ds.had.co.nz/tidy-data.html), he refers to the same operations as _gathering_ and _spreading_.

### Melting
First, values often appear as columns. The table on the right is an example. To tidy up, you want to turn columns into rows:

![Gather example](http://r4ds.had.co.nz/images/tidy-9.png)

Because this operation takes columns into rows, making a "fat" table more tall and skinny, it is sometimes called _melting_. It is also referred to as 'Gathering' or simply 'Wide to Long' (see textbook). However, the `pandas` operation that performs this function is called `melt`.

Notice that "melting" the above table includes the following steps:

1. Identify the column(s) that will be used as _**ID**_ column(s).  In the above example, this is the `country` column.
2. Identify the columns that will provide values for a new _**key**_ column. In the above example, the columns `1999` and `2000` become **values** in a key column called `year`.
3. Convert the values _associated_ with the columns identified in step 2 into a new column as well. In this case, the values formerly in columns `1999` and `2000` become the values in the `cases` column.

Viewing the above example in the context of a melt, it is common to describe `year` as a new _**key**_ variable and `cases` as the new _**value**_ variable for that key.

**Exercise 1.** Write a function called `my_melt` to perform a melt meeting the following specifications:

```python
    def my_melt(df, id_cols, val_cols, key, value):
        ...
```

It should take the following arguments:
- `df`: the input data frame, e.g., `table4` in the example above;
- `id_cols`: a list of the column names that serve as ID; e.g., `Country` in example above
- `val_cols`: a list of the column names that will serve as values; e.g., column `1999` & `2000` in example  table
- `key`: name of the new key variable; e.g., `year` in the example above;
- `value`: name of the column to hold the values; e.g., `cases` in the example above

>#### NOTES
>* By far the easiest way to implement the body of this function is to use the **pandas** `melt` function, and pass it the appropriate parameters. 
>* The example in the text will get you started with examples of `pd.melt`, but you will need to search for additional documentation. _**HINT:**_ Every argument in your function will need to be passed as the appropriate argument to `pd.melt`

In [None]:
def my_melt(df, id_cols, val_cols, key, value):
    assert type(df) is pd.DataFrame
    #
    # YOUR CODE HERE
    #


# You can use the code below to try your function before proceeding to the test cell.
df = pd.DataFrame(columns=['Country','Other','1999','2000'],
                      data=list(zip(['Afghanistan','Brazil','China'],
                                    ['Stuff1','Stuff2','Stuff3'],
                                    [745,375,2208],
                                    [841,422,3119])))
display(df)
my_melt(df,['Country','Other'],['1999','2000'],'Year','Count')

In [None]:
# Test: `melt_test`

def tibbles_are_equivalent(A, B):
    """Given two tidy tables ('tibbles'), returns True iff they are
    equivalent.
    """
    Acols = list(A.columns)
    Bcols = list(B.columns)
    if not len(Acols) == len(Bcols):
        return False
    
    try:
        Z = A.merge(B,on=Acols)
    except ValueError:
        return False
    
    if not len(A)==len(Z):
        return False
    
    return True

table4a = pd.read_csv('Data/table4a.csv')
print("\n=== table4a ===")
display(table4a)

m_4a = my_melt(table4a, id_cols = ['country'], val_cols=['1999', '2000'], key='year', value='cases')
print("=== melt(table4a) ===")
display(m_4a)

table4b = pd.read_csv('Data/table4b.csv')
print("\n=== table4b ===")
display(table4b)

m_4b = my_melt(table4b, id_cols = ['country'], val_cols=['1999', '2000'], key='year', value='population')
print("=== melt(table4b) ===")
display(m_4b)

m_4 = pd.merge(m_4a, m_4b, on=['country', 'year'])
print ("\n=== inner-join(melt(table4a), melt (table4b)) ===")
display(m_4)

m_4['year'] = m_4['year'].apply (int)

table1 = pd.read_csv('Data/table1.csv')
print ("=== table1 (target solution) ===")
display(table1)
assert tibbles_are_equivalent(table1, m_4)
print ("\n(Passed.)")

## Casting (or Spreading)
The second most common issue is that an observation might be split across multiple rows. Table 2 is an example. To tidy up, you want to merge rows:

![Spread example](http://r4ds.had.co.nz/images/tidy-8.png)

This operation is conceptually the opposite of melting, "re-assembling" observations from parts; it is sometimes called _casting_. It is also referred to as 'Spreading' or simply 'Long to Wide' (see textbook). However, the `pandas` operation that performs this function is called `pivot`.


**Exercise 2.** Write a function called `my_cast` to perform a cast meeting the following specifications:

Implement a function to cast a data frame into wide format, given a key column containing new variable names and a value column containing the corresponding values.

```python
    def my_cast(df, id_cols, key, value):
        ...
```

It should take the following arguments:
- `df`: the input data frame; e.g., `table2` in the example above
- `id_cols`: a list with the names of the ID columns; e.g., `country` and `year` in the above example
- `key`: name of the column containing the key variable; e.g., column `key` in the above example
- `value`: name of the column containing the values; e.g., `values` in the above example

>#### NOTES
>* By far the easiest way to implement the body of this function is to use the **pandas** `pivot` function, and pass it the appropriate parameters. 
>* The example in the text will get you started with examples of `pd.pivot`, but you will need to search for additional documentation. _**HINT:**_ Every argument in your function will need to be passed as the appropriate argument to `pd.pivot`
>* If you use the simplest solution, you may find that your ID columns have been made into the index; this structure will not pass the test, because these columns should actually be part of the dataframe. You can resolve this with code similar to the following:
```python
                  tibble = tibble.reset_index().rename(columns={"index":"keycols"}) 
```
>    _This will take all values in the index and put them back into the dataframe as the appropriate columns._

Your code cell below contains a partial solution that verifies that the given `key` and `value` columns are actual columns of the input data frame.


In [None]:
#def my_cast(df, key, value, join_how='outer'):
def my_cast(df, id_cols,key, value):
    """Casts the input data frame into a tibble,
    given the unique ID, key column and value column.
    """
    assert type(df) is pd.DataFrame
    assert key in df.columns and value in df.columns  
    #
    # YOUR CODE HERE
    #

# You can use the code below to try your function before proceeding to the test cell.
t = pd.read_csv('Data/table2.csv')
display(t)
my_cast(t,['country','year'],'type','count')

In [None]:
# Test: `cast_test`

table2 = pd.read_csv('Data/table2.csv')
print('=== table2 ===')
display(table2)

print("\n=== tibble2 = my_cast (table2, ['country','year'], 'type', 'count') ===")
tibble2 = my_cast(table2, ['country','year'], 'type', 'count')
display(tibble2)

assert tibbles_are_equivalent(table1, tibble2)
print('\n(Passed.)')

#### READY TO SUBMIT?
You've reached the end of this notebook. Be sure to restart and run all cells again to **make sure all cells are working** when they run in order. Then submit your **completed** HTML to the submission  folder for this activity.