# Restructuring data with Pandas

![austin](http://www.austintexas.gov/sites/default/files/aac_logo.jpg)

## Scenario:
You have decided that you want to start your own animal shelter, but you want to get an idea of what that will entail and get more information about planning. In this lecture, we are continuing to look at a real data set collected by Austin Animal Center over several years and use our pandas skills from the last lecture and learn some new ones in order to explore this data further.


#### Our goals today are to be able to:

Use the pandas library to:

- Get summary info about a dataset and its variables using `.describe()`, `.mean()`, `.max()`, `.min()`
- Reshape a DataFrame using joins, merges, pivoting, concatenating, and melting


## Getting started

Let's take a moment to examine the [Austin Animal Center data set](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238/data). We answered a few questions about this data yesterday.  What other questions could we ask and answer using this dataset?

In pairs and as a class, let's generate ideas.

## Switch gears

Before we answer those questions about the animal shelter data, let's practice on a simpler dataset.
Read about this dataset here: https://www.kaggle.com/ronitf/heart-disease-uci
![heart-data](images/heartbloodpres.jpeg)

The dataset is most often used to practice classification algorithms. Can one develop a model to predict the likelihood of heart disease based on other measurable characteristics? We will return to that specific question in a few weeks, but for now we wish to use the dataset to practice some pandas methods.

### Get summary info about a dataset and its variables

Applying and using `info`, `describe`, `mean`, `min`, `max` from the Pandas library

The Pandas library has several useful tools built in. Let's explore some of them.

In [None]:
!pwd
!ls -al

In [None]:
import pandas as pd
uci = pd.read_csv('heart.csv')

In [None]:
uci.head()

#### The `.info() `and `.describe()` and `.dtypes` methods

Pandas DataFrames have many useful methods! Let's look at `.info()` , `.describe()`, and `dtypes`.

In [None]:
# Call the .info() method on our dataset. What do you observe?

uci.info()

In [None]:
# Call the .describe() method on our dataset. What do you observe?

uci.describe()

In [None]:
# Use the code below. How does the output differ from info() ?
uci.dtypes

#### `.mean()`, .`min()`,` .max()`, `.sum()`

The methods `.mean()`, `.min()`, and `.max()` will perform just the way you think they will!

Note that these are methods both for Series and for DataFrames.

In [None]:
uci.ca.mean()

In [None]:
uci.mean()

### Apply to Animal Shelter Data
Using `.info()`, `.describe()`, `.value_counts()`, and `dtypes` what observations can we make about the data?

What breed of dog is the most prevalent in the intakes dataset?

What is the age of the oldest animal in the outcomes dataset?

In [None]:
outcomes = pd.read_pickle('./outcomes.pkl')
intakes = pd.read_pickle('./intakes.pkl')

In [None]:
# your code here

## Methods for Re-Organizing DataFrames
#### `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [None]:
uci.groupby('sex')

#### `.groups` and `.get_group()`

These fuctions can help you identify which indices belong to which group and to select just the rows of a given group.

In [None]:
uci.groupby('sex').groups

In [None]:
uci.groupby('sex').get_group(0) # .tail()

### Aggregating

Once we have our groups we can then calculate statistics on each group.  Below we are looking for the standard deviation of each variable split apart by sex.

In [None]:
uci.groupby('sex').std()

Exercise: Tell me the average cholesterol level for those with heart disease.

In [None]:
# Your code here!


### Your turn: Use groupby methods to examine the animal shelter data

In your group complete tasks 1 and 2

#### Task 1
- Use a groupby to show the average age of the different kinds of animal types in the outcomes dataset.
- What about by animal types **and** gender?
 

In [None]:
# your code here

#### Task 2:
- Create new columns `year` and `month` by using a lambda function `x.year` on date of outcome
- Use `groupby` and `.size()` to tell me how many animals are adopted by month using the outcome dataset

In [None]:
# Your code here

## Reshaping a DataFrame

### `.pivot()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

In [None]:
uci.pivot(values = 'age', columns = 'target')

### Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`, `.melt()`

See documentation from [pandas](https://pandas.pydata.org/docs/user_guide/merging.html)

### `.join()`

In [None]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns = ['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns = ['age', 'HP'])

In [None]:
toy1.join(toy2.set_index('age'),
          on = 'age',
          lsuffix = '_A',
          rsuffix = '_B').head()

### `.merge()`

The pandas `.merge()` function is very similar to `.join()` but offers some further versatility (at the cost of requiring more detailed inputs).  This method may be preferred when you don't want to join the dataframes on the index.

In [None]:
ds_chars = pd.read_csv('ds_chars.csv', index_col = 0)

In [None]:
states = pd.read_csv('states.csv', index_col = 0)

In [None]:
ds_chars.merge(states,
               left_on='home_state',
               right_on = 'state',
               how = 'inner')

### `pd.concat()`

Exercise: Look up the documentation on pd.concat (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) and use it to concatenate ds_chars and states.
<br/>
Your result should still have only five rows!

In [None]:
pd.concat([ds_chars, states])

### `pd.melt()`

Melting removes the structure from your DataFrame and puts the data in a 'variable' and 'value' format.

In [None]:
ds_chars

In [None]:
pd.melt(ds_chars,
        id_vars=['name'],
        value_vars=['HP', 'home_state'])

## Your turn!

Merge the intakes and outcomes datasets to answer the question __How long is average stay for animals in this shelter?__

- What data do we need to solve this question?
- What columns from which dataset?

In [None]:
#your code here

#### What if we wanted to know if the mean days in the shelter differs based on animal type?

In [None]:
#Your code here

## BONUS question! Medical needs

1. How many animals come in injured? And what happens to them?
2. How many animals come in and over their stay get neutered?


In [None]:
#your code here