# Cleaning, Analysis, Visualization Walkthrough

[The Austin Animal Shelter Intakes and Outcomes Dataset](https://www.kaggle.com/datasets/aaronschlegel/austin-animal-center-shelter-intakes-and-outcomes)

[Reference: Dan Poynor](https://github.com/danpoynor/pet-shelter-data-analysis-notebook)

In [None]:
# import the things
import pandas as pd
import os

In [None]:
# read csvs from path and create data frames
intakes= r'/Users/brandanscully/Documents/GitHub/DATA_510/aac_intakes.csv'
outcomes= r'/Users/brandanscully/Documents/GitHub/DATA_510/aac_outcomes.csv'

df_i = pd.read_csv(intakes)
df_o = pd.read_csv(outcomes)

In [None]:
# inspect intakes dataframe
df_i.head()

Note that datetime and datetime2 are both datetime, but appear to be redundant.

In [None]:
#inspect outcomes data frame
df_o.head()

Note that datetime and monthyear are in datetime but appear to be redundant.

Note that date_of_birth appears to use datetime format when date would suffice.

In [None]:
#inspect df dimensions
df_i.shape, df_o.shape

Note, data frames have differing number of records, same number of columns.

In [None]:
#deduplicate rows and check the resulting lengths
df_i.drop_duplicates(keep='first', inplace=True)
df_o.drop_duplicates(keep='first', inplace=True)

len(df_i), len(df_o)

That removed 26 rows from df_i and 10 rows from df_o.

In [None]:
#inspect resulting df_i
df_i.info()

In [None]:
#inspect resulting df_o
df_o.info()

In [None]:
#these are the columns that are in df_i but not df_o
set(df_i.columns) - set(df_o.columns)

In [None]:
#these are the columns that are in df_o but not df_i
set(df_o.columns) - set(df_i.columns)

In [None]:
#inspect df_i statistics
df_i.describe()

Note that there are 72,365 unique animal_id values of 80187 records.

Note datetime and datetime2 appear to be redundant.

In [None]:
#inspect df_o statistics
df_o.describe()

Note that there are 72877 unique animal_id values of 80681 records. Compare to 72,365 uniques in df_i.

Note datetime and monthyear fields appear to be redundant.

We'll address the redundant columns using Pandas' [transpose](https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.DataFrame.T.html) method.

We first use .T method to transpose each data frame.
This turns rows to columns.
We then drop_duplicates as before, and .T again to undo our original .T

I originally tried this with the inplace=True argument and got an error related to None values. It worked with variable assignment.

In [None]:
df_i = df_i.T.drop_duplicates().T
df_o = df_o.T.drop_duplicates().T

df_i.shape[1], df_o.shape[1]

Each data frame lost one column.

Let's check to see if there are unique pairs of animal_id and name (there should be).

In [None]:
"""
We're going to use .groupby to give us an index of 'animal_id'.
We're then going to call the 'name' field and generate a list of 
unique names for the index.
Once we have a list of names for each 'animal_id', 
we will apply len() to the list.
This will result in a series of unique names per 'animal_id'.
We'll sort that to see if there are more than one name per id.
"""
df_i.groupby(['animal_id'])['name'].unique().apply(lambda x:len(x)).sort_values()

In [None]:
#same deal for df_o
df_o.groupby(['animal_id'])['name'].unique().apply(lambda x:len(x)).sort_values()

Looks good. Let's merge!

In [None]:
#df_i becomes left

df_i_o = df_i.merge(
    df_o,
    left_on=['name', 'animal_id', 'animal_type', 'breed', 'color'],
    right_on=['name', 'animal_id', 'animal_type', 'breed', 'color'],
    suffixes=('_intake', '_outcome')
)

In [None]:
#inspect the resulting dataframe
df_i_o.info()

Initially, each data frame used 8.0+MB of memory. Combined memory is downt to 13.7+MB (savings!).

We have some datetime fields that are being stored as objects, so we'll need to convert those if we want to use them.

It looks like there are some null values primarily in names and outcome_subtypes. Also outcome_type, sex_upon_intake, age_upon_outcome.

Let's take a look at those.

In [None]:
#start with NaN names
df_i_o[df_i_o.name.isna()].head()

In [None]:
#let's set NaN names to Unknown.
df_i_o['name'].fillna('Unknown', inplace=True)

In [None]:
"""
on to outcome_subtype.
we'll use a sample of 10 here instead of head.
This should let us see a cross section of possible outcomes 
and their subtypes.
"""
df_i_o[df_i_o.outcome_subtype.isna()].sample(10)

It seems the 'Return to Owner' and 'Adoption' outcomes have no associated subtypes. This is fine. Note it makes sense for the outcome_type to come before outcome_subtype.

Let's take a look at the outcome_type using our animal_id approach to make sure we're not missing anything.

In [None]:
df_o.groupby(['outcome_type'])['outcome_subtype'].unique()

The sample approach missed "Died", "Disposal", "Relocate", "Rto-Adopt", "Missing", and possibly "Euthanasia".

There's another way to approach this problem.

In [None]:
df_i_o[df_i_o.outcome_subtype.isna()].outcome_type.unique()

This makes it look like there are some nan outcome_type values.

We should look at these. Let's start with shape.

In [None]:
df_i_o[df_i_o.outcome_type.isna()].shape

Small enough that jupyter should show us the whole table.

In [None]:
df_i_o[df_i_o.outcome_type.isna()]

Woah. More than cats and dogs, but the animal types are being saved as breeds with animal_type 'Other'. 

Note some * in the name field. That field could use some cleaning.

Let's see how many animal types are in the data set.

In [None]:
df_i_o[df_i_o.animal_type=='Other'].groupby(['breed'])['name'].count().sort_values(ascending=False)

Ok. 97 unique breeds of "Other" animal_types.

Bats, raccoons, and rabbits seem to be the most popular.

It looks like there is some ambiguity in the breed descriptions.

In [None]:
#for completeness
df_i_o['animal_type'].unique()

We learned a few things here.

* The name field could use some cleaning
* Some outcome_types have no associated outcome_subtype.
* Some outcome_types clustered in animal_type='Other' are NaN and appear to have NaN outcome_subtype.
* There's some ambiguity in the breed field for animal_type='Other'
* The rest of the breed field is probably worth investigating/cleaning.

# Let's discuss how we should handle these.

In [None]:
#Fix the data here.

Let's engineer some features, starting with stay_duration.

First, we're going to need to convert datetimes stored as objects to datetimes. 

Here's the [pandas datetime docs](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html)

Here's a [datetime reference](https://towardsdatascience.com/10-tricks-for-converting-numbers-and-strings-to-datetime-in-pandas-82a4645fc23d).

In [None]:
df_i_o = df_i_o.astype({
    'datetime_intake': 'datetime64',
    'date_of_birth': 'datetime64',
    'datetime_outcome': 'datetime64'
})
df_i_o.dtypes

In [None]:
#That worked. Let's look at our time data
df_i_o.describe(datetime_is_numeric=True)

In [None]:
df_i_o['stay_days']=(df_i_o['datetime_outcome']-df_i_o['datetime_intake']).dt.days

In [None]:
# Let's look at dtypes
df_i_o.dtypes

In [None]:
#let's look at the column we created
df_i_o['stay_days'].describe(datetime_is_numeric=True)

Clearly something's not right, because we have a negative time value.

Also, one animal has been there over 4 years!

Let's investigate.

In [None]:
df_i_o[df_i_o['stay_days'] < 0][['stay_days','datetime_intake','datetime_outcome']]

This is a substantial number of records. 

# Let's discuss causes and alternative solutiions.

In [None]:
df_i_o[df_i_o['stay_days'] >= 1640]

I refuse to believe an 11 month old lab puppy took 4 years to get adopted!

Speaking of the "age_upon_intake" and "age_upon_outcome" fields...

They contain timedelta-like information, e.g. 11 months. 

Let's discuss how we can convert them to a duration?

In [None]:
# One way that preserves the content of the data
# 1: Figure out what the durations are.
durations = df_i_o['age_upon_intake'].apply(lambda x: x.split()).apply(lambda x: x[1]).unique()
durations

In [None]:
# 2: create a dictionary of duration multipliers
dur_days =[365, 30, 7, 30, 365, 1, 1, 7]
dur_mult = dict(zip(durations, dur_days))
dur_mult

In [None]:
# Create a column to hold the product of the duration scalar * duration days
df_i_o['intake_age_days'] = df_i_o['age_upon_intake'].apply(lambda x: x.split()).apply(lambda x: pd.to_timedelta(int(x[0])*dur_mult[x[1]], unit='D'))

In [None]:
df_i_o[['intake_age_days', 'age_upon_intake']]

In [None]:
#check the dtypes
df_i_o[['intake_age_days', 'age_upon_intake']].dtypes

What else could we try?

Let's add some date helper columns.

In [None]:
df_i_o['datetime_intake_year'] = df_i_o['datetime_intake'].dt.year
df_i_o['datetime_intake_month'] = df_i_o['datetime_intake'].dt.month
df_i_o['intake_year_month'] = pd.to_datetime(df_i_o['datetime_intake']).dt.to_period('M')
# Check the result
df_i_o.head()

In [None]:
monthly = df_i_o.groupby('datetime_intake_month').size().sort_index()
# Plot the findings to make the months with higher and lower intakes more obvious.
chrt = monthly.plot(kind='line', figsize=(12, 6), color="#0d47a1", use_index=True, lw=3)
chrt.set_title('Animal Intake Per Month', fontsize=15, fontweight="bold")
chrt.set_xlabel('Month', fontsize=16)
chrt.set_ylabel('Number of Intakes', fontsize=15)
chrt.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) # Need to avoid FixedFormatter warning
xlbl_mos = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sept", "Oct", "Nov", "Dec"]
chrt.set_xticklabels(xlbl_mos)
chrt.tick_params(colors='#1976d2', which='both', direction='inout', length=6, width="3", labelsize="12")
chrt.grid(True, ls="dashed", lw=".75")
chrt.set_facecolor('#e3f2fd')

In [None]:
monthly adoptions = df_i_o[df_i_o['outcome_type']=='Adoption'].groupby('datetime_intake_month').size().sort_index()

In [None]:
monthly_adoptions = df_i_o[df_i_o['outcome_type']=='Adoption'].groupby('datetime_intake_month').size().sort_index()
monthly_adoption_rate = (monthly_adoptions/monthly)*100
# Plot the findings to make the months with higher and lower intakes more obvious.
chrt2 = monthly_adoption_rate.plot(kind='line', figsize=(12, 6), color="#0d47a1", use_index=True, lw=3)
chrt2.set_title('Monthly Adoption Rate (%)', fontsize=15, fontweight="bold")
chrt2.set_xlabel('Month', fontsize=16)
chrt2.set_ylabel('Adoption Rate (%)', fontsize=15)
chrt2.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) # Need to avoid FixedFormatter warning
chrt2.set_xticklabels(xlbl_mos)
chrt2.tick_params(colors='#1976d2', which='both', direction='inout', length=6, width="3", labelsize="12")
chrt2.grid(True, ls="dashed", lw=".75")
chrt2.set_facecolor('#e3f2fd')

This next example brought to you by sheer force of will and Narragansett.

Let's look at the distribution of outcome types by month using an area plot.

In [None]:
"""First, we're going to need the numerator for every month, 
which is the number of each type of outcome"""
df_i_o.groupby(['datetime_intake_month','outcome_type'])['outcome_type'].count()

In [None]:
#the denominator will be the total number of outcomes that month.
df_i_o.groupby(['datetime_intake_month','outcome_type'])['outcome_type'].count().groupby(level=[0]).sum()

In [None]:
#put them together
monthly_outcomes = df_i_o.groupby(['datetime_intake_month','outcome_type'])['outcome_type'].count()/df_i_o.groupby(['datetime_intake_month','outcome_type'])['outcome_type'].count().groupby(level=[0]).sum()
monthly_outcomes

In [None]:
#wizardry
m_o_unstack = monthly_outcomes.unstack().fillna(0)
m_o_unstack

let's check that we did the math right. 

# Every month should add to ...?


In [None]:
# df_i_o.plot.area(x=None, y=None, **kwargs)
m_o_unstack.sum(axis=1)

In [None]:
#pandas default behavior
m_o_unstack.plot.area()

In [None]:
# let's gussy that up
chrt3 = m_o_unstack.plot(kind='area', figsize=(12, 6))
chrt3.set_title('Monthly Outcomes', fontsize=15, fontweight="bold")
chrt3.set_xlabel('Month', fontsize=16)
chrt3.set_ylabel('Outcome ratio', fontsize=15)
chrt3.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]) # Need to avoid FixedFormatter warning
chrt3.set_xticklabels(xlbl_mos)
chrt3.tick_params(colors='#1976d2', which='both', direction='inout', length=6, width="3", labelsize="12")

# Let's think about storage.

In [None]:
df_i_o.info()

The data came to us as relation tables for intakes and outcomes. It went from ~16MB of memory to ~14MB of memory with some dedup, merge, cleaning. Then we engineered some features.

It's back over 18MB, but to be fair we didn't delete some redundant/unnecessary columns.

# How should we break this down for storage?

List the things:
* delete 'age_upon_intake'...

# For Funsies

In teams of 2, let's work some analysis problems and then visualize them.

1. What is the average stay duration by animal type?
2. What is the average stay duration by age for dogs and cats?
3. What is the average stay duration by 5 most common dog breeds?
3. What is the most common outcome by age class?

[pandas docs: visualization](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html?highlight=str%20split)