# All Pandas All The Time

Pandas is a library we're going to be using pretty much every day in this course, so we're going to do a ton of practice so you can be on your way to becoming a _PANDAS MASTER_.

![Kung fu panda excited](https://data.whicdn.com/images/201331793/original.gif)

Let's continue with the data from the Austin Animal Shelter. 

Data source: [intakes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and [outcomes data](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238).

Once again starting off with intake data, which is data describing the animals as they enter the shelter.

In [1]:
# Imports! Can't use pandas unless we bring it into our notebook
import pandas as pd

In [2]:
# Grab the data, naming the dataframe 'intakes' this time
# Don't forget to read in DateTime as a datetime column
intakes = pd.read_csv("data/Austin_Animal_Center_Intakes.csv", 
                      parse_dates=['DateTime'])

In [3]:
# Check out the first few rows
intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,2019-01-03 16:19:00,01/03/2019 04:19:00 PM,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,07/05/2015 12:59:00 PM,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,04/14/2016 06:43:00 PM,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,10/21/2013 07:59:00 AM,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,06/29/2014 10:38:00 AM,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [4]:
# Check information on the dataframe
intakes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120403 entries, 0 to 120402
Data columns (total 12 columns):
Animal ID           120403 non-null object
Name                82404 non-null object
DateTime            120403 non-null datetime64[ns]
MonthYear           120403 non-null object
Found Location      120403 non-null object
Intake Type         120403 non-null object
Intake Condition    120403 non-null object
Animal Type         120403 non-null object
Sex upon Intake     120402 non-null object
Age upon Intake     120403 non-null object
Breed               120403 non-null object
Color               120403 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 11.0+ MB


Let's do some of the transformations we did yesterday: dropping the MonthYear column, and changing column names to be lowercase without spaces.

In [5]:
# Drop MonthYear
intakes = intakes.drop(columns='MonthYear')

In [6]:
# Rename columns
intakes = intakes.rename(columns = lambda x: x.replace(" ", "_").lower())

In [7]:
# Sanity check
intakes.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


## Dealing with Null Data

It is a fact of the data science life - you will always be surrounded by 'dirty' data. What does it mean for data to be 'dirty'? What are some of the various ways that data can be 'dirty'?

- missing values / blanks / NaNs / nulls
- nonsense data - stuff that doesn't make sense in context (negatives, '9999', any other 'default' value that doesn't convey meaning)
- repeated / duplicate values

In [8]:
# Check for null values recognized by pandas as blank
intakes.isna()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False
3,False,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...
120398,False,False,False,False,False,False,False,False,False,False,False
120399,False,False,False,False,False,False,False,False,False,False,False
120400,False,False,False,False,False,False,False,False,False,False,False
120401,False,False,False,False,False,False,False,False,False,False,False


In [9]:
# Code here for a more helpful null check
intakes.isna().sum()

animal_id               0
name                37999
datetime                0
found_location          0
intake_type             0
intake_condition        0
animal_type             0
sex_upon_intake         1
age_upon_intake         0
breed                   0
color                   0
dtype: int64

There is no one way to deal with null values. What are some of the strategies we can use to deal with them?

- change them to a measure of central tendency - mode, median, mean
- change them to a nonsense number, so you can still see what was null
- drop them


How, in Pandas, can we fill null values recognized by Pandas as null? Let's practice by filling nulls for the Name column with some placeholder value, like 'No name'.

Helpful link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

In [10]:
# Code here to fill nulls in the Name column
intakes['name'].fillna(value="No name", inplace=True)

Now let's check for nulls again...

In [11]:
# Sanity check
intakes.isna().sum()

animal_id           0
name                0
datetime            0
found_location      0
intake_type         0
intake_condition    0
animal_type         0
sex_upon_intake     1
age_upon_intake     0
breed               0
color               0
dtype: int64

Let's try a different strategy for the one lonely null in the 'Sex upon Intake' column - let's just drop that row, since it's only one observation.

Helpful link: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

In [12]:
# Code here to drop the whole row where Sex upon Intake is null
intakes = intakes.dropna(subset=['sex_upon_intake'])

In [13]:
# Copy/paste code from above to re-check for nulls
intakes.isna().sum()

animal_id           0
name                0
datetime            0
found_location      0
intake_type         0
intake_condition    0
animal_type         0
sex_upon_intake     0
age_upon_intake     0
breed               0
color               0
dtype: int64

How do we find sneaky null values that aren't marked by Pandas as null?

In [14]:
# Run this cell without changes
intakes['age_upon_intake'].value_counts()

1 year       21211
2 years      18182
1 month      11578
3 years       7247
2 months      6521
4 years       4360
4 weeks       4316
5 years       3957
3 weeks       3529
3 months      3168
4 months      3119
5 months      2989
6 years       2646
2 weeks       2448
6 months      2297
7 years       2274
8 years       2211
7 months      1807
9 months      1791
10 years      1766
8 months      1445
9 years       1286
1 week        1005
10 months      964
1 weeks        866
12 years       854
11 months      773
0 years        748
11 years       725
1 day          622
3 days         562
13 years       552
2 days         463
14 years       374
15 years       323
4 days         322
5 weeks        305
6 days         299
5 days         180
16 years       135
17 years        78
18 years        47
19 years        26
20 years        17
22 years         5
-1 years         4
21 years         1
-3 years         1
23 years         1
25 years         1
24 years         1
Name: age_upon_intake, dtype: i

Analyze the values you're finding in the 'Age upon Intake' column. What doesn't quite fit here?

**Note:** using `.value_counts()` is just one way to look at the values of a column. In this case, it works because we can see which values are the most common, and it's verbose enough to show even the less common values that might be problematic.

So - how do we want to deal with the data in here that doesn't make sense?

- Different options, but want people to recognize the negative values and want to change/fix those


One strategy for dealing with data involves making it so that we can sort by age, and have a standard scale for age.

First, let's see what that would look like if we try it as the column is now:

In [15]:
# Run this cell without changes
intakes['age_upon_intake'].sort_values(ascending=True).unique()

array(['-1 years', '-3 years', '0 years', '1 day', '1 month', '1 week',
       '1 weeks', '1 year', '10 months', '10 years', '11 months',
       '11 years', '12 years', '13 years', '14 years', '15 years',
       '16 years', '17 years', '18 years', '19 years', '2 days',
       '2 months', '2 weeks', '2 years', '20 years', '21 years',
       '22 years', '23 years', '24 years', '25 years', '3 days',
       '3 months', '3 weeks', '3 years', '4 days', '4 months', '4 weeks',
       '4 years', '5 days', '5 months', '5 weeks', '5 years', '6 days',
       '6 months', '6 years', '7 months', '7 years', '8 months',
       '8 years', '9 months', '9 years'], dtype=object)

Let's unpack what is happening in that line of code - I take the column 'Age upon Intake' by itself (as a series), then sort the values from lowest to highest (`ascending=True`), then grab only unique results so we can see how it ordered the values without looking through all 115,088.

Does that do what we want it to? Let's discuss how this worked - how did it sort?

- Strings, so sorted the numbers and then the different units alphabetically


To make our problem a bit easier, without dealing with the different ways that age is broken out, let's only look at animals where the age is given in years. How can we do that?

In [16]:
# Code here to grab only the animals where age is given in years
years_df = intakes.loc[intakes["age_upon_intake"].str.contains("year") == True]

In [17]:
# Check the shape of this subset dataframe
years_df.shape

(69033, 11)

In [18]:
# Sanity check
years_df["age_upon_intake"].unique()

array(['2 years', '8 years', '4 years', '6 years', '14 years', '18 years',
       '1 year', '3 years', '5 years', '15 years', '7 years', '12 years',
       '10 years', '9 years', '11 years', '0 years', '13 years',
       '17 years', '19 years', '16 years', '20 years', '-1 years',
       '22 years', '21 years', '-3 years', '25 years', '24 years',
       '23 years'], dtype=object)

Can we grab only the number of years from this? Let's make a new column where we can put this data.

In [19]:
# Code here to make a new column, 'Age in Years'
years_df["age_in_years"] = years_df["age_upon_intake"].str.split(" ").str[0]

# Did you get a 'SettingWithCopyWarning'? No worries - let's discuss

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [20]:
years_df["age_in_years"]

0         2
1         8
4         4
5         2
6         6
         ..
120397    1
120399    2
120400    3
120401    3
120402    2
Name: age_in_years, Length: 69033, dtype: object

In [21]:
# Code here to transform that column to an integer
years_df["age_in_years"] = years_df["age_in_years"].astype("int")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [22]:
years_df["age_in_years"]

0         2
1         8
4         4
5         2
6         6
         ..
120397    1
120399    2
120400    3
120401    3
120402    2
Name: age_in_years, Length: 69033, dtype: int64

In [23]:
# Code here to check your work
years_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 69033 entries, 0 to 120402
Data columns (total 12 columns):
animal_id           69033 non-null object
name                69033 non-null object
datetime            69033 non-null datetime64[ns]
found_location      69033 non-null object
intake_type         69033 non-null object
intake_condition    69033 non-null object
animal_type         69033 non-null object
sex_upon_intake     69033 non-null object
age_upon_intake     69033 non-null object
breed               69033 non-null object
color               69033 non-null object
age_in_years        69033 non-null int64
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 6.8+ MB


In [24]:
# Code here to check some statistics on our now-numeric column
years_df["age_in_years"].describe()

count    69033.000000
mean         3.420089
std          3.167055
min         -3.000000
25%          1.000000
50%          2.000000
75%          5.000000
max         25.000000
Name: age_in_years, dtype: float64

In [25]:
# Code here to check the unique values - in order!
years_df["age_in_years"].sort_values(ascending=True).unique()

array([-3, -1,  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14,
       15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25])

In [26]:
# Let's check the mean for our now-numeric column
years_df["age_in_years"].mean()

3.4200889429693047

In [27]:
# Now let's check the median
years_df["age_in_years"].median()

2.0

Let's discuss this column - what does it mean that the mean and median are different? How will that change if we remove some of the nonsense numbers?

- When mean =/= median, you know the data is skewed - not normally distributed
- Removing some outliers may impact the distribution of the data
- However, the outliers we see are below zero, and our mean is above our median - most of our data is below our mean value, so removing those outliers likely won't change that


In [28]:
nonsense_years = [-3, -1, 0]
# Note - since we haven't removed nonsense_years yet, median is affected by them
year_median = years_df["age_in_years"].median()

In [29]:
years_df['age_in_years'].isin(nonsense_years).sum()

753

In [30]:
# Code here to deal with those nonsense numbers
years_df['age_in_years'].replace(nonsense_years, year_median, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [31]:
# Sanity check
years_df["age_in_years"].unique()

array([ 2,  8,  4,  6, 14, 18,  1,  3,  5, 15,  7, 12, 10,  9, 11, 13, 17,
       19, 16, 20, 22, 21, 25, 24, 23])

In [33]:
# Code here to re-check your mean/median values
print(years_df["age_in_years"].mean()) # Went up!
print(years_df["age_in_years"].median()) # so, didn't change with removal

3.4420059971318064
2.0


## Group By

We can use a `groupby` function to find out interesting patterns among groups in our data. Let's use one now to find the average age of each animal type in years.

In [51]:
# Run just a groupby on the animal_type column - what's the output?
years_df.groupby(by='animal_type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe664116240>

In [54]:
# Add an aggregation function
years_df.groupby(by='animal_type').mean()

Unnamed: 0_level_0,age_in_years
animal_type,Unnamed: 1_level_1
Bird,1.725352
Cat,3.610673
Dog,3.591314
Livestock,1.571429
Other,1.583623


## Dealing with Duplicates

Let's go back to our full intakes dataframe

In [34]:
intakes.head()

Unnamed: 0,animal_id,name,datetime,found_location,intake_type,intake_condition,animal_type,sex_upon_intake,age_upon_intake,breed,color
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,No name,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [35]:
# Check for duplicates
intakes.duplicated().sum()

18

In [45]:
# Now check specifically for Animal IDs that are duplicated
intakes.duplicated(subset=['animal_id']).sum()
# That's a lot!

12756

In [46]:
# Handle duplicates - only take the 1st intake for each animal
# Save it as a new version, named clean_intakes
clean_intakes = intakes.drop_duplicates(subset=['animal_id'])

## Merging Dataframes

We were given two data sources here - both an Intakes and an Outcomes CSV. Let's merge them!

![Merge diagram from Data Science Made Simple](http://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png)

[Image from Data Science Made Simple's post on Joining/Merging Pandas Data Frames](http://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)

In [57]:
# Read in our outcomes csv as a dataframe named outcomes
outcomes = pd.read_csv('data/Austin_Animal_Center_Outcomes.csv')

In [59]:
# Check out our outcomes data
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,05/08/2019 06:20:00 PM,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,07/18/2018 04:02:00 PM,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,02/13/2016 05:59:00 PM,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,03/18/2014 11:47:00 AM,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,10/18/2014 06:52:00 PM,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black


What column should we use to merge these DataFrames?

- 


Let's do some quick cleaning on our outcomes dataframe...

In [61]:
# Change the 'DateTime' column here to be recognized as datetime objects
outcomes['DateTime'] = pd.to_datetime(outcomes['DateTime'])

In [64]:
# Change column names to be lower case and remove spaces
outcomes = outcomes.rename(columns= lambda x: x.replace(" ", "_").lower())

In [65]:
# Drop duplicate animal IDs, keeping only the 1st
# Save this as clean_outcomes
clean_outcomes = outcomes.drop_duplicates(subset=['animal_id'])

In [66]:
# Sanity check
clean_outcomes.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color
0,A794011,Chunk,2019-05-08 18:20:00,05/08/2019 06:20:00 PM,05/02/2017,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,2018-07-18 16:02:00,07/18/2018 04:02:00 PM,07/12/2017,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A720371,Moose,2016-02-13 17:59:00,02/13/2016 05:59:00 PM,10/08/2015,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
3,A674754,,2014-03-18 11:47:00,03/18/2014 11:47:00 AM,03/12/2014,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby
4,A689724,*Donatello,2014-10-18 18:52:00,10/18/2014 06:52:00 PM,08/01/2014,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black


Now... let's merge!

In [73]:
# Code here to merge dataframes
combined_df = clean_intakes.merge(clean_outcomes, on='animal_id',
                                  suffixes = ['_intake', '_outcome'])

In [74]:
# Code here to check out the details of our new dataframe
combined_df.head()

Unnamed: 0,animal_id,name_intake,datetime_intake,found_location,intake_type,intake_condition,animal_type_intake,sex_upon_intake,age_upon_intake,breed_intake,...,datetime_outcome,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type_outcome,sex_upon_outcome,age_upon_outcome,breed_outcome,color_outcome
0,A786884,*Brock,2019-01-03 16:19:00,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,...,2019-01-08 15:11:00,01/08/2019 03:11:00 PM,01/03/2017,Transfer,Partner,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,...,2015-07-05 15:13:00,07/05/2015 03:13:00 PM,07/05/2007,Return to Owner,,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,...,2016-04-21 17:17:00,04/21/2016 05:17:00 PM,04/17/2015,Return to Owner,,Dog,Neutered Male,1 year,Basenji Mix,Sable/White
3,A665644,No name,2013-10-21 07:59:00,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,...,2013-10-21 11:39:00,10/21/2013 11:39:00 AM,09/21/2013,Transfer,Partner,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,...,2014-07-02 14:16:00,07/02/2014 02:16:00 PM,06/29/2010,Return to Owner,,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [75]:
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 106902 entries, 0 to 106901
Data columns (total 22 columns):
animal_id              106902 non-null object
name_intake            106902 non-null object
datetime_intake        106902 non-null datetime64[ns]
found_location         106902 non-null object
intake_type            106902 non-null object
intake_condition       106902 non-null object
animal_type_intake     106902 non-null object
sex_upon_intake        106902 non-null object
age_upon_intake        106902 non-null object
breed_intake           106902 non-null object
color_intake           106902 non-null object
name_outcome           69502 non-null object
datetime_outcome       106902 non-null datetime64[ns]
monthyear              106902 non-null object
date_of_birth          106902 non-null object
outcome_type           106894 non-null object
outcome_subtype        52224 non-null object
animal_type_outcome    106902 non-null object
sex_upon_outcome       106902 non-null object
a

Let's discuss - can anyone guess why I had us remove duplicates before this merge? What would happen if I didn't? How could we make our combined_df better?

- T=the combined df might be huge if we left duplicates - would make many possible merges, would need to find a better cleaner solution to merge the two and actually pair up each intake to its subsequent outcome (could make an 'instance' column showing whether this was the first/second/etc time that animal came in the shelter)
    - Can show this by merging intakes/outcomes instead of clean_intakes/clean_outcomes and comparing lengths of the subsequent dfs
- could merge on more columns if there are things that should be the same for each animal every time (color, breed, etc)


## Level Up!

1. Find the **age in days** for all animals, not just the ones whose age is provided in years. Be sure to do this on the original dataframe, not just on subsets of the dataframe.

   - (Assume a year is 365 days, and a month is 30 days)

        
2. Ask a few questions of the combined dataframe that you couldn't figure out by just looking at the intakes or outcomes dataframes by themselves.

   - Example: Can you find out how long each animal in the combined dataframe has been in the shelter? 
        
       - Hint: Check out Date Time objects - a new data type that isn't a string or an integer, but which Pandas can recognize as time! https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

In [40]:
# Code here to work on level up #1


In [41]:
# Code here to work on level up #2
