# Aggregating and Combining `pandas` DataFrames

## Objectives

- Use GroupBy objects to organize and aggregate data
- Create pivot tables from DataFrames
- Combine DataFrames by merging, joining, and concatinating

## Set Up

Surprise, surprise... we're still working with the Austin Animal Center Data! Let's start with Outcomes

In [1]:
# Imports

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

In [12]:
space = " "
space

' '

In [34]:
outcomes=pd.read_csv('data/Austin_Animal_Center_Outcomes_022822.csv',
                     parse_dates=['DateTime', 'Date of Birth'])

In [35]:
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby


In [36]:
outcomes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 137097 entries, 0 to 137096
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype         
---  ------            --------------   -----         
 0   Animal ID         137097 non-null  object        
 1   Name              96095 non-null   object        
 2   DateTime          137097 non-null  datetime64[ns]
 3   MonthYear         137097 non-null  object        
 4   Date of Birth     137097 non-null  datetime64[ns]
 5   Outcome Type      137073 non-null  object        
 6   Outcome Subtype   62653 non-null   object        
 7   Animal Type       137097 non-null  object        
 8   Sex upon Outcome  137095 non-null  object        
 9   Age upon Outcome  137092 non-null  object        
 10  Breed             137097 non-null  object        
 11  Color             137097 non-null  object        
dtypes: datetime64[ns](2), object(10)
memory usage: 12.6+ MB


In [37]:
outcomes['DateTime'].dt.date

0         2019-05-08
1         2018-07-18
2         2020-08-16
3         2016-02-13
4         2014-03-18
             ...    
137092    2022-01-24
137093    2022-02-28
137094    2022-02-28
137095    2022-02-28
137096    2022-02-28
Name: DateTime, Length: 137097, dtype: object

In [38]:
# Let's create our Age in Days column
outcomes['Calculated Age in Days'] = pd.to_datetime(outcomes['DateTime'].dt.date) - outcomes['Date of Birth']

In [39]:
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,736 days
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,371 days
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,366 days
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,128 days
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,6 days


In [40]:
# Grab just the integer here...
outcomes['Calculated Age in Days'] = outcomes['Calculated Age in Days'].dt.days

In [41]:
# Sanity check
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,736
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,371
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,366
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,128
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,6


In [42]:
outcomes['Calculated Age in Days'].dtype

dtype('int64')

## Aggregating over DataFrames: `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. (And if you haven't, you'll see it very soon!) Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [43]:
# Just using groupby outputs some weird GroupBy object
outcomes.groupby(by='Animal Type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fae8a9f8d60>

Once we know we are working with a type of object, it opens up a suite of attributes and methods. One attribute we can look at is `groups`.

In [44]:
# This returns each group indexed by the group name, e.g. 'Bird',
# along with the row indices of each value

outcomes.groupby('Animal Type').groups

{'Bird': [206, 534, 985, 1027, 1284, 1310, 2220, 2258, 2274, 2417, 2521, 2598, 2712, 2778, 3178, 3379, 3648, 3743, 4003, 4024, 4288, 4702, 4766, 4998, 5063, 5205, 5436, 5656, 5848, 6087, 6236, 6340, 6592, 6682, 7033, 7352, 7428, 7985, 8048, 8315, 8331, 8414, 8538, 8922, 9203, 9448, 9758, 9825, 10103, 10165, 10407, 10657, 10736, 11386, 11616, 11674, 11732, 11765, 11771, 11837, 12205, 12418, 12423, 12451, 12474, 12713, 12902, 12978, 13057, 13063, 13095, 13272, 13317, 13323, 13435, 13474, 13677, 13934, 13950, 13963, 13981, 14108, 14131, 14146, 14193, 15114, 15193, 15543, 15553, 15813, 16022, 16197, 16499, 16866, 17173, 17338, 17390, 17426, 18319, 18359, ...], 'Cat': [0, 4, 7, 8, 10, 11, 14, 15, 16, 17, 18, 20, 24, 26, 34, 37, 49, 54, 56, 66, 67, 68, 70, 75, 78, 80, 83, 84, 89, 90, 92, 94, 95, 97, 98, 102, 113, 115, 116, 117, 118, 120, 122, 126, 139, 141, 142, 145, 147, 148, 151, 152, 156, 157, 158, 164, 167, 168, 170, 171, 176, 178, 184, 191, 192, 194, 200, 202, 203, 207, 209, 212, 215, 2

In [45]:

outcomes.groupby('Animal Type').groups.keys()

dict_keys(['Bird', 'Cat', 'Dog', 'Livestock', 'Other'])

In [46]:
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,736
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,371
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,366
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,128
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,6


In [47]:
# Same goes for multi-index groupbys
animal_outcome = outcomes.groupby(['Animal Type', 'Outcome Type'])

In [48]:
animal_outcome.groups

{('Cat', 'Rto-Adopt'): [0, 302, 1372, 1675, 1792, 1874, 2325, 2633, 3305, 3854, 4522, 4813, 5095, 5723, 6402, 7476, 7594, 8266, 8481, 9260, 10224, 11986, 12151, 12928, 13817, 16229, 16464, 16682, 16887, 17100, 17275, 18810, 19013, 19017, 20244, 23467, 26596, 26795, 26838, 27954, 29305, 29597, 29755, 30087, 32129, 33733, 33770, 34089, 35555, 37026, 38018, 38166, 39349, 39382, 39524, 39589, 39750, 41369, 42095, 42453, 42723, 44088, 45047, 45094, 45264, 45405, 47103, 48098, 48499, 50079, 50424, 52015, 53810, 56460, 57128, 60371, 61038, 61170, 61357, 61540, 62234, 62256, 62275, 62578, 64067, 64733, 65282, 65338, 66857, 66957, 67447, 67780, 68104, 70264, 70436, 71147, 71735, 71937, 76949, 78131, ...], ('Dog', 'Adoption'): [1, 3, 5, 6, 9, 21, 28, 30, 31, 32, 35, 39, 41, 43, 45, 46, 48, 52, 55, 57, 58, 61, 62, 64, 69, 74, 76, 81, 82, 85, 88, 91, 96, 99, 100, 104, 107, 110, 111, 114, 119, 121, 125, 127, 128, 132, 134, 135, 137, 140, 144, 149, 155, 159, 161, 162, 163, 165, 169, 173, 174, 188, 1

In [49]:
# .groups outputs a dictionary, so we can access the group names using keys()
animal_outcome.groups.keys()

dict_keys([('Cat', 'Rto-Adopt'), ('Dog', 'Adoption'), ('Other', 'Euthanasia'), ('Cat', 'Transfer'), ('Cat', 'Adoption'), ('Cat', 'Return to Owner'), ('Dog', 'Return to Owner'), ('Dog', 'Transfer'), ('Cat', 'Euthanasia'), ('Other', 'Adoption'), ('Dog', 'Rto-Adopt'), ('Cat', 'Died'), ('Dog', 'Euthanasia'), ('Other', 'Transfer'), ('Bird', 'Adoption'), ('Other', 'Disposal'), ('Other', 'Died'), ('Dog', 'Died'), ('Cat', 'Disposal'), ('Other', 'Return to Owner'), ('Bird', 'Euthanasia'), ('Bird', 'Transfer'), ('Livestock', 'Return to Owner'), ('Dog', 'Missing'), ('Other', 'Relocate'), ('Dog', nan), ('Livestock', 'Adoption'), ('Bird', 'Return to Owner'), ('Dog', 'Disposal'), ('Cat', 'Missing'), ('Bird', 'Disposal'), ('Bird', 'Died'), ('Other', 'Missing'), ('Other', 'Rto-Adopt'), ('Bird', 'Relocate'), ('Bird', 'Missing'), ('Other', nan), ('Livestock', 'Transfer'), ('Cat', 'Relocate'), ('Cat', nan), ('Livestock', 'Died'), ('Livestock', 'Euthanasia')])

In [50]:
# We can then get a specific group, such as cats that were adopted
animal_outcome.get_group(('Cat', 'Adoption'))

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days
7,A689724,*Donatello,2014-10-18 18:52:00,Oct 2014,2014-08-01,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black,78
8,A680969,*Zeus,2014-08-05 16:59:00,Aug 2014,2014-06-03,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby,63
20,A730621,*Liza,2016-09-10 18:59:00,Sep 2016,2016-05-18,Adoption,,Cat,Spayed Female,3 months,Domestic Shorthair Mix,Calico,115
26,A801106,,2019-08-16 14:05:00,Aug 2019,2019-05-06,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair,Orange Tabby,102
54,A792258,Vesper,2019-04-10 20:53:00,Apr 2019,2016-09-08,Adoption,,Cat,Spayed Female,2 years,Domestic Shorthair Mix,Tortie,944
...,...,...,...,...,...,...,...,...,...,...,...,...,...
137072,A846689,Coco Chanel,2022-02-26 17:23:00,Feb 2022,2021-08-19,Adoption,,Cat,Spayed Female,6 months,Domestic Shorthair,Blue Tabby,191
137073,A845330,Mitzi,2022-02-26 18:09:00,Feb 2022,2021-01-28,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair,Torbie/White,394
137088,A851184,*Papaya,2022-02-28 11:38:00,Feb 2022,2021-02-08,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Orange Tabby/White,385
137090,A847804,*Mahalia,2022-02-28 11:42:00,Feb 2022,2011-12-08,Adoption,,Cat,Spayed Female,10 years,Domestic Shorthair Mix,Brown Tabby/White,3735


In [51]:
outcomes.groupby(by=['Animal Type', 'Outcome Type']).get_group(('Cat', 'Adoption'))

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days
7,A689724,*Donatello,2014-10-18 18:52:00,Oct 2014,2014-08-01,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black,78
8,A680969,*Zeus,2014-08-05 16:59:00,Aug 2014,2014-06-03,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby,63
20,A730621,*Liza,2016-09-10 18:59:00,Sep 2016,2016-05-18,Adoption,,Cat,Spayed Female,3 months,Domestic Shorthair Mix,Calico,115
26,A801106,,2019-08-16 14:05:00,Aug 2019,2019-05-06,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair,Orange Tabby,102
54,A792258,Vesper,2019-04-10 20:53:00,Apr 2019,2016-09-08,Adoption,,Cat,Spayed Female,2 years,Domestic Shorthair Mix,Tortie,944
...,...,...,...,...,...,...,...,...,...,...,...,...,...
137072,A846689,Coco Chanel,2022-02-26 17:23:00,Feb 2022,2021-08-19,Adoption,,Cat,Spayed Female,6 months,Domestic Shorthair,Blue Tabby,191
137073,A845330,Mitzi,2022-02-26 18:09:00,Feb 2022,2021-01-28,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair,Torbie/White,394
137088,A851184,*Papaya,2022-02-28 11:38:00,Feb 2022,2021-02-08,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Orange Tabby/White,385
137090,A847804,*Mahalia,2022-02-28 11:42:00,Feb 2022,2011-12-08,Adoption,,Cat,Spayed Female,10 years,Domestic Shorthair Mix,Brown Tabby/White,3735


In [52]:
# Could have done this using loc
outcomes.loc[(outcomes['Animal Type'] == 'Cat') & (outcomes['Outcome Type'] == 'Adoption')]

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days
7,A689724,*Donatello,2014-10-18 18:52:00,Oct 2014,2014-08-01,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,Black,78
8,A680969,*Zeus,2014-08-05 16:59:00,Aug 2014,2014-06-03,Adoption,,Cat,Neutered Male,2 months,Domestic Shorthair Mix,White/Orange Tabby,63
20,A730621,*Liza,2016-09-10 18:59:00,Sep 2016,2016-05-18,Adoption,,Cat,Spayed Female,3 months,Domestic Shorthair Mix,Calico,115
26,A801106,,2019-08-16 14:05:00,Aug 2019,2019-05-06,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair,Orange Tabby,102
54,A792258,Vesper,2019-04-10 20:53:00,Apr 2019,2016-09-08,Adoption,,Cat,Spayed Female,2 years,Domestic Shorthair Mix,Tortie,944
...,...,...,...,...,...,...,...,...,...,...,...,...,...
137072,A846689,Coco Chanel,2022-02-26 17:23:00,Feb 2022,2021-08-19,Adoption,,Cat,Spayed Female,6 months,Domestic Shorthair,Blue Tabby,191
137073,A845330,Mitzi,2022-02-26 18:09:00,Feb 2022,2021-01-28,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair,Torbie/White,394
137088,A851184,*Papaya,2022-02-28 11:38:00,Feb 2022,2021-02-08,Adoption,,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Orange Tabby/White,385
137090,A847804,*Mahalia,2022-02-28 11:42:00,Feb 2022,2011-12-08,Adoption,,Cat,Spayed Female,10 years,Domestic Shorthair Mix,Brown Tabby/White,3735


## Aggregating

Once again, as we will see in SQL, groupby objects are intended to be used with aggregation. In SQL, we will see that our queries that include GROUP BY require aggregation performed on columns.

We can use `.sum()`, `.mean()`, `.count()`, `.max()`, `.min()`, etc. Find a list of common aggregations [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).

In [53]:
outcomes.groupby('Animal Type').count()

Unnamed: 0_level_0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days
Animal Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Bird,636,145,636,636,636,636,380,636,636,636,636,636
Cat,52092,30380,52092,52092,52092,52088,31815,52092,52091,52092,52092,52092
Dog,77091,64516,77091,77091,77091,77076,24590,77089,77089,77091,77091,77091
Livestock,25,3,25,25,25,25,19,25,25,25,25,25
Other,7253,1051,7253,7253,7253,7248,5849,7253,7251,7253,7253,7253


In [54]:
outcomes['Animal Type'].value_counts()

Dog          77091
Cat          52092
Other         7253
Bird           636
Livestock       25
Name: Animal Type, dtype: int64

In [55]:
outcomes.groupby('Animal Type').mean()

Unnamed: 0_level_0,Calculated Age in Days
Animal Type,Unnamed: 1_level_1
Bird,533.716981
Cat,532.816939
Dog,1004.114514
Livestock,503.8
Other,479.143665


In [56]:
animal_outcome.mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Calculated Age in Days
Animal Type,Outcome Type,Unnamed: 2_level_1
Bird,Adoption,510.379147
Bird,Died,406.217391
Bird,Disposal,392.275862
Bird,Euthanasia,500.275591
Bird,Missing,384.0
Bird,Relocate,957.4
Bird,Return to Owner,477.694444
Bird,Transfer,604.748744
Cat,Adoption,467.456066
Cat,Died,445.049351


In [71]:
outcomes.groupby('Animal Type').agg({'Calculated Age in Days': 'max'})

Unnamed: 0_level_0,Calculated Age in Days
Animal Type,Unnamed: 1_level_1
Bird,10996
Cat,8036
Dog,8766
Livestock,1844
Other,7671


In [79]:
outcomes.groupby('Animal Type').agg({'Breed': ' '.join})

Unnamed: 0_level_0,Breed
Animal Type,Unnamed: 1_level_1
Bird,Chicken Mix Chicken Mix Quaker Mix Quaker Chic...
Cat,Domestic Shorthair Mix Domestic Shorthair Mix ...
Dog,Chihuahua Shorthair Mix Anatol Shepherd/Labrad...
Livestock,Pig Mix Pig Pig Pygmy Pig Mix Potbelly Pig Mix...
Other,Raccoon Opossum Bat Mix Bat Mix Bat Polish Rac...


## Exercise

Use `.groupby()` to find the most recent birth date of each (main) animal type.


In [None]:
# Your code here


In [80]:
agg_dict = {'Date of Birth': 'max', 'Calculated Age in Days': 'mean'}
outcomes.groupby('Animal Type')[['Date of Birth',
                                 'Calculated Age in Days']].agg(agg_dict)

Unnamed: 0_level_0,Date of Birth,Calculated Age in Days
Animal Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Bird,2022-01-06,533.716981
Cat,2022-02-18,532.816939
Dog,2022-02-14,1004.114514
Livestock,2020-05-28,503.8
Other,2022-02-11,479.143665


<details>
    <summary>Answer</summary>

```python
outcomes.groupby('Animal Type')['Date of Birth'].max()
```
</details>

# Pivoting a DataFrame

## `.pivot_table()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

Grouping by two different columns can be very helpful.

In [81]:
outcomes.groupby(by=['Outcome Type', 'Sex upon Outcome']).agg('mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,Calculated Age in Days
Outcome Type,Sex upon Outcome,Unnamed: 2_level_1
Adoption,Intact Female,419.505568
Adoption,Intact Male,476.596774
Adoption,Neutered Male,651.126203
Adoption,Spayed Female,648.050078
Adoption,Unknown,389.964286
Died,Intact Female,349.628049
Died,Intact Male,304.392157
Died,Neutered Male,1859.732673
Died,Spayed Female,2099.905263
Died,Unknown,311.192737


In [82]:
outcomes.groupby(by=['Outcome Type', 'Sex upon Outcome']).agg('mean').index

MultiIndex([(       'Adoption', 'Intact Female'),
            (       'Adoption',   'Intact Male'),
            (       'Adoption', 'Neutered Male'),
            (       'Adoption', 'Spayed Female'),
            (       'Adoption',       'Unknown'),
            (           'Died', 'Intact Female'),
            (           'Died',   'Intact Male'),
            (           'Died', 'Neutered Male'),
            (           'Died', 'Spayed Female'),
            (           'Died',       'Unknown'),
            (       'Disposal', 'Intact Female'),
            (       'Disposal',   'Intact Male'),
            (       'Disposal', 'Neutered Male'),
            (       'Disposal', 'Spayed Female'),
            (       'Disposal',       'Unknown'),
            (     'Euthanasia', 'Intact Female'),
            (     'Euthanasia',   'Intact Male'),
            (     'Euthanasia', 'Neutered Male'),
            (     'Euthanasia', 'Spayed Female'),
            (     'Euthanasia',       'Unknown'),


But it has the unsavory side effect of creating a two-level index. This can be a good time to use `.pivot_table()`.

(There is also a `.pivot()`. For the somewhat subtle differences, see [here](https://stackoverflow.com/questions/30960338/pandas-difference-between-pivot-and-pivot-table-why-is-only-pivot-table-workin).)

In [83]:
outcomes['test_int'] = 5
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days,test_int
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,736,5
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,371,5
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,366,5
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,128,5
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,6,5


In [85]:
outcomes.groupby(by=['Outcome Type', 'Sex upon Outcome']).agg('mean')

Unnamed: 0_level_0,Unnamed: 1_level_0,Calculated Age in Days,test_int
Outcome Type,Sex upon Outcome,Unnamed: 2_level_1,Unnamed: 3_level_1
Adoption,Intact Female,419.505568,5.0
Adoption,Intact Male,476.596774,5.0
Adoption,Neutered Male,651.126203,5.0
Adoption,Spayed Female,648.050078,5.0
Adoption,Unknown,389.964286,5.0
Died,Intact Female,349.628049,5.0
Died,Intact Male,304.392157,5.0
Died,Neutered Male,1859.732673,5.0
Died,Spayed Female,2099.905263,5.0
Died,Unknown,311.192737,5.0


In [84]:
# Check it out!
outcomes.pivot_table(index='Outcome Type', 
                     columns='Sex upon Outcome', aggfunc='mean')

Unnamed: 0_level_0,Calculated Age in Days,Calculated Age in Days,Calculated Age in Days,Calculated Age in Days,Calculated Age in Days,test_int,test_int,test_int,test_int,test_int
Sex upon Outcome,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown
Outcome Type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
Adoption,419.505568,476.596774,651.126203,648.050078,389.964286,5.0,5.0,5.0,5.0,5.0
Died,349.628049,304.392157,1859.732673,2099.905263,311.192737,5.0,5.0,5.0,5.0,5.0
Disposal,407.277778,735.935484,1987.777778,2470.166667,447.166016,5.0,5.0,5.0,5.0,5.0
Euthanasia,1133.424632,902.4763,2226.837838,2313.677966,503.333588,5.0,5.0,5.0,5.0,5.0
Missing,266.8,340.846154,1188.285714,1262.733333,169.25,5.0,5.0,5.0,5.0,5.0
Relocate,732.0,,1105.0,495.0,612.2,5.0,,5.0,5.0,5.0
Return to Owner,1067.053879,1110.232182,1634.609964,1751.059214,764.225275,5.0,5.0,5.0,5.0,5.0
Rto-Adopt,1481.142857,1503.175,1254.399577,1252.076923,1590.0,5.0,5.0,5.0,5.0,5.0
Transfer,409.499911,351.252856,1126.40892,1097.636076,169.775301,5.0,5.0,5.0,5.0,5.0


In [86]:
outcomes.pivot_table(index='Outcome Type', 
                     columns='Sex upon Outcome', aggfunc='mean')['test_int']

Sex upon Outcome,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown
Outcome Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adoption,5.0,5.0,5.0,5.0,5.0
Died,5.0,5.0,5.0,5.0,5.0
Disposal,5.0,5.0,5.0,5.0,5.0
Euthanasia,5.0,5.0,5.0,5.0,5.0
Missing,5.0,5.0,5.0,5.0,5.0
Relocate,5.0,,5.0,5.0,5.0
Return to Owner,5.0,5.0,5.0,5.0,5.0
Rto-Adopt,5.0,5.0,5.0,5.0,5.0
Transfer,5.0,5.0,5.0,5.0,5.0


In [87]:
outcomes.pivot('Outcome Type', 'Sex upon Outcome')

ValueError: Index contains duplicate entries, cannot reshape

In [88]:
outcomes.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,137087,137088,137089,137090,137091,137092,137093,137094,137095,137096
Animal ID,A794011,A776359,A821648,A720371,A674754,A659412,A814515,A689724,A680969,A840370,...,A690359,A851184,A850725,A847804,A842414,A850166,A852031,A845839,A844321,A813933
Name,Chunk,Gizmo,,Moose,,Princess,Quentin,*Donatello,*Zeus,Tulip,...,*Porridge,*Papaya,Monty,*Mahalia,,Rainey,Noodle,*Carmen,Mia Marie,Lucille
DateTime,2019-05-08 18:20:00,2018-07-18 16:02:00,2020-08-16 11:38:00,2016-02-13 17:59:00,2014-03-18 11:47:00,2020-10-05 14:37:00,2020-05-06 07:59:00,2014-10-18 18:52:00,2014-08-05 16:59:00,2021-08-19 19:36:00,...,2022-02-26 11:34:00,2022-02-28 11:38:00,2022-02-25 11:23:00,2022-02-28 11:42:00,2021-09-17 13:38:00,2022-01-24 18:20:00,2022-02-28 12:50:00,2022-02-28 13:49:00,2022-02-28 13:04:00,2022-02-28 14:19:00
MonthYear,May 2019,Jul 2018,Aug 2020,Feb 2016,Mar 2014,Oct 2020,May 2020,Oct 2014,Aug 2014,Aug 2021,...,Feb 2022,Feb 2022,Feb 2022,Feb 2022,Sep 2021,Jan 2022,Feb 2022,Feb 2022,Feb 2022,Feb 2022
Date of Birth,2017-05-02 00:00:00,2017-07-12 00:00:00,2019-08-16 00:00:00,2015-10-08 00:00:00,2014-03-12 00:00:00,2013-03-24 00:00:00,2018-03-01 00:00:00,2014-08-01 00:00:00,2014-06-03 00:00:00,2019-08-06 00:00:00,...,2014-06-19 00:00:00,2021-02-08 00:00:00,2020-01-30 00:00:00,2011-12-08 00:00:00,2021-07-15 00:00:00,2021-11-19 00:00:00,2020-02-23 00:00:00,2020-05-05 00:00:00,2013-10-15 00:00:00,2018-12-21 00:00:00
Outcome Type,Rto-Adopt,Adoption,Euthanasia,Adoption,Transfer,Adoption,Adoption,Adoption,Adoption,Adoption,...,Transfer,Adoption,Adoption,Adoption,Adoption,Adoption,Transfer,Adoption,Adoption,Adoption
Outcome Subtype,,,,,Partner,,Foster,,,,...,Partner,,,,,,Partner,Foster,Foster,
Animal Type,Cat,Dog,Other,Dog,Cat,Dog,Dog,Cat,Cat,Dog,...,Dog,Cat,Dog,Cat,Dog,Cat,Dog,Dog,Dog,Dog
Sex upon Outcome,Neutered Male,Neutered Male,Unknown,Neutered Male,Intact Male,Spayed Female,Neutered Male,Neutered Male,Neutered Male,Spayed Female,...,Neutered Male,Spayed Female,Neutered Male,Spayed Female,Intact Female,Intact Male,Neutered Male,Spayed Female,Spayed Female,Spayed Female
Age upon Outcome,2 years,1 year,1 year,4 months,6 days,7 years,2 years,2 months,2 months,2 years,...,7 years,1 year,2 years,10 years,2 months,2 months,2 years,1 year,8 years,3 years


# Methods for Combining DataFrames: `.join()`, `.merge()`, `.concat()`

Many ways to combine dataframes! Luckily, pandas has great docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [91]:
new_df = pd.DataFrame(data=[[1,2,3,4,5,6,7,8,9,10,11,12,13,14]], columns=outcomes.columns)
new_df

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days,test_int
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14


In [94]:
pd.concat([outcomes, new_df], ignore_index=True)

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days,test_int
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02 00:00:00,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,736,5
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12 00:00:00,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,371,5
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16 00:00:00,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,366,5
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08 00:00:00,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,128,5
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12 00:00:00,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,6,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
137093,A852031,Noodle,2022-02-28 12:50:00,Feb 2022,2020-02-23 00:00:00,Transfer,Partner,Dog,Neutered Male,2 years,Pomeranian/Chihuahua Longhair,Buff,736,5
137094,A845839,*Carmen,2022-02-28 13:49:00,Feb 2022,2020-05-05 00:00:00,Adoption,Foster,Dog,Spayed Female,1 year,Pit Bull Mix,Brown,664,5
137095,A844321,Mia Marie,2022-02-28 13:04:00,Feb 2022,2013-10-15 00:00:00,Adoption,Foster,Dog,Spayed Female,8 years,Pit Bull,Black/White,3058,5
137096,A813933,Lucille,2022-02-28 14:19:00,Feb 2022,2018-12-21 00:00:00,Adoption,,Dog,Spayed Female,3 years,Belgian Malinois,Brown/Black,1165,5


## `.join()`

In [95]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'MP'])

toy1

Unnamed: 0,age,HP
0,63,142
1,33,47


In [96]:
toy2

Unnamed: 0,age,MP
0,63,100
1,33,200


In [97]:
# We can't just join these as they are, since we haven't specified our suffixes

toy1.join(toy2)

ValueError: columns overlap but no suffix specified: Index(['age'], dtype='object')

In [98]:
toy1.join(toy2, lsuffix='_l', rsuffix='_r')

Unnamed: 0,age_l,HP,age_r,MP
0,63,142,63,100
1,33,47,33,200


### Left table is always the on you are running the .join/.merge on

If we don't want to keep both, we could set the overlapping column as the index in each DataFrame:

In [100]:
toy1.set_index('age').join(toy2.set_index('age')).reset_index()

Unnamed: 0,age,HP,MP
0,63,142,100
1,33,47,200


In [101]:
# Or drop it from one part
toy1.drop('age', axis=1).join(toy2)

Unnamed: 0,HP,age,MP
0,142,63,100
1,47,33,200


In [107]:
toy1.join(toy2.set_index('age'), on='age')

Unnamed: 0,age,HP,MP
0,63,142,100
1,33,47,200


## `.merge()`

Or we could use `.merge()`:

In [111]:
toy1.merge(toy2, left_index=True, right_index=True)

Unnamed: 0,age_x,HP,age_y,MP
0,63,142,63,100
1,33,47,33,200


In [112]:
toy1.merge(toy2)

Unnamed: 0,age,HP,MP
0,63,142,100
1,33,47,200


In [113]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)
ds_chars

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [114]:
states = pd.read_csv('data/states.csv', index_col=0)
states

Unnamed: 0,state,nickname,capital
0,WA,evergreen,Olympia
1,TX,alamo,Austin
2,DC,district,Washington
3,OH,buckeye,Columbus
4,OR,beaver,Salem


## The `how` Parameter

This parameter in both `.join()` and `.merge()` tells the compiler what sort of join to effect. We'll cover this in detail when we discuss SQL.

![image showcasing how the how parameter in a join/merge would combine the two datasets, using venn-style diagrams](https://www.datasciencemadesimple.com/wp-content/uploads/2017/09/join-or-merge-in-python-pandas-1.png)
[[Image Source]](https://www.datasciencemadesimple.com/join-merge-data-frames-pandas-python/)

In [115]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='inner')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


In [116]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='outer')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200.0,WA,WA,evergreen,Olympia
1,miles,200.0,WA,WA,evergreen,Olympia
2,alan,170.0,TX,TX,alamo,Austin
3,rachel,200.0,TX,TX,alamo,Austin
4,alison,300.0,DC,DC,district,Washington
5,,,,OH,buckeye,Columbus
6,,,,OR,beaver,Salem


In [117]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='left')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,alison,300,DC,DC,district,Washington
4,rachel,200,TX,TX,alamo,Austin


In [118]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='right')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200.0,WA,WA,evergreen,Olympia
1,miles,200.0,WA,WA,evergreen,Olympia
2,alan,170.0,TX,TX,alamo,Austin
3,rachel,200.0,TX,TX,alamo,Austin
4,alison,300.0,DC,DC,district,Washington
5,,,,OH,buckeye,Columbus
6,,,,OR,beaver,Salem


## `pd.concat()`

This method takes a *list* of pandas objects as arguments.

In [119]:
prefs = pd.read_csv('data/preferences.csv', index_col=0)
prefs

Unnamed: 0,cuisine,genre
0,Greek,horror
1,Indian,scifi
2,American,fantasy
3,Thai,tech
4,Indian,documentary


In [120]:
pd.concat([ds_chars, prefs], ignore_index=True)

Unnamed: 0,name,HP,home_state,cuisine,genre
0,greg,200.0,WA,,
1,miles,200.0,WA,,
2,alan,170.0,TX,,
3,alison,300.0,DC,,
4,rachel,200.0,TX,,
5,,,,Greek,horror
6,,,,Indian,scifi
7,,,,American,fantasy
8,,,,Thai,tech
9,,,,Indian,documentary


`pd.concat()`–– and many other pandas operations –– make use of an `axis` parameter. For this particular method I need to specify whether I want to concatenate the DataFrames *row-wise* (`axis=0`) or *column-wise* (`axis=1`). The default is `axis=0`, so let's override that!

In [121]:
ds_full = pd.concat([ds_chars, prefs], axis=1)
ds_full

Unnamed: 0,name,HP,home_state,cuisine,genre
0,greg,200,WA,Greek,horror
1,miles,200,WA,Indian,scifi
2,alan,170,TX,American,fantasy
3,alison,300,DC,Thai,tech
4,rachel,200,TX,Indian,documentary


## Back to the Center

We have Intakes data and we have Outcomes data... time to merge!

In [127]:
# Peek at the outcomes data we already had in here
outcomes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Date of Birth,Outcome Type,Outcome Subtype,Animal Type,Sex upon Outcome,Age upon Outcome,Breed,Color,Calculated Age in Days,test_int
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,736,5
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,371,5
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,366,5
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,128,5
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,6,5


In [128]:
# Read in the intakes data
intakes = pd.read_csv("data/Austin_Animal_Center_Intakes_022822.csv",
                      parse_dates=['DateTime'])
# Check it out
intakes.head()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
0,A786884,*Brock,2019-01-03 16:19:00,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,Beagle Mix,Tricolor
1,A706918,Belle,2015-07-05 12:59:00,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver
2,A724273,Runster,2016-04-14 18:43:00,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,Basenji Mix,Sable/White
3,A665644,,2013-10-21 07:59:00,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico
4,A682524,Rio,2014-06-29 10:38:00,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray


In [129]:
# Let's try merging on Animal ID
combined = outcomes.merge(intakes, on='Animal ID', how='inner', suffixes=['_outcomes', '_intakes'])

In [130]:
# What was the result?
combined.head()

Unnamed: 0,Animal ID,Name_outcomes,DateTime_outcomes,MonthYear_outcomes,Date of Birth,Outcome Type,Outcome Subtype,Animal Type_outcomes,Sex upon Outcome,Age upon Outcome,...,DateTime_intakes,MonthYear_intakes,Found Location,Intake Type,Intake Condition,Animal Type_intakes,Sex upon Intake,Age upon Intake,Breed_intakes,Color_intakes
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,...,2019-05-02 16:51:00,May 2019,Austin (TX),Owner Surrender,Normal,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,...,2018-07-12 12:46:00,July 2018,7201 Levander Loop in Austin (TX),Stray,Normal,Dog,Intact Male,1 year,Chihuahua Shorthair Mix,White/Brown
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,...,2020-08-16 10:10:00,August 2020,Armadillo Rd And Clubway Ln in Austin (TX),Wildlife,Sick,Other,Unknown,1 year,Raccoon,Gray
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,...,2016-02-08 11:05:00,February 2016,Dove Dr And E Stassney in Austin (TX),Stray,Normal,Dog,Intact Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
4,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,...,2016-02-15 10:37:00,February 2016,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff


In [131]:
combined.shape

(176664, 25)

In [132]:
intakes.shape

(136763, 12)

In [133]:
outcomes.shape

(137097, 14)

Let's discuss/explore: did that work the way we expected?

- 

<details>
    <summary>Observation Notes</summary>

- We went from about 136k rows in each of the dataframes to 176k! Even using an inner join! Something seems off. 
    
    
</details>

In [134]:
# We might want to try something different
# Can we clean something to make a better merge?
combined.loc[combined.duplicated(subset='Animal ID', keep=False)]

Unnamed: 0,Animal ID,Name_outcomes,DateTime_outcomes,MonthYear_outcomes,Date of Birth,Outcome Type,Outcome Subtype,Animal Type_outcomes,Sex upon Outcome,Age upon Outcome,...,DateTime_intakes,MonthYear_intakes,Found Location,Intake Type,Intake Condition,Animal Type_intakes,Sex upon Intake,Age upon Intake,Breed_intakes,Color_intakes
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,...,2016-02-08 11:05:00,February 2016,Dove Dr And E Stassney in Austin (TX),Stray,Normal,Dog,Intact Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
4,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,...,2016-02-15 10:37:00,February 2016,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
5,A720371,Moose,2016-02-15 00:00:00,Feb 2016,2015-10-08,Transfer,Partner,Dog,Neutered Male,4 months,...,2016-02-08 11:05:00,February 2016,Dove Dr And E Stassney in Austin (TX),Stray,Normal,Dog,Intact Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
6,A720371,Moose,2016-02-15 00:00:00,Feb 2016,2015-10-08,Transfer,Partner,Dog,Neutered Male,4 months,...,2016-02-15 10:37:00,February 2016,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff
8,A659412,Princess,2020-10-05 14:37:00,Oct 2020,2013-03-24,Adoption,,Dog,Spayed Female,7 years,...,2018-06-13 12:55:00,June 2018,Austin (TX),Owner Surrender,Normal,Dog,Spayed Female,5 years,Chihuahua Shorthair Mix,Brown
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
176643,A850561,Max,2022-02-05 16:22:00,Feb 2022,2020-09-27,Adoption,,Dog,Neutered Male,1 year,...,2022-01-31 17:22:00,January 2022,Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,1 year,Belgian Malinois/German Shepherd,Black/Brown
176644,A850561,Max,2022-02-05 16:22:00,Feb 2022,2020-09-27,Adoption,,Dog,Neutered Male,1 year,...,2022-02-27 16:52:00,February 2022,1414 South Lamar Boulevard in Austin (TX),Owner Surrender,Normal,Dog,Neutered Male,1 year,Belgian Malinois/German Shepherd,Black/Brown
176645,A850561,Max,2022-02-05 16:22:00,Feb 2022,2020-09-27,Adoption,,Dog,Neutered Male,1 year,...,2022-01-26 17:31:00,January 2022,Austin (TX),Public Assist,Normal,Dog,Neutered Male,1 year,Belgian Malinois/German Shepherd,Black/Brown
176656,A850725,Monty,2022-02-25 11:23:00,Feb 2022,2020-01-30,Adoption,,Dog,Neutered Male,2 years,...,2022-01-30 14:32:00,January 2022,14100 Thermal Dr in Austin (TX),Stray,Injured,Dog,Intact Male,2 years,Labrador Retriever,Black/White


In [135]:
# Try again
clean_intakes = intakes.drop_duplicates(subset=['Animal ID'])
clean_outcomes = outcomes.drop_duplicates(subset=['Animal ID'])

In [136]:
clean_combined_df = clean_intakes.merge(clean_outcomes, on='Animal ID',
                                        how='inner',
                                        suffixes=['_intake', '_outcome'])

In [137]:
clean_combined_df.head()

Unnamed: 0,Animal ID,Name_intake,DateTime_intake,MonthYear_intake,Found Location,Intake Type,Intake Condition,Animal Type_intake,Sex upon Intake,Age upon Intake,...,Date of Birth,Outcome Type,Outcome Subtype,Animal Type_outcome,Sex upon Outcome,Age upon Outcome,Breed_outcome,Color_outcome,Calculated Age in Days,test_int
0,A786884,*Brock,2019-01-03 16:19:00,January 2019,2501 Magin Meadow Dr in Austin (TX),Stray,Normal,Dog,Neutered Male,2 years,...,2017-01-03,Transfer,Partner,Dog,Neutered Male,2 years,Beagle Mix,Tricolor,735,5
1,A706918,Belle,2015-07-05 12:59:00,July 2015,9409 Bluegrass Dr in Austin (TX),Stray,Normal,Dog,Spayed Female,8 years,...,2007-07-05,Return to Owner,,Dog,Spayed Female,8 years,English Springer Spaniel,White/Liver,2922,5
2,A724273,Runster,2016-04-14 18:43:00,April 2016,2818 Palomino Trail in Austin (TX),Stray,Normal,Dog,Intact Male,11 months,...,2015-04-17,Return to Owner,,Dog,Neutered Male,1 year,Basenji Mix,Sable/White,370,5
3,A665644,,2013-10-21 07:59:00,October 2013,Austin (TX),Stray,Sick,Cat,Intact Female,4 weeks,...,2013-09-21,Transfer,Partner,Cat,Intact Female,4 weeks,Domestic Shorthair Mix,Calico,30,5
4,A682524,Rio,2014-06-29 10:38:00,June 2014,800 Grove Blvd in Austin (TX),Stray,Normal,Dog,Neutered Male,4 years,...,2010-06-29,Return to Owner,,Dog,Neutered Male,4 years,Doberman Pinsch/Australian Cattle Dog,Tan/Gray,1464,5


In [138]:
clean_combined_df.shape

(121773, 25)

In [139]:
intakes.sort_values(by='DateTime').tail()

Unnamed: 0,Animal ID,Name,DateTime,MonthYear,Found Location,Intake Type,Intake Condition,Animal Type,Sex upon Intake,Age upon Intake,Breed,Color
136756,A852227,Georgia,2022-02-28 12:13:00,February 2022,Travis (TX),Owner Surrender,Normal,Cat,Intact Female,3 months,Domestic Shorthair Mix,Black
136750,A852230,,2022-02-28 12:15:00,February 2022,Austin (TX),Stray,Normal,Cat,Unknown,5 months,Domestic Shorthair,Brown Tabby
136754,A852234,,2022-02-28 12:33:00,February 2022,305 E Lola Lane in Austin (TX),Stray,Injured,Cat,Intact Female,2 years,Domestic Shorthair,Calico
136752,A852219,Nieves,2022-02-28 13:41:00,February 2022,45 And 183 in Travis (TX),Stray,Normal,Dog,Intact Female,3 years,Great Pyrenees,White
136758,A852238,A852238,2022-02-28 14:02:00,February 2022,Cameron And Crosspark Drive in Austin (TX),Stray,Normal,Dog,Intact Female,6 months,Labrador Retriever Mix,Black/White


# Level Up: Quick Column Name Clean Up Code

Throwing a quick use of a lambda function your way:

In [122]:
outcomes_renamed = outcomes.rename(columns = lambda x: x.replace(" ", "_").lower())
outcomes_renamed.head()

Unnamed: 0,animal_id,name,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,calculated_age_in_days,test_int
0,A794011,Chunk,2019-05-08 18:20:00,May 2019,2017-05-02,Rto-Adopt,,Cat,Neutered Male,2 years,Domestic Shorthair Mix,Brown Tabby/White,736,5
1,A776359,Gizmo,2018-07-18 16:02:00,Jul 2018,2017-07-12,Adoption,,Dog,Neutered Male,1 year,Chihuahua Shorthair Mix,White/Brown,371,5
2,A821648,,2020-08-16 11:38:00,Aug 2020,2019-08-16,Euthanasia,,Other,Unknown,1 year,Raccoon,Gray,366,5
3,A720371,Moose,2016-02-13 17:59:00,Feb 2016,2015-10-08,Adoption,,Dog,Neutered Male,4 months,Anatol Shepherd/Labrador Retriever,Buff,128,5
4,A674754,,2014-03-18 11:47:00,Mar 2014,2014-03-12,Transfer,Partner,Cat,Intact Male,6 days,Domestic Shorthair Mix,Orange Tabby,6,5


# Level Up: `pandas.set_option()`

We can adjust how `pandas` works by setting options in advance.

For complete documentation, see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).

## Block Scientific Notation

For example, suppose we want to prevent numbers from being displayed in scientific notation.

In [123]:
df = pd.DataFrame([[1e9, 2e9], [3e9, 4e9]])
df

Unnamed: 0,0,1
0,1000000000.0,2000000000.0
1,3000000000.0,4000000000.0


Then we can use:

In [124]:
pd.set_option('display.float_format', '{:,.2f}'.format)

df

Unnamed: 0,0,1
0,1000000000.0,2000000000.0
1,3000000000.0,4000000000.0


## See More Rows

Or suppose we want `pandas` to show more rows.

In [125]:
df2 = pd.DataFrame(np.array(range(100)))
df2

Unnamed: 0,0
0,0
1,1
2,2
3,3
4,4
...,...
95,95
96,96
97,97
98,98


In that case we can use:

In [126]:
pd.set_option('display.max_rows', 100)

df2

Unnamed: 0,0
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
