# Aggregating and Combining `pandas` DataFrames

In [1]:
import pandas as pd
import numpy as np
import requests as rq
from sklearn.preprocessing import OneHotEncoder
from zipfile import ZipFile

## Learning Goals

- Use GroupBy objects to organize and aggregate data
- Create pivot tables from DataFrames
- Combine DataFrames by merging and appending 

We'll work with the Austin Animal Center dataset and with data from King County's Department of Assessments (Seattle housing data).

### Austin Animal Center Data

In [2]:
data = rq.get('https://data.austintexas.gov/resource/9t4d-g238.json').text

In [3]:
animals = pd.read_json(data)

In [4]:
animals.head()

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,name
0,A853501,2022-03-23 13:59:00,2022-03-23T13:59:00.000,2020-03-19T00:00:00.000,Euthanasia,Rabies Risk,Other,Unknown,2 years,Bat,Brown,
1,A853653,2022-03-23 08:35:00,2022-03-23T08:35:00.000,2021-03-22T00:00:00.000,Euthanasia,Rabies Risk,Other,Unknown,1 year,Bat,Brown,
2,A853381,2022-03-22 19:01:00,2022-03-22T19:01:00.000,2020-03-17T00:00:00.000,Adoption,,Dog,Spayed Female,2 years,Basenji Mix,Black/Tan,*Korinna
3,A534157,2022-03-22 18:22:00,2022-03-22T18:22:00.000,2008-05-30T00:00:00.000,Adoption,,Cat,Spayed Female,13 years,Domestic Shorthair Mix,Brown Tabby,Sasha Bell
4,A853445,2022-03-22 18:02:00,2022-03-22T18:02:00.000,2021-10-18T00:00:00.000,Adoption,,Cat,Spayed Female,5 months,Domestic Shorthair,Brown Tabby,Petunia


In [8]:
animals.dtypes

animal_id                   object
datetime            datetime64[ns]
monthyear                   object
date_of_birth               object
outcome_type                object
outcome_subtype             object
animal_type                 object
sex_upon_outcome            object
age_upon_outcome            object
breed                       object
color                       object
name                        object
dtype: object

## Aggregating over DataFrames: `.groupby()`

Those of you familiar with SQL have probably used the GROUP BY command. (And if you haven't, you'll see it very soon!) Pandas has this, too.

The `.groupby()` method is especially useful for aggregate functions applied to the data grouped in particular ways.

In [9]:
animals.groupby('animal_type')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f9385f23130>

In [10]:
animals.columns

Index(['animal_id', 'datetime', 'monthyear', 'date_of_birth', 'outcome_type',
       'outcome_subtype', 'animal_type', 'sex_upon_outcome',
       'age_upon_outcome', 'breed', 'color', 'name'],
      dtype='object')

We can group by multiple columns, and also return a DataFrameGroupBy object

Notice the object type [DataFrameGroupBy](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) object. 

### `.groups` and `.get_group()`

In [11]:
animals.groupby(['animal_type', 'outcome_type'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f93889f4af0>

In [13]:
animals.head()

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,name
0,A853501,2022-03-23 13:59:00,2022-03-23T13:59:00.000,2020-03-19T00:00:00.000,Euthanasia,Rabies Risk,Other,Unknown,2 years,Bat,Brown,
1,A853653,2022-03-23 08:35:00,2022-03-23T08:35:00.000,2021-03-22T00:00:00.000,Euthanasia,Rabies Risk,Other,Unknown,1 year,Bat,Brown,
2,A853381,2022-03-22 19:01:00,2022-03-22T19:01:00.000,2020-03-17T00:00:00.000,Adoption,,Dog,Spayed Female,2 years,Basenji Mix,Black/Tan,*Korinna
3,A534157,2022-03-22 18:22:00,2022-03-22T18:22:00.000,2008-05-30T00:00:00.000,Adoption,,Cat,Spayed Female,13 years,Domestic Shorthair Mix,Brown Tabby,Sasha Bell
4,A853445,2022-03-22 18:02:00,2022-03-22T18:02:00.000,2021-10-18T00:00:00.000,Adoption,,Cat,Spayed Female,5 months,Domestic Shorthair,Brown Tabby,Petunia


In [12]:
# This retuns each group indexed by the group name: I.E. 'Bird', along with the row indices of each value
# .group tells me which row belongs to which animal

animals.groupby('animal_type').groups

{'Bird': [455, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556], 'Cat': [3, 4, 6, 9, 10, 13, 14, 16, 17, 21, 24, 27, 31, 33, 36, 41, 42, 45, 46, 47, 48, 49, 52, 68, 70, 71, 75, 76, 85, 87, 95, 97, 101, 107, 108, 112, 117, 125, 128, 133, 135, 136, 139, 146, 149, 150, 151, 152, 154, 168, 169, 175, 180, 181, 183, 187, 189, 192, 198, 200, 205, 206, 207, 212, 215, 216, 219, 243, 244, 245, 246, 247, 255, 259, 261, 274, 276, 277, 278, 279, 283, 284, 291, 296, 297, 300, 305, 326, 329, 332, 333, 335, 352, 362, 364, 366, 367, 368, 380, 382, ...], 'Dog': [2, 5, 7, 8, 11, 12, 15, 18, 19, 20, 22, 23, 25, 26, 28, 29, 30, 32, 34, 37, 38, 39, 40, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 69, 72, 73, 74, 77, 78, 79, 80, 81, 82, 83, 84, 86, 88, 89, 90, 91, 92, 93, 94, 96, 98, 99, 100, 102, 103, 105, 106, 114, 115, 116, 118, 119, 120, 121, 122, 124, 126, 129, 130, 131, 132, 134, 137, 138, 140, 141, 142, 143, 144, 145, 147, 148, 153, 156, 157, 158, 159, 160

Once we know we are working with a type of object, it opens up a suite of attributes and methods. One attribute we can look at is groups.

In [14]:
# Once we know the group indices, we can return the groups using those indices.
animals.groupby('animal_type').get_group('Dog')

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,name
2,A853381,2022-03-22 19:01:00,2022-03-22T19:01:00.000,2020-03-17T00:00:00.000,Adoption,,Dog,Spayed Female,2 years,Basenji Mix,Black/Tan,*Korinna
5,A853426,2022-03-22 17:56:00,2022-03-22T17:56:00.000,2022-01-06T00:00:00.000,Adoption,,Dog,Neutered Male,2 months,Chinese Sharpei Mix,Buff/Black,*Flycatcher
7,A853629,2022-03-22 16:43:00,2022-03-22T16:43:00.000,2021-07-22T00:00:00.000,Return to Owner,Field,Dog,Intact Male,7 months,Siberian Husky Mix,White,Oso
8,A852933,2022-03-22 16:32:00,2022-03-22T16:32:00.000,2021-08-10T00:00:00.000,Return to Owner,,Dog,Spayed Female,7 months,Pit Bull,White/Gray,Indica
11,A852469,2022-03-22 15:06:00,2022-03-22T15:06:00.000,2022-01-19T00:00:00.000,Adoption,Foster,Dog,Neutered Male,2 months,German Shepherd Mix,Tricolor,*Pequeno
...,...,...,...,...,...,...,...,...,...,...,...,...
989,A851434,2022-02-13 13:15:00,2022-02-13T13:15:00.000,2010-02-12T00:00:00.000,Transfer,Partner,Dog,Neutered Male,12 years,Chihuahua Shorthair,Fawn/Tan,Taco Bell
990,A849153,2022-02-13 12:00:00,2022-02-13T12:00:00.000,2021-01-03T00:00:00.000,Transfer,Partner,Dog,Spayed Female,1 year,Rottweiler,Black/Brown,Natasha
994,A851303,2022-02-12 18:58:00,2022-02-12T18:58:00.000,2020-02-09T00:00:00.000,Transfer,Partner,Dog,Intact Female,2 years,Cairn Terrier Mix,Brown/White,A851303
997,A851330,2022-02-12 18:16:00,2022-02-12T18:16:00.000,2017-02-10T00:00:00.000,Adoption,,Dog,Neutered Male,5 years,Chihuahua Shorthair Mix,Buff,Jake


#### Multi-Indexing

In [15]:
# Same goes for multi index groupbys
animal_outcome = animals.groupby(['animal_type', 'outcome_type'])
animal_outcome.groups

{('Bird', 'Adoption'): [455, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552, 553, 554, 555, 556], ('Cat', 'Adoption'): [3, 4, 6, 9, 10, 13, 14, 16, 17, 21, 24, 27, 31, 33, 36, 52, 68, 75, 76, 87, 95, 97, 108, 112, 117, 133, 135, 136, 139, 146, 152, 168, 169, 175, 183, 187, 189, 192, 198, 200, 205, 206, 207, 215, 216, 259, 261, 274, 276, 277, 278, 279, 283, 291, 296, 297, 300, 305, 326, 329, 332, 333, 335, 352, 362, 366, 367, 368, 390, 392, 393, 394, 396, 399, 400, 404, 405, 412, 419, 420, 422, 436, 442, 443, 451, 467, 468, 471, 484, 490, 497, 517, 571, 578, 591, 592, 599, 605, 618, 619, ...], ('Cat', 'Died'): [586, 629], ('Cat', 'Euthanasia'): [85, 101, 125, 128, 154, 364, 473, 474, 475, 558, 625, 719, 803, 832, 911, 972, 998], ('Cat', 'Return to Owner'): [151, 284, 380, 459, 540, 598, 801, 802, 910, 922, 934, 985], ('Cat', 'Rto-Adopt'): [514], ('Cat', 'Transfer'): [41, 42, 45, 46, 47, 48, 49, 70, 71, 107, 149, 150, 180, 181, 212, 219, 243, 244, 245, 246, 247, 255, 382, 383

In [16]:
# animal_outcome.groups is a dictionary, so we can access the group names using keys()
animal_outcome.groups.keys()

dict_keys([('Bird', 'Adoption'), ('Cat', 'Adoption'), ('Cat', 'Died'), ('Cat', 'Euthanasia'), ('Cat', 'Return to Owner'), ('Cat', 'Rto-Adopt'), ('Cat', 'Transfer'), ('Dog', 'Adoption'), ('Dog', 'Died'), ('Dog', 'Euthanasia'), ('Dog', 'Return to Owner'), ('Dog', 'Rto-Adopt'), ('Dog', 'Transfer'), ('Other', 'Adoption'), ('Other', 'Disposal'), ('Other', 'Euthanasia'), ('Other', 'Return to Owner'), ('Other', 'Transfer')])

In [17]:
animal_outcome.groups.values()

dict_values([Int64Index([455, 541, 542, 543, 544, 545, 546, 547, 548, 549, 550, 551, 552,
            553, 554, 555, 556],
           dtype='int64'), Int64Index([  3,   4,   6,   9,  10,  13,  14,  16,  17,  21,
            ...
            961, 962, 963, 967, 974, 975, 980, 981, 984, 996],
           dtype='int64', length=179), Int64Index([586, 629], dtype='int64'), Int64Index([ 85, 101, 125, 128, 154, 364, 473, 474, 475, 558, 625, 719, 803,
            832, 911, 972, 998],
           dtype='int64'), Int64Index([151, 284, 380, 459, 540, 598, 801, 802, 910, 922, 934, 985], dtype='int64'), Int64Index([514], dtype='int64'), Int64Index([ 41,  42,  45,  46,  47,  48,  49,  70,  71, 107, 149, 150, 180,
            181, 212, 219, 243, 244, 245, 246, 247, 255, 382, 383, 384, 385,
            386, 387, 388, 428, 431, 432, 438, 518, 519, 520, 563, 581, 582,
            583, 584, 585, 621, 623, 624, 626, 737, 761, 804, 805, 806, 822,
            823, 824, 825, 841, 873, 889, 891, 892, 976, 979],


In [18]:
# We can then get a specific group, such as Cats that were adopted
animal_outcome.get_group(('Cat', 'Adoption'))

Unnamed: 0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,animal_type,sex_upon_outcome,age_upon_outcome,breed,color,name
3,A534157,2022-03-22 18:22:00,2022-03-22T18:22:00.000,2008-05-30T00:00:00.000,Adoption,,Cat,Spayed Female,13 years,Domestic Shorthair Mix,Brown Tabby,Sasha Bell
4,A853445,2022-03-22 18:02:00,2022-03-22T18:02:00.000,2021-10-18T00:00:00.000,Adoption,,Cat,Spayed Female,5 months,Domestic Shorthair,Brown Tabby,Petunia
6,A852038,2022-03-22 17:54:00,2022-03-22T17:54:00.000,2021-02-28T00:00:00.000,Adoption,,Cat,Neutered Male,1 year,Domestic Shorthair,Black,*Minkus
9,A853444,2022-03-22 15:43:00,2022-03-22T15:43:00.000,2020-03-18T00:00:00.000,Adoption,,Cat,Spayed Female,2 years,Domestic Shorthair,Brown Tabby,Hope
10,A849460,2022-03-22 15:10:00,2022-03-22T15:10:00.000,2021-01-08T00:00:00.000,Adoption,Foster,Cat,Neutered Male,1 year,Domestic Shorthair,Orange Tabby,*Gingerbread
...,...,...,...,...,...,...,...,...,...,...,...,...
975,A850396,2022-02-14 16:08:00,2022-02-14T16:08:00.000,2020-01-31T00:00:00.000,Adoption,,Cat,Spayed Female,2 years,Domestic Shorthair,Brown Tabby,Lovebug
980,A851280,2022-02-14 15:06:00,2022-02-14T15:06:00.000,2020-06-09T00:00:00.000,Adoption,,Cat,Spayed Female,1 year,Domestic Medium Hair,White/Black,Gemma
981,A850855,2022-02-14 14:39:00,2022-02-14T14:39:00.000,2021-10-23T00:00:00.000,Adoption,,Cat,Neutered Male,3 months,Domestic Shorthair,Orange Tabby/White,*Clark
984,A846503,2022-02-14 12:33:00,2022-02-14T12:33:00.000,2021-09-08T00:00:00.000,Adoption,,Cat,Spayed Female,5 months,Domestic Shorthair Mix,Tortie,*Pandora


### Aggregating

Once again, as we will see in SQL, groupby objects are intended to be used with aggregation. In SQL, we will see that our queries that include GROUP BY require aggregation performed on columns.

We can use `.sum()`, `.mean()`, `.count()`, `.max()`, `.min()`, etc. Find a list of common aggregations [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html).

In [19]:
animals.groupby('animal_type').count()

Unnamed: 0_level_0,animal_id,datetime,monthyear,date_of_birth,outcome_type,outcome_subtype,sex_upon_outcome,age_upon_outcome,breed,color,name
animal_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Bird,17,17,17,17,17,0,17,17,17,17,1
Cat,273,273,273,273,273,106,273,273,273,273,235
Dog,649,649,649,649,649,243,649,649,649,649,602
Other,61,61,61,61,61,54,61,61,61,61,20


### Exercise

Use `.groupby()` to find the most recently born of each (main) animal type.

In [20]:
animals.groupby('animal_type')['date_of_birth'].max()

animal_type
Bird     2020-03-01T00:00:00.000
Cat      2022-03-19T00:00:00.000
Dog      2022-03-13T00:00:00.000
Other    2021-07-02T00:00:00.000
Name: date_of_birth, dtype: object

<details>
    <summary>Answer</summary>
    <code>animals.groupby('animal_type')['date_of_birth'].max()</code>
    </details>

## Pivoting a DataFrame

### `.pivot_table()`

Those of you familiar with Excel have probably used Pivot Tables. Pandas has a similar functionality.

Grouping by two different columns can be very helpful.

In [21]:
animals.groupby(by=['outcome_type', 'sex_upon_outcome']).agg(len)

Unnamed: 0_level_0,Unnamed: 1_level_0,animal_id,datetime,monthyear,date_of_birth,outcome_subtype,animal_type,age_upon_outcome,breed,color,name
outcome_type,sex_upon_outcome,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Adoption,Intact Female,26,26,26,26,26,26,26,26,26,26
Adoption,Intact Male,7,7,7,7,7,7,7,7,7,7
Adoption,Neutered Male,254,254,254,254,254,254,254,254,254,254
Adoption,Spayed Female,248,248,248,248,248,248,248,248,248,248
Died,Intact Female,1,1,1,1,1,1,1,1,1,1
Died,Intact Male,2,2,2,2,2,2,2,2,2,2
Died,Neutered Male,2,2,2,2,2,2,2,2,2,2
Died,Unknown,1,1,1,1,1,1,1,1,1,1
Disposal,Unknown,1,1,1,1,1,1,1,1,1,1
Euthanasia,Intact Female,7,7,7,7,7,7,7,7,7,7


But it has the unsavory side effect of creating a two-level index. This can be a good time to use `.pivot_table()`.

(There is also a `.pivot()`. For the somewhat subtle differences, see [here](https://stackoverflow.com/questions/30960338/pandas-difference-between-pivot-and-pivot-table-why-is-only-pivot-table-workin).)

#### Example

In [23]:
df = pd.DataFrame({"sex": ["male", "male", "male", "male", "male",
                          "female", "female", "female", "female"],
                    "num_puppies": ["one", "one", "one", "two", "two",
                          "one", "one", "two", "two"],
                    "breed": ["terrier", "retriever", "retriever", "terrier",
                          "terrier", "retriever", "terrier", "terrier",
                          "retriever"],
                    "past_owners": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                    "family_members": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
df

Unnamed: 0,sex,num_puppies,breed,past_owners,family_members
0,male,one,terrier,1,2
1,male,one,retriever,2,4
2,male,one,retriever,2,5
3,male,two,terrier,3,5
4,male,two,terrier,3,6
5,female,one,retriever,4,6
6,female,one,terrier,5,8
7,female,two,terrier,6,9
8,female,two,retriever,7,9


In [24]:
# This first example aggregates values by taking the sum.

table = pd.pivot_table(df, values='past_owners', index=['sex', 'num_puppies'],
                     columns=['breed'], aggfunc=np.sum)
table

Unnamed: 0_level_0,breed,retriever,terrier
sex,num_puppies,Unnamed: 2_level_1,Unnamed: 3_level_1
female,one,4.0,5.0
female,two,7.0,6.0
male,one,4.0,1.0
male,two,,6.0


In [26]:
table.index

MultiIndex([('female', 'one'),
            ('female', 'two'),
            (  'male', 'one'),
            (  'male', 'two')],
           names=['sex', 'num_puppies'])

In [27]:
table.reset_index() # flatten

breed,sex,num_puppies,retriever,terrier
0,female,one,4.0,5.0
1,female,two,7.0,6.0
2,male,one,4.0,1.0
3,male,two,,6.0


#### Back to Austin animals

In [39]:
animals.pivot_table(index='outcome_type', columns='sex_upon_outcome', aggfunc=len)

Unnamed: 0_level_0,age_upon_outcome,age_upon_outcome,age_upon_outcome,age_upon_outcome,age_upon_outcome,animal_id,animal_id,animal_id,animal_id,animal_id,...,name,name,name,name,name,outcome_subtype,outcome_subtype,outcome_subtype,outcome_subtype,outcome_subtype
sex_upon_outcome,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown,...,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown,Intact Female,Intact Male,Neutered Male,Spayed Female,Unknown
outcome_type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Adoption,26.0,7.0,254.0,248.0,,26.0,7.0,254.0,248.0,,...,26.0,7.0,254.0,248.0,,26.0,7.0,254.0,248.0,
Died,1.0,2.0,2.0,,1.0,1.0,2.0,2.0,,1.0,...,1.0,2.0,2.0,,1.0,1.0,2.0,2.0,,1.0
Disposal,,,,,1.0,,,,,1.0,...,,,,,1.0,,,,,1.0
Euthanasia,7.0,9.0,2.0,1.0,50.0,7.0,9.0,2.0,1.0,50.0,...,7.0,9.0,2.0,1.0,50.0,7.0,9.0,2.0,1.0,50.0
Return to Owner,30.0,43.0,41.0,33.0,2.0,30.0,43.0,41.0,33.0,2.0,...,30.0,43.0,41.0,33.0,2.0,30.0,43.0,41.0,33.0,2.0
Rto-Adopt,1.0,,8.0,5.0,,1.0,,8.0,5.0,,...,1.0,,8.0,5.0,,1.0,,8.0,5.0,
Transfer,63.0,54.0,57.0,44.0,8.0,63.0,54.0,57.0,44.0,8.0,...,63.0,54.0,57.0,44.0,8.0,63.0,54.0,57.0,44.0,8.0


### Exercise

Use `.pivot_table()` to add up the number of my tasks by category. Hint: Use `sum()` as your aggregating function.

In [28]:
tasks = pd.DataFrame({'category': ['house', 'house', 'school', 'school'],
                      'descr': ['kitchen', 'laundry', 'git', 'Python'],
                      'priority': [2, 3, 4, 1], 'num_tasks': [2, 1, 2, 3]})

tasks

Unnamed: 0,category,descr,priority,num_tasks
0,house,kitchen,2,2
1,house,laundry,3,1
2,school,git,4,2
3,school,Python,1,3


In [38]:
tasks.pivot_table(index='category', values='num_tasks', aggfunc=sum)

Unnamed: 0_level_0,num_tasks
category,Unnamed: 1_level_1
house,3
school,5


<details>
    <summary>Answer</summary>
    <code>tasks.pivot_table(values='num_tasks', index='category', aggfunc=sum)</code>
    </details>

## Methods for Combining DataFrames: `.join()`, `.merge()`, `pd.concat()`

### `.join()`

In [40]:
toy1 = pd.DataFrame([[63, 142], [33, 47]], columns=['age', 'HP'])
toy2 = pd.DataFrame([[63, 100], [33, 200]], columns=['age', 'MP'])

toy1

Unnamed: 0,age,HP
0,63,142
1,33,47


In [41]:
toy2

Unnamed: 0,age,MP
0,63,100
1,33,200


In [42]:
# We can't just join these as they are, since we haven't specified our suffixes.

toy1.join(toy2)

ValueError: columns overlap but no suffix specified: Index(['age'], dtype='object')

In [43]:
toy1.join(toy2, lsuffix='1', rsuffix='2')

Unnamed: 0,age1,HP,age2,MP
0,63,142,63,100
1,33,47,33,200


If we don't want to keep both, we could set the overlapping column as the index in each DataFrame:

In [45]:
toy1.set_index('age').join(toy2.set_index('age'))

Unnamed: 0_level_0,HP,MP
age,Unnamed: 1_level_1,Unnamed: 2_level_1
63,142,100
33,47,200


For more on this method, check out the [doc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html)!

### `.merge()`

Or we could use `.merge()`:

In [44]:
toy1.merge(toy2)

Unnamed: 0,age,HP,MP
0,63,142,100
1,33,47,200


In [46]:
ds_chars = pd.read_csv('data/ds_chars.csv', index_col=0)
ds_chars

Unnamed: 0,name,HP,home_state
0,greg,200,WA
1,miles,200,WA
2,alan,170,TX
3,alison,300,DC
4,rachel,200,TX


In [54]:
states = pd.read_csv('data/states.csv', index_col=0)
states

Unnamed: 0,state,nickname,capital
0,WA,evergreen,Olympia
1,TX,alamo,Austin
2,DC,district,Washington
3,OH,buckeye,Columbus
4,OR,beaver,Salem


### The `how` Parameter

This parameter in both `.join()` and `.merge()` tells the compiler what sort of join to effect. We'll cover this in detail when we discuss SQL.

In [48]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='inner')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,WA,evergreen,Olympia
2,alan,170,TX,TX,alamo,Austin
3,rachel,200,TX,TX,alamo,Austin
4,alison,300,DC,DC,district,Washington


In [49]:
ds_chars.merge(states,
               left_on='home_state',
               right_on='state',
               how='outer')

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200.0,WA,WA,evergreen,Olympia
1,miles,200.0,WA,WA,evergreen,Olympia
2,alan,170.0,TX,TX,alamo,Austin
3,rachel,200.0,TX,TX,alamo,Austin
4,alison,300.0,DC,DC,district,Washington
5,,,,OH,buckeye,Columbus
6,,,,OR,beaver,Salem


### `pd.concat()`

This method takes a *list* of pandas objects as arguments.

In [50]:
ds_full = pd.concat([ds_chars, states])
ds_full

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200.0,WA,,,
1,miles,200.0,WA,,,
2,alan,170.0,TX,,,
3,alison,300.0,DC,,,
4,rachel,200.0,TX,,,
0,,,,WA,evergreen,Olympia
1,,,,TX,alamo,Austin
2,,,,DC,district,Washington
3,,,,OH,buckeye,Columbus
4,,,,OR,beaver,Salem


`pd.concat()`–– and many other pandas operations –– make use of an `axis` parameter. For this particular method I need to specify whether I want to concatenate the DataFrames *row-wise* (`axis=0`) or *column-wise* (`axis=1`). The default is `axis=0`, so let's override that!

In [51]:
ds_full = pd.concat([ds_chars, states], axis=1)
ds_full

Unnamed: 0,name,HP,home_state,state,nickname,capital
0,greg,200,WA,WA,evergreen,Olympia
1,miles,200,WA,TX,alamo,Austin
2,alan,170,TX,DC,district,Washington
3,alison,300,DC,OH,buckeye,Columbus
4,rachel,200,TX,OR,beaver,Salem


## King County Assessments

As data scientists, we want to build a model to predict the sale price of a house in Seattle in 2019, based on its square footage. We know that the King County Department of Assessments has comprehensive data available on real property sales in the Seattle area. We need to prepare the data.

### First, get the data!

Go [here](https://info.kingcounty.gov/assessor/DataDownload/default.aspx) and download two files: "Real Property Sales" and "Residential Building". Then unzip them. (Or you can run the cells below if you prefer.)

In [None]:
# %%bash
# cd data
# curl -o property_sales.zip https://aqua.kingcounty.gov/extranet/assessor/Real%20Property%20Sales.zip 

In [None]:
# %%bash
# cd data
# curl -o res_bldg.zip https://aqua.kingcounty.gov/extranet/assessor/Residential%20Building.zip 

In [None]:
# zf = ZipFile('data/property_sales.zip', 'r')
# zf.extractall('data')
# zf.close()

In [None]:
# zf = ZipFile('data/res_bldg.zip', 'r')
# zf.extractall('data')
# zf.close()

In [55]:
# You'll need to use a new encoding here. List of all encodings here:
# https://docs.python.org/3/library/codecs.html#standard-encodings

# Both of these csv files have many columns, so we'll just pre-select
# which ones we want to use.

sales_df = pd.read_csv('data/EXTR_RPSale.csv',
                       encoding='latin-1',
                       usecols=['Major', 'Minor', 'DocumentDate', 'SalePrice'])

  exec(code_obj, self.user_global_ns, self.user_ns)


In [56]:
sales_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2220587 entries, 0 to 2220586
Data columns (total 4 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   Major         object
 1   Minor         object
 2   DocumentDate  object
 3   SalePrice     int64 
dtypes: int64(1), object(3)
memory usage: 67.8+ MB


In [57]:
bldg_df = pd.read_csv('data/EXTR_ResBldg.csv',
                     usecols=['Major', 'Minor', 'SqFtTotLiving', 'ZipCode'])

  exec(code_obj, self.user_global_ns, self.user_ns)


In [58]:
bldg_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 522030 entries, 0 to 522029
Data columns (total 4 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   Major          522030 non-null  int64 
 1   Minor          522030 non-null  int64 
 2   ZipCode        470868 non-null  object
 3   SqFtTotLiving  522030 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 15.9+ MB


In [111]:
sales_data = pd.merge(sales_df, bldg_df, on=['Major', 'Minor'])

In [60]:
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,ZipCode,SqFtTotLiving
0,4000,228,04/29/1997,103500,98168,1560
1,4000,228,10/15/2014,221900,98168,1560
2,4000,228,08/28/2020,0,98168,1560
3,4000,228,05/06/2005,198000,98168,1560
4,4000,228,04/26/2019,369000,98168,1560


In [65]:
sales_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1544897 entries, 0 to 1544896
Data columns (total 6 columns):
 #   Column         Non-Null Count    Dtype 
---  ------         --------------    ----- 
 0   Major          1544897 non-null  object
 1   Minor          1544897 non-null  object
 2   DocumentDate   1544897 non-null  object
 3   SalePrice      1544897 non-null  int64 
 4   ZipCode        1404584 non-null  object
 5   SqFtTotLiving  1544897 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 82.5+ MB


We can see right away that we're missing ZIP codes for many of the sales transactions.

In [62]:
sales_data.loc[sales_data['ZipCode'].isna()].head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,ZipCode,SqFtTotLiving
28,226700,160,05/08/2003,0,,1560
29,226700,160,05/11/1996,0,,1560
30,226700,160,09/08/2011,0,,1560
31,226700,160,10/09/2018,855000,,1560
32,226700,160,01/24/2020,0,,1560


### Exercise

What percentage of housing records are missing ZIP codes?

In [87]:
sales_data['ZipCode'].isna().sum() / len(sales_data)

0.09082353063019735

<details>
    <summary>Answer</summary>
    <code>sales_data['ZipCode'].isna().sum() / sales_data.shape[0]</code>
    </details>

Let's drop the rows with missing zip codes.

In [112]:
sales_data = sales_data.dropna(subset=['ZipCode'], how='all')
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,ZipCode,SqFtTotLiving
0,4000,228,04/29/1997,103500,98168,1560
1,4000,228,10/15/2014,221900,98168,1560
2,4000,228,08/28/2020,0,98168,1560
3,4000,228,05/06/2005,198000,98168,1560
4,4000,228,04/26/2019,369000,98168,1560


In [89]:
sales_data.isna().sum()

Major            0
Minor            0
DocumentDate     0
SalePrice        0
ZipCode          0
SqFtTotLiving    0
dtype: int64

In [None]:
sales_data = sales_data.loc[~sales_data['ZipCode'].isna(), :]
sales_data.head()

## Time Permitting: Data Cleaning with Pandas

### 1. Investigate and drop rows with invalid values in the SalePrice and SqFtTotLiving columns.

In [90]:
sales_data.describe()

Unnamed: 0,SalePrice,SqFtTotLiving
count,1404584.0,1404584.0
mean,308971.3,2109.228
std,758983.3,977.6886
min,-400.0,0.0
25%,0.0,1440.0
50%,167500.0,1940.0
75%,380000.0,2570.0
max,37500000.0,48160.0


In [74]:
sales_data['SalePrice'].min()

-400

In [113]:
sales_data = sales_data[sales_data['SalePrice'] > 10000]

<details>
    <summary>One possible answer here</summary>
    <code>sales_data = sales_data[sales_data['SalePrice'] > 10000]</code>
    </details>

### 2. Investigate and handle non-numeric ZipCode values

Can you find a way to shorten ZIP+4 codes to the first five digits?

In [114]:
sales_data['ZipCode'].dtype

dtype('O')

In [93]:
sales_dataX = sales_data.copy()

In [105]:
sales_dataX['ZipCode'].sample(10)

399198     98055
1006554    98107
19269      98059
1467688    98022
1330983    98136
273619     98032
1186659    98072
363511     98040
1402524    98115
1536754    98125
Name: ZipCode, dtype: object

In [120]:
def is_integer(x):
    try:
        _ = int(x)
    except ValueError:
        return False
    return True

sales_data.loc[sales_data['ZipCode'].apply(is_integer) == False, 'ZipCode'].head()

13    98033.0
14    98033.0
15    98033.0
16    98033.0
17    98033.0
Name: ZipCode, dtype: object

In [121]:
sales_data['ZipCode'].dtype

dtype('O')

In [126]:
sales_data['ZipCode'].sample(10)

906168     98092
423954     98053
1304365    98075
530335     98108
1233248    98092
27086      98117
417839     98058
1039819    98055
774307     98177
464306     98117
Name: ZipCode, dtype: object

In [127]:
def five_digit_ZIP(x):
    try:
        return int(str(x)[:5])
    except:
        return x
sales_data['ZipCode'] = sales_data['ZipCode'].map(five_digit_ZIP)
sales_data = sales_data.loc[sales_data['ZipCode'].apply(is_integer) == True, :]
sales_data['ZipCode'] = sales_data['ZipCode'].map(int)

<details>
    <summary>One possible answer here</summary>
    <code>def five_digit_ZIP(x):
    try:
        return int(str(x)[:5])
    except:
        return x
sales_data['ZipCode'] = sales_data['ZipCode'].map(five_digit_ZIP)
sales_data = sales_data.loc[sales_data['ZipCode'].apply(is_integer) == True, :]
sales_data['ZipCode'] = sales_data['ZipCode'].map(int)</code>
    </details>

In [128]:
sales_data.head(2)

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,ZipCode,SqFtTotLiving
0,4000,228,04/29/1997,103500,98168,1560
1,4000,228,10/15/2014,221900,98168,1560


### 3. Add a column for PricePerSqFt



In [129]:
sales_data['PricePerSqFt'] =  sales_data['SalePrice']/sales_data['SqFtTotLiving']
sales_data['PricePerSqFt'].head()

0     66.346154
1    142.243590
3    126.923077
4    236.538462
7    217.415730
Name: PricePerSqFt, dtype: float64

<details>
    <summary>Answer here</summary>
    <code>sales_data['PricePerSqFt'] = sales_data['SalePrice'] / sales_data['SqFtTotLiving']</code>
    </details>

### 4. Subset the data to 2021 sales only.

We can assume that the DocumentDate is approximately the sale date.

In [133]:
sales_data ['DocumentDate'].dtype

dtype('<M8[ns]')

In [132]:
sales_data['DocumentDate'] = pd.to_datetime(sales_data['DocumentDate'])

In [134]:
sales_data ['DocumentDate'].sample(2)

451787    2000-09-06
1512386   1997-07-30
Name: DocumentDate, dtype: datetime64[ns]

In [155]:
sales_data = sales_data[sales_data['DocumentDate'] >= '2021-01-01']
sales_data = sales_data[sales_data['DocumentDate'] < '2022-01-01']
sales_data.head()

Unnamed: 0,Major,Minor,DocumentDate,SalePrice,ZipCode,SqFtTotLiving,PricePerSqFt
38,891050,230,2021-12-13,920000,98133,2770,332.13
46,118000,275,2021-09-29,311000,98178,2880,107.99
147,923890,1045,2021-12-17,726000,98136,1230,590.24
174,251701,640,2021-03-22,605000,98042,2290,264.19
222,329370,160,2021-03-24,815000,98133,2500,326.0


<details>
    <summary>Answer here</summary>
    <code>sales_data['DocumentDate'] = pd.to_datetime(sales_data['DocumentDate'])
sales_data = sales_data.loc[sales_data['DocumentDate'] > '12/31/2020']</code>
    </details>

### 5. What is the mean price per square foot for a house sold in Seattle in 2021?

In [156]:
sales_data['PricePerSqFt'].mean()

507.33230872888953

<details>
    <summary>Answer here</summary>
    <code>sales_data['PricePerSqFt'].mean()</code>
    </details>

## Level Up: `pandas.set_option()`

We can adjust how `pandas` works by setting options in advance.

### Block Scientific Notation

For example, suppose we want to prevent numbers from being displayed in scientific notation.

In [142]:
df = pd.DataFrame([[1e9, 2e9], [3e9, 4e9]])
df

Unnamed: 0,0,1
0,1000000000.0,2000000000.0
1,3000000000.0,4000000000.0


Then we can use:

In [143]:
pd.set_option('display.float_format', '{:.2f}'.format)
df

Unnamed: 0,0,1
0,1000000000.0,2000000000.0
1,3000000000.0,4000000000.0


### See More Rows

Or suppose we want `pandas` to show more rows.

In [144]:
df2 = pd.DataFrame(np.array(range(100)))
df2

Unnamed: 0,0
0,0
1,1
2,2
3,3
4,4
...,...
95,95
96,96
97,97
98,98


In that case we can use:

In [145]:
pd.set_option('display.max_rows', 100)

df2

Unnamed: 0,0
0,0
1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9


For complete documentation, see [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html).