<img src="https://user-images.strikinglycdn.com/res/hrscywv4p/image/upload/c_limit,fl_lossy,h_300,w_300,f_auto,q_auto/1266110/Logo_wzxi0f.png" style="float: left; margin: 20px; height: 55px">

***Sleep Faster! - Arnold Schwarzenegger***

# 1. Introduction to Pandas

**LEARNING OBJECTIVES**
*After this lesson, you will be able to:*
- Inspect data types
- Clean up a column using `df.apply()`
- Know what situations to use `.value_counts()` in your code

Since we're starting to get pretty comfortable with using pandas to do EDA, let's add a
couple more tools to our toolbox.

The main data types stored in pandas objects are float, int, bool, datetime64, datetime64, timedelta,
category, and object.

df.apply() will apply a function along any axis of the DataFrame. We'll see it in action below.

pandas.Series.value_counts returns Series containing counts of unique values. The resulting
Series will be in descending order so that the first element is the most frequently-occurring
element. Excludes NA values.

- Examples of [dtypes](http://pandas.pydata.org/pandas-docs/stable/pandas.pdf).
- Examples of [value_counts](http://nullege.com/codes/search/pandas.Series.value_counts).

In [0]:
import pandas as pd
import numpy as np

Assigning a datatype

In [0]:
a=pd.Series([1, 2, 3, 4, 5, 6.2])
b=a.astype("int")
print(a)
print(b)

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.2
dtype: float64
0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64


Printing the DataTypes

In [0]:
print("a: "+str(a.dtypes))
print("b: "+str(b.dtypes))

a: float64
b: int64


Create a pandas series with letters as the index:

In [0]:
c = pd.Series(np.random.randn(7), index=['a', 'b', 'c', 'd', 'e', 'f', 'g'])  
print(c)

As above with no index statement, refers to the default behaviour which is integers starting from 0


In [0]:
d = pd.Series(np.random.randn(7))  
print(d)

0   -0.344869
1   -0.322142
2   -0.256439
3   -0.908999
4    1.492016
5   -0.313876
6   -0.771880
dtype: float64


`loc` refers to the index as an absolute for each row, and iloc refers to numerical counting which will vary if the order of rows changes

In [0]:
c.loc["a":"c"]
c.iloc[0:3]

a   -1.245113
b    0.513797
c   -0.107182
dtype: float64

Be aware that null values (i.e. NaN) mean that integers will be cast as floats, and you cannot cast them back to integers


In [0]:
d=pd.Series([1,2,2,np.nan,4])
print(d.dtypes)

float64


Thiis cell will not run:

In [0]:
e=d.astype("int")

ValueError: Cannot convert NA to integer

A super useful function is `value_counts`, which tells you how many times a value occurs


In [0]:
d.value_counts()

2.0    2
4.0    1
1.0    1
dtype: int64

However be aware that the default behaviour of value_counts is to ignore null values, to return these include the argument `dropna=False` (another good one to know is `ascending`, and also try `sort`)


In [0]:
d.value_counts(dropna=False, ascending=True)

NaN     1
 1.0    1
 4.0    1
 2.0    2
dtype: int64

Boolean indexing logic allows us to select based on whether a condition is `True` or `False` the tilda ~ reverses the logic, so `False` becomes `True`.
<br>Note that you can achieve this step without the `.loc`, but it's a good idea to get used to using it.

In [0]:
f = pd.Series(range(-3, 4))
print(f)

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64


In [0]:
print(f.loc[f>0])
print(f.loc[~f>0])

4    1
5    2
6    3
dtype: int64
0   -3
1   -2
dtype: int64


You can nest these: the `OR` operator is the pipe `|` and the `AND` operator is the ampersand `&`.
<br> Note that parentheses are required for each condition


In [0]:
print(f.loc[(f < -1) | (f > 2)])
print(f.loc[(f>0)&(f>1)])

0   -3
1   -2
6    3
dtype: int64
5    2
6    3
dtype: int64


Let's create a dataframe with some assorted datatypes


In [0]:
g = pd.DataFrame(dict(A = np.random.rand(3),
                        B = 1,
                        C = 'foo',
                        D = pd.Timestamp('2001-01-02'),
                        E = pd.Series([1.0]*3).astype('float32'),
                        F = False,
                        G = pd.Series([1]*3,dtype='int8')))
g

Unnamed: 0,A,B,C,D,E,F,G
0,0.383265,1,foo,2001-01-02,1.0,False,1
1,0.785179,1,foo,2001-01-02,1.0,False,1
2,0.139806,1,foo,2001-01-02,1.0,False,1


In [0]:
g.dtypes

A           float64
B             int64
C            object
D    datetime64[ns]
E           float32
F              bool
G              int8
dtype: object

Either of these will achieve the same thing


In [0]:
print(g.get_dtype_counts())
print(g.dtypes.value_counts())

bool              1
datetime64[ns]    1
float32           1
float64           1
int64             1
int8              1
object            1
dtype: int64
bool              1
int64             1
object            1
int8              1
float32           1
float64           1
datetime64[ns]    1
dtype: int64


A pretty common operation would be string replacements. No need to iterate over rows thankfully


In [0]:
g["H"]=g["C"].replace("foo", "bar")
g

Unnamed: 0,A,B,C,D,E,F,G,H
0,0.383265,1,foo,2001-01-02,1.0,False,1,bar
1,0.785179,1,foo,2001-01-02,1.0,False,1,bar
2,0.139806,1,foo,2001-01-02,1.0,False,1,bar


You can simply adjust every row without iteration


In [0]:
g["I"]=(g["E"]*2)+50
g

Unnamed: 0,A,B,C,D,E,F,G,H,I
0,0.383265,1,foo,2001-01-02,1.0,False,1,bar,52.0
1,0.785179,1,foo,2001-01-02,1.0,False,1,bar,52.0
2,0.139806,1,foo,2001-01-02,1.0,False,1,bar,52.0


You can use the `apply` function to perform an operation on all cells separately, or all cells across a row or a column.

In [0]:
h = pd.DataFrame(np.random.randn(5, 4), columns=['a', 'b', 'c', 'd'])
h

Unnamed: 0,a,b,c,d
0,1.370846,2.293187,0.415357,0.840045
1,-0.047257,-0.055056,-0.974981,-0.626014
2,-0.740154,-0.210575,0.496156,-0.122672
3,-0.334158,1.722904,-0.13824,1.560636
4,1.231376,-1.918527,-0.107005,0.216071


In [0]:
h.apply(np.sqrt)

Unnamed: 0,a,b,c,d
0,,1.012468,1.152731,
1,1.261894,1.021123,,
2,,1.127642,1.187174,
3,1.214923,1.102056,0.911368,0.829358
4,0.640189,1.230683,,0.109857


Note the nulls, as some values were negative.
<br> For getting the mean of each column `axis=0`

In [0]:
h.apply(np.mean, axis=0)

a    0.233684
b    1.213693
c    0.128155
d   -0.111374
dtype: float64

For the mean of each row `axis=1`

In [0]:
h.apply(np.mean, axis=1)

0    0.119918
1    0.289377
2    0.254697
3    1.052248
4    0.113959
dtype: float64

Note the default is `axis=0`

In [0]:
h.apply(np.mean)

a    0.233684
b    1.213693
c    0.128155
d   -0.111374
dtype: float64

Whatever operation you want to try will assume you want to perform it to each row without iteration required.

In [0]:
h["g"]=np.square(h["a"])+np.square(h["b"])
h

Unnamed: 0,a,b,c,d,g
0,-1.141983,1.02509,1.328789,-0.732227,2.354935
1,1.592377,1.042692,-1.447338,-0.030225,3.622872
2,-1.167852,1.271577,1.409383,-0.49432,2.980788
3,1.476038,1.214527,0.830593,0.687835,3.653762
4,0.409842,1.514581,-1.480655,0.012069,2.461926


Let's bring in a bigger dataset

In [0]:
music=pd.read_csv("Billboard.csv")
music.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,...,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,...,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,...,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,...,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,...,,,,,,,,,,


`.info()`

In [0]:
music.info()

`.describe()`

In [0]:
music.describe()

Check the shape

In [0]:
music.shape 

(317, 83)

What is `.shape`? What makes it different from `.describe()` and `.info()`?

In [0]:
#.shape is an attribute and .describe() and .info() are methods

Why aren't we displaying all columns above? let's set this as an option in pandas.
<br>We can either manually input an arbitrary number above 83, or why not input the known value using shape

In [0]:
pd.set_option("display.max_columns", music.shape[1])
music.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,x4th.week,x5th.week,x6th.week,x7th.week,x8th.week,x9th.week,x10th.week,x11th.week,x12th.week,x13th.week,x14th.week,x15th.week,x16th.week,x17th.week,x18th.week,x19th.week,x20th.week,x21st.week,x22nd.week,x23rd.week,x24th.week,x25th.week,x26th.week,x27th.week,x28th.week,x29th.week,x30th.week,x31st.week,x32nd.week,x33rd.week,x34th.week,x35th.week,x36th.week,x37th.week,x38th.week,x39th.week,x40th.week,x41st.week,x42nd.week,x43rd.week,x44th.week,x45th.week,x46th.week,x47th.week,x48th.week,x49th.week,x50th.week,x51st.week,x52nd.week,x53rd.week,x54th.week,x55th.week,x56th.week,x57th.week,x58th.week,x59th.week,x60th.week,x61st.week,x62nd.week,x63rd.week,x64th.week,x65th.week,x66th.week,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,33.0,23.0,15.0,7.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,7.0,10.0,12.0,15.0,22.0,29.0,31.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,5.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,15.0,19.0,21.0,26.0,36.0,48.0,47.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,31.0,20.0,13.0,7.0,6.0,4.0,4.0,4.0,6.0,4.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,4.0,8.0,8.0,12.0,14.0,17.0,21.0,24.0,30.0,34.0,37.0,46.0,47.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,14.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,8.0,11.0,16.0,20.0,25.0,27.0,27.0,29.0,44.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,29.0,23.0,18.0,11.0,9.0,9.0,11.0,1.0,1.0,1.0,1.0,4.0,8.0,12.0,22.0,23.0,43.0,44.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Let's isolate some columns so we can deal with a smaller dataframe that is easier to view


In [0]:
music_simple=music[["year", "artist.inverted", "track", "time", "genre", "date.entered", "date.peaked"]]
music_simple.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14


Some column names are awkward


In [0]:
music_simple=music_simple.rename(columns={"artist.inverted":"artist"})

In [0]:
music_simple.dtypes

year             int64
artist          object
track           object
time            object
genre           object
date.entered    object
date.peaked     object
dtype: object

Our dates and times are objects, meaning they are strings - if we want to manipulate them we need pandas to understand they are actually datetime values (for the dates) or timedeltas(for the track length)

In [0]:
music_simple["entered_dt"]=pd.to_datetime(music_simple["date.entered"])
music_simple["peaked_dt"]=pd.to_datetime(music_simple["date.peaked"])
music_simple.head()

Unnamed: 0,year,artist,track,time,genre,date.entered,date.peaked,entered_dt,peaked_dt
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,2000-09-23,2000-11-18
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,2000-02-12,2000-04-08
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,1999-10-23,2000-01-29
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,2000-08-12,2000-09-16
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,2000-08-05,2000-10-14


Looks the same but wait!

In [0]:
music_simple.dtypes

year                     int64
artist                  object
track                   object
time                    object
genre                   object
date.entered            object
date.peaked             object
entered_dt      datetime64[ns]
peaked_dt       datetime64[ns]
dtype: object

We can now do some operations like

In [0]:
music_simple["month_entered"] = music_simple["entered_dt"].dt.month
music_simple.head()

Unnamed: 0,year,artist,track,time,genre,date.entered,date.peaked,entered_dt,peaked_dt,month_entered
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,2000-09-23,2000-11-18,9
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,2000-02-12,2000-04-08,2
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,1999-10-23,2000-01-29,10
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,2000-08-12,2000-09-16,8
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,2000-08-05,2000-10-14,8


But this doesn't work for the time column because pandas doesn't know if it is HH:MM or MM:SS

In [0]:
music_simple["time_td"] = "0:" + music_simple["time"]
music_simple.head()

Unnamed: 0,year,artist,track,time,genre,date.entered,date.peaked,entered_dt,peaked_dt,month_entered,time_td
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,2000-09-23,2000-11-18,9,0:3:38
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,2000-02-12,2000-04-08,2,0:4:18
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,1999-10-23,2000-01-29,10,0:4:07
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,2000-08-12,2000-09-16,8,0:3:45
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,2000-08-05,2000-10-14,8,0:3:38


Now it works because it interprets the format as HH:MM:SS and we can perform an operation to work out the time in seconds.
<br>Check pandas documentation for the various functions you can call on timedeltas and datetimes

In [0]:
music_simple["time_td"] = pd.to_timedelta(music_simple["time_td"])
music_simple["time_seconds"] = music_simple["time_td"].dt.total_seconds()
music_simple.head()

Unnamed: 0,year,artist,track,time,genre,date.entered,date.peaked,entered_dt,peaked_dt,month_entered,time_td,time_seconds
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,2000-09-23,2000-11-18,9,00:03:38,218.0
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,2000-02-12,2000-04-08,2,00:04:18,258.0
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,1999-10-23,2000-01-29,10,00:04:07,247.0
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,2000-08-12,2000-09-16,8,00:03:45,225.0
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,2000-08-05,2000-10-14,8,00:03:38,218.0


How many songs were entered each day? Several ways to do this but we looked at a pivot table, don't worry we will go over these in more detail so you can take this step as magic for now.

In [0]:
pivot=pd.pivot_table(music_simple, index="entered_dt", values="track", aggfunc="count")
pivot.head()

entered_dt
1999-06-05    1
1999-07-17    1
1999-09-04    1
1999-09-11    1
1999-10-09    2
Name: track, dtype: int64

Does this match?

In [0]:
pivot.sum()

317

In [0]:
music_simple.shape

(317, 12)

Seems so yes.
<br>Let's also fill the missing value where no songs were entered

In [0]:
date_range=pd.date_range(start=pivot.index.min(), end=pivot.index.max())
pivot2=pivot.reindex(date_range)

Fill the nulls with zeros, as no songs were entered on those days

In [0]:
pivot2 = pivot2.fillna(0)
pivot2.head()

1999-06-05    1.0
1999-06-06    0.0
1999-06-07    0.0
1999-06-08    0.0
1999-06-09    0.0
Freq: D, Name: track, dtype: float64

Check matches

In [0]:
pivot2.sum()

317.0

For returning no null values

In [0]:
no_nulls = music.dropna()

Oh dear! nothing left

In [0]:
no_nulls.shape

(0, 83)

Let's subset

In [0]:
some_nulls = music.dropna(subset=["x1st.week", "x2nd.week"])

In [0]:
some_nulls.shape

(312, 83)

Other useful operations:

In [0]:
third_week_null_rows = music.loc[music["x3rd.week"].isnull(),:]
third_week_non_null_rows = music.loc[music["x3rd.week"].notnull(),:]

(10, 83)

# 2. Pivot Tables

We load the data set (excel)

In [0]:
sales_funnel_df = pd.read_excel("sales-funnel.xlsx")
sales_funnel_df

Unnamed: 0,Account,Name,Rep,Manager,Product,Quantity,Price,Status
0,714466,Trantow-Barrows,Craig Booker,Debra Henley,CPU,1,30000,presented
1,714466,Trantow-Barrows,Craig Booker,Debra Henley,Software,1,10000,presented
2,714466,Trantow-Barrows,Craig Booker,Debra Henley,Maintenance,2,5000,pending
3,737550,"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,CPU,1,35000,declined
4,146832,Kiehn-Spinka,Daniel Hilton,Debra Henley,CPU,2,65000,won
5,218895,Kulas Inc,Daniel Hilton,Debra Henley,CPU,2,40000,pending
6,218895,Kulas Inc,Daniel Hilton,Debra Henley,Software,1,10000,presented
7,412290,Jerde-Hilpert,John Smith,Debra Henley,Maintenance,2,5000,pending
8,740150,Barton LLC,John Smith,Debra Henley,CPU,1,35000,declined
9,141962,Herman LLC,Cedric Moss,Fred Anderson,CPU,2,65000,won


We change the type of Status and Account into Categories (categorical data)

In [0]:
sales_funnel_df["Status"] = sales_funnel_df["Status"].astype("category")
sales_funnel_df["Account"] = sales_funnel_df["Account"].astype("category")

We start with a simple pivot. Note that the aggregation function is the mean. Ex: for Kulas/Quantity, two rows [1, 2] => Quantity = 1.5

In [0]:
pd.pivot_table(sales_funnel_df, index=["Name"])

Unnamed: 0_level_0,Price,Quantity
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Barton LLC,35000.0,1.0
"Fritsch, Russel and Anderson",35000.0,1.0
Herman LLC,65000.0,2.0
Jerde-Hilpert,5000.0,2.0
"Kassulke, Ondricka and Metz",7000.0,3.0
Keeling LLC,100000.0,5.0
Kiehn-Spinka,65000.0,2.0
Koepp Ltd,35000.0,2.0
Kulas Inc,25000.0,1.5
Purdy-Kunde,30000.0,1.0


We now do a pivot using multiple columns for the Index

In [0]:
pd.pivot_table(sales_funnel_df, index=['Name', 'Rep', 'Manager'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Price,Quantity
Name,Rep,Manager,Unnamed: 3_level_1,Unnamed: 4_level_1
Barton LLC,John Smith,Debra Henley,35000.0,1.0
"Fritsch, Russel and Anderson",Craig Booker,Debra Henley,35000.0,1.0
Herman LLC,Cedric Moss,Fred Anderson,65000.0,2.0
Jerde-Hilpert,John Smith,Debra Henley,5000.0,2.0
"Kassulke, Ondricka and Metz",Wendy Yule,Fred Anderson,7000.0,3.0
Keeling LLC,Wendy Yule,Fred Anderson,100000.0,5.0
Kiehn-Spinka,Daniel Hilton,Debra Henley,65000.0,2.0
Koepp Ltd,Wendy Yule,Fred Anderson,35000.0,2.0
Kulas Inc,Daniel Hilton,Debra Henley,25000.0,1.5
Purdy-Kunde,Cedric Moss,Fred Anderson,30000.0,1.0


Please notice that if we change the order in the index, pandas detects the relations between managers and reps and groups them automatically.

In [0]:
pd.pivot_table(sales_funnel_df, index=['Manager',  'Rep', 'Name'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Price,Quantity
Manager,Rep,Name,Unnamed: 3_level_1,Unnamed: 4_level_1
Debra Henley,Craig Booker,"Fritsch, Russel and Anderson",35000.0,1.0
Debra Henley,Craig Booker,Trantow-Barrows,15000.0,1.333333
Debra Henley,Daniel Hilton,Kiehn-Spinka,65000.0,2.0
Debra Henley,Daniel Hilton,Kulas Inc,25000.0,1.5
Debra Henley,John Smith,Barton LLC,35000.0,1.0
Debra Henley,John Smith,Jerde-Hilpert,5000.0,2.0
Fred Anderson,Cedric Moss,Herman LLC,65000.0,2.0
Fred Anderson,Cedric Moss,Purdy-Kunde,30000.0,1.0
Fred Anderson,Cedric Moss,Stokes LLC,7500.0,1.0
Fred Anderson,Wendy Yule,"Kassulke, Ondricka and Metz",7000.0,3.0


Now, thinking only about understanding the sales funnel for each salesman, we drop the account name from the index

In [0]:
pd.pivot_table(sales_funnel_df, index=['Manager', 'Rep'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Quantity
Manager,Rep,Unnamed: 2_level_1,Unnamed: 3_level_1
Debra Henley,Craig Booker,20000.0,1.25
Debra Henley,Daniel Hilton,38333.333333,1.666667
Debra Henley,John Smith,20000.0,1.5
Fred Anderson,Cedric Moss,27500.0,1.25
Fred Anderson,Wendy Yule,44250.0,3.0


We can also specify the columns to be aggregated. For ex: Price

In [0]:
pd.pivot_table(sales_funnel_df, index=['Manager', 'Rep'], values=['Price'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Manager,Rep,Unnamed: 2_level_1
Debra Henley,Craig Booker,20000
Debra Henley,Daniel Hilton,38333
Debra Henley,John Smith,20000
Fred Anderson,Cedric Moss,27500
Fred Anderson,Wendy Yule,44250


So far for the aggregating columns we have been getting the mean value. We can specify another aggregation strategy with the aggfunc parameter

In [0]:
pd.pivot_table(sales_funnel_df, index=["Manager", "Rep"], values=['Price'], aggfunc=np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Price
Manager,Rep,Unnamed: 2_level_1
Debra Henley,Craig Booker,80000
Debra Henley,Daniel Hilton,115000
Debra Henley,John Smith,40000
Fred Anderson,Cedric Moss,110000
Fred Anderson,Wendy Yule,177000


We can also give a list of functions to aggfunc, and it will get that agggregation for all the values

In [0]:
pd.pivot_table(sales_funnel_df, index=["Manager", "Rep"], values=['Price'], aggfunc=[np.sum, len])

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price
Manager,Rep,Unnamed: 2_level_2,Unnamed: 3_level_2
Debra Henley,Craig Booker,80000,4
Debra Henley,Daniel Hilton,115000,3
Debra Henley,John Smith,40000,2
Fred Anderson,Cedric Moss,110000,4
Fred Anderson,Wendy Yule,177000,4


We can also define columns to further segment our values. Remember that the aggregations are always done on the values.

In [0]:
pd.pivot_table(sales_funnel_df, index=["Manager", "Rep"], values=['Price'], 
               columns=['Product'] ,aggfunc=[np.sum, len])

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price,Price,Price,Price,Price,Price,Price
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
Debra Henley,Craig Booker,65000.0,5000.0,,10000.0,2.0,1.0,,1.0
Debra Henley,Daniel Hilton,105000.0,,,10000.0,2.0,,,1.0
Debra Henley,John Smith,35000.0,5000.0,,,1.0,1.0,,
Fred Anderson,Cedric Moss,95000.0,5000.0,,10000.0,2.0,1.0,,1.0
Fred Anderson,Wendy Yule,165000.0,7000.0,5000.0,,2.0,1.0,1.0,


You can notice that some values are NaN (more on this in the last lesson of the day).
We can change those NaNs into zeros.

In [0]:
pd.pivot_table(sales_funnel_df, index=["Manager", "Rep"], values=['Price'], 
               columns=['Product'] ,aggfunc=[np.sum, len], fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price,Price,Price,Price,Price,Price,Price
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Rep,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
Debra Henley,Craig Booker,65000,5000,0,10000,2,1,0,1
Debra Henley,Daniel Hilton,105000,0,0,10000,2,0,0,1
Debra Henley,John Smith,35000,5000,0,0,1,1,0,0
Fred Anderson,Cedric Moss,95000,5000,0,10000,2,1,0,1
Fred Anderson,Wendy Yule,165000,7000,5000,0,2,1,1,0


If we move Products into the index, we have another way of summarizing the same info.

In [0]:
pd.pivot_table(sales_funnel_df, index=["Manager", "Rep", "Product"], values=['Price'], 
               aggfunc=[np.sum, len], fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Price,Price
Manager,Rep,Product,Unnamed: 3_level_2,Unnamed: 4_level_2
Debra Henley,Craig Booker,CPU,65000,2
Debra Henley,Craig Booker,Maintenance,5000,1
Debra Henley,Craig Booker,Software,10000,1
Debra Henley,Daniel Hilton,CPU,105000,2
Debra Henley,Daniel Hilton,Software,10000,1
Debra Henley,John Smith,CPU,35000,1
Debra Henley,John Smith,Maintenance,5000,1
Fred Anderson,Cedric Moss,CPU,95000,2
Fred Anderson,Cedric Moss,Maintenance,5000,1
Fred Anderson,Cedric Moss,Software,10000,1


We can also use margin, to get a total for each value column.

In [0]:
pd.pivot_table(sales_funnel_df, index=["Manager", "Rep", "Product"], values=['Price'], 
               aggfunc=[np.sum, len], fill_value=0, margins=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sum,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Price,Price
Manager,Rep,Product,Unnamed: 3_level_2,Unnamed: 4_level_2
Debra Henley,Craig Booker,CPU,65000.0,2.0
Debra Henley,Craig Booker,Maintenance,5000.0,1.0
Debra Henley,Craig Booker,Software,10000.0,1.0
Debra Henley,Daniel Hilton,CPU,105000.0,2.0
Debra Henley,Daniel Hilton,Software,10000.0,1.0
Debra Henley,John Smith,CPU,35000.0,1.0
Debra Henley,John Smith,Maintenance,5000.0,1.0
Fred Anderson,Cedric Moss,CPU,95000.0,2.0
Fred Anderson,Cedric Moss,Maintenance,5000.0,1.0
Fred Anderson,Cedric Moss,Software,10000.0,1.0


We can go up a level and try to analyse the value of the funnel for each manager, de aggregating on each status.

In [0]:
pd.pivot_table(sales_funnel_df, index=["Manager", "Status"], values=['Price'], aggfunc=[np.sum, len])

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Price,Price
Manager,Status,Unnamed: 2_level_2,Unnamed: 3_level_2
Debra Henley,declined,70000,2
Debra Henley,pending,50000,3
Debra Henley,presented,50000,3
Debra Henley,won,65000,1
Fred Anderson,declined,65000,1
Fred Anderson,pending,5000,1
Fred Anderson,presented,45000,3
Fred Anderson,won,172000,3


Another trick is to pass a dict for aggfun so we can specify different aggregation strategies for each value. Also, each value of the dict can be a list, as usual.

In [0]:
table = pd.pivot_table(df,index=["Manager","Status"],columns=["Product"],values=["Quantity","Price"],
               aggfunc={"Quantity":len,"Price":[np.sum, len]},fill_value=0)
print(table)

                        Price                                       \
                          len                                  sum   
Product                   CPU Maintenance Monitor Software     CPU   
Manager       Status                                                 
Debra Henley  declined      2           0       0        0   70000   
              pending       1           2       0        0   40000   
              presented     1           0       0        2   30000   
              won           1           0       0        0   65000   
Fred Anderson declined      1           0       0        0   65000   
              pending       0           1       0        0       0   
              presented     1           0       1        1   30000   
              won           2           1       0        0  165000   

                                                     Quantity              \
                                                          len               
Produ

Finally, once we have a pivot we are happy with, we can also query it.

In [0]:
table.query('Manager == ["Debra Henley"]')

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,len,len,len,len,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
Debra Henley,declined,2,0,0,0,70000,0,0,0,2,0,0,0
Debra Henley,pending,1,2,0,0,40000,10000,0,0,1,2,0,0
Debra Henley,presented,1,0,0,2,30000,0,0,20000,1,0,0,2
Debra Henley,won,1,0,0,0,65000,0,0,0,1,0,0,0


In [0]:
table.query("Status != ['won']")

Unnamed: 0_level_0,Unnamed: 1_level_0,Price,Price,Price,Price,Price,Price,Price,Price,Quantity,Quantity,Quantity,Quantity
Unnamed: 0_level_1,Unnamed: 1_level_1,len,len,len,len,sum,sum,sum,sum,len,len,len,len
Unnamed: 0_level_2,Product,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software,CPU,Maintenance,Monitor,Software
Manager,Status,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3
Debra Henley,declined,2,0,0,0,70000,0,0,0,2,0,0,0
Debra Henley,pending,1,2,0,0,40000,10000,0,0,1,2,0,0
Debra Henley,presented,1,0,0,2,30000,0,0,20000,1,0,0,2
Fred Anderson,declined,1,0,0,0,65000,0,0,0,1,0,0,0
Fred Anderson,pending,0,1,0,0,0,5000,0,0,0,1,0,0
Fred Anderson,presented,1,0,1,1,30000,0,5000,10000,1,0,1,1
