![MLTrain logo](https://mltrain.cc/wp-content/uploads/2017/11/mltrain_logo-4.png "MLTrain logo")

In [1]:
!wget -q -O changeNBLayout.py https://raw.githubusercontent.com/cmalliopoulos/PfBDAaML/master/changeNBLayout.py
%run changeNBLayout.py
# %run ../PfBDAaML/changeNBLayout.py

# Introduction #
The pandas library implements the basic machinery for handling in-memory tabular data with Python.  
It is a large library. Contains >600 methods, attributes and functions.  
The online documentation of the current version (0.21) is available at https://pandas.pydata.org/pandas-docs/stable/overview.html.  
  
The core data structures in Pandas are the __Series, DataFrame and Index objects__.  
You should think of Series and DataFrames as fact tables (ie with named and typed columns) and Indices as Dimensions with hierarchies.  
  
Pandas' methods and funtions permit standard __dimensional analysis__ (slicing, dicing and pivoting). Moreover Pandas provides a comprehensive set of analytical functions for __group transforms and ranking__.  

### Imports ###

In [3]:
from os import linesep as endl
import pandas as pd
import numpy as np

# Adjust Pandas layout options
pd.set_option('display.width', 124)

! wget -q -O nba.csv https://raw.githubusercontent.com/cmalliopoulos/PfBDAaML/master/nba.csv
nba = pd.read_csv('nba.csv')
print 'NBS player stats:', endl, nba.sample(5)

# ! wget -q -O employees.csv https://raw.githubusercontent.com/cmalliopoulos/PfBDAaML/master/employees.csv
emp = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
print '', endl, emp.sample(5)

NBS player stats: 
                Name                   Team  Number Position   Age Height  Weight        College     Salary
225  Greivis Vasquez        Milwaukee Bucks    21.0       PG  29.0    6-6   217.0       Maryland  6600000.0
112  Marcelo Huertas     Los Angeles Lakers     9.0       PG  33.0    6-3   200.0            NaN   525093.0
92        Jeff Ayres   Los Angeles Clippers    19.0       PF  29.0    6-9   250.0  Arizona State   111444.0
245     Corey Brewer        Houston Rockets    33.0       SG  30.0    6-9   186.0        Florida  8229375.0
415       Randy Foye  Oklahoma City Thunder     6.0       SG  32.0    6-4   213.0      Villanova  3135000.0
 
    First Name  Gender Start Date     Last Login Time  Salary  Bonus % Senior Management             Team
706       Todd    Male 1993-07-04 2017-11-22 18:53:00  128175   18.473              True              NaN
756    Stephen    Male 1984-10-21 2017-11-22 06:26:00  121816   10.615              True     Distribution
466     Walte

# Introspection methods and attributes #

Dataframes have quite a few object attributes for __introspection__ and convenience methods for querying their __content__

In [10]:
print 'Print first 5 rows', endl, nba.head(5)
print endl, 'Print 5 last rows', endl, nba.tail()
print endl, 'A random sample of 5 rows', endl, nba.sample(5)
print endl, 'Column types', endl, nba.dtypes
print endl, 'column names', endl, nba.columns
print endl, 'The index object', endl, nba.index
print endl, 'DataFrame values as array', endl, nba.values
print endl, 'Shape', endl, nba.shape
print endl, 'Number of elements', endl, nba.size
print endl, 'Basic statistics', endl, nba.info()

Print first 5 rows 
            Name            Team  Number Position   Age Height  Weight            College     Salary
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0

Print 5 last rows 
             Name       Team  Number Position   Age Height  Weight College     Salary
453  Shelvin Mack  Utah Jazz     8.0       PG  26.0    6-3   203.0  Butler  2433333.0
454     Raul Neto  Utah Jazz    25.0       PG  24.0    6-1   179.0     NaN   900000.0
455  Tibor Pleiss  Utah Jazz    21.0        C  26.0    7-3   256.0     NaN  2900000.0
456   Jeff

__Column types__ can be set programmatically

In [12]:
print 'Before type-setting:', endl, emp.dtypes

# Change types programmatically
emp["Senior Management"] = emp["Senior Management"].astype('bool')
emp["Gender"] = emp["Gender"].astype("category")

print endl, 'After type-setting', endl, emp.dtypes

Before type-setting: 
First Name                   object
Gender                       object
Start Date           datetime64[ns]
Last Login Time      datetime64[ns]
Salary                        int64
Bonus %                     float64
Senior Management            object
Team                         object
dtype: object

After type-setting 
First Name                   object
Gender                     category
Start Date           datetime64[ns]
Last Login Time      datetime64[ns]
Salary                        int64
Bonus %                     float64
Senior Management              bool
Team                         object
dtype: object


# Construction #

In [22]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

print 'Dictionary ctor:'
print frame

print endl, 'If you specify a different order in "columns" the df will be arranged properly:'
print pd.DataFrame(data, columns = ['year', 'state', 'pop'])

frame2 = pd.DataFrame(
    data, 
    columns = ['year', 'state', 'pop', 'debt'],
    index = ['one', 'two', 'three', 'four', 'five'])
print endl, 'Non-existing columns fill the DataFrame with nulls:'
print frame2

pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)

print endl, 'A nested dict creates column and index objects:'
print frame3

listT = lambda _: np.array(_).T
frame4 = pd.DataFrame(
    data = listT(data.values()), 
    index = ['one', 'two', 'three', 'four', 'five'], 
    columns = ['year', 'state', 'pop'])

frame4.index.name = 'ixNames'
frame4.columns.name = 'colNames'

print endl, 'Index and columns can be named:'
print frame4


Dictionary ctor:
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002

If you specify a different order in "columns" the df will be arranged properly:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9

Non-existing columns fill the DataFrame with nulls:
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN

A nested dict creates column and index objects:
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6

Index and columns can be named:
colNames    year state   pop
ixNames                     
one         Ohio   1.5  2000
two         Ohio   1.7  2001
three       Ohio   3.6  2002
four      Nevada   2.4  2001
five      Nevada   2.9  2002


------------------------------------------
# Projection and selection #

### Projection ###

In [42]:
data = pd.DataFrame(
    np.random.choice(10, 24).reshape((6, 4)),
    index = ['Ohio', 'Colorado', 'Utah', 'New York', 'Atlanta', 'San Francisco'],
    columns = ['one', 'two', 'three', 'four'])

# Projections
print data['one']
print data[['one', 'two']]
print data[[col for col in data.columns if 'o' in col]]

Ohio             4
Colorado         1
Utah             6
New York         9
Atlanta          5
San Francisco    6
Name: one, dtype: int64
               one  two
Ohio             4    0
Colorado         1    7
Utah             6    3
New York         9    2
Atlanta          5    8
San Francisco    6    2
               one  two  four
Ohio             4    0     5
Colorado         1    7     8
Utah             6    3     7
New York         9    2     0
Atlanta          5    8     5
San Francisco    6    2     0


### Selection by index ###

In [65]:
# Selection by index label
data.loc['Ohio']
data.loc['Ohio':'Utah']
data.loc[['Ohio', 'New York']]

# Integer selection
data.iloc[2:4]
data.iloc[::-1]
data[2:4]


# Simultaneous selection by label and projection
data.loc['Atlanta', ['two', 'three']]

two      8
three    1
Name: Atlanta, dtype: int64

### Boolean and relational selections ###

We can specify __index__ shards (position lists and ranges) by boolean or relational expressions:

In [38]:
# Selection based on categories
print emp[emp['Team'] == 'Marketing'].sample(2)
print endl, emp[(emp['Gender'] == 'Male') & (emp['Team'] == 'Sales')].sample(2)
print endl, emp[(emp['Gender'] == 'Female') & ~(emp['First Name'] == 'Mary')].sample(2)

# Selection based on scalars
print endl, emp[emp['Start Date'] > '1990-01-01'].sample(2)

# a more complex condition
print endl, emp[~emp['Team'].isin(['Marketing', 'Sales'])].sample(5, replace = False)

print endl, emp[emp['Salary'].between(80000, 150000)].sample(4)

    First Name  Gender Start Date     Last Login Time  Salary  Bonus % Senior Management       Team
586       Rose  Female 2004-10-30 2017-11-22 16:34:00   56961    7.585             False  Marketing
860    Phillip    Male 1984-10-07 2017-11-22 11:05:00   36837   14.660             False  Marketing

    First Name Gender Start Date     Last Login Time  Salary  Bonus % Senior Management   Team
739     Carlos   Male 1981-01-25 2017-11-22 10:00:00  138598   14.737             False  Sales
787      Kevin   Male 2005-07-01 2017-11-22 15:22:00  141498    4.135              True  Sales

    First Name  Gender Start Date     Last Login Time  Salary  Bonus % Senior Management     Team
6         Ruby  Female 1987-08-17 2017-11-22 16:20:00   65476   10.012              True  Product
548     Janice  Female 1984-01-02 2017-11-22 21:06:00   41190    3.311              True    Sales

    First Name Gender Start Date     Last Login Time  Salary  Bonus % Senior Management       Team
722     Joshua   Ma

### General boolean selections ###
We can select values of a Dataframe's cells more flexibly:  
We can use boolean masks with the shape of the frame:

In [50]:
mask = pd.DataFrame(np.random.choice([True, False], emp.shape, p = [.5, .5]), columns = emp.columns)
print emp[mask].isnull().sample(10)

     First Name  Gender  Start Date  Last Login Time  Salary  Bonus %  Senior Management   Team
170       False   False        True             True   False     True              False  False
208        True    True        True            False    True     True               True   True
354        True    True        True            False    True    False               True   True
154       False   False       False            False   False    False              False  False
401        True   False        True             True    True    False               True   True
545       False    True        True            False    True     True               True  False
214        True    True        True             True   False     True               True   True
144       False    True        True             True   False     True              False  False
870       False    True        True             True    True     True               True   True
697        True   False       False     

### .isnull and .notnull ###

In [27]:
print emp['Team'].isnull().sample(5)

print endl, 'First names of employees without team:'
print emp['First Name'][emp['Team'].isnull()].sample(5)

460    True
961    True
390    True
322    True
432    True
Name: Team, dtype: bool

First names of employees without team:
843    NaN
500    NaN
38     NaN
182    NaN
97     NaN
Name: First Name, dtype: object


# .unique and .nunique #

In [38]:
print emp['First Name'][emp['Team'].isnull()].unique()

print endl, len(emp['First Name'][emp['Team'].isnull()].unique())

print endl, 'Not equal. Why?', endl, emp['First Name'][emp['Team'].isnull()].nunique()

print endl, "Now they're equal:", endl, emp['First Name'][emp['Team'].isnull()].nunique(dropna = False)

['Thomas' 'Louise' nan 'James' 'Christopher' 'Jonathan' 'Michael' 'Jeremy'
 'Bobby' 'Edward' 'Joyce' 'Jason' 'Chris' 'Richard' 'Wanda' 'Jimmy' 'Peter'
 'Kimberly' 'Harry' 'Carl' 'Randy' 'Donald' 'Joseph' 'Alice' 'Todd'
 'Daniel' 'Antonio' 'Lawrence' 'Nicole' 'Charles' 'Mildred' 'Phillip'
 'Ryan' 'Joe']

34

Not equal. Why? 
33

Now they're equal: 
34


# sorting and ranking #

In [46]:
# To sort by the values of one or more columns use sort_values method
print emp.sort_values(by = 'First Name', na_position = 'last').head(5)

# Create an index by which to sort
print endl, 'Sort by "Start Date" using index sorting'
_0 = emp.set_index(keys = 'Start Date', drop = True)
print endl, _0.sort_index(ascending = False, na_position = 'last').head(5)

    First Name Gender Start Date     Last Login Time  Salary  Bonus %  Senior Management             Team
101      Aaron   Male 2012-02-17 2017-11-21 10:20:00   61602   11.849               True        Marketing
327      Aaron   Male 1994-01-29 2017-11-21 18:48:00   58755    5.097               True        Marketing
440      Aaron   Male 1990-07-22 2017-11-21 14:53:00   52119   11.343               True  Client Services
937      Aaron    NaN 1986-01-22 2017-11-21 19:39:00   63126   18.424              False  Client Services
137       Adam   Male 2011-05-21 2017-11-21 01:45:00   95327   15.120              False     Distribution

Sort by "Start Date" using index sorting

           First Name  Gender     Last Login Time  Salary  Bonus %  Senior Management             Team
Start Date                                                                                            
2016-07-15      Terry     NaN 2017-11-21 00:29:00  140002   19.490               True        Marketing
2016-06-16  

`rank` will assign integers to values according to their ordered position.  
The ranks can be contiguous or not according to the ranking method. Groups of equal values are always assigned the same rank.  
`average` assigns to each equi-valued group the average of their sort-index.  
  
__NB:__ The ranks of the resulting set are not sorted

In [61]:
# create a dummy DataFrame
_1 = pd.DataFrame(np.random.choice(5, [10, 2]))
print _1

# Default ranking
print endl, 'Default ranks of first column elements', df[0].rank()

print endl, 'Dense ranking', endl, pd.concat([df[0], df[0].rank(method = 'dense')], axis = 1)

   0  1
0  1  4
1  4  1
2  4  2
3  4  0
4  4  4
5  0  3
6  1  2
7  3  0
8  1  1
9  1  3

Default ranks of first column elements 0    8.5
1    2.0
2    8.5
3    2.0
4    8.5
5    6.0
6    4.5
7    4.5
8    2.0
9    8.5
Name: 0, dtype: float64

Dense ranking 
   0    0
0  4  4.0
1  0  1.0
2  4  4.0
3  0  1.0
4  4  4.0
5  3  3.0
6  1  2.0
7  1  2.0
8  0  1.0
9  4  4.0


-------------------
# Groupings and transforms #

Projections and selections when combined with `.groupby`, `.apply` and `.transform` methods described in the sequel, essentially constitute Panda's powerful framework for multidimensional analysis (MDA).  
MDA supported by visualizations (an area we'll explore later) is what has been known as __Exploratory Data Analysis__, a term coined in the 70s by John Tukey, one of the prominent statisticians of our era

__Quiz:__ What else is [Tukey known for](https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm)? (it dominated the electronics industry for 3 decades)

In [4]:
from datetime import datetime as dt
np.random.seed(101)

### Dataset ###
Create a Dataframe with Sales records of an imaginary retailer.  
The retailer sells products identified by their Stock-keeping unit (SKU column) in several stores identified by their ID (STORE column).  
Each record contains the number and the price of items (SKUs) sold in a store at a particular day.
  
For our case assume there're 4 SKUs, 5 stores and we have records from 01Oct2017 to 31Oct2017

In [5]:
# SKUs
storeSales = pd.DataFrame(np.random.choice(list('ABCD'), 100), columns = ['SKU'])
# Stores
storeSales['STORE'] = np.random.choice(list('WXYZ'), 100)
# Dates
storeSales['DAY'] = np.random.choice(pd.date_range(start = dt(2017, 10, 1), end = dt(2017, 10, 31)), 100)
#Price: 
prices = dict(zip(list('ABCD'), np.random.uniform(10, 100, 4)))
storeSales['PRICE'] = storeSales['SKU'].map(prices)

# Items sold. # Permit 0 sales on item
storeSales['SALES'] = np.random.uniform(0, 100, 100).astype(int)

Permit some null prices (e.g. assume some SKUs were sold with coupons of varying markdowns)  
__Caveat:__  
If you try to use instead
``` Python
storeSales[np.random.choice([True, False], storeSales.shape[0], p = [.05, .95])]['PRICE'] = np.nan
```
you get a warning that 'you're trying to a ssign to an implicit copy'.  
  
__Quiz:__ Where's the implicit copy?

In [6]:
storeSales.loc[np.random.choice([True, False], storeSales.shape[0], p = [.05, .95]), 'PRICE'] = np.nan
print storeSales.sample(20)

   SKU STORE        DAY      PRICE  SALES
91   D     X 2017-10-08  29.675272      3
17   A     Y 2017-10-12  70.296706     11
69   D     W 2017-10-11  29.675272     95
1    D     Z 2017-10-30  29.675272     87
86   D     Z 2017-10-09  29.675272     51
90   C     X 2017-10-14  56.171025     19
34   A     Y 2017-10-08  70.296706     39
35   A     W 2017-10-21  70.296706     18
97   C     W 2017-10-09  56.171025     10
55   D     W 2017-10-18  29.675272     75
62   C     X 2017-10-03  56.171025     62
67   D     Z 2017-10-07  29.675272     34
39   C     Y 2017-10-01  56.171025     77
70   A     Z 2017-10-23  70.296706     12
57   C     Y 2017-10-26  56.171025     23
59   A     Y 2017-10-03  70.296706     36
25   D     Z 2017-10-20  29.675272     23
10   B     Y 2017-10-05  39.044652      0
15   B     Z 2017-10-21  39.044652     74
84   A     Z 2017-10-01  70.296706     68


__Sales per STORE per DAY:__

In [7]:
storeSales.groupby(['STORE', 'DAY'])['SALES'].sum().sample(5)

STORE  DAY       
X      2017-10-28      0
W      2017-10-19    106
Z      2017-10-22     22
Y      2017-10-09     62
Z      2017-10-20     23
Name: SALES, dtype: int64

__Average sales of store X__

In [8]:
storeSales[storeSales['STORE'] == 'X'].groupby('SKU')['SALES'].mean()

SKU
A    46.800000
B    46.500000
C    51.200000
D    45.636364
Name: SALES, dtype: float64

-------------------------------------------------------------------------------------
To answer time queries we need the .dt accessor of Series objects with datetime dtype.

In [76]:
storeSales['DAY'].dtype

dtype('<M8[ns]')

In [78]:
# Monday = 0, Sunday = 6
storeSales.groupby(storeSales['DAY'].dt.dayofweek)['SALES'].mean()

DAY
0    50.555556
1    34.611111
2    47.153846
3    55.307692
4    42.230769
5    42.777778
6    50.937500
Name: SALES, dtype: float64

### Exercise: ###
Find the average sales on Saturdays