![MLTrain logo](https://mltrain.cc/wp-content/uploads/2017/11/mltrain_logo-4.png "MLTrain logo")

In [1]:
# !wget -q -O changeNBLayout.py https://raw.githubusercontent.com/cmalliopoulos/PfBDAaML/master/changeNBLayout.py
%run changeNBLayout.py

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo('https://www.youtube.com/watch?v=XDAnFZqJDvI&list=PLIivdWyY5sqJxnwJhe3etaK7utrBiPBQ2&index=12')

# Introduction #

The pandas library implements the basic machinery for handling in-memory tabular data with Python.  
It is a large library. Contains >600 methods, attributes and functions.  
The online documentation of the current version (0.21) is available at https://pandas.pydata.org/pandas-docs/stable/overview.html.  
  
The core data structures in Pandas are the __Series, DataFrame and Index objects__.  
You should think of Series and DataFrames as fact tables (ie with named and typed columns) and Indices as Dimensions with hierarchies.  
  
Pandas' methods and funtions permit standard __dimensional analysis__ (slicing, dicing and pivoting). Moreover Pandas provides a comprehensive set of analytical functions for __group transforms and ranking__.  
  
__Pandas__ stands for __Pan__el __Da__ta.  
It was developed as a way to programmaticaly do EXCEL-type calculations however it resembles more to Microsoft's __Power Pivot__  
due to its rich type system, visualization and tabular processing capabilities.  


### Imports ###

In [1]:
from os import linesep as endl
import pandas as pd
import numpy as np

# Adjust Pandas layout options
pd.set_option('display.width', 124)

In [2]:
! wget -q -O nba.csv https://raw.githubusercontent.com/cmalliopoulos/PfBDAaML/master/nba.csv
nba = pd.read_csv('nba.csv')
print 'NBS player stats:', endl, nba.sample(5)


NBS player stats: 
                         Name                  Team  Number Position   Age Height  Weight     College      Salary
451             Chris Johnson             Utah Jazz    23.0       SF  26.0    6-6   206.0      Dayton    981348.0
37               Jerian Grant       New York Knicks    13.0       PG  23.0    6-4   195.0  Notre Dame   1572360.0
252            Terrence Jones       Houston Rockets     6.0       PF  24.0    6-9   252.0    Kentucky   2489530.0
99   Luc Richard Mbah a Moute  Los Angeles Clippers    12.0       PF  29.0    6-8   230.0        UCLA    947276.0
96              Blake Griffin  Los Angeles Clippers    32.0       PF  27.0   6-10   251.0    Oklahoma  18907726.0


In [3]:
! wget -q -O employees.csv https://raw.githubusercontent.com/cmalliopoulos/PfBDAaML/master/employees.csv
emp = pd.read_csv("employees.csv", parse_dates = ["Start Date", "Last Login Time"])
print '', endl, emp.sample(5)

 
    First Name  Gender Start Date     Last Login Time  Salary  Bonus % Senior Management                  Team
841       Ruby  Female 2006-08-13 2017-11-29 18:27:00   48354   19.501             False  Business Development
973    Russell    Male 2013-05-10 2017-11-29 23:08:00  137359   11.105             False  Business Development
152       Ruth  Female 1999-08-19 2017-11-29 04:03:00  129297    8.067              True       Client Services
771      Peter    Male 1991-05-22 2017-11-29 01:39:00  102577   12.026              True               Product
456    Deborah     NaN 1983-02-03 2017-11-29 23:38:00  101457    6.662             False           Engineering


# Introspection methods and attributes #

Dataframes have quite a few object attributes for __introspection__ and convenience methods for querying their __content__

In [5]:
print 'Print first 5 rows', endl, nba.head(5)

Print first 5 rows 
            Name            Team  Number Position   Age Height  Weight            College     Salary
0  Avery Bradley  Boston Celtics     0.0       PG  25.0    6-2   180.0              Texas  7730337.0
1    Jae Crowder  Boston Celtics    99.0       SF  25.0    6-6   235.0          Marquette  6796117.0
2   John Holland  Boston Celtics    30.0       SG  27.0    6-5   205.0  Boston University        NaN
3    R.J. Hunter  Boston Celtics    28.0       SG  22.0    6-5   185.0      Georgia State  1148640.0
4  Jonas Jerebko  Boston Celtics     8.0       PF  29.0   6-10   231.0                NaN  5000000.0


In [19]:
print 'Print 5 last rows', endl, nba.tail()


Print 5 last rows 
             Name       Team  Number Position   Age Height  Weight College     Salary
453  Shelvin Mack  Utah Jazz     8.0       PG  26.0    6-3   203.0  Butler  2433333.0
454     Raul Neto  Utah Jazz    25.0       PG  24.0    6-1   179.0     NaN   900000.0
455  Tibor Pleiss  Utah Jazz    21.0        C  26.0    7-3   256.0     NaN  2900000.0
456   Jeff Withey  Utah Jazz    24.0        C  26.0    7-0   231.0  Kansas   947276.0
457           NaN        NaN     NaN      NaN   NaN    NaN     NaN     NaN        NaN


In [18]:
print 'A random sample of 5 rows', endl, nba.sample(5)


A random sample of 5 rows 
                       Name                    Team  Number Position   Age Height  Weight   College     Salary
32   Thanasis Antetokounmpo         New York Knicks    43.0       SF  23.0    6-7   205.0       NaN    30888.0
4             Jonas Jerebko          Boston Celtics     8.0       PF  29.0   6-10   231.0       NaN  5000000.0
88        Marreese Speights   Golden State Warriors     5.0        C  28.0   6-10   255.0   Florida  3815000.0
457                     NaN                     NaN     NaN      NaN   NaN    NaN     NaN       NaN        NaN
410      Karl-Anthony Towns  Minnesota Timberwolves    32.0        C  20.0    7-0   244.0  Kentucky  5703600.0


In [20]:
print 'Column types', endl, nba.dtypes


Column types 
Name         object
Team         object
Number      float64
Position     object
Age         float64
Height       object
Weight      float64
College      object
Salary      float64
dtype: object


In [17]:
print 'column names', endl, nba.columns


column names 
Index([u'Name', u'Team', u'Number', u'Position', u'Age', u'Height', u'Weight', u'College', u'Salary'], dtype='object')


In [16]:
print 'The index object', endl, nba.index


The index object 
RangeIndex(start=0, stop=458, step=1)


In [15]:
print 'DataFrame values as array', endl, nba.values


DataFrame values as array 
[['Avery Bradley' 'Boston Celtics' 0.0 ..., 180.0 'Texas' 7730337.0]
 ['Jae Crowder' 'Boston Celtics' 99.0 ..., 235.0 'Marquette' 6796117.0]
 ['John Holland' 'Boston Celtics' 30.0 ..., 205.0 'Boston University' nan]
 ..., 
 ['Tibor Pleiss' 'Utah Jazz' 21.0 ..., 256.0 nan 2900000.0]
 ['Jeff Withey' 'Utah Jazz' 24.0 ..., 231.0 'Kansas' 947276.0]
 [nan nan nan ..., nan nan nan]]


In [21]:
print 'Shape', endl, nba.shape


Shape 
(458, 9)


In [13]:
print endl, 'Number of elements', endl, nba.size



Number of elements 
4122


In [22]:
print 'Basic statistics', endl, nba.info()

Basic statistics 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 458 entries, 0 to 457
Data columns (total 9 columns):
Name        457 non-null object
Team        457 non-null object
Number      457 non-null float64
Position    457 non-null object
Age         457 non-null float64
Height      457 non-null object
Weight      457 non-null float64
College     373 non-null object
Salary      446 non-null float64
dtypes: float64(4), object(5)
memory usage: 32.3+ KB
None


__Value counts__

In [6]:
print nba.Team.value_counts(sort = True).head(5)

New Orleans Pelicans    19
Memphis Grizzlies       18
Milwaukee Bucks         16
New York Knicks         16
Denver Nuggets          15
Name: Team, dtype: int64


__Column types__ can be set programmatically

In [23]:
print 'Before type-setting:', endl, emp.dtypes


Before type-setting: 
First Name                   object
Gender                       object
Start Date           datetime64[ns]
Last Login Time      datetime64[ns]
Salary                        int64
Bonus %                     float64
Senior Management            object
Team                         object
dtype: object


In [24]:
# Change types programmatically
emp["Senior Management"] = emp["Senior Management"].astype('bool')
emp["Gender"] = emp["Gender"].astype("category")

print endl, 'After type-setting', endl, emp.dtypes


After type-setting 
First Name                   object
Gender                     category
Start Date           datetime64[ns]
Last Login Time      datetime64[ns]
Salary                        int64
Bonus %                     float64
Senior Management              bool
Team                         object
dtype: object


# Construction #

In [25]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

print 'Dictionary ctor:'
print pd.DataFrame(data)


Dictionary ctor:
   pop   state  year
0  1.5    Ohio  2000
1  1.7    Ohio  2001
2  3.6    Ohio  2002
3  2.4  Nevada  2001
4  2.9  Nevada  2002


In [26]:
print endl, 'If you specify a different order in "columns" the df will be arranged properly:'
print pd.DataFrame(data, columns = ['year', 'state', 'pop'])



If you specify a different order in "columns" the df will be arranged properly:
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9


In [31]:
frame2 = pd.DataFrame(
    data, 
    columns = ['year', 'state', 'pop', 'debt'],
    index = ['one', 'two', 'three', 'four', 'five'])
print 'Non-existing columns fill the DataFrame with nulls:'
print frame2


Non-existing columns fill the DataFrame with nulls:
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN


In [29]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}
frame3 = pd.DataFrame(pop)

print 'A nested dict creates column and index objects:'
print frame3


A nested dict creates column and index objects:
      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6


In [30]:
listT = lambda _: np.array(_).T

frame4 = pd.DataFrame(
    data = listT(data.values()), 
    index = ['one', 'two', 'three', 'four', 'five'], 
    columns = ['year', 'state', 'pop'])

frame4.index.name = 'ixNames'
frame4.columns.name = 'colNames'

print 'Index and columns can be named:'
print frame4

Index and columns can be named:
colNames    year state   pop
ixNames                     
one         Ohio   1.5  2000
two         Ohio   1.7  2001
three       Ohio   3.6  2002
four      Nevada   2.4  2001
five      Nevada   2.9  2002


------------------------------------------
# Projection and selection #

### Projection ###

In [32]:
data = pd.DataFrame(
    np.random.choice(10, 24).reshape((6, 4)),
    index = ['Ohio', 'Colorado', 'Utah', 'New York', 'Atlanta', 'San Francisco'],
    columns = ['one', 'two', 'three', 'four'])

# Projections
print data['one']

Ohio             9
Colorado         6
Utah             2
New York         5
Atlanta          0
San Francisco    3
Name: one, dtype: int64


In [33]:
print data[['one', 'two']]

               one  two
Ohio             9    0
Colorado         6    6
Utah             2    1
New York         5    0
Atlanta          0    8
San Francisco    3    3


In [34]:
print data[[col for col in data.columns if 'o' in col]]

               one  two  four
Ohio             9    0     5
Colorado         6    6     1
Utah             2    1     5
New York         5    0     0
Atlanta          0    8     9
San Francisco    3    3     3


### Selection by index ###

In [35]:
# Selection by index label
data.loc['Ohio']
data.loc['Ohio':'Utah']
data.loc[['Ohio', 'New York']]

# Integer selection
data.iloc[2:4]
data.iloc[::-1]
data[2:4]


# Simultaneous selection by label and projection
data.loc['Atlanta', ['two', 'three']]

two      8
three    3
Name: Atlanta, dtype: int64

### Boolean and relational selections ###

We can specify __index__ shards (position lists and ranges) by boolean or relational expressions:

In [36]:
# Selection based on categories
print emp[emp['Team'] == 'Marketing'].sample(2)

    First Name  Gender Start Date     Last Login Time  Salary  Bonus %  Senior Management       Team
947        NaN    Male 2012-07-30 2017-11-27 15:07:00  107351    5.329               True  Marketing
730     Nicole  Female 2009-04-26 2017-11-27 00:40:00   66047   18.674               True  Marketing


In [37]:
print emp[(emp['Gender'] == 'Male') & (emp['Team'] == 'Sales')].sample(2)
print endl, emp[(emp['Gender'] == 'Female') & ~(emp['First Name'] == 'Mary')].sample(2)


    First Name Gender Start Date     Last Login Time  Salary  Bonus %  Senior Management   Team
371      Larry   Male 2003-08-27 2017-11-27 13:00:00   91133    5.140              False  Sales
202      Roger   Male 1982-11-08 2017-11-27 02:32:00  140558    5.084               True  Sales

    First Name  Gender Start Date     Last Login Time  Salary  Bonus %  Senior Management       Team
361   Margaret  Female 2014-05-05 2017-11-27 06:01:00   55044    4.078              False      Sales
43     Marilyn  Female 1980-12-07 2017-11-27 03:16:00   73524    5.207               True  Marketing


In [38]:
# Selection based on scalars
print emp[emp['Start Date'] > '1990-01-01'].sample(2)


    First Name Gender Start Date     Last Login Time  Salary  Bonus %  Senior Management             Team
70        Todd    NaN 2003-06-10 2017-11-27 14:26:00   84692    6.617              False  Client Services
463       Jose   Male 2002-07-11 2017-11-27 09:15:00   59862    3.269              False          Product


In [40]:
# a more complex condition
print endl, emp[~emp['Team'].isin(['Marketing', 'Sales'])].sample(3, replace = False)
print endl, emp[emp['Salary'].between(80000, 150000)].sample(3)


    First Name  Gender Start Date     Last Login Time  Salary  Bonus %  Senior Management             Team
289    Jessica  Female 1985-09-27 2017-11-27 13:35:00   75145    6.388               True            Legal
966      Louis    Male 2011-08-16 2017-11-27 17:19:00   93022    9.146               True  Human Resources
879        Amy  Female 2009-05-20 2017-11-27 06:26:00   75415   19.132              False  Client Services

    First Name  Gender Start Date     Last Login Time  Salary  Bonus %  Senior Management                  Team
725     Jeremy    Male 1991-05-19 2017-11-27 13:40:00  131513    1.876               True               Finance
219      Billy    Male 1995-03-13 2017-11-27 12:05:00  120444    7.768               True               Finance
164       Mary  Female 1999-08-13 2017-11-27 01:03:00  134645   18.197              False  Business Development


### More general selections ###
With Dataframes of integral types more general selections are possible:

In [42]:
df = pd.DataFrame(np.random.randn(5, 5))
print 'Dataframe of floats:', endl, df


Dataframe of floats: 
          0         1         2         3         4
0 -2.202541 -0.223003  0.158628  0.703146  1.529454
1  0.665455 -0.373617  0.627859  2.034397 -0.075145
2 -0.815562 -0.107887  1.090448 -0.882143  0.560488
3  0.188105  1.804585 -2.377244 -1.538447 -0.493399
4  0.062000 -1.273610  1.415440  0.314913  0.123493


In [44]:
mask = (df > .5)
print 'Boolean mask', endl, mask


Boolean mask 
       0      1      2      3      4
0  False  False  False   True   True
1   True  False   True   True  False
2  False  False   True  False   True
3  False   True  False  False  False
4  False  False   True  False  False


In [45]:
print 'Applying mask to df:', endl, df[mask]

Applying mask to df: 
          0         1         2         3         4
0       NaN       NaN       NaN  0.703146  1.529454
1  0.665455       NaN  0.627859  2.034397       NaN
2       NaN       NaN  1.090448       NaN  0.560488
3       NaN  1.804585       NaN       NaN       NaN
4       NaN       NaN  1.415440       NaN       NaN


### .isnull and .notnull ###

In [46]:
print emp['Team'].isnull().sample(5)


618    False
538    False
267    False
113    False
208    False
Name: Team, dtype: bool


In [47]:
print 'First names of employees without team:'
print emp['First Name'][emp['Team'].isnull()].sample(5)

First names of employees without team:
781    Lawrence
864        Ryan
774         NaN
479     Richard
199    Jonathan
Name: First Name, dtype: object


# .unique and .nunique #

In [49]:
print emp['First Name'][emp['Team'].isnull()].unique()


['Thomas' 'Louise' nan 'James' 'Christopher' 'Jonathan' 'Michael' 'Jeremy'
 'Bobby' 'Edward' 'Joyce' 'Jason' 'Chris' 'Richard' 'Wanda' 'Jimmy' 'Peter'
 'Kimberly' 'Harry' 'Carl' 'Randy' 'Donald' 'Joseph' 'Alice' 'Todd'
 'Daniel' 'Antonio' 'Lawrence' 'Nicole' 'Charles' 'Mildred' 'Phillip'
 'Ryan' 'Joe']


In [62]:
print len(emp['First Name'][emp['Team'].isnull()].unique())


34


In [61]:
print 'Not equal. Why?', endl, emp['First Name'][emp['Team'].isnull()].nunique()


Not equal. Why? 
33


In [60]:
print "Now they're equal:", endl, emp['First Name'][emp['Team'].isnull()].nunique(dropna = False)

Now they're equal: 
34


# sorting and ranking #

In [53]:
# To sort by the values of one or more columns use sort_values method
print emp.sort_values(by = 'First Name', na_position = 'last').head(5)


    First Name Gender Start Date     Last Login Time  Salary  Bonus %  Senior Management             Team
101      Aaron   Male 2012-02-17 2017-11-27 10:20:00   61602   11.849               True        Marketing
327      Aaron   Male 1994-01-29 2017-11-27 18:48:00   58755    5.097               True        Marketing
440      Aaron   Male 1990-07-22 2017-11-27 14:53:00   52119   11.343               True  Client Services
937      Aaron    NaN 1986-01-22 2017-11-27 19:39:00   63126   18.424              False  Client Services
137       Adam   Male 2011-05-21 2017-11-27 01:45:00   95327   15.120              False     Distribution


In [59]:
# Create an index by which to sort
print 'Sort by "Start Date" using index sorting'
_0 = emp.set_index(keys = 'Start Date', drop = True)
print endl, _0.sort_index(ascending = False, na_position = 'last').head(5)

Sort by "Start Date" using index sorting

           First Name  Gender     Last Login Time  Salary  Bonus %  Senior Management             Team
Start Date                                                                                            
2016-07-15      Terry     NaN 2017-11-27 00:29:00  140002   19.490               True        Marketing
2016-06-16       Tina  Female 2017-11-27 19:47:00  100705   16.961               True        Marketing
2016-06-05    Lillian  Female 2017-11-27 06:09:00   59414    1.256              False          Product
2016-05-24        NaN    Male 2017-11-27 21:17:00   76409    7.008               True     Distribution
2016-05-12    Lillian     NaN 2017-11-27 15:43:00   64164   17.612              False  Human Resources


`rank` will assign integers to values according to their ordered position.  
The ranks can be contiguous or not according to the ranking method. Groups of equal values are always assigned the same rank.  
`average` assigns to each equi-valued group the average of their sort-index.  
  
__NB:__ The ranks of the resulting set are not sorted

In [67]:
# create a dummy DataFrame
np.random.seed(101)
df = pd.DataFrame(np.random.choice(5, [5, 2]))
print df


   0  1
0  3  1
1  3  1
2  0  4
3  0  4
4  4  0


In [68]:
# Default ranking
print 'Default ranks of first column elements', df[0].rank()

Default ranks of first column elements 0    3.5
1    3.5
2    1.5
3    1.5
4    5.0
Name: 0, dtype: float64


In [69]:
print 'Dense ranking', endl, pd.concat([df[0], df[0].rank(method = 'dense')], axis = 1)

Dense ranking 
   0    0
0  3  2.0
1  3  2.0
2  0  1.0
3  0  1.0
4  4  3.0


-------------------
# Groupings and transforms #

Projections and selections when combined with `.groupby`, `.apply` and `.transform` methods described in the sequel, essentially constitute Panda's powerful framework for multidimensional analysis (MDA).  
MDA supported by visualizations (an area we'll explore later) is what has been known as __Exploratory Data Analysis__, a term coined in the 70s by John Tukey, one of the prominent statisticians of our era

__Quiz:__ What else is [Tukey known for](https://en.wikipedia.org/wiki/Cooley%E2%80%93Tukey_FFT_algorithm)? (it dominated the electronics industry for 3 decades)

In [72]:
from datetime import datetime as dt
np.random.seed(101)

### Dataset ###
Create a Dataframe with Sales records of an imaginary retailer.  
The retailer sells products identified by their Stock-keeping unit (SKU column) in several stores identified by their ID (STORE column).  
Each record contains the number and the price of items (SKUs) sold in a store at a particular day.
  
For our case assume there're 4 SKUs, 5 stores and we have records from 01May2016 to 31Oct2017

In [73]:
# SKUs
storeSales = pd.DataFrame(np.random.choice(list('ABCD'), 100), columns = ['SKU'])
# Stores
storeSales['STORE'] = np.random.choice(list('WXYZ'), 100)
# Dates
storeSales['DAY'] = np.random.choice(pd.date_range(start = dt(2016, 5, 1), end = dt(2017, 10, 31)), 100)
#Price: 
prices = dict(zip(list('ABCD'), np.random.uniform(10, 100, 4)))
storeSales['PRICE'] = storeSales['SKU'].map(prices)

# Items sold. # Permit 0 sales on item
storeSales['SALES'] = np.random.uniform(0, 100, 100).astype(int)

Permit some null prices (e.g. assume some SKUs were sold with coupons of varying markdowns)  
__Caveat:__  
If you try to use instead
``` Python
storeSales[np.random.choice([True, False], storeSales.shape[0], p = [.05, .95])]['PRICE'] = np.nan
```
you get a warning that 'you're trying to a ssign to an implicit copy'.  
  
We use loc to avoid this. 'loc' returns a __view__ of the underlying data so assignments are safe.
  
__Quiz:__  
Where's the implicit copy in the above statement?

In [74]:
storeSales.loc[np.random.choice([True, False], storeSales.shape[0], p = [.05, .95]), 'PRICE'] = np.nan
print storeSales.sample(5)

   SKU STORE        DAY      PRICE  SALES
51   A     X 2016-07-15  50.774371     31
25   D     Z 2017-08-13  11.601213     62
32   D     W 2017-10-01  11.601213     87
9    B     X 2016-07-19  28.058070     46
19   C     Y 2016-11-07  66.801608      1


__Sales per STORE per DAY:__

In [75]:
storeSales.groupby(['STORE', 'DAY'])['SALES'].sum().sample(5)

STORE  DAY       
Y      2017-04-25    59
       2016-07-01    62
W      2016-07-28    18
       2017-09-28    75
       2017-09-01    76
Name: SALES, dtype: int64

__Average sales of store X__

In [76]:
storeSales[storeSales['STORE'] == 'X'].groupby('SKU')['SALES'].mean()

SKU
A    45.800000
B    22.500000
C    57.000000
D    54.727273
Name: SALES, dtype: float64

-------------------------------------------------------------------------------------
To answer time queries we use the .dt accessor of Series objects with datetime dtype.

In [79]:
storeSales['DAY'].dtype

dtype('<M8[ns]')

In [80]:
# Monday = 0, Sunday = 6
print storeSales.groupby(storeSales['DAY'].dt.dayofweek)['SALES'].mean()

DAY
0    53.000000
1    47.000000
2    54.250000
3    42.571429
4    51.450000
5    47.272727
6    51.230769
Name: SALES, dtype: float64


-----------------------------------------
<div style = "color: darkred; font-size: 200%; font-weight: bold;  text-decoration: underline"> 
Exercise 
</div>  

Find the average sales by store on Saturdays

-------------------------------------------------------------------------------
We can compute more advanced aggregations using .groupby with function objects:  
E.g to compute the sum of sales per week and year

In [81]:
yearWeekAsInts = lambda tseries_: tseries_.year * 100 + tseries_.week
salesByWeek = storeSales.groupby(yearWeekAsInts(storeSales['DAY'].dt))['SALES'].sum()
print salesByWeek.sort_index().sample(10)

DAY
201651    109
201723     90
201743    107
201638     27
201742    157
201752     72
201707     78
201629    126
201622     57
201635    193
Name: SALES, dtype: int64


To compute more complex aggregations you should rather create a column for yearWeeks:

In [82]:
storeSales['YEAR_WEEK'] = yearWeekAsInts(storeSales['DAY'].dt)
print storeSales.groupby(['YEAR_WEEK', 'STORE'])['SALES'].sum().sort_index().sample(10)

YEAR_WEEK  STORE
201647     Y         57
201626     Y         62
201730     Z         59
201722     Z         42
201733     W         43
201717     Z         94
201621     Y        105
201735     W         76
201627     X         75
201631     Z        128
Name: SALES, dtype: int64
