# Transforming Data in Python
In these examples the data will be in Pandas objects, Series or DataFrames.

Pandas documentation is found at:
https://pandas.pydata.org/pandas-docs/stable/ 
and official tutorials are at:
https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html


# setup

In [45]:
import pandas as pd
import pprint
import numpy as np
gradeDataPath = r"C:\Users\Carlos Zambrana\OD\LAS792_Spring2021_ForStudents\data\gradeExample.csv"



In [46]:
from pathlib import Path

print(Path.cwd())

/Users/cagilalbayrak/LAS792


### DataFrame Structure - Indexing and Selecting Data

Documentation on indexing and selecting data for Pandas is at
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html 

### Creating a DataFrame by columns
While you will most likely be importing a DataFrame from a file, 
for this notebook we will be creating example DataFrames with a call to a DataFrame constructor function.

First let's look at the structure of a DataFrame. DataFrame objects have built in indices for rows and columns.

This DataFrame is constructed from a dictionary that has keys of column names and values of lists of values in the column. The lists mus be in the same order (of rows) and nave the same number of entries.

In [47]:
columnDict = {'age':[10,20,30],
            'textAge':['ten', 'twenty', 'thirty']}
print('\n dictionary columnDict\n')
pprint.pprint(columnDict)

print('\n DataFrame myDf')

# Creating a DataFrame from a dictionary object
# The column index is created from the keys
# the row index defaults to range(n) where range is the number of rows
myDf = pd.DataFrame(columnDict)
myDf


 dictionary columnDict

{'age': [10, 20, 30], 'textAge': ['ten', 'twenty', 'thirty']}

 DataFrame myDf


Unnamed: 0,age,textAge
0,10,ten
1,20,twenty
2,30,thirty


In [48]:
# specify a row index explicitly
myDf = pd.DataFrame(columnDict, 
                    index = ['r0','r1','r2'])
myDf

Unnamed: 0,age,textAge
r0,10,ten
r1,20,twenty
r2,30,thirty


### Creating a DataFrame by rows
A DataFrame can also be constructed from a list of dictionaries, each one represents a row

In [49]:
rowList = [{'age':10, 'textAge':'ten'},
           {'age':20, 'textAge':'twenty'},
           {'age':30, 'textAge':'thirty'}]
print('rowlist is ', rowList)  

myDf = pd.DataFrame(rowList, 
                    index = ['r0','r1','r2'])
myDf

rowlist is  [{'age': 10, 'textAge': 'ten'}, {'age': 20, 'textAge': 'twenty'}, {'age': 30, 'textAge': 'thirty'}]


Unnamed: 0,age,textAge
r0,10,ten
r1,20,twenty
r2,30,thirty


### Selecting one column from a DataFrame - dot notation
A single column of a DataFrame may be referenced using a dot notation like in this example.

In [50]:
import pandas as pd
import pprint
myDf = pd.DataFrame( {'age':[10,     20,       30],
                  'textAge':['ten', 'twenty', 'thirty']},
                   index = ['r0','r1','r2'])
pprint.pprint(myDf)
myDf.age

    age textAge
r0   10     ten
r1   20  twenty
r2   30  thirty


r0    10
r1    20
r2    30
Name: age, dtype: int64

### A DataFrame column referenced dictionary style
A column may also be referenced like a value in a dictionary, using a column name as a key

In [51]:
col = myDf['age']
print(type(col))
pprint.pprint(col)


<class 'pandas.core.series.Series'>
r0    10
r1    20
r2    30
Name: age, dtype: int64


### Columns
The column names are an index to the column axis

In [52]:
myDf.columns

Index(['age', 'textAge'], dtype='object')

### The Row Index
The row index may be retrieved and changed via the index attribute of the DataFrame

In [53]:
# Display the row index
print(' the row index for myDf is ', myDf.index)

# now change it
myDf.index=['row1','row2','row3']
myDf

 the row index for myDf is  Index(['r0', 'r1', 'r2'], dtype='object')


Unnamed: 0,age,textAge
row1,10,ten
row2,20,twenty
row3,30,thirty


### Row index 
The row index can be changed in an existing DataFrame, or can be specified in the initial DataFrame function.

In [54]:
inputDict2 = {'age':[10,20,30],
            'textAge':['ten', 'twenty', 'thirty'],
             'ID':[3,2,1],
             'fourthCol':['foo','bar','baz']}

myDf2 = pd.DataFrame(inputDict2, 
                     index=['r1','r2','r3'])
myDf2

Unnamed: 0,age,textAge,ID,fourthCol
r1,10,ten,3,foo
r2,20,twenty,2,bar
r3,30,thirty,1,baz


# Boolean Indexing
Elements of a Series can be selected via a Series of boolean values

In [55]:
import pandas as pd
import pprint

exampleSeries = pd.Series(['zero is first', 
               'one is second', 
               'two is third', 
               'three is fourth', 
               'four is last here'])

# This tuple is True for the first and third element
isFirstOrThird = pd.Series((True,
                 False,
                 True,
                 False,
                 False))
exampleSeries[isFirstOrThird]


0    zero is first
2     two is third
dtype: object

### Using an expression to generate the Boolean Series
The Boolean Series can be generated by an expression

In [56]:
r = pd.Series(range(0, 10))
print(type(r>3),'\n',r>3)
r[r > 3]


<class 'pandas.core.series.Series'> 
 0    False
1    False
2    False
3    False
4     True
5     True
6     True
7     True
8     True
9     True
dtype: bool


4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64

### recommended indexing methods

The documentation at:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
notes that the same indexing operators that can be used with python iterables can be used with Pandas objects, the *.loc* and *.iloc* access methods are recommended.

*.loc* uses the labels of the row or column indices

*.iloc* uses a numeric position index (0 based)




### by row number

In [57]:
# The second row of myDf, returned as a Series
rowTwo = myDf.iloc[1]

pprint.pprint(rowTwo)
print(type(rowTwo))


age            20
textAge    twenty
Name: row2, dtype: object
<class 'pandas.core.series.Series'>


### all rows and some columns by index

In [58]:

# All rows, and the first and second column of myDf2, returned as a DataFrame
cols = myDf2.iloc[:,0:2]

pprint.pprint(cols)
print(type(cols))


    age textAge
r1   10     ten
r2   20  twenty
r3   30  thirty
<class 'pandas.core.frame.DataFrame'>


### a sub DataFrame

In [59]:
# a sub-DataFrame selected by lists of labels

mySlice = myDf2.loc[['r2','r1'],  ['age','ID']]
pprint.pprint(mySlice)
print(type(mySlice))

    age  ID
r2   20   2
r1   10   3
<class 'pandas.core.frame.DataFrame'>


### all rows and some columns by name

In [60]:
mySlice = myDf2.loc[:,  ['ID', 'age']]
mySlice

Unnamed: 0,ID,age
r1,3,10
r2,2,20
r3,1,30


In [61]:
#just one equal would change all to 3 but two equals just calls in the ID which is 3.
myDf2.ID==3
myDf2(myDf2.ID==3)

TypeError: 'DataFrame' object is not callable

### Some rows and all columns

In [None]:
mySlice = myDf2.loc[['r2','r1'],  :]
mySlice

### Boolean subsetting

In [None]:
# a subset of rows by an expression on a value

mySlice2 = myDf2[myDf2.ID>1]
pprint.pprint(mySlice2)
print(type(mySlice2))

### Copy vs view
These methods return a copy 

In [None]:
mySlice2['fourthCol'] = 'xxxxx'
myDf2


In [None]:
mySliceWithLoc= myDf2.loc[['r2','r1'],  ['fourthCol']]
mySliceWithLoc = "xxxxx"
myDf2

### Subsetting on the left hand side of an assignment

In [None]:
myDf2.loc[['r2','r1'],  ['fourthCol']] = 'xxxx'
myDf2

In [None]:
df = myDf2.copy()
df
df.loc[df.fourthCol=='baz','textAge']='other' #[:,df.fourthCol=='baz'] ='other'
df

The editing example below will use a logical expression on a column. Here is an example. It returns a Series with boolean values.

### Ranges
The *range* sequence will be mentioned below. Here are some examples of its usage.

In [None]:
# the range sequence returns a sequence of integers
#https://docs.python.org/3/library/functions.html#func-range 
# https://docs.python.org/3/library/stdtypes.html#typesseq 

# range with only a stop value
print(list( range(4) ) )
    
# range with a start and stop  not that the stop value is not included
print(list( range(4,6) ) )

# range with a start and stop and a step
print(list( range(6,12,2) ) )

### selecting rows with a range

In [64]:
myDf2.iloc[range(0,2)]

Unnamed: 0,age,textAge,ID,fourthCol
r1,10,ten,3,foo
r2,20,twenty,2,bar


In [None]:
### another dataframe example

In [65]:
rowNumbers = list(range(1,21))
anotherDF = pd.DataFrame({'letter':['a','b']*10,
                     'number':rowNumbers},
                        index=[hex(r) for r in rowNumbers])
anotherDF

Unnamed: 0,letter,number
0x1,a,1
0x2,b,2
0x3,a,3
0x4,b,4
0x5,a,5
0x6,b,6
0x7,a,7
0x8,b,8
0x9,a,9
0xa,b,10


### Subsetting by postion with a defined range
This will find the rows 6, 8, and 10. Note that this is positional, since the labels for the rows (the row index) are hex strings.

In [66]:
myRange = range(6,12,2)
rowNumbers = list(range(1,21))

anotherDF.iloc[list(myRange)]

Unnamed: 0,letter,number
0x7,a,7
0x9,a,9
0xb,a,11


### Subsetting by a list of row ***index values (row names)***
Here the loc function is used to extract the rows with a set of labels

In [None]:
rowLabelSubset = [hex(r) for r in myRange]
print('rowLabelSubset is ', rowLabelSubset)
anotherDF.loc[rowLabelSubset]

### Subsetting by a list of values in one of the columns
Here all of the rows with values in a listindex  of values in a particular column are selected. The trick is to make a Boolean list to subset the DataFrame

In [67]:
wantedRows = [n in [5,2,7] for n in anotherDF.number]
print('wanted Rows', wantedRows)
print('\n  The Resulting DataFrame')
anotherDF[wantedRows]

wanted Rows [False, True, False, False, True, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False]

  The Resulting DataFrame


Unnamed: 0,letter,number
0x2,b,2
0x5,a,5
0x7,a,7


### myDf2

In [63]:
inputDict2 = {'age':[10,20,30],
            'textAge':['ten', 'twenty', 'thirty'],
             'ID':[3,2,1],
             'fourthCol':['foo','bar','baz']}

myDf2 = pd.DataFrame(inputDict2, 
                     index=['r1','r2','r3'])
myDf2

Unnamed: 0,age,textAge,ID,fourthCol
r1,10,ten,3,foo
r2,20,twenty,2,bar
r3,30,thirty,1,baz


### Editing a Value 

Indexing can be used to specify a value to be updated. Editing a DataFrame in place may not be the best practice. Also note that since DataFrames are mutable, in the example below ***myDf2 and myDf3 are both references to the same DataFrame object so they "both" get changed***.

In [24]:
myDf3 = myDf2

myDf2.loc[myDf2.ID == 1, 'age'] = 11
pprint.pprint(myDf2)

myDf3

    age textAge  ID fourthCol
r1   10     ten   3       foo
r2   20  twenty   2       bar
r3   11  thirty   1       baz


Unnamed: 0,age,textAge,ID,fourthCol
r1,10,ten,3,foo
r2,20,twenty,2,bar
r3,11,thirty,1,baz


### Mutability 
If you want a copy of a DataFrame, not just another reference to the same DataFrame, use the .copy method of the DataFrame.

In [25]:
inputDict2 = {'age':[10,20,30],
            'textAge':['ten', 'twenty', 'thirty'],
             'ID':[3,2,1],
             'fourthCol':['foo','bar','baz']}
myDf2 = pd.DataFrame(inputDict2, 
                     index=['r1','r2','r3'])
myDf3 = pd.DataFrame(myDf2)

# this is the same as pd.DataFrame(myDf2, copy=True) 
myDf4 = myDf2.copy()

myDf2.loc[myDf2.ID == 1, 'age'] = 11
print('myDf2 and myDf3 point to the same DataFrame. It changed. \n',myDf2)
print(myDf3)

print('myDf4 is not changed \n')
myDf4

myDf2 and myDf3 point to the same DataFrame. It changed. 
     age textAge  ID fourthCol
r1   10     ten   3       foo
r2   20  twenty   2       bar
r3   11  thirty   1       baz
    age textAge  ID fourthCol
r1   10     ten   3       foo
r2   20  twenty   2       bar
r3   11  thirty   1       baz
myDf4 is not changed 



Unnamed: 0,age,textAge,ID,fourthCol
r1,10,ten,3,foo
r2,20,twenty,2,bar
r3,30,thirty,1,baz


### Beware
The following increases a cell by 10 percent each time it is run. This could be really bad if you don't keep an original copy of the data and run it more times than intended. This is an example of where you need to create your processes so that they are reproducible. 

Try running it several times.


In [26]:
myDf2.loc[myDf2.ID == 1, 'age'] *= 1.1
pprint.pprint(myDf2)


     age textAge  ID fourthCol
r1  10.0     ten   3       foo
r2  20.0  twenty   2       bar
r3  12.1  thirty   1       baz


# Expressions
In Python expressions are built up from *atoms* and operators. In the expression

1 + myVar

`
 the + is an operator
 the 1 is a literal
 "myVar" is an identifier(name) of an object (a variable)
`

When the expression is evaluated the value of each atom is fed to the appropriate operator.

The formal rules for expressions can be found at:
https://docs.python.org/3/reference/expressions.html



### operations on scalars

In [27]:
myVar = 7
print(" the expression 1 + myVar returns ", 1 + myVar)


 the expression 1 + myVar returns  8


### Computations among objects
arithmetic operations among indexed objects will associate cells with the same indices


### first the DataFrames
these examples are from McKinney *Python for Data Analysis*  pages 148,149

In [28]:
Df1 = pd.DataFrame(np.arange(12.).reshape((3,4)),
                  columns=list('abcd'))
Df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [29]:
Df2 = pd.DataFrame(np.arange(20.).reshape((4,5)),
                  columns=list('abcde'))
Df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


### Adding two DataFrames
Corresponding cells of the two DataFrames are added together. New cells that did not have a match in one of the DataFrames are given a missing value.

In [30]:
Df1 + Df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


### Using methods to do transformations
There may be methods that have extra features to perform operations that simple operators cannot. In this example misssing values are replaced with a zero. You can find these methods in the documentation for the object (in this case the DataFrame).

In [31]:
Df1.add(Df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


### Function Calls
An expression can also include a function call. In the first example below the + operator fails when asked to operate on a string and a number.
In the second example below the *str* function is called to convert the value of the literal 1 to a string before concatenating it to the preceeding literal string. Without that casting the operation throws an error.

You may have also noted that the call to the print function itself is an expression that has the side effect of printing a string to the console.

In [32]:
print("number "+ 1)

TypeError: can only concatenate str (not "int") to str

In [33]:
print("number "+ str(1))

number 1


### Operators as Functions

Operators are functions that also have a special built in syntax, like the + sign in the expression 2+2 .  
https://docs.python.org/3/library/operator.html 

There is a module, *operator*, that defines a function for each intrinsic operator in Python. In the example below the *add* function is used to add two and two

In [34]:
import operator
print(operator.add(2,2))

4


### Precedence
Expressions are evaluated from left to right, but the precedence of operators must be taken into account.

The precedence of operators is listed in section 6.16 of the Python documentation
https://docs.python.org/3/reference/expressions.html

In [35]:
# Since / has a higher precedence than +  
# the division is performed first
print(1 + 2 / 2)


# This forces the addition to be first
print( (1+2) / 2     )

2.0
1.5


### Another Function Example
In the following a user written function uses expressions within the for loop of the definition to turn a list of objects into a camel case string.

In [36]:
import string
def camelize(stringList):
    camelString = ''
    for aString in stringList:
        aString = str(aString)
        camelString =   camelString + aString.capitalize()
    return camelString

the call to the camelize function appears in an expression. The argument to the function call is an expression.

In [37]:
expressionResult = "the camelized string is: " + camelize(["foo", "bar", 1])
print(expressionResult)

the camelized string is: FooBar1


### Recoding with Dictionaries - The replace() Function
For variables with discrete values you can specify a dictionary that describes the transformation. IN this example the recoded series is added as a new column


In [38]:
transform = {"a":1, "b":2}


myDf = pd.DataFrame({'v1':['a','c','b'], 'v2':[1,2,3]})

print(myDf)
myDf['v3'] = myDf.v1.replace(transform.keys(),transform.values() )
myDf

  v1  v2
0  a   1
1  c   2
2  b   3


Unnamed: 0,v1,v2,v3
0,a,1,1
1,c,2,c
2,b,3,2


### in place
THe ***inplace*** parameter allows for changing a series in place

In [39]:
myDf = pd.DataFrame({'v1':['a','c','b'], 'v2':[1,2,3]})

print(myDf)
myDf.v1.replace(transform.keys(),transform.values() , inplace=True)
myDf

  v1  v2
0  a   1
1  c   2
2  b   3


Unnamed: 0,v1,v2
0,1,1
1,c,2
2,2,3


## transforming with functions


### applying a dictionary with map

In [40]:
transform = {"a":1, "b":2, np.nan:-9}
myDf = pd.DataFrame({'v1':['a','c','B',np.nan], 'v2':[1,2,3,4]})

print(myDf)
myDf.v1 = myDf.v1.map(transform)
myDf

    v1  v2
0    a   1
1    c   2
2    B   3
3  NaN   4


Unnamed: 0,v1,v2
0,1.0,1
1,,2
2,,3
3,-9.0,4


### applying a function  with map
For a simple function that has one argument the map function can be used to apply said function to each element of a series. 

In [41]:
def triplit(charCell):
    if pd.isna(charCell):
        return charCell
    else:
        return charCell*3

myDf = pd.DataFrame({'v1':['a','c','B',np.nan], 'v2':[1,2,3,4]})
print(myDf)
myDf.v1 = myDf.v1.map(triplit)
myDf

    v1  v2
0    a   1
1    c   2
2    B   3
3  NaN   4


Unnamed: 0,v1,v2
0,aaa,1
1,ccc,2
2,BBB,3
3,,4


### applying a function with apply
The apply function allows for specifying additional arguments.

In [42]:
def upDict(seriesValue, mappingDict, missingValue):
    '''
    maps lowercased values using a dictionary,
    maps missing to specified value'''
    if pd.isnull(seriesValue):
        return missingValue
    newValue = mappingDict.get(seriesValue.lower(), None)
    if newValue == None:
        return seriesValue
    else:
        return newValue
    
transform = {"a":1, "b":2}

myDf = pd.DataFrame({'v1':['a','c','B',np.nan], 'v2':[1,2,3,4]})

myDf['v3'] = myDf.v1.apply(upDict, args=(transform,-9))

myDf

Unnamed: 0,v1,v2,v3
0,a,1,1
1,c,2,c
2,B,3,2
3,,4,-9


## Binning
Sometimes it is useful to convert continuous data into categorical (binned) data

### the cut method

In [43]:
bios = pd.DataFrame(20*np.random.randn(10,1)+150, columns=['weightLbs'])

bios['heightIn'] = pd.Series(8*np.random.randn(10)+68)

bios['bmi'] = 703 * bios.weightLbs / (bios.heightIn**2)

bmiCutVals = [0,15,16,18.5,25,30,35,40,99]

bmiLabels = ['Very severely underweight ',
             'Severely underweight ',
             'Underweight',
             'Normal (healthy weight) ',
             'Overweight',
             'Obese Class I (Moderately obese) ',
             'Obese Class II (Severely obese)',
             'Obese Class III (Very severely obese) ' ]

bios['bmiInt'] = pd.cut(bios.bmi,bmiCutVals)
bios['bmiCat'] = pd.cut(bios.bmi,bmiCutVals, labels=bmiLabels)
bios

Unnamed: 0,weightLbs,heightIn,bmi,bmiInt,bmiCat
0,124.440661,70.33837,17.682067,"(16.0, 18.5]",Underweight
1,164.840341,62.666798,29.508275,"(25.0, 30.0]",Overweight
2,135.646367,78.725313,15.386318,"(15.0, 16.0]",Severely underweight
3,138.164945,76.181998,16.735877,"(16.0, 18.5]",Underweight
4,147.313518,72.313646,19.804205,"(18.5, 25.0]",Normal (healthy weight)
5,158.315725,70.698424,22.266908,"(18.5, 25.0]",Normal (healthy weight)
6,150.951876,62.743958,26.955663,"(25.0, 30.0]",Overweight
7,154.953405,76.410115,18.657555,"(18.5, 25.0]",Normal (healthy weight)
8,172.235595,66.746842,27.177956,"(25.0, 30.0]",Overweight
9,137.870922,56.300291,30.577834,"(30.0, 35.0]",Obese Class I (Moderately obese)


### Intervals vs interval labels
The cell below shows the difference between obtaining an interval from cut and a label. The former is a special pandas object that describes an open or closed interval. The latter is just a string.

In [44]:
bios = pd.DataFrame(20*np.random.randn(10,1)+150, columns=['weightLbs'])

bios['heightIn'] = pd.Series(8*np.random.randn(10)+68)

bios['bmi'] = 703 * bios.weightLbs / (bios.heightIn**2)

bmiCutVals = [0,15,16,18.5,25,30,35,40,99]

bmiLabels = ['Very severely underweight ',
             'Severely underweight ',
             'Underweight',
             'Normal (healthy weight) ',
             'Overweight',
             'Obese Class I (Moderately obese) ',
             'Obese Class II (Severely obese)',
             'Obese Class III (Very severely obese) ' ]

bios['bmiInt'] = pd.cut(bios.bmi,bmiCutVals)
bios['bmiCat'] = pd.cut(bios.bmi,bmiCutVals, labels=bmiLabels)
print(type(bios['bmiInt'][0]))
print(type(bios['bmiCat'][0]))


<class 'pandas._libs.interval.Interval'>
<class 'str'>


## groupby

In [50]:
### example data

In [18]:
grades = pd.read_csv(gradeDataPath,
            names=['gender', 'class',  'grade1',  'grade2'])
grades

Unnamed: 0,gender,class,grade1,grade2
0,female,senior,94,96
1,female,senior,90,92
2,female,junior,95,93
3,male,senior,86,88
4,female,sophmore,82,79
5,male,junior,90,93
6,male,sophmore,83,85
7,female,junior,97,94
8,male,sophmore,90,87
9,male,junior,96,97


### the groupby method

In [64]:
classGroup = grades.groupby('class')
classGroup

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002C24AD4D3C8>

### groupby is an iterator 
It returns a tuple having the key values and the subset for that combination of keys

In [65]:
for key,subset in classGroup:
    print('\nkey is ', key)
    print(subset)


key is  junior
    gender   class  grade1  grade2
2   female  junior      95      93
5     male  junior      90      93
7   female  junior      97      94
9     male  junior      96      97
10  female  junior      82      84

key is  senior
    gender   class  grade1  grade2
0   female  senior      94      96
1   female  senior      90      92
3     male  senior      86      88
12    male  senior      93      90
14  female  senior      96      92

key is  sophmore
    gender     class  grade1  grade2
4   female  sophmore      82      79
6     male  sophmore      83      85
8     male  sophmore      90      87
11  female  sophmore      98      99
13    male  sophmore      86      86


### Its indices attribute shows a little of how it works
This dictionary shows that for each key the groupby can produce the indices of the rows in the original table that have that key.

In [60]:
classGroup.indices

{'junior': array([ 2,  5,  7,  9, 10], dtype=int64),
 'senior': array([ 0,  1,  3, 12, 14], dtype=int64),
 'sophmore': array([ 4,  6,  8, 11, 13], dtype=int64)}

### you can retrieve individual subsets by the key

In [63]:
classGroup.get_group('senior')

Unnamed: 0,gender,class,grade1,grade2
0,female,senior,94,96
1,female,senior,90,92
3,male,senior,86,88
12,male,senior,93,90
14,female,senior,96,92


### compound keys
Keys can be composed of multiple columns

In [69]:
genderClassGroup = grades.groupby(['class', 'gender'])
genderClassGroup.indices

{('junior', 'female'): array([ 2,  7, 10], dtype=int64),
 ('junior', 'male'): array([5, 9], dtype=int64),
 ('senior', 'female'): array([ 0,  1, 14], dtype=int64),
 ('senior', 'male'): array([ 3, 12], dtype=int64),
 ('sophmore', 'female'): array([ 4, 11], dtype=int64),
 ('sophmore', 'male'): array([ 6,  8, 13], dtype=int64)}

## Aggregations
The GroupBy has a number of predefined mathods that return a single aggregated row for each group. In the example below the mean of each column is returned. Note that the non-numeric gender column is not returned

In [75]:
print('Juniors ', (95+90+97+96+82)/5)
grades.groupby('class').mean()

Juniors  92.0


Unnamed: 0_level_0,grade1,grade2
class,Unnamed: 1_level_1,Unnamed: 2_level_1
junior,92.0,92.2
senior,91.8,91.6
sophmore,87.8,87.2


### Grouping of columns
The axis parameter allows for specifying whether to group together rows (axis=0) or columns (axis=1). It's more clear if the word 'rows' or 'columns' is used. The mapping dictionary is used to indicate how to group the columns when axis=1

In [100]:
mapping={'gender':'CombinedKey', 'class':'CombinedKey',
    'grade1':'total', 'grade2':'total'}
sums = grades.groupby(mapping, axis='columns').sum()
sums

Unnamed: 0,CombinedKey,total
0,femalesenior,190
1,femalesenior,182
2,femalejunior,188
3,malesenior,174
4,femalesophmore,161
5,malejunior,183
6,malesophmore,168
7,femalejunior,191
8,malesophmore,177
9,malejunior,193


### merging the results
The result of the groupby.sum() has the same index as the original data, so the two DataFrames can be merged. We’ll look at the details of how merge works in a later session.


In [107]:
pd.merge(grades,sums, left_index=True, right_index=True)

Unnamed: 0,gender,class,grade1,grade2,CombinedKey,total
0,female,senior,94,96,femalesenior,190
1,female,senior,90,92,femalesenior,182
2,female,junior,95,93,femalejunior,188
3,male,senior,86,88,malesenior,174
4,female,sophmore,82,79,femalesophmore,161
5,male,junior,90,93,malejunior,183
6,male,sophmore,83,85,malesophmore,168
7,female,junior,97,94,femalejunior,191
8,male,sophmore,90,87,malesophmore,177
9,male,junior,96,97,malejunior,193


### transform
The transform applies a function to each key group and then combines that result with the original index. In the example below that is then appended to the original dataset. This would allow computing the difference fromt the group mean for each individual

In [116]:
print('Juniors ', (95+90+97+96+82)/5)
grades['g1Mean'] = grades[['class','grade1']].groupby('class').transform('mean')
grades


Juniors  92.0


Unnamed: 0,gender,class,grade1,grade2,g1Mean
0,female,senior,94,96,91.8
1,female,senior,90,92,91.8
2,female,junior,95,93,92.0
3,male,senior,86,88,91.8
4,female,sophmore,82,79,87.8
5,male,junior,90,93,92.0
6,male,sophmore,83,85,87.8
7,female,junior,97,94,92.0
8,male,sophmore,90,87,87.8
9,male,junior,96,97,92.0


### agg
The agg method allows for user written avvregation functions to be applied to each group. In the example below a string with a note is generated. First, reset the grades dataset.

In [131]:
grades = pd.read_fwf(gradeDataPath,
            cols=['gender', 'class',  'grade1',  'grade2'])


In [152]:
def uniqueStrings(arr):
    stArr = [str(v) for v in arr]
    return ','.join(sorted(set(stArr)))
print(' testing uniqueStrings returns:', uniqueStrings([1,2,9,3,4,3,2,9]))
genderGrouped = grades.groupby('gender')
genderGrouped['grade1'].agg(uniqueStrings)


 testing uniqueStrings returns: 1,2,3,4,9


gender
female    82,90,94,95,96,97,98
male            83,86,90,93,96
Name: grade1, dtype: object

### separate functions for each column
In this example the argument for agg is a dictionary with the keys the column names and the values a function or list of functions to be applied. Note that the user function is not in quotes, the built in method of groupby is.


In [155]:
def uniqueStrings(arr):
    stArr = [str(v) for v in arr]
    return ','.join(sorted(set(stArr)))
print(' testing uniqueStrings returns:', uniqueStrings([1,2,9,3,4,3,2,9]))
genderGrouped = grades.groupby('gender')
genderGrouped.agg({'class': uniqueStrings, 
                  'grade1': ['mean', 'max', 'min']})


 testing uniqueStrings returns: 1,2,3,4,9


Unnamed: 0_level_0,class,grade1,grade1,grade1
Unnamed: 0_level_1,uniqueStrings,mean,max,min
gender,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
female,"junior,senior,sophmore",91.75,98,82
male,"junior,senior,sophmore",89.142857,96,83


In [45]:
print("that's all folks")

that's all folks
