## Python for Data Analysis, 2nd ed

### Chapter 3. Built-in Data Structures

#### 3.2 Functions

##### Currying: Partial Argument Application
Currying (from Haskell Curry) means deriving new functions from existing ones by *partial argument application*.

In [7]:
def add_numbers(x, y):
    return x + y

add_five = lambda z: add_numbers(z, 5)

assert add_five(4) == 9

In [33]:
from functools import partial
add_five = partial(add_numbers, 5)

assert add_five(5) == 10

##### Generators
Generators return a sequence of multiple results lazily, pausing after each one until the next one is requested.

To create a generator, use the keyword `yield` instead of `return` in a function

In [41]:
def squares(n=10):
    print('Generating squares from 1 to {}'.format(n ** 2))
    for i in range(1, n+1):
        yield i ** 2

gen = squares()

for x in gen:
    print(x)

Generating squares from 1 to 100
1
4
9
16
25
36
49
64
81
100


###### Generator expressions
(Analogue to list, dict and set comprehensions)

In [48]:
def _make_gen():
    for x in range(100):
        yield x ** 2
        
# equivalent to:
gen = (x ** 2 for x in range(100))

In [49]:
sum(x ** 2 for x in range(100))

328350

In [50]:
dict((i, i ** 2) for i in range(5))

{0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

###### itertools module

e.g. `groupby` takes any sequence and a function, grouping consecutive elements in the sequence by return value of the function

In [60]:
import itertools

first_letter = lambda x: x[0]  # returns the first letter of a word
names = ['Zenos', 'Zarathustra', 'Dennett', 'Daniel', 'David', 'Greg']

for letter, names in itertools.groupby(names, first_letter):
    print(letter, list(names))

('Z', ['Zenos', 'Zarathustra'])
('D', ['Dennett', 'Daniel', 'David'])
('G', ['Greg'])


In [74]:
numbers = [10, 212, 43, 94, 764]

from itertools import combinations
for comb in combinations(numbers, 3):
    print(comb)

print('---')
    
from itertools import combinations_with_replacement
for comb in combinations_with_replacement(numbers, 2):
    print(comb)

(10, 212, 43)
(10, 212, 94)
(10, 212, 764)
(10, 43, 94)
(10, 43, 764)
(10, 94, 764)
(212, 43, 94)
(212, 43, 764)
(212, 94, 764)
(43, 94, 764)
---
(10, 10)
(10, 212)
(10, 43)
(10, 94)
(10, 764)
(212, 212)
(212, 43)
(212, 94)
(212, 764)
(43, 43)
(43, 94)
(43, 764)
(94, 94)
(94, 764)
(764, 764)


##### Errors and Exception Handling

In [75]:
float('1.4314')

1.4314

In [76]:
float('string')

ValueError: could not convert string to float: string

In [103]:
def graceful_floaat(x):
    try:
        return float(x)
    except ValueError:  # Can also have no exception types  
        return x

In [104]:
graceful_float('1.12412')


1.12412

In [105]:
graceful_float('string')

'string'

In [106]:
graceful_floaat((1, 3))

TypeError: float() argument must be a string or a number

In [107]:
def graceful_float(x):
    try:
        return float(x)
    except (ValueError, TypeError):
        return x

In [108]:
graceful_float(['string one', 'string 2'])

['string one', 'string 2']

In some cases, you may want to suppress an exception, but you want some code to be executed regardless of whether the code in the try block succeeds or not. To do this, use `finally`

In [123]:
'''
f = open('/some/path', 'w')
try:
    f.open()
    f.read()
finally:
    f.close()
'''; # Here, the file handle 'f' will always get closed

In [127]:
'''
f = open(path, 'w')
try:
    f.write()
except:
    print('Failed')
else:
    print('Succeeded')
finally:
    f.close()
'''; # Executes only if the try: block succeeded

##### Exceptions in IPython

IPython will, by default, print a full call stack trace (traceback) with a few lines of context around the position at each point in the stack

The traceback depth can be controlled using the `%xmode` magic command, from `Plain` to `Verbose`

You can also *step into the stack* using the `%debug` or `%pdb` magic commands after an error has occurred for interactive post-mortem debugging.

### Chapter 5. Getting started with pandas

#### 5.1 Introduction to pandas data structures

In [140]:
import pandas as pd

from pandas import Series, DataFrame

In [141]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
            'year': [2000, 2001, 2002, 2001, 2002, 2003],
            'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

In [142]:
frame.describe()

Unnamed: 0,pop,year
count,6.0,6.0
mean,2.55,2001.5
std,0.836062,1.048809
min,1.5,2000.0
25%,1.875,2001.0
50%,2.65,2001.5
75%,3.125,2002.0
max,3.6,2003.0


In [143]:
frame.set_index('state')

Unnamed: 0_level_0,pop,year
state,Unnamed: 1_level_1,Unnamed: 2_level_1
Ohio,1.5,2000
Ohio,1.7,2001
Ohio,3.6,2002
Nevada,2.4,2001
Nevada,2.9,2002
Nevada,3.2,2003


In [144]:
frame.transpose()

Unnamed: 0,0,1,2,3,4,5
pop,1.5,1.7,3.6,2.4,2.9,3.2
state,Ohio,Ohio,Ohio,Nevada,Nevada,Nevada
year,2000,2001,2002,2001,2002,2003


In [145]:
frame.truncate(1,4)

Unnamed: 0,pop,state,year
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [146]:
frame.head()  # Return first 5 rows 

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [147]:
# passing the 'columns' argument will show the columns in that order, and ad NaN to any columns without values
# unspecified columns will not be shown
frame2 = pd.DataFrame(data, columns=['state', 'pop', 'debt'], index = ['one', 'two', 'three', 'four', 'five', 'six'])

frame2

Unnamed: 0,state,pop,debt
one,Ohio,1.5,
two,Ohio,1.7,
three,Ohio,3.6,
four,Nevada,2.4,
five,Nevada,2.9,
six,Nevada,3.2,


In [148]:
# Retrieving columns: Dict-like notation / By attribute
print frame2['state']
print '--------------------------'
print frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object
--------------------------
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object


In [149]:
# Rows can also be retrieved by position or name with the special loc attribute
frame2.loc['three']

state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [150]:
# Columns can be assigned a scalar
frame2['debt'] = 16.5
print frame2

print '--------------------------'

# Or a list/array of values. In this case, the list's length must match the DataFrame's length
frame2.debt = [15.2, 12.4, 24.2, 12.4, 24.1, 24.4]
print frame2

print '--------------------------'

# Or a Series, in which case the labels will be realigned accordingly
val = pd.Series([-1.0, -2.2, -5], index=['one', 'two', 'five'])
frame2.debt = val
print frame2

        state  pop  debt
one      Ohio  1.5  16.5
two      Ohio  1.7  16.5
three    Ohio  3.6  16.5
four   Nevada  2.4  16.5
five   Nevada  2.9  16.5
six    Nevada  3.2  16.5
--------------------------
        state  pop  debt
one      Ohio  1.5  15.2
two      Ohio  1.7  12.4
three    Ohio  3.6  24.2
four   Nevada  2.4  12.4
five   Nevada  2.9  24.1
six    Nevada  3.2  24.4
--------------------------
        state  pop  debt
one      Ohio  1.5  -1.0
two      Ohio  1.7  -2.2
three    Ohio  3.6   NaN
four   Nevada  2.4   NaN
five   Nevada  2.9  -5.0
six    Nevada  3.2   NaN


In [151]:
# Assigning a column will create a new one if it doesn't already exist
frame2['eastern'] = frame2.state == 'Ohio'
frame2

Unnamed: 0,state,pop,debt,eastern
one,Ohio,1.5,-1.0,True
two,Ohio,1.7,-2.2,True
three,Ohio,3.6,,True
four,Nevada,2.4,,False
five,Nevada,2.9,-5.0,False
six,Nevada,3.2,,False


In [152]:
# Deleting columns
del frame2['eastern']
frame2.columns

Index([u'state', u'pop', u'debt'], dtype='object')

In [153]:
# Passing a dict of dicts to DataFrame. Outer keys- columns, Inner keys- rows
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)
print frame3
print '--------------------------'
print frame3.T
print '--------------------------'
print frame3.transpose()

      Nevada  Ohio
2000     NaN   1.5
2001     2.4   1.7
2002     2.9   3.6
--------------------------
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6
--------------------------
        2000  2001  2002
Nevada   NaN   2.4   2.9
Ohio     1.5   1.7   3.6


In [154]:
frame3.index.name = 'Year'
frame3.columns.name = 'State'
frame3

State,Nevada,Ohio
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [155]:
print frame3.index.is_monotonic

True
