# Faster Data Types

Python is simple because of it's flexibility - it tries to find what will work best

## Specifying data types

Python uses different checks to be able to choose the correct data-type, even when it is given: the more specific about your data you can be, the fewer checks are done

In [2]:
import timeit as ti
iterMax = 100000

def makeFloat( ):
    valB = float( '5' )
    
def makeInt( ):
    # Checks all bases - should take longest
    valB = int( '5' )
        
def makeIntMore( ):
    # Provide the base - fewer checks to should be faster
    valB = int( '5', 10 )

print 'Float:', ti.timeit( makeFloat,     number = iterMax )
print 'Int:  ', ti.timeit( makeInt,       number = iterMax )
print 'Int+: ', ti.timeit( makeIntMore,   number = iterMax )

Float: 0.0285930633545
Int:   0.0739169120789
Int+:  0.0335159301758


## Avoiding Loops

Loops tend to be slow in python (and in general, but compilers typically unroll for efficiency).  Let's look at some better ways of doing things.

In [3]:
import numpy as np
someList = range(50)

A basic for loop:

In [4]:
%%timeit
out=[]
for i in range( len( someList ) ):
    out.append( someList[i] * 2 )

100000 loops, best of 3: 7.9 µs per loop


The same thing, but in non-loop form:

In [101]:
%%timeit
twoXSomeList = ( someList[i] * 2 for i in range( len( someList ) ) )

1000000 loops, best of 3: 1.3 µs per loop


Enumerate and Zip can also be used for efficiency if you need to use loops with multiple variables

In [102]:
import timeit as ti
iterMax = 10000
incVal = 1
aTot = range( 0, 500, incVal )
bTot = range( 1000, 1500, incVal )

def bothInd():
    for indVal in range( len( aTot ) ):
        out = aTot[indVal] * bTot[indVal]

def indVal():
    for indVal, aValue in enumerate( aTot ):
        out = aValue * bTot[indVal]
        
def zipped():
    for aValue, bValue in zip(aTot, bTot):
         out = aValue * bValue
            
            
print 'bothInd:', ti.timeit( bothInd, number = iterMax )
print 'indVal: ', ti.timeit( indVal,  number = iterMax )
print 'zipped: ', ti.timeit( zipped,  number = iterMax )

bothInd: 0.507728815079
indVal:  0.388553857803
zipped:  0.42235994339


## Numpy structures

Numpy has built in structures to do create efficient arrays!  There is some overhead associated with the creation of the arrays, so larger arrays have more savings

In [103]:
import numpy as np
import timeit as ti
iterMax = 10000

def numpyAdd(n):
    a = np.arange(n) ** 2
    b = np.arange(n) ** 3
    return a + b

def listAdd(n):
    a = [i ** 2 for i in range(n)]
    b = [i ** 3 for i in range(n)]
    return [a[i] + b[i] for i in range(n)]

print '#-# 10 '
print 'Numpy:', ti.timeit( 'numpyAdd(10)', 'from __main__ import numpyAdd', number = iterMax )
print 'List: ', ti.timeit( 'listAdd(10)',  'from __main__ import listAdd',  number = iterMax )
print '#-# 100 '
print 'Numpy:', ti.timeit( 'numpyAdd(100)', 'from __main__ import numpyAdd', number = iterMax )
print 'List: ', ti.timeit( 'listAdd(100)',  'from __main__ import listAdd',  number = iterMax )
print '#-# 10000 '
print 'Numpy:', ti.timeit( 'numpyAdd(1000)', 'from __main__ import numpyAdd', number = iterMax )
print 'List: ', ti.timeit( 'listAdd(1000)',  'from __main__ import listAdd',  number = iterMax )


#-# 10 
Numpy: 0.0656039714813
List:  0.0457808971405
#-# 100 
Numpy: 0.0714168548584
List:  0.294731140137
#-# 10000 
Numpy: 0.145833969116
List:  2.82997989655


In [104]:
import numpy as np
someList = range( 500 )
someNpArray = np.array( someList )

In [105]:
%%timeit
out=[]
for i in range( len( someList ) ):
    out.append( someList[i] * 2 )

The slowest run took 8.98 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 65.9 µs per loop


In [106]:
%%timeit
twoXSomeList = ( someList[i] * 2 for i in range( len( someList ) ) )

The slowest run took 4.36 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 5.09 µs per loop


In [107]:
%%timeit
twoXSomeArray = someNpArray * 2

The slowest run took 15.40 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.09 µs per loop


## Pandas structures

Pandas is a module created for working with large data-sets efficiently.  It uses DataFrames to store information in terms of the data, the column names and the index (row) names

In [6]:
import pandas as pd
import numpy as np
data = np.array( [ ['X',   'ColA', 'ColB' ],
                   ['RowA', 1.0,    10.1  ],
                   ['RowB', 3.14,   2.1   ],
                   ['RowC', 3.0,    42.   ] ] )
                
dFrame = pd.DataFrame( data    = data[1:,1:],
                       index   = data[1:,0],
                       columns = data[0,1:]) 

print dFrame

      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0


We can use the index and columns variables to loop through or as counters with len

In [7]:
print len(dFrame.index)
print len(dFrame.columns)

3
2


Pandas allows for the data to be addressed in multiple ways

In [8]:
print dFrame

# Using at - all dimensions listed in one set of brackets
print dFrame.iat[1,0]
print dFrame.at['RowB','ColA']

# Using loc - all dimensions in separate brackets
print dFrame.iloc[1][0]
print dFrame.loc['RowB']['ColA']


      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0
3.14
3.14
3.14
3.14


We can use some tools built into pandas to iterate through the data structure efficiently

In [10]:
print dFrame

for index, row in dFrame.iterrows():
    #print 'Index:', index
    print row[ 'ColA' ]

      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0
1.0
3.14
3.0


In [11]:
print dFrame

for index, col in dFrame.iteritems():
    #print 'Index:', index
    print col[ 'RowA' ]

      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0
1.0
10.1


Pandas also has built in math functions known for efficiency with large data-sets

In [12]:
print dFrame

dFrame["ColA"] = pd.to_numeric( dFrame[ "ColA" ] )
dFrame["ColB"] = pd.to_numeric( dFrame[ "ColB" ] )
dFrame['sum'] = dFrame.sum( axis = 1 )
print dFrame

updatedDFrame = dFrame.applymap(lambda x: x+1) 
print updatedDFrame

      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0
      ColA  ColB    sum
RowA  1.00  10.1  11.10
RowB  3.14   2.1   5.24
RowC  3.00  42.0  45.00
      ColA  ColB    sum
RowA  2.00  11.1  12.10
RowB  4.14   3.1   6.24
RowC  4.00  43.0  46.00
