# Faster Data Types and Loops

Python is simple because of it's flexibility.  This notebook examines some methods for speeding things up:
* Looping using alternate methods
* Using numpy structures for math
* Using pandas structure for data manipulation

As a reminder:

$m = 10^{-3} $ (milli)

$\mu = 10^{-6} $ (micro)

$n = 10^{-9} $ (nano)


## Alternate Strategies for Loops

Loops tend to be slow in python (and in general, but compilers typically unroll for efficiency).  Let's look at some better ways of doing things.

In [1]:
import numpy as np
someList = range(50)

#### A basic for loop:

In [2]:
%%timeit
out=[]
for i in range( len( someList ) ):
    out.append( someList[i] * 2 )

The slowest run took 4.30 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.42 µs per loop


#### The same thing, but in non-loop form:

For loops have to check to see if they are complete at the end of every step - we can re-write this in a way that reduces the time greatly without that check

In [3]:
%%timeit
twoXSomeList = ( someList[i] * 2 for i in range( len( someList ) ) )

The slowest run took 13.63 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 570 ns per loop


#### Swapping for loops 

Enumerate and Zip can also be used for efficiency if you need to use loops with multiple variables

In [4]:
import timeit as ti

# Set-up
iterMax = 10000
incVal = 1
aTot = range( 0, 500, incVal )
bTot = range( 1000, 1500, incVal )

# Using an index for both values
def bothInd():
    for indVal in range( len( aTot ) ):
        out = aTot[indVal] * bTot[indVal]

# Use an index for one value and the actual value for the other
def indVal():
    for indVal, aValue in enumerate( aTot ):
        out = aValue * bTot[indVal]

# Don't us any index
def zipped():
    for aValue, bValue in zip(aTot, bTot):
         out = aValue * bValue
            
            
print( 'bothInd:', ti.timeit( bothInd, number = iterMax ) )
print( 'indVal: ', ti.timeit( indVal,  number = iterMax ) )
print( 'zipped: ', ti.timeit( zipped,  number = iterMax ) )

bothInd: 1.3269468329999654
indVal:  0.9102434800006449
zipped:  0.42603352300284314


## Numpy structures

Numpy has built in structures to do create efficient arrays!  There is some overhead associated with the creation of the arrays, so larger arrays have more savings

In [5]:
import numpy as np
import timeit as ti
iterMax = 10000

# A numpy way of adding n^2 to n^3
def numpyAdd(n):
    a = np.arange(n) ** 2
    b = np.arange(n) ** 3
    return a + b

# A list way of adding n^2 to n^3
def listAdd(n):
    a = [i ** 2 for i in range(n)]
    b = [i ** 3 for i in range(n)]
    return [a[i] + b[i] for i in range(n)]

# Do this with an input of 10
print( '#-# 10 ' )
print( 'Numpy:', ti.timeit( 'numpyAdd(10)', 'from __main__ import numpyAdd', number = iterMax ) )
print( 'List: ', ti.timeit( 'listAdd(10)',  'from __main__ import listAdd',  number = iterMax ) )

# Do this with an input of 100
print( '\n#-# 100 ' )
print( 'Numpy:', ti.timeit( 'numpyAdd(100)', 'from __main__ import numpyAdd', number = iterMax ) )
print( 'List: ', ti.timeit( 'listAdd(100)',  'from __main__ import listAdd',  number = iterMax ) )

# Do this with an input of 1000
print( '\n#-# 1000 ' )
print( 'Numpy:', ti.timeit( 'numpyAdd(1000)', 'from __main__ import numpyAdd', number = iterMax ) )
print( 'List: ', ti.timeit( 'listAdd(1000)',  'from __main__ import listAdd',  number = iterMax ) )

#-# 10 
Numpy: 0.10019599299994297
List:  0.07776459800152224

#-# 100 
Numpy: 0.055238162000023294
List:  0.6829036730014195

#-# 1000 
Numpy: 0.11121881500002928
List:  6.506387186997017


#### Avoiding loops by using vector operations

We can also use the built in functionality for numpy arrays to avoid loops

In [6]:
import numpy as np
someList = range( 500 )
someNpArray = np.array( someList )

A basic for loop

In [7]:
%%timeit
out=[]
for i in range( len( someList ) ):
    out.append( someList[i] * 2 )

The slowest run took 5.66 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 91.9 µs per loop


We can speed this up in the same way as before

In [8]:
%%timeit
twoXSomeList = ( someList[i] * 2 for i in range( len( someList ) ) )

The slowest run took 9.45 times longer than the fastest. This could mean that an intermediate result is being cached.
1000000 loops, best of 3: 627 ns per loop


We can do the same thing with an array.

In [9]:
%%timeit
twoXSomeArray = someNpArray * 2

The slowest run took 41.69 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.21 µs per loop


## Pandas structures

Pandas is a module created for working with large data-sets efficiently.  It uses DataFrames to store information in terms of the data, the column names and the index (row) names.  None of the things shown here are time-intensive, but would be for large data sets.  Rather than manipulate the raw data, do this using Pandas

In [10]:
import pandas as pd
import numpy as np
data = np.array( [ ['X',   'ColA', 'ColB' ],
                   ['RowA', 1.0,    10.1  ],
                   ['RowB', 3.14,   2.1   ],
                   ['RowC', 3.0,    42.   ] ] )
                
dFrame = pd.DataFrame( data    = data[1:,1:],
                       index   = data[1:,0],
                       columns = data[0,1:]) 

print( dFrame )

      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0


We can use the index and columns variables to loop through or as counters with len

In [11]:
print( len(dFrame.index) )
print( len(dFrame.columns) )

3
2


Pandas allows for the data to be addressed in multiple ways

In [12]:
print( dFrame )

# Using at - all dimensions listed in one set of brackets
print("\n#-# Found with iat")
print( dFrame.iat[1,0] )
print("#-# Found with at")
print( dFrame.at['RowB','ColA'] )

# Using loc - all dimensions in separate brackets
print("#-# Found with iloc")
print( dFrame.iloc[1][0] )
print("#-# Found with loc")
print( dFrame.loc['RowB']['ColA'] )


      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0

#-# Found with iat
3.14
#-# Found with at
3.14
#-# Found with iloc
3.14
#-# Found with loc
3.14


We can use some tools built into pandas to iterate through the data structure efficiently

In [13]:
# Print values in a column by iterating through the rows with iterrows
print( dFrame )

print("\n#-# Iterate through the rows in ColA")
for index, row in dFrame.iterrows():
    #print 'Index:', index
    print( row[ 'ColA' ] )

      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0

#-# Iterate through the rows in ColA
1.0
3.14
3.0


In [14]:
# Print the values in a row by iterating through the columns with iteritems
print( dFrame )

print("\n#-# Iterate through the cols in RowA")
for index, col in dFrame.iteritems():
    #print 'Index:', index
    print( col[ 'RowA' ] )

      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0

#-# Iterate through the cols in RowA
1.0
10.1


Pandas also has built in math functions known for efficiency with large data-sets

In [15]:
print("#-# Original")
print( dFrame )

# Use sum to add up the columns within each row
dFrame["ColA"] = pd.to_numeric( dFrame[ "ColA" ] )
dFrame["ColB"] = pd.to_numeric( dFrame[ "ColB" ] )
dFrame['sum'] = dFrame.sum( axis = 1 )
print("\n#-# With a sum column")
print( dFrame )

# Use applymap to add 1 to all data values
updatedDFrame = dFrame.applymap(lambda x: x+1) 
print("\n#-# Add one; not sum doesn't add 2")
print( updatedDFrame )

#-# Original
      ColA  ColB
RowA   1.0  10.1
RowB  3.14   2.1
RowC   3.0  42.0

#-# With a sum column
      ColA  ColB    sum
RowA  1.00  10.1  11.10
RowB  3.14   2.1   5.24
RowC  3.00  42.0  45.00

#-# Add one; not sum doesn't add 2
      ColA  ColB    sum
RowA  2.00  11.1  12.10
RowB  4.14   3.1   6.24
RowC  4.00  43.0  46.00


# Check yourself

In [38]:
# Variables for use with these examples
import numpy as np
import pandas as pd

someVals = np.arange(500)
pandasD = np.array( [ ['X',   'Sue', 'John', 'Topher' ],
                   ['HW0', 92., 47., 85.   ],
                   ['MidTerm', 88., 75., 93.  ],
                   ['HW1', 91., 94., 91.   ],
                   ['HW1', 77.4, 86., 89.   ],
                   ['Final', 88., 86.0, 98.   ] ] )
classScores = pd.DataFrame( data    = pandasD[1:,1:],
                       index   = pandasD[1:,0],
                       columns = pandasD[0,1:]) 

Create a new variable, someValsSquared, that is the square of someVals using a non-loop for loop

In [39]:
# Try it here


Add the values in someVals and someValsSquared together using zip to create the variable someAddedVals

In [40]:
# Try it here


What is the average score for each of the graded objects in classScores? What is the average score for each student (assuming all grades are weighted equally)?

In [37]:
# Try it here
