# Pandas(panel data)
- Original ideas came from "R" statistical language
-  now panda developers claim:
    - more functionality than R
    - faster algorithms than R
- based on numpy
- Can connect directly to databases
- Can read/write in many file formats
- very large package
- the two primary classes are Series and DataFrame
    - both support vector arithmetic and broadcasting like numpy
- [doc](http://pandas.pydata.org)
- [cheat sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)

In [None]:
# standard abbreviations

import numpy as np
import pandas as pd
import datetime
pd.__version__

# Series is like a numpy 1D array with an index attached
- the index defaults to typical slot addressing, 0 to N-1
- something like a dictionary, where the key elements are the index,
and key values are the array itself
- there are a number of techniques and tricks for indexing pandas - we will only use two:
    - iloc - int index
    - loc - object index

In [None]:
# got an automatic index(on the left)

ser = pd.Series(range(10,15))

ser

In [None]:
ser.values, ser.index

In [None]:
# use 'iloc' for int indexes

ser.iloc[3]

In [None]:
# a slice keeps the index

ser.iloc[2:5]

In [None]:
# index can be specified explicity, and does not need to be numeric

ser2 = pd.Series(range(10,15), \
    index=['butler', 'math', \
           'science', 'avery', 'business'])
ser2

In [None]:
# different type of index

ser2.values, ser2.index

In [None]:
# can retrieve element via index 
# 'loc' means do key index lookup

ser2.iloc[3], ser2.loc['avery']

In [None]:
# slice keeps index

ser2.iloc[1:3]

In [None]:
# slice by keys  
# note - end is inclusive, unlike list/iloc slice

ser2.loc['butler':'science']

In [None]:
# like a dict...

'math' in ser2, 'foo' in ser2, ser2.keys(),ser2.values

In [None]:
ser2.items(), list( ser2.items())

In [None]:
ser3 = pd.Series(range(20,25), \
    index=['butler', 'science','math', \
            'avery', 'business'])
ser3

In [None]:
ser2

In [None]:
# can add series
# science and math in different places
# indexes are aligned, even though index positions are different
# sort of like a database join
# vector arithmetic

ser2 + ser3

In [None]:
# broadcasting

2 * ser2 + 3 * ser3 + 5

In [None]:
# create from a dict
# has some different indexes 

d = {'math':10, 'science':10, 'law':13, 'avery':12}
ser4 = pd.Series(d)
ser4

In [None]:
ser3

In [None]:
# business, butler, law indexes are only 
# defined in one of the summands, 
# so can't compute their sums

# hey, where did the floating point come from??
# what's a NaN??

ssum=ser3+ser4
ssum

# addition of Series
- same index values are added together, even though indexes are in different order
- the sum index is the union of the indexes in both Series. 
- if there is not a value in both Series for an index, the value is the special IEEE floating point value NaN(Not a Number), which normally represents invalid floating point operations
- NaNs lets pandas represent missing values efficiently
- note that in order to use NaNs, the original 
integer values were converted to floats!

# real world data almost always has missing values 
- need to deal with it somehow


In [None]:
# functions like mean are smart about NaN's
# they just skip NaN's, instead of raising errors

ssum 

In [None]:
ssum.mean(), (35+32+31)/3.

In [None]:
# call sin on each element
# don't raise an error on the NaN's
# sin(NaN) = NaN

np.sin(ssum)

In [None]:
# drop any row with a NaN

ssum.dropna()

In [None]:
ssum

In [None]:
# can fill in missing vals

ssum.fillna(0)

In [None]:
ser3

In [None]:
ser4

In [None]:
# another fix

ser3.add(ser4, fill_value=10)

In [None]:
ssum

In [None]:
# can be nicer to interpolate missing values

ssum.interpolate()

# Example - find prime numbers
- define findPrimes
- return a list of primes upto a given limit
- use [sieve of eratosthenes]
(https://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#example) algorithm
    - no divide or mod
- use a Series with numpy booleans
- slices with increments can do most of the work




In [None]:
# find primes upto 30 
# if a bool is True, that number is prime

ser = pd.Series(np.ones(30, dtype=np.bool))
ser

In [None]:
# 0, 1 are not prime

ser[:2] = False
ser

In [None]:
# trash evens

ser[4::2] = False
ser

In [None]:
# trash multiples

for j in range(3, 30, 2):
    ser[2*j::j] = False
ser

In [None]:
# but how do we get a list of the primes?



















# boolean index myself!!

ser[ser]

In [None]:
ssi = ser[ser].index
ssi

In [None]:
# not a list, but pretty close

ssi[4], len(ssi), [p for p in ssi]

In [None]:
# can make a list

list(ssi)