# Data Analysis Notes

## Numpy

* ndarray - fast, space-efficient multidimensional array providing vectorized operations
* standard math functions with no loops
* linear algebra
* random number eneration
* Fourier transform capabilities

### ndarray
A generic multidimensional container for homogeneous data
* all element must be the same type
* features of an ndarray
    * shape - a tuple indicating the size of each dimension data.shape ret: (2,3)
    * dtype - object describing the data type of the array data.dtype ret: dtype('float64')

In [1]:
import numpy as np

#### creating ndarrays

* array - convert input data (list, tuple, array) to ndarray by inferring dtype or explicityly specifying. Copies by default
* asarray - convert input to ndarray, without copying if input is already an ndarray
* arange - like range but returns as ndarray instead of list
* ones, ones_like - produces array of 1s with given shape, ones_like takes and array and produces an array of 1s of the same dimensions
* zerios, zeros_like - like ones but with zeros 
* empty, empty_like - creates new arrays, but doesn't popluate with values - whatever is currently in memory
* eye, identity - create a square NxN identity matrix (1s on the diagonal and 0s elsewhere)

In [5]:
# 1D arrays create with np.array
arr1 = np.array([6,7.5,8,0,1])
arr1

array([ 6. ,  7.5,  8. ,  0. ,  1. ])

In [9]:
# multidimensional arrays create with a list of equal length lists
arr2 = np.array([[1,2,3,4],[5,6,7,8]])
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

In [11]:
print arr2.ndim
print arr2.shape
print arr2.dtype

2
(2, 4)
int64


In [19]:
arr0s = np.zeros(10)
print "Array of 0s: "
print arr0s
print 

arrm0s = np.zeros((3,4))
print "Multidimensional array of 0s: "
print arrm0s
print 

arr1s = np.ones(10)
print "Array of 1s: "
print arr1s
print 

print "empty may but IS NOT GUARENTEED to return an array of 0s"
np.empty((2,3,2))  

Array of 0s: 
[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]

Multidimensional array of 0s: 
[[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]

Array of 1s: 
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]

empty may but IS NOT GUARENTEED to return an array of 0s


array([[[ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.]],

       [[ 0.,  0.],
        [ 0.,  0.],
        [ 0.,  0.]]])

In [21]:
# numpy version of range
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

#### Data Types for ndarrays

* dtype can be defined at creation if needed
* astype can be used to convert dtypes - always creates a copy

In [28]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype = np.string_)
print numeric_strings

print numeric_strings.astype(float)
print numeric_strings.astype(float).astype(int)

['1.25' '-9.6' '42']
[  1.25  -9.6   42.  ]
[ 1 -9 42]


#### Operations between Arrays and Scalars

**Array Operations:**
* any arithmetic operations between equal-size arrays applies the operation elementwise
* different sized arrays is called broadcasting - discussed in Chapter 12

**Scalar Operations:**
* propogate arithmetic through each element


In [31]:
arr = np.array([[1.,2.,3.],[4.,5.,6.]])
print arr

print
print "arr * arr: "
print arr*arr

print
print "arr - arr"
print arr - arr

[[ 1.  2.  3.]
 [ 4.  5.  6.]]

arr * arr: 
[[  1.   4.   9.]
 [ 16.  25.  36.]]

arr - arr
[[ 0.  0.  0.]
 [ 0.  0.  0.]]


In [33]:
print "1/arr"
print 1/arr

print 
print "arr ** 0.5"
print arr ** 0.5


1/arr
[[ 1.          0.5         0.33333333]
 [ 0.25        0.2         0.16666667]]

arr ** 0.5
[[ 1.          1.41421356  1.73205081]
 [ 2.          2.23606798  2.44948974]]


#### Basic Indexing and Slicing

* slicing is done mostly as expected for lists
* Multidimensional arrays can be sliced with repeated [2][0] or [2,0]


* array slices are views NOT copies so changes to them affects the source array
* if you want a copy use arr[5:8].copy()


* multidimensional array assignments can be either scalars or size matched ndarrays


In [38]:
arr = np.arange(10)
print arr

print
print "access object at index 5"
print arr[5]

print 
print "return ndarray of objects (5-8]"
print arr[5:8]

[0 1 2 3 4 5 6 7 8 9]

access object at index 5
5

return ndarray of objects (5-8]
[5 6 7]


In [39]:
# assign objects to a given value
arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

In [42]:
arr2d = np.array([[1,2,3],[4,5,6],[7,8,9]])

print "a single slice of a multidimensional array returns another ndarray"
print arr2d[2]

print 
print "there are 2 options to slice both"
print arr2d[2][0]
print arr2d[2,0]

a single slice of a multidimensional array returns another ndarray
[7 8 9]

there are 2 options to slice both
7
7


#### Boolean Indexing

Can use booleans to index as well

In [56]:
names = np.array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'])
print names == 'Bob'

print
data = np.random.randn(7,4)
print data

print
print data[names == 'Bob', 2:]

print data[(names == 'Bob') | (names == 'Will'), 2:]

[ True False False  True False False False]

[[ 2.30636807  1.58341026 -0.46749486  0.28654039]
 [ 0.06502078 -0.90556446  1.03511216  0.35194532]
 [-0.52156253  0.97710069 -0.933863    0.64848837]
 [-0.03983877  0.97992079  0.51382864 -1.33908823]
 [-0.56505517  0.77945476 -0.46145929 -1.09264386]
 [ 1.2220915   0.95085505  0.72212019 -0.63587666]
 [ 1.06115116 -0.32056447 -0.49341202  1.76311529]]

[[-0.46749486  0.28654039]
 [ 0.51382864 -1.33908823]]
[[-0.46749486  0.28654039]
 [-0.933863    0.64848837]
 [ 0.51382864 -1.33908823]
 [-0.46145929 -1.09264386]]


#### Fancy indexing

pass a list of integers to select a subset of rows in a particular order


For multidimensional arrays pass a set list of lists to pull particular elements

In [60]:
arr = np.empty((8,4))
for i in range(8):
    arr[i] = i

print arr
print 
print
print "arr[4,3,0,6]"
print arr[[4,3,0,6]]

[[ 0.  0.  0.  0.]
 [ 1.  1.  1.  1.]
 [ 2.  2.  2.  2.]
 [ 3.  3.  3.  3.]
 [ 4.  4.  4.  4.]
 [ 5.  5.  5.  5.]
 [ 6.  6.  6.  6.]
 [ 7.  7.  7.  7.]]


arr[4,3,0,6]
[[ 4.  4.  4.  4.]
 [ 3.  3.  3.  3.]
 [ 0.  0.  0.  0.]
 [ 6.  6.  6.  6.]]


In [63]:
arr = np.arange(32).reshape(8,4)
print arr

print 
print arr[[1,5,7,2],[0,3,1,2]]

[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]
 [16 17 18 19]
 [20 21 22 23]
 [24 25 26 27]
 [28 29 30 31]]

[ 4 23 29 10]


#### Transposing and Swapping Axes

* transpose returns a view on the original data without copying
* for higher dimensional arrays transpose accepts a tuple of axis numbers to permute the axes 
* T is a special case of swapaxes which takes axes numbers as well

In [69]:
arr = np.arange(15).reshape(3,5)
print arr

print 
print arr.T

print 
print arr.swapaxes(0,1)

[[ 0  1  2  3  4]
 [ 5  6  7  8  9]
 [10 11 12 13 14]]

[[ 0  5 10]
 [ 1  6 11]
 [ 2  7 12]
 [ 3  8 13]
 [ 4  9 14]]

[[ 0  5 10]
 [ 1  6 11]
 [ 2  7 12]
 [ 3  8 13]
 [ 4  9 14]]


#### Universal Functions: Fast Element-wise Array Functions

A universal function (ufunc) performs elementwise operations on data in ndarrays
* abs
* sqrt
* exp
* log
* sign
* ceil
* floor
* rint
* modf
* isnan
* isinf
* isfinite
* etc



There are also binary ufuncs which take 2 arrays and return a single array as the result
* add
* subtract
* multiply
* power
* maximum
* minimum
* mod
* greater, greater_equal, less, less_equal, equal, not_equal
* logical_and, logical_or

#### Conditional Logic as Array Operations

numpy.where is a vectorized version of the ternary expression x if condition else y

In [70]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])

cond = np.array([True, False, True, True, False])
np.where(cond, xarr, yarr)

array([ 1.1,  2.2,  1.3,  1.4,  2.5])

In [71]:
arr = np.random.randn(4,4)

print arr
print 

print np.where(arr > 0, 2, -2)

[[ 1.08283649 -0.49184401 -2.34512336 -0.04154162]
 [ 0.11962092 -1.99957714 -0.2234407  -2.53879499]
 [-0.05851935  0.1268745  -1.01495372  0.06985686]
 [ 1.30567402  0.83101006 -1.13422051 -0.93632847]]

[[ 2 -2 -2 -2]
 [ 2 -2 -2 -2]
 [-2  2 -2  2]
 [ 2  2 -2 -2]]


#### Mathematical and Statistical Methods

Many basics stats commands available using either arr.mean() or np.mean(arr)

* sum
* mean
* std, var
* min, max
* argmin, argmax
* cumsum
* cumprod

#### Methods for Boolean Arrays

* any tests whether one or more values in an array is True
* all checks if every value is True

#### Sorting

* sort does an in place sort
* multi-dim arrays can be sorted in place along an axis

#### Unique and Other Set Logic

* unique - returns a sorted list of unique values
* intersect1d(x,y) - compute sorted, common elements in x and y
* union1d(x, y) - compute the sorted union of elements
* in1d(x, y) - returns an array of booleans indicating whether the vlaues in one array are in another 
* setdiff1d(x, y) - set difference, elements in x that are not in y

#### Linear Algebra

np.linalg.

* dot(x,y) - matrix multiplication
* diag(x) - returns diagnoal elements as a 1D array, or convert 1D array to square matrix with 1D as diagonal
* trace - compute sum of diagonal
* det - matrix determinant
* eig - compute eigenvalues and eigenvectors of square matrix
* inv - inverse of square matrix

#### Random Number Generation

numpy.random module


allows you to quickly draw samples from a handful of distributions
* rand - uniform
* randint - integers given low to high range 
* randn - normal distribution mean 0, standard deviation 1
* binomial - binomial distribution
* nomral - guassian
* beta - beta
* chisquare - chi-square 
* gamma - gamma
* uniform - uniform [0,1)


* seed - seed random number generator
* permutation - return random permutation of a sequence
* shuffle - randomly permute a sequence in place

## Pandas

Suggested import: 



In [72]:
from pandas import Series, DataFrame
import pandas as pd

### Series

A one dimensional array like object containing an array of data and array of associated array data label (index). 

* can be created with a single list of values or with this and a list of indexes or with a dict
* You can use the index of the location to select single values or sets of values
* very similar operations to numpy arrays can be performed but mainintaing the index
* sort of like a fixed length ordered dict
* can search for a index using 'b' in obj


* missing data shows up as NaN - this will be the case if you try to access a index that doesn't exist
* missing data can be found using isnull or notnull


* Series automatically align differently indexed data in arithmetic operations
* Series index can be named using .name
* Series index can be altered in place using .index

In [74]:
obj = pd.Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [81]:
obj2 = Series({'Ohio':35000, 'Texas':71000, 'Oregon':16000, 'Utah':5000})
obj2

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [75]:
print obj.values
print obj.index

[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)


In [77]:
obj_index = pd.Series([4,7,-5,3], index = ['d', 'b', 'a', 'c'])
obj_index

d    4
b    7
a   -5
c    3
dtype: int64

In [79]:
print obj_index['a']
print obj_index[['c', 'a', 'd']]

-5
c    3
a   -5
d    4
dtype: int64


In [80]:
obj_index[obj_index > 0]

d    4
b    7
c    3
dtype: int64

In [84]:
obj4 = pd.Series(obj2, index = ['California', 'Ohio', 'Oregon', 'Texas'])
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [88]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [90]:
obj2 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [91]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [93]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

### DataFrame

* A tabular, spreadsheet like data structure with an order collection of columns which can each be a different type
* Has a row and a column index
* A dict of series (one for all sharing the same index)

**Construction of DataFrames**
* One common way is from a dict of equal-length lists or NumPy arrays
* columns can be used to define the columns order, and add empty columns filled with NaN if you didn't supply data
* alternatively a nested dict of dicts - outer dict keys are coluns and inner are row indices 
* Dict of series is another option


**Indexing by column**
* Columns can be grabbed either like a dict or as an attribute (df.state or df['state'])
* Columns can be modified by assignment, assigning a scalar or an array OR a **Series** in which case it will match on index
* assigning a new column will create it
* del will remove a column

**Indexing by row**
* Rows can be retrieved by position (loc) of name (iloc)


* values returns a 2D ndarray with dtype to accomodate all types (often object)

In [114]:
data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 
        'year':[2000, 2001, 2002, 2001, 2002],
        'pop':[1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data, columns = ['year', 'state', 'pop', 'debt'])
frame

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,
1,2001,Ohio,1.7,
2,2002,Ohio,3.6,
3,2001,Nevada,2.4,
4,2002,Nevada,2.9,


In [115]:
frame['state']

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [116]:
frame.state

0      Ohio
1      Ohio
2      Ohio
3    Nevada
4    Nevada
Name: state, dtype: object

In [117]:
frame['debt'] = 16.5
frame

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,16.5
1,2001,Ohio,1.7,16.5
2,2002,Ohio,3.6,16.5
3,2001,Nevada,2.4,16.5
4,2002,Nevada,2.9,16.5


In [118]:
frame['debt'] = np.arange(5.)
frame

Unnamed: 0,year,state,pop,debt
0,2000,Ohio,1.5,0.0
1,2001,Ohio,1.7,1.0
2,2002,Ohio,3.6,2.0
3,2001,Nevada,2.4,3.0
4,2002,Nevada,2.9,4.0


In [119]:
frame2 = DataFrame(data, columns = ['year', 'state', 'pop', 'debt'], index = ['one', 'two', 'three', 'four', 'five'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [120]:
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [121]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

### Index Objects

Index objects hold axis labels and other metadata
* index objects are immutable, this is important so that they can be safely shared


* Can be different types:
    * multiIndex
    * DatetimeIndex
    * PeriodIndex

### Essential Functionality


#### Reindexing
Create a new object where the data conforms to a new index, filling missing values with NaN

* by default missing values will be filled with NaN but other options exist
    * fill_value = 0 - will fill missing with 0
    * method = 'ffill' -  carry values forward
    * method = 'bfill' - carry values backward
    
    
* default is to reindex rows but columns can be reindexed using the keyword columns 

In [122]:
obj = Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [125]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'], fill_value = 0)
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [127]:
obj3 = Series(['blue', 'purple', 'yellow'], index = [0,2,4])
obj3.reindex(range(6), method = 'ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [135]:
frame = DataFrame(np.arange(9).reshape(3,3), index = ['a', 'c', 'd'], columns = ['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [137]:
frame.reindex(index = ['a','b','c','d'], columns=['Texas', 'Ohio', 'California', 'Oregon'])

Unnamed: 0,Texas,Ohio,California,Oregon
a,1.0,0.0,2.0,
b,,,,
c,4.0,3.0,5.0,
d,7.0,6.0,8.0,


#### Dropping entries from an axis

drop removes row by index

In [141]:
obj = Series(np.arange(5.), index = ['a','b','c','d','e'])
print obj
print obj.drop('c')
print obj.drop(['c','d'])

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64
a    0.0
b    1.0
e    4.0
dtype: float64


In [142]:
data = DataFrame(np.arange(16).reshape(4,4), 
                 index = ['Ohio', 'Colorado', 'Utah', 'New York'], 
                 columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [143]:
data.drop(['Colorado', 'Utah'])

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
New York,12,13,14,15


In [144]:
data.drop('two', axis = 1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


#### Indexing, selection and filtering

**Series**
* works very much like numpy indexing except you can also use the index
* slicing is inclusive 


**DataFrames**
* can select rows by slicing or a boolean array

In [145]:
data = DataFrame(np.arange(16).reshape(4,4), 
                index = ['Ohio', 'Colorado', 'Utah', 'New York'], 
                columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [151]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,0,0
Colorado,6,0
Utah,10,8
New York,14,12


In [147]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [148]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [149]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [153]:
data.loc[['Colorado', 'Utah'], ['two', 'three']]

Unnamed: 0,two,three
Colorado,5,6
Utah,9,10


In [167]:
data[data.three > 5][:'Utah']

Unnamed: 0,one,two,three,four
Colorado,0,5,6,7
Utah,8,9,10,11


#### Arithmetic and data alignment

* missing index across operations returns NaN by default
* for DataFrames alignment is performed on both row and columns 


* use arithmetic methods and the fill_value arguemnt to do otherwise 

In [171]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index = ['a', 'c', 'd', 'e'])
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a', 'c', 'e', 'f', 'g'])
print s1
print s2
print s1+s2

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64
a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64
a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64


In [173]:
df1 = DataFrame(np.arange(12.).reshape(3,4), columns = list('abcd')) 
df2 = DataFrame(np.arange(20.).reshape(4,5), columns = list('abcde'))

In [174]:
print df1+df2
print df1.add(df2, fill_value=0)

      a     b     c     d   e
0   0.0   2.0   4.0   6.0 NaN
1   9.0  11.0  13.0  15.0 NaN
2  18.0  20.0  22.0  24.0 NaN
3   NaN   NaN   NaN   NaN NaN
      a     b     c     d     e
0   0.0   2.0   4.0   6.0   4.0
1   9.0  11.0  13.0  15.0   9.0
2  18.0  20.0  22.0  24.0  14.0
3  15.0  16.0  17.0  18.0  19.0


#### Function application and mapping

* apply - applies a function on 1D arrays to each column or row 
    * can return a scalar or a Series
* applymap - element-wise operation on Dataframe
* map - element-wise operation on Series

In [177]:
frame = DataFrame(np.random.randn(4,3), columns = list('bde'), index = ['Utah', 'Ohio', 'Texas', 'Oregon'])
f = lambda x: x.max() - x.min()
print frame

               b         d         e
Utah    0.978149 -0.570633 -0.471903
Ohio    0.790802  0.451122 -0.003661
Texas  -0.473697  0.112599  0.733702
Oregon -0.946494  2.218031  1.430734


In [178]:
frame.apply(f)

b    1.924642
d    2.788664
e    1.902637
dtype: float64

In [179]:
frame.apply(f, axis=1)

Utah      1.548782
Ohio      0.794462
Texas     1.207400
Oregon    3.164525
dtype: float64

In [180]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.946494,-0.570633,-0.471903
max,0.978149,2.218031,1.430734


In [181]:
format_str = lambda x: '%.2f' % x
frame.applymap(format_str)

Unnamed: 0,b,d,e
Utah,0.98,-0.57,-0.47
Ohio,0.79,0.45,-0.0
Texas,-0.47,0.11,0.73
Oregon,-0.95,2.22,1.43


#### Sorting and ranking

* sort_index - sort lexicographically by row or column index (returns new sorted object)
* order - sort series by values


DataFrames
* sort_index with by and list of columns - to sort on columns

In [182]:
obj = Series(range(4), index = ['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64