# Agenda
* Numpy
* Pandas
* Lab


# Introduction


## Create a new notebook for your code-along:

From our submission directory, type:
    
    jupyter notebook

From the IPython Dashboard, open a new notebook.
Change the title to: "Numpy and Pandas"

# Introduction to Numpy

* Overview
* ndarray
* Indexing and Slicing

More info: [http://wiki.scipy.org/Tentative_NumPy_Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)


## Numpy Overview

* Why Python for Data? Numpy brings *decades* of C math into Python!
* Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality
* NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)

In [2]:
import numpy as np
import pandas as pd


### Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items. 

In [6]:
# Creating arrays
#calling ? in front provides help
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4))
d = np.arange(0,11,1)
print a, b, c, d

[ 0.  0.  0.] [[ 1.  1.  1.]
 [ 1.  1.  1.]] [[[2 2 1 2]
  [2 3 3 2]
  [7 8 4 6]]

 [[3 9 8 2]
  [5 5 8 4]
  [2 1 5 7]]] [ 0  1  2  3  4  5  6  7  8  9 10]


What are these functions?

    arange?

In [7]:
# Note the way each array is printed:
a,b,c,d

(array([ 0.,  0.,  0.]), array([[ 1.,  1.,  1.],
        [ 1.,  1.,  1.]]), array([[[2, 2, 1, 2],
         [2, 3, 3, 2],
         [7, 8, 4, 6]],
 
        [[3, 9, 8, 2],
         [5, 5, 8, 4],
         [2, 1, 5, 7]]]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]))

In [8]:
## Arithmetic in arrays is element wise

In [9]:
a = np.array( [20,30,40,50] )
b = np.arange( 4 )
b

array([0, 1, 2, 3])

In [10]:
c = a-b
c

array([20, 29, 38, 47])

In [11]:
b**2

array([0, 1, 4, 9])

## Indexing, Slicing and Iterating

In [12]:
# one-dimensional arrays work like lists:
a = np.arange(10)**2

In [13]:
a

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [14]:
a[2:5]

array([ 4,  9, 16])

In [15]:
# Multidimensional arrays use tuples with commas for indexing
# with (row,column) conventions beginning, as always in Python, from 0

In [16]:
b = np.random.randint(1,100,(4,4))

In [17]:
b

array([[92, 32, 75, 72],
       [92, 79, 47, 90],
       [94, 84, 29, 97],
       [35, 32, 57, 91]])

In [18]:
# Guess the output
print(b[2,3])
print(b[0,0])


97
92


In [20]:
b[0:3,1],b[:,1]

(array([32, 79, 84]), array([32, 79, 84, 32]))

In [21]:
b[1:3,:]

array([[92, 79, 47, 90],
       [94, 84, 29, 97]])

# Introduction to Pandas

* Object Creation
* Viewing data
* Selection
* Missing data
* Grouping
* Reshaping
* Time series
* Plotting
* i/o
 

_pandas.pydata.org_

## Pandas Overview

_Source: [pandas.pydata.org](http://pandas.pydata.org/pandas-docs/stable/10min.html)_

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [8]:
dates = pd.date_range('20140101',periods=6)
dates

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [9]:
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
z = pd.DataFrame(index = df.index, columns = df.columns)
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [10]:
# Index, columns, underlying numpy data
df.T
df

Unnamed: 0,A,B,C,D
2014-01-01,1.078506,-0.558053,-2.057272,-1.072423
2014-01-02,1.205022,-1.318195,-0.942636,0.570518
2014-01-03,-0.576141,-0.421054,0.099232,-0.318254
2014-01-04,1.420392,0.897595,-0.922699,0.827347
2014-01-05,-0.109823,-1.01092,-1.068103,-1.975316
2014-01-06,-1.767098,-0.063905,1.586344,0.257661


In [26]:
df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : 'foo' })
    

df2

Unnamed: 0,A,B,C,D,E
0,1.0,2013-01-02,1.0,3,foo
1,1.0,2013-01-02,1.0,3,foo
2,1.0,2013-01-02,1.0,3,foo
3,1.0,2013-01-02,1.0,3,foo


In [27]:
# With specific dtypes
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

#### Viewing Data

In [28]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,0.30544,1.582244,-0.755301,-0.563081
2014-01-02,-2.059095,1.830817,-0.772775,1.927213
2014-01-03,-1.521501,-0.260282,0.446055,-0.305249
2014-01-04,-0.193088,0.348156,0.394483,-1.053801
2014-01-05,1.420014,-0.274107,-0.79355,-0.957536


In [29]:
df.tail()

Unnamed: 0,A,B,C,D
2014-01-02,-2.059095,1.830817,-0.772775,1.927213
2014-01-03,-1.521501,-0.260282,0.446055,-0.305249
2014-01-04,-0.193088,0.348156,0.394483,-1.053801
2014-01-05,1.420014,-0.274107,-0.79355,-0.957536
2014-01-06,1.479728,1.079766,-1.153144,0.888863


In [30]:
df.index

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [31]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.09475,0.717766,-0.439039,-0.010599
std,1.472249,0.915887,0.681935,1.178233
min,-2.059095,-0.274107,-1.153144,-1.053801
25%,-1.189397,-0.108172,-0.788356,-0.858922
50%,0.056176,0.713961,-0.764038,-0.434165
75%,1.141371,1.456624,0.107037,0.590335
max,1.479728,1.830817,0.446055,1.927213


In [32]:
df.sort_values(by='B')
df

Unnamed: 0,A,B,C,D
2014-01-01,0.30544,1.582244,-0.755301,-0.563081
2014-01-02,-2.059095,1.830817,-0.772775,1.927213
2014-01-03,-1.521501,-0.260282,0.446055,-0.305249
2014-01-04,-0.193088,0.348156,0.394483,-1.053801
2014-01-05,1.420014,-0.274107,-0.79355,-0.957536
2014-01-06,1.479728,1.079766,-1.153144,0.888863


### Selection

In [33]:
df[['A','B']] #second [] allow access a dataframe

Unnamed: 0,A,B
2014-01-01,0.30544,1.582244
2014-01-02,-2.059095,1.830817
2014-01-03,-1.521501,-0.260282
2014-01-04,-0.193088,0.348156
2014-01-05,1.420014,-0.274107
2014-01-06,1.479728,1.079766


In [34]:
df[0:3]

Unnamed: 0,A,B,C,D
2014-01-01,0.30544,1.582244,-0.755301,-0.563081
2014-01-02,-2.059095,1.830817,-0.772775,1.927213
2014-01-03,-1.521501,-0.260282,0.446055,-0.305249


In [35]:
# By label
df.loc[dates[0]]

A    0.305440
B    1.582244
C   -0.755301
D   -0.563081
Name: 2014-01-01 00:00:00, dtype: float64

In [36]:
# multi-axis by label
df.loc[:,['A','B']]

Unnamed: 0,A,B
2014-01-01,0.30544,1.582244
2014-01-02,-2.059095,1.830817
2014-01-03,-1.521501,-0.260282
2014-01-04,-0.193088,0.348156
2014-01-05,1.420014,-0.274107
2014-01-06,1.479728,1.079766


In [37]:
# Date Range
df.loc['20140102':'20140104',['B']]

Unnamed: 0,B
2014-01-02,1.830817
2014-01-03,-0.260282
2014-01-04,0.348156


In [38]:
# Fast access to scalar
df.at[dates[1],'B']

1.8308167450507209

In [39]:
# iloc provides integer locations similar to np style
df.iloc[3:]

Unnamed: 0,A,B,C,D
2014-01-04,-0.193088,0.348156,0.394483,-1.053801
2014-01-05,1.420014,-0.274107,-0.79355,-0.957536
2014-01-06,1.479728,1.079766,-1.153144,0.888863


### Boolean Indexing

In [40]:
df[df.A < 0] # Basically a 'where' operation

Unnamed: 0,A,B,C,D
2014-01-02,-2.059095,1.830817,-0.772775,1.927213
2014-01-03,-1.521501,-0.260282,0.446055,-0.305249
2014-01-04,-0.193088,0.348156,0.394483,-1.053801


### Setting

In [41]:
df_posA = df.copy() # Without "copy" it would act on the dataset

df_posA[df_posA.A < 0] = -1*df_posA

In [42]:
df_posA

Unnamed: 0,A,B,C,D
2014-01-01,0.30544,1.582244,-0.755301,-0.563081
2014-01-02,2.059095,-1.830817,0.772775,-1.927213
2014-01-03,1.521501,0.260282,-0.446055,0.305249
2014-01-04,0.193088,-0.348156,-0.394483,1.053801
2014-01-05,1.420014,-0.274107,-0.79355,-0.957536
2014-01-06,1.479728,1.079766,-1.153144,0.888863


In [43]:
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))

In [44]:
s1

2014-01-02    1
2014-01-03    2
2014-01-04    3
2014-01-05    4
2014-01-06    5
2014-01-07    6
Freq: D, dtype: int64

In [45]:
df['F'] = s1

In [4]:
df

NameError: name 'df' is not defined

### Missing Data

In [47]:
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])

In [48]:
df1.loc[dates[0]:dates[1],'E'] = 1

In [49]:
df1

Unnamed: 0,A,B,C,D,F,E
2014-01-01,0.30544,1.582244,-0.755301,-0.563081,,1.0
2014-01-02,-2.059095,1.830817,-0.772775,1.927213,1.0,1.0
2014-01-03,-1.521501,-0.260282,0.446055,-0.305249,2.0,
2014-01-04,-0.193088,0.348156,0.394483,-1.053801,3.0,


In [50]:
# find where values are null
pd.isnull(df1)

Unnamed: 0,A,B,C,D,F,E
2014-01-01,False,False,False,False,True,False
2014-01-02,False,False,False,False,False,False
2014-01-03,False,False,False,False,False,True
2014-01-04,False,False,False,False,False,True


### Operations

In [51]:
df.describe()

Unnamed: 0,A,B,C,D,F
count,6.0,6.0,6.0,6.0,5.0
mean,-0.09475,0.717766,-0.439039,-0.010599,3.0
std,1.472249,0.915887,0.681935,1.178233,1.581139
min,-2.059095,-0.274107,-1.153144,-1.053801,1.0
25%,-1.189397,-0.108172,-0.788356,-0.858922,2.0
50%,0.056176,0.713961,-0.764038,-0.434165,3.0
75%,1.141371,1.456624,0.107037,0.590335,4.0
max,1.479728,1.830817,0.446055,1.927213,5.0


In [52]:
df.mean(),df.mean(1) # Operation on two different axes

(A   -0.094750
 B    0.717766
 C   -0.439039
 D   -0.010599
 F    3.000000
 dtype: float64, 2014-01-01    0.142326
 2014-01-02    0.385232
 2014-01-03    0.071805
 2014-01-04    0.499150
 2014-01-05    0.678964
 2014-01-06    1.459043
 Freq: D, dtype: float64)

### Applying functions

In [3]:
df

NameError: name 'df' is not defined

In [54]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D,F
2014-01-01,0.30544,1.582244,-0.755301,-0.563081,
2014-01-02,-1.753655,3.413061,-1.528075,1.364132,1.0
2014-01-03,-3.275156,3.152779,-1.082021,1.058883,3.0
2014-01-04,-3.468243,3.500935,-0.687537,0.005082,6.0
2014-01-05,-2.048229,3.226829,-1.481088,-0.952454,10.0
2014-01-06,-0.568501,4.306594,-2.634232,-0.063591,15.0


In [16]:
def custom_func(x, a):
    return x + a

#both do the samething
df.apply(lambda x: x.max()) #lambda makes an function without it being defined 
#do an operation with user input (below)
df.apply(lambda x: custom_func(x, 1))

Unnamed: 0,A,B,C,D
2014-01-01,2.078506,0.441947,-1.057272,-0.072423
2014-01-02,2.205022,-0.318195,0.057364,1.570518
2014-01-03,0.423859,0.578946,1.099232,0.681746
2014-01-04,2.420392,1.897595,0.077301,1.827347
2014-01-05,0.890177,-0.01092,-0.068103,-0.975316
2014-01-06,-0.767098,0.936095,2.586344,1.257661


In [56]:
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

### Merge

In [57]:
np.random.randn(10,4)

array([[-0.63004922, -1.59825579,  1.57135955,  0.8372529 ],
       [-0.90978749,  0.6164943 , -1.23203251, -1.48755087],
       [ 0.2684147 ,  1.81282425,  1.61585991, -0.3780532 ],
       [ 0.46944204,  0.92522524,  0.41592032,  0.59103306],
       [-1.25728434, -0.44017552, -1.56507323,  1.1092632 ],
       [ 1.55686711,  0.68134391, -0.1474489 , -1.5526394 ],
       [ 0.64392246,  0.48329234,  0.70164411,  1.76367499],
       [ 0.64308961,  1.01804717, -1.00852702, -1.09198794],
       [ 0.10679744,  0.19168672,  0.77541976, -1.36352573],
       [-0.54018422, -1.32098216,  0.71191158, -1.69998242]])

In [58]:
#Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10,4))
df

Unnamed: 0,0,1,2,3
0,0.540047,-2.362725,0.288254,0.56283
1,-0.413663,-0.909021,-0.012049,-0.701526
2,-0.964266,0.972204,-0.804958,0.545468
3,-0.293776,-0.259829,0.371753,-0.7056
4,-0.417441,1.43227,1.110488,-0.355463
5,0.385209,-0.810505,-0.222489,-1.24728
6,-1.395946,0.327114,1.000198,0.36486
7,-1.432864,1.970238,-0.180278,-0.413504
8,1.751233,-0.291671,-0.757042,0.100739
9,-1.364001,-0.950015,0.542743,0.175593


In [59]:
# Break it into pieces
pieces = [df[:3], df[3:7],df[7:]]
pieces

[          0         1         2         3
 0  0.540047 -2.362725  0.288254  0.562830
 1 -0.413663 -0.909021 -0.012049 -0.701526
 2 -0.964266  0.972204 -0.804958  0.545468,
           0         1         2         3
 3 -0.293776 -0.259829  0.371753 -0.705600
 4 -0.417441  1.432270  1.110488 -0.355463
 5  0.385209 -0.810505 -0.222489 -1.247280
 6 -1.395946  0.327114  1.000198  0.364860,
           0         1         2         3
 7 -1.432864  1.970238 -0.180278 -0.413504
 8  1.751233 -0.291671 -0.757042  0.100739
 9 -1.364001 -0.950015  0.542743  0.175593]

In [60]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.540047,-2.362725,0.288254,0.56283
1,-0.413663,-0.909021,-0.012049,-0.701526
2,-0.964266,0.972204,-0.804958,0.545468
3,-0.293776,-0.259829,0.371753,-0.7056
4,-0.417441,1.43227,1.110488,-0.355463
5,0.385209,-0.810505,-0.222489,-1.24728
6,-1.395946,0.327114,1.000198,0.36486
7,-1.432864,1.970238,-0.180278,-0.413504
8,1.751233,-0.291671,-0.757042,0.100739
9,-1.364001,-0.950015,0.542743,0.175593


In [61]:
# Also can "Join" and "Append"
df

Unnamed: 0,0,1,2,3
0,0.540047,-2.362725,0.288254,0.56283
1,-0.413663,-0.909021,-0.012049,-0.701526
2,-0.964266,0.972204,-0.804958,0.545468
3,-0.293776,-0.259829,0.371753,-0.7056
4,-0.417441,1.43227,1.110488,-0.355463
5,0.385209,-0.810505,-0.222489,-1.24728
6,-1.395946,0.327114,1.000198,0.36486
7,-1.432864,1.970238,-0.180278,-0.413504
8,1.751233,-0.291671,-0.757042,0.100739
9,-1.364001,-0.950015,0.542743,0.175593


### Grouping


In [62]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})

In [63]:
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.945613,-0.475291
1,bar,one,0.444286,-0.794496
2,foo,two,0.925798,-0.666301
3,bar,three,0.934796,-0.228948
4,foo,two,1.597492,0.465232
5,bar,two,0.680151,-0.909297
6,foo,one,1.226569,0.288632
7,foo,three,-2.18315,-0.038068


In [64]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,0.444286,-0.794496
bar,three,0.934796,-0.228948
bar,two,0.680151,-0.909297
foo,one,0.280956,-0.186659
foo,three,-2.18315,-0.038068
foo,two,2.52329,-0.201069


### Reshaping

In [65]:
# You can also stack or unstack levels

In [66]:
a = df.groupby(['A','B']).sum()

In [67]:
# Pivot Tables
pd.pivot_table(df,values=['C','D'],index=['A'],columns=['B'])

Unnamed: 0_level_0,C,C,C,D,D,D
B,one,three,two,one,three,two
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,0.444286,0.934796,0.680151,-0.794496,-0.228948,-0.909297
foo,0.140478,-2.18315,1.261645,-0.093329,-0.038068,-0.100535


### Time Series


In [68]:
import pandas as pd
import numpy as np

In [69]:
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')

In [70]:
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [71]:
ts

2014-01-01 00:00:00    295
2014-01-01 00:00:01    351
2014-01-01 00:00:02    447
2014-01-01 00:00:03    122
2014-01-01 00:00:04    353
2014-01-01 00:00:05     45
2014-01-01 00:00:06    459
2014-01-01 00:00:07    395
2014-01-01 00:00:08    328
2014-01-01 00:00:09     29
2014-01-01 00:00:10    410
2014-01-01 00:00:11    340
2014-01-01 00:00:12      8
2014-01-01 00:00:13     71
2014-01-01 00:00:14    481
2014-01-01 00:00:15    300
2014-01-01 00:00:16    316
2014-01-01 00:00:17    215
2014-01-01 00:00:18     86
2014-01-01 00:00:19     18
2014-01-01 00:00:20    203
2014-01-01 00:00:21    255
2014-01-01 00:00:22    111
2014-01-01 00:00:23    436
2014-01-01 00:00:24     49
2014-01-01 00:00:25    243
2014-01-01 00:00:26    490
2014-01-01 00:00:27      1
2014-01-01 00:00:28    280
2014-01-01 00:00:29    191
                      ... 
2014-01-01 00:01:10    235
2014-01-01 00:01:11    259
2014-01-01 00:01:12    231
2014-01-01 00:01:13    163
2014-01-01 00:01:14    424
2014-01-01 00:01:15     71
2

In [72]:
# Built in resampling
ts.resample('1Min').mean() # Resample secondly to 1Minutely

2014-01-01 00:00:00    259.200
2014-01-01 00:01:00    284.575
Freq: T, dtype: float64

In [73]:
# Many additional time series features
ts. #use tab

SyntaxError: invalid syntax (<ipython-input-73-5c9240a56f62>, line 2)

### Plotting


In [74]:
ts.plot()

<matplotlib.axes._subplots.AxesSubplot at 0x105c0c150>

In [75]:
def randwalk(startdate,points):
    ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
    ts=ts.cumsum()
    ts.plot()
    return(ts)

In [76]:
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)

In [77]:
# Pandas plot function will print with labels as default

In [78]:
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #

<matplotlib.legend.Legend at 0x10b0a0610>

### I/O
I/O is straightforward with, for example, pd.read_csv or df.to_csv

#### The benefits of open source:

Let's look under x's in plt modules

# Next Steps

**Recommended Resources**

Name | Description
--- | ---
[Official Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/10min.html) | Wes & Company's selection of tutorials and lectures
[Julia Evans Pandas Cookbook](https://github.com/jvns/pandas-cookbook) | Great resource with examples from weather, bikes and 311 calls
[Learn Pandas Tutorials](https://bitbucket.org/hrojas/learn-pandas) | A great series of Pandas tutorials from Dave Rojas
[Research Computing Python Data PYNBs](https://github.com/ResearchComputing/Meetup-Fall-2013/tree/master/python) | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas