# Agenda
* Numpy
* Pandas
* Lab


# Introduction


## Create a new notebook for your code-along:

From our submission directory, type:
    
    jupyter notebook

From the IPython Dashboard, open a new notebook.
Change the title to: "Numpy and Pandas"

# Introduction to Numpy

* Overview
* ndarray
* Indexing and Slicing

More info: [http://wiki.scipy.org/Tentative_NumPy_Tutorial](http://wiki.scipy.org/Tentative_NumPy_Tutorial)


## Numpy Overview

* Why Python for Data? Numpy brings *decades* of C math into Python!
* Numpy provides a wrapper for extensive C/C++/Fortran codebases, used for data analysis functionality
* NDAarray allows easy vectorized math and broadcasting (i.e. functions for vector elements of different shapes)

In [6]:
!pip install package_name

Collecting package_name
  Downloading package_name-0.1.tar.gz
Building wheels for collected packages: package-name
  Running setup.py bdist_wheel for package-name ... [?25l- \ done
[?25h  Stored in directory: /Users/edmond_20000/Library/Caches/pip/wheels/97/25/75/79d4ad8fbbcea368670af12dd8d1f2ccbe49a7ce7deb1fb0ab
Successfully built package-name
Installing collected packages: package-name
Successfully installed package-name-0.1


In [7]:
import numpy as np

### Creating ndarrays

An array object represents a multidimensional, homogeneous array of fixed-size items. 

In [8]:
# Creating arrays
a = np.zeros((3))
b = np.ones((2,3))
c = np.random.randint(1,10,(2,3,4))
d = np.arange(0,11,1)

In [13]:
print (a)

[ 0.  0.  0.]


In [14]:
print b 

[[ 1.  1.  1.]
 [ 1.  1.  1.]]


In [20]:
print c

[[[6 7 7 8]
  [1 1 1 4]
  [9 8 9 5]]

 [[3 4 6 9]
  [2 9 7 2]
  [1 4 6 1]]]


In [18]:
print d

[ 0  1  2  3  4  5  6  7  8  9 10]


What are these functions?

    arange?

In [40]:
c = np.random.randint(1,10,4)
print c

[5 8 5 5]


In [21]:
# Note the way each array is printed:
a,b,c,d

(array([ 0.,  0.,  0.]), array([[ 1.,  1.,  1.],
        [ 1.,  1.,  1.]]), array([[[6, 7, 7, 8],
         [1, 1, 1, 4],
         [9, 8, 9, 5]],
 
        [[3, 4, 6, 9],
         [2, 9, 7, 2],
         [1, 4, 6, 1]]]), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]))

In [None]:
## Arithmetic in arrays is element wise

In [22]:
a = np.array( [20,30,40,50] )
b = np.arange( 4 )
b

array([0, 1, 2, 3])

In [23]:
c = a-b
c

array([20, 29, 38, 47])

In [24]:
b**2

array([0, 1, 4, 9])

In [27]:
#b**2 will not work so need to use map
map( lambda x: x**2, [0,1,2,3])

[0, 1, 4, 9]

## Indexing, Slicing and Iterating

In [28]:
# one-dimensional arrays work like lists:
a = np.arange(10)**2
print a

#map( lambda x: X **2, range(10))

[ 0  1  4  9 16 25 36 49 64 81]


In [29]:
a

array([ 0,  1,  4,  9, 16, 25, 36, 49, 64, 81])

In [30]:
a[2:5]

array([ 4,  9, 16])

In [None]:
# Multidimensional arrays use tuples with commas for indexing
# with (row,column) conventions beginning, as always in Python, from 0

In [31]:
b = np.random.randint(1,100,(4,4))

In [32]:
b

array([[25, 65, 69,  7],
       [85, 17,  4, 22],
       [ 7,  5, 58, 16],
       [23, 52, 46, 35]])

In [35]:
# Guess the output
print(b[2,3])
print(b[0,0])


16
25


In [41]:
b[0:3,1],b[:,1]

(array([65, 17,  5]), array([65, 17,  5, 52]))

In [42]:
b[1:3,:]

array([[85, 17,  4, 22],
       [ 7,  5, 58, 16]])

# Introduction to Pandas

* Object Creation
* Viewing data
* Selection
* Missing data
* Grouping
* Reshaping
* Time series
* Plotting
* i/o
 

_pandas.pydata.org_

## Pandas Overview

_Source: [pandas.pydata.org](http://pandas.pydata.org/pandas-docs/stable/10min.html)_

In [43]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [46]:
dates = pd.date_range('20140101',periods=6)
dates

DatetimeIndex(['2014-01-01', '2014-01-02', '2014-01-03', '2014-01-04',
               '2014-01-05', '2014-01-06'],
              dtype='datetime64[ns]', freq='D')

In [57]:
#this will returns the years instead of the days
dates = pd.date_range('20140101',periods=6, freq="366D")
print dates

DatetimeIndex(['2014-01-01', '2015-01-02', '2016-01-03', '2017-01-03',
               '2018-01-04', '2019-01-05'],
              dtype='datetime64[ns]', freq='366D')


In [49]:
#this returns a timestamp. helps with historical data
dates = pd.date_range('20140101',periods=6)
dates [0]

#wherever you have stringified dates like "2010-05-01" or "2010/04/04", it will return a timestamp.

Timestamp('2014-01-01 00:00:00', freq='D')

In [53]:
#sample timestamp
date1 = dates[0]

In [54]:
#printing the timestamp
print date1.day
print date1.month
print date1.year

1
1
2014


In [60]:
#returns random numbers from a random distribution on 6 columns and 4 rows.
np.random.randn(6,4)

array([[ 0.21809338,  0.63844799,  1.74294098, -0.43690965],
       [-1.87939836, -0.0397535 , -0.22782256, -0.34279839],
       [ 0.06144449,  0.2960631 ,  1.08412883,  0.15130826],
       [ 1.21702976, -1.23512353,  0.87738496,  0.12984023],
       [ 0.05955683, -0.47370786,  1.25816591,  0.9064061 ],
       [ 0.59344125,  0.26775239,  0.50163379,  0.47862135]])

In [55]:
import time

In [56]:
time.localtime()
#this gives you the local time on your computer

time.struct_time(tm_year=2016, tm_mon=11, tm_mday=14, tm_hour=19, tm_min=34, tm_sec=26, tm_wday=0, tm_yday=319, tm_isdst=0)

In [62]:
#making a dataframe

df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
z = pd.DataFrame(index = df.index, columns = df.columns)
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [59]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,-0.030941,-0.243624,-0.358242,-0.561736
2015-01-02,-0.951847,1.335492,-0.521909,0.229564
2016-01-03,-0.528537,-0.772701,-0.169211,-0.757516
2017-01-03,-1.66902,0.163882,1.330974,-1.211827
2018-01-04,0.781567,-0.667241,1.167789,1.009775


In [64]:
# Index, columns, underlying numpy data
#Transpose meaning, it flips the table around
df.T


Unnamed: 0,2014-01-01 00:00:00,2015-01-02 00:00:00,2016-01-03 00:00:00,2017-01-03 00:00:00,2018-01-04 00:00:00,2019-01-05 00:00:00
A,1.595521,1.17348,0.951366,-0.073438,-0.671779,-1.080399
B,1.718262,1.744622,-0.79637,0.339853,-0.805496,-1.111915
C,0.782275,-0.195889,-0.547382,-1.523696,-1.305093,-0.915148
D,-1.015218,0.150449,-1.191503,-0.987303,0.420598,-1.143999


In [66]:
#we can make a dataframe by throwing in a dictionary. This is another way of doing a dataframe
#columns are A,B,C,D,E and every value for the columns is what is assigned to it. So Column A will be 1.
#every value for B will be the exact samedate.
#every single column of Column C will be a float data type. So its printing 1 on 4 rows.
#column D will return np.array 3,3,3,3 and they will be all integers.

df2 = pd.DataFrame({ 'A' : 1.,
                         'B' : pd.Timestamp('20130102'),
                         'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                         'D' : np.array([3] * 4,dtype='int32'),
                         'E' : 'foo' })
    

df2

Unnamed: 0,A,B,C,D,E
0,1.0,2013-01-02,1.0,3,foo
1,1.0,2013-01-02,1.0,3,foo
2,1.0,2013-01-02,1.0,3,foo
3,1.0,2013-01-02,1.0,3,foo


In [67]:
# With specific dtypes
#df2.dtypes => schema. schema is simply column data types.
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E            object
dtype: object

#### Viewing Data

In [68]:
df.head()

Unnamed: 0,A,B,C,D
2014-01-01,1.595521,1.718262,0.782275,-1.015218
2015-01-02,1.17348,1.744622,-0.195889,0.150449
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598


In [69]:
df.tail()

Unnamed: 0,A,B,C,D
2015-01-02,1.17348,1.744622,-0.195889,0.150449
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999


In [70]:
df.index #tells you the index column

DatetimeIndex(['2014-01-01', '2015-01-02', '2016-01-03', '2017-01-03',
               '2018-01-04', '2019-01-05'],
              dtype='datetime64[ns]', freq='366D')

In [71]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.315792,0.181493,-0.617489,-0.627829
std,1.081983,1.298721,0.839596,0.71671
min,-1.080399,-1.111915,-1.523696,-1.191503
25%,-0.522194,-0.803215,-1.207607,-1.111804
50%,0.438964,-0.228259,-0.731265,-1.001261
75%,1.117952,1.37366,-0.283762,-0.133989
max,1.595521,1.744622,0.782275,0.420598


In [74]:
df.sort_values(by='B')

Unnamed: 0,A,B,C,D
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303
2014-01-01,1.595521,1.718262,0.782275,-1.015218
2015-01-02,1.17348,1.744622,-0.195889,0.150449


In [98]:
df.sort_values(by='B') #the inplace makes the sorting stay.

Unnamed: 0,A,B,C,D
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303
2014-01-01,1.595521,1.718262,0.782275,-1.015218
2015-01-02,1.17348,1.744622,-0.195889,0.150449


In [99]:
df.head() #the index columns never change

Unnamed: 0,A,B,C,D
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303
2014-01-01,1.595521,1.718262,0.782275,-1.015218


### Selection

In [100]:
df[['A','B']] #returns a dataframe with just 2 columns.

Unnamed: 0,A,B
2019-01-05,-1.080399,-1.111915
2018-01-04,-0.671779,-0.805496
2016-01-03,0.951366,-0.79637
2017-01-03,-0.073438,0.339853
2014-01-01,1.595521,1.718262
2015-01-02,1.17348,1.744622


In [101]:
df[0:3] #rows will be 0,1,2,3 => 4 rows. Columns will be 0,1,2 =>  3 columns. 

Unnamed: 0,A,B,C,D
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503


In [102]:
# By label
df.loc[dates[0]]

A    1.595521
B    1.718262
C    0.782275
D   -1.015218
Name: 2014-01-01 00:00:00, dtype: float64

In [103]:
# multi-axis by label
# basically give me column A and B
df.loc[:,['A','B']]

Unnamed: 0,A,B
2019-01-05,-1.080399,-1.111915
2018-01-04,-0.671779,-0.805496
2016-01-03,0.951366,-0.79637
2017-01-03,-0.073438,0.339853
2014-01-01,1.595521,1.718262
2015-01-02,1.17348,1.744622


In [104]:
df.loc['2014-01-01':'2014-01-02',['A','B']]

Unnamed: 0,A,B
2014-01-01,1.595521,1.718262


In [107]:
# Date Range
#When index is between those days, we need column B
df.loc['20140102':'20140104',['B']]

Unnamed: 0,B


In [88]:
# Fast access to scalar
df.at[dates[1],'B']

1.744621655062218

In [108]:
# iloc provides integer locations similar to np style
df.iloc[3:]

Unnamed: 0,A,B,C,D
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303
2014-01-01,1.595521,1.718262,0.782275,-1.015218
2015-01-02,1.17348,1.744622,-0.195889,0.150449


### Boolean Indexing

In [109]:
df[df.A < 0] # Basically a 'where' operation. 
#Give me all those values where the value is less than zero

Unnamed: 0,A,B,C,D
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303


### Setting

In [110]:
#This simply creates a copy of your dataframe
df_posA = df.copy() # Without "copy" it would act on the dataset

df_posA.head()

Unnamed: 0,A,B,C,D
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303
2014-01-01,1.595521,1.718262,0.782275,-1.015218


In [111]:
df_posA[df_posA.A < 0] = -1*df_posA #this is switching the direction of the numbers in the rows.

In [112]:
df_posA

Unnamed: 0,A,B,C,D
2019-01-05,1.080399,1.111915,0.915148,1.143999
2018-01-04,0.671779,0.805496,1.305093,-0.420598
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503
2017-01-03,0.073438,-0.339853,1.523696,0.987303
2014-01-01,1.595521,1.718262,0.782275,-1.015218
2015-01-02,1.17348,1.744622,-0.195889,0.150449


In [None]:
#Setting new column aligns data by index
s1 = pd.Series([1,2,3,4,5,6],index=pd.date_range('20140102',periods=6))

In [None]:
s1

In [None]:
df['F'] = s1

In [None]:
df

### Missing Data

In [None]:
# Add a column with missing data
df1 = df.reindex(index=dates[0:4],columns=list(df.columns) + ['E'])

In [None]:
df1.loc[dates[0]:dates[1],'E'] = 1

In [None]:
df1

In [None]:
# find where values are null
pd.isnull(df1)

### Operations

In [None]:
df.describe()

In [None]:
df.mean(),df.mean(1) # Operation on two different axes

### Applying functions

In [None]:
#we use apply when we are trying to derive a column
df

In [113]:
df.head()

Unnamed: 0,A,B,C,D
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303
2014-01-01,1.595521,1.718262,0.782275,-1.015218


In [117]:
#create a new column called E but derive the values from D and multiply all the values by 2.
df['E'] = df.D.apply(lambda x: x*2)

#lambda is a function that will not be used again
#you can also write it this way => df['E'] = df.D.apply(multiplyBy2)

In [118]:
df.head()

Unnamed: 0,A,B,C,D,E
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999,-2.287998
2018-01-04,-0.671779,-0.805496,-1.305093,0.420598,0.841196
2016-01-03,0.951366,-0.79637,-0.547382,-1.191503,-2.383006
2017-01-03,-0.073438,0.339853,-1.523696,-0.987303,-1.974607
2014-01-01,1.595521,1.718262,0.782275,-1.015218,-2.030437


In [120]:
df.apply(np.cumsum, axis = 1).head() #aggregate everything on the same row. basically adding the rows next to it.

Unnamed: 0,A,B,C,D,E
2019-01-05,-1.080399,-2.192314,-3.107462,-4.251461,-6.53946
2018-01-04,-0.671779,-1.477275,-2.782368,-2.36177,-1.520574
2016-01-03,0.951366,0.154995,-0.392387,-1.58389,-3.966896
2017-01-03,-0.073438,0.266415,-1.257282,-2.244585,-4.219192
2014-01-01,1.595521,3.313783,4.096057,3.080839,1.050402


In [121]:
df.apply(np.cumsum, axis = 0).head() #aggregate by column. it's adding by the column.

Unnamed: 0,A,B,C,D,E
2019-01-05,-1.080399,-1.111915,-0.915148,-1.143999,-2.287998
2018-01-04,-1.752178,-1.917411,-2.220241,-0.723401,-1.446802
2016-01-03,-0.800813,-2.713781,-2.767623,-1.914904,-3.829808
2017-01-03,-0.874251,-2.373928,-4.291319,-2.902208,-5.804415
2014-01-01,0.72127,-0.655666,-3.509045,-3.917426,-7.834852


In [122]:
df.apply(lambda x: x.max() - x.min()).head() #this is giving you the range. the max of A - the min of B.
#Returns a series. This is being done on the column

A    2.675920
B    2.856537
C    2.305971
D    1.612101
E    3.224202
dtype: float64

In [123]:
import math
df.apply(lambda x: math.exp(5)*x, axis = 1).head()

Unnamed: 0,A,B,C,D,E
2019-01-05,-160.345475,-165.022797,-135.819991,-169.784534,-339.569067
2018-01-04,-99.700862,-119.546204,-193.692935,62.42229,124.84458
2016-01-03,141.195188,-118.19186,-81.23873,-176.834736,-353.669473
2017-01-03,-10.899236,50.438656,-226.136565,-146.528822,-293.057644
2014-01-01,236.796289,255.012699,116.099838,-150.671776,-301.343553


In [None]:
# Built in string methods
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower()

### Merge

In [None]:
np.random.randn(10,4)

In [None]:
#Concatenating pandas objects together
df = pd.DataFrame(np.random.randn(10,4))
df

In [None]:
# Break it into pieces
pieces = [df[:3], df[3:7],df[7:]]
pieces

In [None]:
pd.concat(pieces)

In [None]:
# Also can "Join" and "Append"
df

### Grouping


In [None]:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                       'foo', 'bar', 'foo', 'foo'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C' : np.random.randn(8),
                       'D' : np.random.randn(8)})

In [None]:
df

In [None]:
df.groupby(['A','B']).sum()

### Reshaping

In [None]:
# You can also stack or unstack levels

In [None]:
a = df.groupby(['A','B']).sum()

In [None]:
# Pivot Tables
pd.pivot_table(df,values=['C','D'],index=['A'],columns=['B'])

### Time Series


In [None]:
import pandas as pd
import numpy as np

In [None]:
# 100 Seconds starting on January 1st
rng = pd.date_range('1/1/2014', periods=100, freq='S')

In [None]:
# Give each second a random value
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [None]:
ts

In [None]:
# Built in resampling
ts.resample('1Min').mean() # Resample secondly to 1Minutely

In [None]:
# Many additional time series features
ts. #use tab

### Plotting


In [None]:
ts.plot()

In [None]:
def randwalk(startdate,points):
    ts = pd.Series(np.random.randn(points), index=pd.date_range(startdate, periods=points))
    ts=ts.cumsum()
    ts.plot()
    return(ts)

In [None]:
# Using pandas to make a simple random walker by repeatedly running:
a=randwalk('1/1/2012',1000)

In [None]:
# Pandas plot function will print with labels as default

In [None]:
df = pd.DataFrame(np.random.randn(100, 4), index=ts.index,columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure();df.plot();plt.legend(loc='best') #

### I/O
I/O is straightforward with, for example, pd.read_csv or df.to_csv

#### The benefits of open source:

Let's look under x's in plt modules

# Next Steps

**Recommended Resources**

Name | Description
--- | ---
[Official Pandas Tutorials](http://pandas.pydata.org/pandas-docs/stable/10min.html) | Wes & Company's selection of tutorials and lectures
[Julia Evans Pandas Cookbook](https://github.com/jvns/pandas-cookbook) | Great resource with examples from weather, bikes and 311 calls
[Learn Pandas Tutorials](https://bitbucket.org/hrojas/learn-pandas) | A great series of Pandas tutorials from Dave Rojas
[Research Computing Python Data PYNBs](https://github.com/ResearchComputing/Meetup-Fall-2013/tree/master/python) | A super awesome set of python notebooks from a meetup-based course exclusively devoted to pandas