# GA Data Science 19 (DAT19) - Class 5
## Developing Mastery of Pandas, Numpy & Bokeh
####  

Justin Breucop (with parts from Craig Sakuma)

## Lab goals

- NumPy: Entering the Matrix
- Pandas: DataFrames as Bamboo
- Bokeh: Picture-Perfect Visuals

##NumPy
As we've seen in lecture, linear algebra is the branch of mathematics describing navigation between different vector spaces. This core concept is very important as a big piece of data cleansing is converting data into various formats and certain algorithms require data to be in a specific shape.

NumPy is a package designed to be used in scientific computing, and specifically around building N-dimensional array objects.

###Creating an array

In [1]:
import numpy as np
a = np.arange(25).reshape(5,5)
# arange(n) is a function that creates a 1 row array of integers of length n 
# reshape(M,N) is a method converts a list to a matrix of size MxN
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

We can convert from lists to arrays. Note however unlike lists, elements of an array all have to be of the same datatype.

In [2]:
alist = [[ 0,  1,  2,  3,  4],[ 5,  6,  7,  8,  9],[10, 11, 12, 13, 14],[15, 16, 17, 18, 19],[20, 21, 22, 23, 24]]
type(alist)

list

In [3]:
np.array(alist)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24]])

In [4]:
biga = a*10
biga

array([[  0,  10,  20,  30,  40],
       [ 50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140],
       [150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240]])

In [5]:
print biga.mean()
print biga.mean(0) #Average per column
biga.mean(1) #average per row
# type(biga.mean(1))

120.0
[ 100.  110.  120.  130.  140.]


array([  20.,   70.,  120.,  170.,  220.])

In [6]:
bigm = np.matrix(biga-20)
bigm

matrix([[-20, -10,   0,  10,  20],
        [ 30,  40,  50,  60,  70],
        [ 80,  90, 100, 110, 120],
        [130, 140, 150, 160, 170],
        [180, 190, 200, 210, 220]])

In [7]:
np.linalg.inv(biga-20)

array([[ -5.82741163e+12,  -3.82630046e+13,   6.47978853e+13,
          8.50288992e+12,  -2.92103589e+13],
       [  1.03354093e+13,   4.66192930e+13,  -6.97823380e+13,
         -4.16348403e+13,   5.44624760e+13],
       [ -6.86095256e+13,   1.33700614e+14,  -7.97512434e+13,
          3.28387473e+13,  -1.81785922e+13],
       [  1.29522470e+14,  -2.54207088e+14,   1.09657960e+14,
          2.52154667e+13,  -1.01888078e+13],
       [ -6.54209419e+13,   1.12150186e+14,  -2.49222636e+13,
         -2.49222636e+13,   3.11528295e+12]])

####Slices

In [8]:
bigm = np.array(bigm)
bigm[0]

array([-20, -10,   0,  10,  20])

In [9]:
#Same thing, but demonstrating the full slice with a colon
biga[0,:]
biga

array([[  0,  10,  20,  30,  40],
       [ 50,  60,  70,  80,  90],
       [100, 110, 120, 130, 140],
       [150, 160, 170, 180, 190],
       [200, 210, 220, 230, 240]])

In [10]:
biga[:,3]

array([ 30,  80, 130, 180, 230])

Slice rules work for even more complex dimensional data

In [11]:
compa = np.arange(30).reshape(5,3,2)
compa

array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5]],

       [[ 6,  7],
        [ 8,  9],
        [10, 11]],

       [[12, 13],
        [14, 15],
        [16, 17]],

       [[18, 19],
        [20, 21],
        [22, 23]],

       [[24, 25],
        [26, 27],
        [28, 29]]])

In [12]:
# lets describe it
print compa.shape
print compa.ndim
print compa.dtype

(5L, 3L, 2L)
3
int32


In [13]:
compa[3,:,1]

array([19, 21, 23])

In [14]:
compa[0,0,0]

0

In [15]:
compa[0,0,0] = 5.9
compa[0,0,0]

5

Numpy tries to resolve conflicting datatypes, sometimes to our dismay

In [16]:
compa = compa.astype(float)
compa[0,0,0] = 5.75
compa[0,0,0]

5.75

####Random Numbers
Random numbers are very helpful and are necessary at times for testing data pipelines and running statistical analyses. Functions for creating random values are under numpy.random.

In [17]:
#Create a randomized array
rm = np.random.rand(5,5)
rm

array([[ 0.25109588,  0.92131942,  0.40842843,  0.88814629,  0.32451911],
       [ 0.11653304,  0.55786091,  0.58410784,  0.05161716,  0.91906662],
       [ 0.67285099,  0.05513986,  0.44958739,  0.66365954,  0.43529299],
       [ 0.01447089,  0.47676985,  0.02778103,  0.03299097,  0.40672957],
       [ 0.26250527,  0.4410875 ,  0.3376203 ,  0.52388349,  0.55998911]])

In [18]:
rm.shape

(5L, 5L)

In [19]:
print rm.mean()
print rm.mean(0) #Average per column
print rm.mean(1) #average per row

0.415322137436
[ 0.26349121  0.49043551  0.361505    0.43205949  0.52911948]
[ 0.55870182  0.44583711  0.45530615  0.19174846  0.42501714]


In [20]:
# for a different Normal Distribution, use np.random.normal
rm = np.random.normal(5,9,(30,30))
rm

array([[  2.88298417e-01,   7.29631656e+00,   1.29529969e+01,
         -9.09464383e+00,  -5.60343936e+00,   1.40103599e+01,
          1.30388358e+01,   4.45701630e+00,   5.15748288e+00,
         -4.26120497e+00,   4.01756692e+00,  -2.87836283e+01,
          1.89436681e+00,   4.32671171e+00,  -5.76308430e+00,
          2.37465293e+00,   4.84450603e+00,   5.81290092e+00,
          8.90949345e+00,  -3.80873026e+00,   1.03483334e+01,
          2.20187841e+01,   2.03814585e+01,  -3.23848345e-01,
          3.84310771e+00,   4.03170374e+00,   1.40857874e+01,
          5.15606036e+00,   6.28912418e+00,  -3.59780550e+00],
       [  1.32380272e+01,   2.53274582e+01,   1.57143618e+01,
          2.45313116e-01,   1.15765307e+01,   5.49046338e+00,
         -1.04831680e+01,   1.89751529e+00,  -2.94859531e+00,
          1.73185038e+01,   1.44339845e+01,   7.34641698e+00,
          8.68015415e+00,  -1.03252306e+01,   5.67596535e+00,
          1.12063572e+00,   2.90379481e+00,   1.19905995e+01,
       

In [21]:
print rm.mean(), "which is hopefully close to the input mean"
print rm.var(), "which variance = stdev squared"
print np.median(rm)

4.76701127895 which is hopefully close to the input mean
77.7149436892 which variance = stdev squared
4.77486820771


Find more distributions and random functions here: http://docs.scipy.org/doc/numpy/reference/routines.random.html

###Exercise 1
1) Create a 4x5 array of integers numbering 0 to 19.

In [22]:
np.arange(20).reshape(4,5)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14],
       [15, 16, 17, 18, 19]])

2) Create a 50x500 array with a mean of 20 and variance of 100. Save it to a variable called  `biggie`

In [23]:
biggie = np.random.normal(20,10,(50,500))
print biggie.shape
print biggie.mean()
print biggie.var()

(50L, 500L)
19.991084069
101.06996582


3) Change the mean of the array to a value within 1 of 0 and the variance within 1 of 25. Think about what the mean and the variance represent and try using various mathematical operations.

In [24]:
morph = (biggie - 20)/2
print morph.mean()
print morph.var()

-0.00445796549195
25.267491455


##Pandas: DataFrames as Bamboo
You've already been exposed to dataframes in the previous labs so lets get into dataframes and how we can work with them.

In [25]:
import pandas as pd

data = pd.read_csv("../data/titanic.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


In [26]:
data.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [None]:
data[data.Age>65]

In [None]:
data[(data.Age==11)&(data.SibSp==5)]

In [None]:
data[(data.Age==11)|(data.SibSp==5)]

###Cleaning Data

In [27]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB


####Working with nulls
Exclude data

In [None]:
# data[data.Age.isnull()]
data[data.Age.notnull()]

In [None]:
# You can also just replace the nulls
data.Age[data.Age.isnull()].fillna(0)

In [None]:
#Replace with the mean to preserve statistical values
avg_age = data.Age[data.Age.notnull()].mean()
print avg_age
data.Age.fillna(avg_age)

####Replace with random normal distribution

In [None]:
# Get values of mean and standard deviation
data.Age[data.Age.notnull()].describe()

In [None]:
# Replace null values with 
data.Age.fillna(np.random.normal(29.7,14.5),inplace=True)

In [None]:
data.Age.fillna(np.random.normal(29.7,14.5)).describe()

###Convert categorical data to numerical

In [None]:
data.Sex=='female'

In [None]:
data.rename(columns={'Sex':'Is Female'},inplace=True)
data['Is Female']=data['Is Female']=='female'
data.head()

In [None]:
# get unique values of Embarked
data.Embarked.unique()

In [None]:
# replace values with numbers
data.Embarked.replace(['S', 'C', 'Q'],[1,2,3],inplace=True)
data.head()

###Selecting with .loc, .iloc, & .ix

Selecting data in pandas can be tricky. The main takeaway is that .loc looks for index labels, .iloc looks for the integer index position, and .ix can be a mix. 

In [None]:
df = pd.DataFrame(np.random.randn(6,4),index=list('abcdef'),columns=list('ABCD'))
df

In [None]:
df.loc['f']

In [None]:
df.iloc[len(df.index)-1]

In [None]:
df.A.ix['f'] == df.A.ix[-1]

In [None]:
cc = list('cookies')
cc[-4]

###Group by

In [None]:
# Find average age of passengers that survived vs. died
data.groupby('Survived')['Age'].mean()

In [None]:
# Count number of female passengers
data.groupby('Is Female')['PassengerId'].count()

In [None]:
data.groupby(['Survived','Pclass'])['PassengerId'].count()

###Apply

In [None]:


# Convert ticket prices to USD
data.Fare.apply(lambda x: x*1.6)

In [None]:
data.Name

In [None]:
data.Name.apply(lambda x: x.split(",")[0])

###Concatenate

In [None]:
data_first_half = data.iloc[0:10,:]
data_first_half.info()

In [None]:
data_second_half = data.iloc[10:,:]

remake_data = pd.concat([data_first_half,data_second_half])
remake_data.info()

###EXERCISE 2
1) Replace Pclass numbers with 'First Class', 'Second Class', 'Third Class'

2) What was the average ticket price for survivors vs. dead passengers?

###Bonus!!!
Round all ages to the nearest year using `apply`

##Bokeh: Picture Perfect Visuals

To install Bokeh, go to a terminal and type:

`conda install bokeh` 

Bokeh is built by the same people that created Anaconda (Continuum Analytics) and is designed out of the box for web display, making it nice for creating presentation ready, interactive visuals quickly. Labs in this course will be shown in Bokeh. Checkout http://bokeh.pydata.org/en/latest/docs/quickstart.html#concepts to see some of the range of capabilities.

In [None]:
from bokeh.plotting import figure, output_notebook,show,vplot
output_notebook()

In [None]:
import pandas.io.data
import datetime
aapl = pd.io.data.get_data_yahoo('FB', 
                                 start=datetime.datetime(2015, 4, 1), 
                                 end=datetime.datetime(2015, 4, 28))


In [None]:
# prepare some data
x = aapl.Low
y = aapl.High

# create a new plot with a title and axis labels
p = figure(title="Stock High vs. Low", x_axis_label='Low', y_axis_label='High')

# These are glyphs
p.circle(x, y,size=30,alpha=.5,)
p.line(x,x*y.mean()/x.mean())

# show the results
show(p)

At its core, Bokeh is built up with Plots and Glyphs. Plots are created with the figure keyword and then glyphs are visuals that are added to the visualization. The visuals are scalable, interactive and savable. You can even create vectorized colors.

In [None]:
# prepare some data
N = 4000
x = np.random.random(size=N) * 100
y = np.random.random(size=N) * 100
radii = np.random.random(size=N) * 1.5
colors = ["#%02x%02x%02x" % (r, g, 150) for r, g in zip(np.floor(50+2*x), np.floor(30+2*y))]

TOOLS="resize,crosshair,pan,wheel_zoom,box_zoom,reset,box_select,lasso_select"

# create a new plot with the tools above, and explicit ranges
p = figure(tools=TOOLS, x_range=(0,100), y_range=(0,100))

# add a circle renderer with vecorized colors and sizes
p.circle(x,y, radius=radii, fill_color=colors, fill_alpha=0.6, line_color=None)

# show the results
show(p)

In [None]:
p1 = figure(title="Titanic Ages Dead",x_axis_label = 'Age',y_axis_label = 'Count')
#construct the histogram
hist, edges = np.histogram(data.Age[data.Survived==0].values, density=True, bins=50)
#Construct your x axis
x = np.linspace(data.Age.min(),data.Age.max(),100)
#add the bars, scaling the value to the full count of people
p1.quad(top=hist*len(data.Age), bottom=0, left=edges[:-1], right=edges[1:],line_color='black')

p2 = figure(title="Titanic Ages Survived",x_axis_label = 'Age',y_axis_label = 'Count')

hist, edges = np.histogram(data.Age[data.Survived==1].values, density=True, bins=50)
x = np.linspace(data.Age.min(),data.Age.max(),100)
p2.quad(top=hist*len(data.Age), bottom=0, left=edges[:-1], right=edges[1:],line_color='black')


show(vplot(p1,p2))