# numpy/pandas

## numpy
* fundamental package for scientific computing (i.e., numerics and mathematics) with Python
* vector oriented computing
* efficiently implemented multi-dimensional arrays
* how are numpy arrays different from Python containers?
 * Python variables are referencesâ€“values are independent objects with their own space in memory and a Python variable points (or refers) to it
   * inefficient for lots of vars of same type
 * numpy arrays reserve a space in memory and all of the values are contiguous

![alt-text](array_vs_list.png 'array vs. list')

![alt-text](numpy-array.jpg 'numpy-array')

## numpy datatypes
* __`numpy`__ is very precise about identifying datatypes
* several types of integers: __`numpy.int8`__, __`numpy.int16`__, __`numpy.int32`__, __`numpy.int64`__ (also unsigned)
* __`numpy.float32`__, __`numpy.float64`__, __`numpy.float128`__ (also complex types)
* boolean
* string, Unicode string (same as Python but length must be specified in advance)

## creating numpy arrays

In [None]:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
a # repr() is being called

In [None]:
type(a), a.dtype

In [None]:
# types matter for ndarrays!
a[0] = 34.7 # Ok, as it can be converted to int
a[0] = 'x'
a

In [None]:
# If need be, you can specify type
a = np.array([1, 2, 3, 4, 5], dtype=np.float64)
a

In [None]:
a.ndim, a.shape, a.size

In [None]:
a.

In [None]:
# unlike Python lists, NumPy arrays can
# multi-dimensional
b = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]],
             dtype=np.float64)
b

In [None]:
# ...or initialize using a list comprehension
np.array([range(i, i + 3) for i in [3, 5, 7]])

In [None]:
b, b.ndim, b.shape, b.size

## Creating arrays from scratch
* especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy

In [None]:
np.zeros((4, 6), dtype=int)

In [None]:
np.empty((4, 4), dtype='float64')

In [None]:
np.full((3, 9), 3.14159)

In [None]:
# linear sequence, similar to range()
np.arange(0, 10, 2)

In [None]:
# five values evenly-spaced beteen 0 and 10
np.linspace(0, 10, 5)

In [None]:
# 3x3 array of uniformly distributed random values between 0 and 1
np.random.random((3, 3))

In [None]:
np.random.standard_normal((2, 4))

In [None]:
# 3x3 array of normally distributed random values with mean 0 and stdev 2
np.random.normal(0, 2, (3, 3))

In [None]:
# 4x4 array of random integers in interval [0, 100)
np.random.randint(0, 100, (4, 4))

In [None]:
# identity matrix
np.eye(8, dtype='float32')

## indexing/slicing

In [None]:
a = np.linspace(0, 10, 5)
a

In [None]:
a[3]

In [None]:
aa = np.random.random((5, 4))
aa

In [None]:
aa[1, 1]

In [None]:
aa[2:4] # extract row 2 and 3

In [None]:
aa[2:5, 1] # extract rows 2-4, element 1

In [None]:
aa[::-1]

In [None]:
aa[::-1, ::-1]

## Manipulating numpy arrays

In [None]:
a = np.random.standard_normal((2, 4))
b = np.random.standard_normal((2, 4))
a, b

In [None]:
np.vstack([a, b])

In [None]:
np.hstack([a, b])

In [None]:
a.transpose()

## Saving/Loading a numpy array

In [None]:
np.save('/tmp/a.npy', a)
a1 = np.load('/tmp/a.npy')
a1

## Performing math on numpy arrays

In [None]:
x = np.linspace(0, 10, 1000)
x

In [None]:
%time sinx = np.sin(x)
# "universal" function which operates on entire array!
sinx


In [None]:
%%timeit
for i in range(0, 1000):
    sinx[i] = np.sin(x[i])

In [None]:
cosx = np.cos(x)
y = sinx * cosx
y

In [None]:
xplus1 = x + 1
a = np.array([[1, 2], [3, 4]])
b = np.array([[-1, -2], [-3, -4]])
np.matmul(a, b)

## __`numpy`__ Datetime Object

In [None]:
np.datetime64('2016')

In [None]:
np.datetime64('2016-03')

In [None]:
np.datetime64('2016-03-31 08:30:00')

In [None]:
np.datetime64('2016-03-07') < np.datetime64('2016-03-09')

In [None]:
np.datetime64('2016-03-09') - np.datetime64('2016-03-07')

In [None]:
np.datetime64('2016-01-01') + np.timedelta64(59, 'D')

In [None]:
np.arange(np.datetime64('2016-02-01'),
          np.datetime64('2016-03-01'))
#np.timedelta64(67,'D') / np.timedelta64(1, 'W')

# Pandas
* has gained broad acceptance as THE data analysis tool for Python
* built on top of __`numpy`__ and significantly enhances it
* "__`numpy`__ with labels"
* deals with data in tabular form, but which attaches more general labels to the rows and columns
* more robust in handling common data formats and missing data
* adds relational database operations, e.g., joins
* the two most commons datatypes are series (1D) and dataframes (2D)

In [None]:
import pandas as pd

# Panda Series

In [None]:
s = pd.Series([0, 1, 4, 9, 16, 25], name='squares')
s

In [None]:
s.values

In [None]:
s.index

In [None]:
s[2]

In [None]:
s[2:4]

In [None]:
ieee2015 = pd.Series([100.0, 99.9, 99.4, 96.5, 91.3, 84.8, 84.5, 83.0, 
76.2, 72.4], index=['Java', 'C', 'C++', 'Python', 'C#', 'R', 'PHP',
                    'JavaScript', 'Ruby', 'Matlab'])

In [None]:
ieee2015

In [None]:
ieee2015.index

In [None]:
ieee2015[3], ieee2015['Ruby']


# Panda indices

In [None]:
s = pd.Series(np.nan, index=[49, 48, 47, 46,
                             45, 1, 2, 3, 4, 5])

In [None]:
s[:3]

In [None]:
# iloc = integer index location
s.iloc[:3]

In [None]:
# all items up to and including the string index '3'
# (not the 3rd element in the series)
s.loc[:3]

In [None]:
# there is no index '6', so 
# s[:6] == s.iloc[:6]
s[:6]

In [None]:
s.iloc[:6]

In [None]:
s.loc[:6]

In [None]:
ieee2015[1:4]

In [None]:
ieee2015['C++':'R']

In [None]:
ieee2015[ieee2015 > 95]

# Panda series from dict

In [None]:
ieee2015 = pd.Series({'Java': 100.0, 'C': 99.9, 'C++': 99.4,
                      'Python': 96.5, 'C#': 91.3, 'R': 84.8,
                      'PHP': 84.5, 'JavaScript': 83.0, 'Ruby': 76.2,
                      'Matlab': 72.4})

In [None]:
ieee2015

# Panda DataFrames
* extend numpy 2D arrays by giving labels to the columns and also to the rows (if you provide an explicit index)


In [None]:
ieee2014 = pd.Series([100.0, 99.3, 95.5, 94.5, 92.4, 84.8, 84.5,
    78.9, 74.3, 72.8], index=['Java', 'C', 'C++',
    'Python', 'C#', 'PHP', 'JavaScript', 'Ruby', 'R', 'Matlab'])
ieee2015 = pd.Series({'Java': 100.0, 'C': 99.9, 'C++': 99.4,
        'Python': 96.5, 'C#': 91.3, 'R': 84.8, 'PHP': 84.5,
        'JavaScript': 83.0, 'Ruby': 76.2, 'Matlab': 72.4})
pldata = pd.DataFrame({'2014': ieee2014, '2015': ieee2015})
#ieee2014, ieee2015

#pldata = pd.DataFrame(ieee2014, ieee2015)
print(pldata)

In [None]:
pldata

In [None]:
pldata.sort_values(by='2015', ascending=False)

In [None]:
pldata.values

In [None]:
pldata.columns

In [None]:
pldata['2014']

# Adding a column to a DataFrame

In [None]:
pldata['avg'] = (pldata['2014'] + pldata['2015']) / 2
pldata

# Creating a DataFrame from dicts

In [None]:
presidents = pd.DataFrame([
    { 'name': 'Barack Obama', 'elect': 2008, 'born': 1961 },
    { 'name': 'George W. Bush', 'elect': 2000, 'born': 1946 },
    { 'name': 'Bill Clinton', 'elect': 1992, 'born': 1946 },
    { 'name': 'George H.W. Bush', 'elect': 1988, 'born': 1924 },
])
presidents

# Setting the Index of a DataFrame

In [None]:
president_indexes = presidents.set_index('name')
president_indexes

In [None]:
presidents

# Manipulating a DataFrame

In [None]:
president_indexes

In [None]:
president_indexes['born'].idxmax()

In [None]:
president_indexes['born']['Bill Clinton']

In [None]:
president_indexes.loc['Bill Clinton']

In [None]:
president_indexes.loc['Bill Clinton']['born']

In [None]:
#presidents['born']
pd.DataFrame(presidents['born'])

In [None]:
presidents['born'][2]

In [None]:
presidents.iloc[2]

In [None]:
presidents.iloc[2]['born']

# Merging Two DataFrames

In [None]:
presidents_dads = pd.DataFrame([
    { 'son': 'Barack Obama', 'father': 'Barack Obama, Sr.' },
    { 'son': 'George W. Bush', 'father': 'George H.W. Bush' },
    { 'son': 'George H.W. Bush', 'father': 'Prescott Bush' },
])

presidents_dads

In [None]:
presidents

In [None]:
pd.merge(presidents, presidents_dads, 
         left_on='name', right_on='son')

In [None]:
pd.merge(presidents, presidents_dads, left_on='name',
         right_on='son').drop('son' , axis=1)

In [None]:
pd.merge(presidents, presidents_dads, left_on='name',
         right_on='son', how='left').drop('son', axis=1)

# Lab: Pandas
*  read the weather data from __`weather.csv`__ (__`http://bit.ly/1PL3X6t`__) into a DataFrame called __`weather`__ (there is a Pandas function __`read_csv`__ that will parse and read a CSV file)
* set the index of weather to be __`DATE`__
* examine the column __`PrecipitationIn`__ (precipitation in inches by date)
* determine the total amount of precipitation for the entire dataset (there is a __`.sum()`__ function)
* determine the total amount of precipitation for the month of February 2013
* create a new __`DataFrame`__ which only contains the rows of weather for which there was some precipitation

In [None]:
weather = pd.read_csv('weather.csv')
weather.info()
weather = weather.set_index('DATE')

In [None]:
weather.index

In [None]:
weather['PrecipitationIn']

In [None]:
pin = weather['PrecipitationIn']
pin.sum()

In [None]:
pin.head()

In [None]:
pin.index = pd.to_datetime(weather.index)

In [None]:
type(pin.index)

In [None]:
pin['2013-02-01':'2013-02-28'].sum()

In [None]:
wplus = weather[weather['PrecipitationIn'] > 0]
min(wplus['PrecipitationIn'])

In [None]:
dir(object), dir(float), dir(int)

In [None]:
int.__dict__