# numpy/pandas

## numpy
* fundamental package for scientific computing (i.e., numerics and mathematics) with Python
* vector oriented computing
* efficiently implemented multi-dimensional arrays
* how are numpy arrays different from Python containers?
 * Python variables are references–values are independent objects with their own space in memory and a Python variable points (or refers) to it
   * inefficient for lots of vars of same type
 * numpy arrays reserve a space in memory and all of the values are contiguous

![alt-text](array_vs_list.png 'array vs. list')

![alt-text](numpy-array.jpg 'numpy-array')

## numpy datatypes
* __`numpy`__ is very precise about identifying datatypes
* several types of integers: __`numpy.int8`__, __`numpy.int16`__, __`numpy.int32`__, __`numpy.int64`__ (also unsigned)
* __`numpy.float32`__, __`numpy.float64`__, __`numpy.float128`__ (also complex types)
* boolean
* string, Unicode string (same as Python but length must be specified in advance)

## creating numpy arrays

In [1]:
import numpy as np
a = np.array([1, 2, 3, 4, 5])
a # repr() is being called

array([1, 2, 3, 4, 5])

In [2]:
type(a), a.dtype

(numpy.ndarray, dtype('int64'))

In [3]:
# types matter for ndarrays!
a[0] = 34.7 # Ok, as it can be converted to int
a[0] = 'x'
a

ValueError: invalid literal for int() with base 10: 'x'

In [4]:
# If need be, you can specify type
a = np.array([1, 2, 3, 4, 5], dtype=np.float64)
a

array([1., 2., 3., 4., 5.])

In [5]:
a.ndim, a.shape, a.size

(1, (5,), 5)

In [6]:
# unlike Python lists, NumPy arrays can
# multi-dimensional
b = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]],
             dtype=np.float64)
b

array([[ 1.,  2.,  3.,  4.,  5.],
       [ 6.,  7.,  8.,  9., 10.]])

In [7]:
# ...or initialize using a list comprehension
np.array([range(i, i + 3) for i in [3, 5, 7]])

array([[3, 4, 5],
       [5, 6, 7],
       [7, 8, 9]])

In [8]:
b, b.ndim, b.shape, b.size

(array([[ 1.,  2.,  3.,  4.,  5.],
        [ 6.,  7.,  8.,  9., 10.]]),
 2,
 (2, 5),
 10)

## Creating arrays from scratch
* especially for larger arrays, it is more efficient to create arrays from scratch using routines built into NumPy

In [9]:
np.zeros((4, 6), dtype=int)

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]])

In [10]:
np.empty((4, 4), dtype='float64')

array([[2.05833592e-312, 2.33419537e-312, 0.00000000e+000,
        0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000],
       [0.00000000e+000, 0.00000000e+000, 0.00000000e+000,
        0.00000000e+000]])

In [11]:
np.full((3, 9), 3.14159)

array([[3.14159, 3.14159, 3.14159, 3.14159, 3.14159, 3.14159, 3.14159,
        3.14159, 3.14159],
       [3.14159, 3.14159, 3.14159, 3.14159, 3.14159, 3.14159, 3.14159,
        3.14159, 3.14159],
       [3.14159, 3.14159, 3.14159, 3.14159, 3.14159, 3.14159, 3.14159,
        3.14159, 3.14159]])

In [12]:
# linear sequence, similar to range()
np.arange(0, 10, 2)

array([0, 2, 4, 6, 8])

In [13]:
# five values evenly-spaced beteen 0 and 10
np.linspace(0, 10, 5)

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

In [14]:
# 3x3 array of uniformly distributed random values between 0 and 1
np.random.random((3, 3))

array([[0.88002901, 0.40211093, 0.95274787],
       [0.17577771, 0.66513977, 0.82806628],
       [0.02690665, 0.32556801, 0.51454477]])

In [15]:
np.random.standard_normal((2, 4))

array([[ 0.19203831,  0.10139577,  1.34746388,  0.05372531],
       [ 0.736995  , -0.5704542 ,  0.13652502,  0.54624729]])

In [18]:
# 3x3 array of normally distributed random values with mean 0 and stdev 2
np.random.normal(100, 2, (3, 3))

array([[100.48968342,  98.73720152,  97.65764087],
       [ 99.16138029, 100.84005768, 100.37231338],
       [ 97.70451557, 100.82492292, 100.64800286]])

In [19]:
# 4x4 array of random integers in interval [0, 100)
np.random.randint(0, 100, (4, 4))

array([[39, 64, 23, 22],
       [70, 52, 75, 81],
       [36, 26, 27, 65],
       [53,  5, 73, 39]])

In [20]:
# identity matrix
np.eye(8, dtype='float32')

array([[1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

## indexing/slicing

In [21]:
a = np.linspace(0, 10, 5)
a

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

In [22]:
a[3]

7.5

In [23]:
aa = np.random.random((5, 4))
aa

array([[0.23825205, 0.77844337, 0.69031609, 0.51563637],
       [0.5506819 , 0.40488487, 0.45003278, 0.94762492],
       [0.41653271, 0.10942384, 0.42877519, 0.91195054],
       [0.09349675, 0.49323687, 0.90320342, 0.28048338],
       [0.99867931, 0.41959517, 0.04135449, 0.34225642]])

In [24]:
aa[1, 1]

0.4048848664297481

In [25]:
aa[2:4] # extract row 2 and 3

array([[0.41653271, 0.10942384, 0.42877519, 0.91195054],
       [0.09349675, 0.49323687, 0.90320342, 0.28048338]])

In [26]:
aa[2:5, 1] # extract rows 2-4, element 1

array([0.10942384, 0.49323687, 0.41959517])

In [27]:
aa[::-1]

array([[0.99867931, 0.41959517, 0.04135449, 0.34225642],
       [0.09349675, 0.49323687, 0.90320342, 0.28048338],
       [0.41653271, 0.10942384, 0.42877519, 0.91195054],
       [0.5506819 , 0.40488487, 0.45003278, 0.94762492],
       [0.23825205, 0.77844337, 0.69031609, 0.51563637]])

In [28]:
aa[::-1, ::-1]

array([[0.34225642, 0.04135449, 0.41959517, 0.99867931],
       [0.28048338, 0.90320342, 0.49323687, 0.09349675],
       [0.91195054, 0.42877519, 0.10942384, 0.41653271],
       [0.94762492, 0.45003278, 0.40488487, 0.5506819 ],
       [0.51563637, 0.69031609, 0.77844337, 0.23825205]])

## Manipulating numpy arrays

In [29]:
a = np.random.standard_normal((2, 4))
b = np.random.standard_normal((2, 4))
a, b

(array([[-1.18130656, -0.01621406, -0.99367793, -0.66918042],
        [ 0.62520918,  0.47723536, -2.14988918, -1.89072274]]),
 array([[-0.75430286, -0.29504343,  1.37375351,  0.75609414],
        [ 1.16826555, -0.39396515,  0.88584679, -1.03599908]]))

In [30]:
np.vstack([a, b])

array([[-1.18130656, -0.01621406, -0.99367793, -0.66918042],
       [ 0.62520918,  0.47723536, -2.14988918, -1.89072274],
       [-0.75430286, -0.29504343,  1.37375351,  0.75609414],
       [ 1.16826555, -0.39396515,  0.88584679, -1.03599908]])

In [31]:
np.hstack([a, b])

array([[-1.18130656, -0.01621406, -0.99367793, -0.66918042, -0.75430286,
        -0.29504343,  1.37375351,  0.75609414],
       [ 0.62520918,  0.47723536, -2.14988918, -1.89072274,  1.16826555,
        -0.39396515,  0.88584679, -1.03599908]])

In [32]:
a.transpose()

array([[-1.18130656,  0.62520918],
       [-0.01621406,  0.47723536],
       [-0.99367793, -2.14988918],
       [-0.66918042, -1.89072274]])

## Saving/Loading a numpy array

In [33]:
np.save('/tmp/a.npy', a)
a1 = np.load('/tmp/a.npy')
a1

array([[-1.18130656, -0.01621406, -0.99367793, -0.66918042],
       [ 0.62520918,  0.47723536, -2.14988918, -1.89072274]])

## Performing math on numpy arrays

In [34]:
x = np.linspace(0, 10, 1000)
x

array([ 0.        ,  0.01001001,  0.02002002,  0.03003003,  0.04004004,
        0.05005005,  0.06006006,  0.07007007,  0.08008008,  0.09009009,
        0.1001001 ,  0.11011011,  0.12012012,  0.13013013,  0.14014014,
        0.15015015,  0.16016016,  0.17017017,  0.18018018,  0.19019019,
        0.2002002 ,  0.21021021,  0.22022022,  0.23023023,  0.24024024,
        0.25025025,  0.26026026,  0.27027027,  0.28028028,  0.29029029,
        0.3003003 ,  0.31031031,  0.32032032,  0.33033033,  0.34034034,
        0.35035035,  0.36036036,  0.37037037,  0.38038038,  0.39039039,
        0.4004004 ,  0.41041041,  0.42042042,  0.43043043,  0.44044044,
        0.45045045,  0.46046046,  0.47047047,  0.48048048,  0.49049049,
        0.5005005 ,  0.51051051,  0.52052052,  0.53053053,  0.54054054,
        0.55055055,  0.56056056,  0.57057057,  0.58058058,  0.59059059,
        0.6006006 ,  0.61061061,  0.62062062,  0.63063063,  0.64064064,
        0.65065065,  0.66066066,  0.67067067,  0.68068068,  0.69

In [35]:
%time sinx = np.sin(x)
# "universal" function which operates on entire array!
sinx


CPU times: user 191 µs, sys: 90 µs, total: 281 µs
Wall time: 280 µs


array([ 0.        ,  0.01000984,  0.02001868,  0.03002552,  0.04002934,
        0.05002916,  0.06002396,  0.07001275,  0.07999452,  0.08996827,
        0.09993302,  0.10988774,  0.11983146,  0.12976317,  0.13968188,
        0.14958659,  0.15947632,  0.16935006,  0.17920684,  0.18904566,
        0.19886554,  0.20866549,  0.21844453,  0.22820168,  0.23793597,
        0.24764642,  0.25733206,  0.26699191,  0.276625  ,  0.28623038,
        0.29580708,  0.30535414,  0.3148706 ,  0.32435552,  0.33380793,
        0.3432269 ,  0.35261147,  0.36196071,  0.37127369,  0.38054946,
        0.3897871 ,  0.39898569,  0.4081443 ,  0.41726201,  0.42633791,
        0.4353711 ,  0.44436066,  0.45330569,  0.46220531,  0.47105861,
        0.47986471,  0.48862273,  0.49733179,  0.50599102,  0.51459954,
        0.52315651,  0.53166105,  0.54011232,  0.54850948,  0.55685167,
        0.56513807,  0.57336784,  0.58154016,  0.58965421,  0.59770917,
        0.60570425,  0.61363863,  0.62151153,  0.62932216,  0.63

In [36]:
%%timeit
for i in range(0, 1000):
    sinx[i] = np.sin(x[i])

1.79 ms ± 136 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [37]:
cosx = np.cos(x)
y = sinx * cosx
y

array([ 0.        ,  0.01000934,  0.02001467,  0.03001198,  0.03999726,
        0.04996651,  0.05991573,  0.06984094,  0.07973816,  0.08960342,
        0.09943277,  0.10922227,  0.11896799,  0.12866603,  0.1383125 ,
        0.14790354,  0.1574353 ,  0.16690397,  0.17630574,  0.18563684,
        0.19489355,  0.20407215,  0.21316896,  0.22218033,  0.23110265,
        0.23993235,  0.24866589,  0.25729977,  0.26583053,  0.27425474,
        0.28256903,  0.29077008,  0.29885459,  0.30681932,  0.31466108,
        0.32237673,  0.32996317,  0.33741737,  0.34473634,  0.35191714,
        0.35895689,  0.36585278,  0.37260204,  0.37920197,  0.38564992,
        0.3919433 ,  0.3980796 ,  0.40405635,  0.40987116,  0.4155217 ,
        0.4210057 ,  0.42632097,  0.43146538,  0.43643686,  0.44123342,
        0.44585314,  0.45029416,  0.45455472,  0.45863309,  0.46252765,
        0.46623684,  0.46975916,  0.47309321,  0.47623765,  0.47919122,
        0.48195273,  0.48452109,  0.48689525,  0.48907427,  0.49

In [38]:
xplus1 = x + 1
a = np.array([[1, 2], [3, 4]])
b = np.array([[-1, -2], [-3, -4]])
np.matmul(a, b)

array([[ -7, -10],
       [-15, -22]])

## __`numpy`__ Datetime Object

In [39]:
np.datetime64('2016')

numpy.datetime64('2016')

In [40]:
np.datetime64('2016-03')

numpy.datetime64('2016-03')

In [41]:
np.datetime64('2016-03-31 08:30:00')

numpy.datetime64('2016-03-31T08:30:00')

In [42]:
np.datetime64('2016-03-07') < np.datetime64('2016-03-09')

True

In [43]:
np.datetime64('2016-03-09') - np.datetime64('2016-03-07')

numpy.timedelta64(2,'D')

In [44]:
np.datetime64('2016-01-01') + np.timedelta64(59, 'D')

numpy.datetime64('2016-02-29')

In [45]:
np.arange(np.datetime64('2016-02-01'),
          np.datetime64('2016-03-01'))
#np.timedelta64(67,'D') / np.timedelta64(1, 'W')

array(['2016-02-01', '2016-02-02', '2016-02-03', '2016-02-04',
       '2016-02-05', '2016-02-06', '2016-02-07', '2016-02-08',
       '2016-02-09', '2016-02-10', '2016-02-11', '2016-02-12',
       '2016-02-13', '2016-02-14', '2016-02-15', '2016-02-16',
       '2016-02-17', '2016-02-18', '2016-02-19', '2016-02-20',
       '2016-02-21', '2016-02-22', '2016-02-23', '2016-02-24',
       '2016-02-25', '2016-02-26', '2016-02-27', '2016-02-28',
       '2016-02-29'], dtype='datetime64[D]')

# Pandas
* has gained broad acceptance as THE data analysis tool for Python
* built on top of __`numpy`__ and significantly enhances it
* "__`numpy`__ with labels"
* deals with data in tabular form, but which attaches more general labels to the rows and columns
* more robust in handling common data formats and missing data
* adds relational database operations, e.g., joins
* the two most commons datatypes are series (1D) and dataframes (2D)

In [46]:
import pandas as pd

# Panda Series

In [47]:
s = pd.Series([0, 1, 4, 9, 16, 25], name='squares')
s

0     0
1     1
2     4
3     9
4    16
5    25
Name: squares, dtype: int64

In [48]:
s.values

array([ 0,  1,  4,  9, 16, 25])

In [49]:
s.index

RangeIndex(start=0, stop=6, step=1)

In [50]:
s[2]

4

In [51]:
s[2:4]

2    4
3    9
Name: squares, dtype: int64

In [52]:
ieee2015 = pd.Series([100.0, 99.9, 99.4, 96.5, 91.3, 84.8, 84.5, 83.0, 
76.2, 72.4], index=['Java', 'C', 'C++', 'Python', 'C#', 'R', 'PHP',
                    'JavaScript', 'Ruby', 'Matlab'])

In [53]:
ieee2015

Java          100.0
C              99.9
C++            99.4
Python         96.5
C#             91.3
R              84.8
PHP            84.5
JavaScript     83.0
Ruby           76.2
Matlab         72.4
dtype: float64

In [54]:
ieee2015.index

Index(['Java', 'C', 'C++', 'Python', 'C#', 'R', 'PHP', 'JavaScript', 'Ruby',
       'Matlab'],
      dtype='object')

In [62]:
ieee2015[3], ieee2015['Ruby']


(96.5, 76.200000000000003)

# Panda indices

In [55]:
s = pd.Series(np.nan, index=[49, 48, 47, 46,
                             45, 1, 2, 3, 4, 5])

In [56]:
s[:3]

49   NaN
48   NaN
47   NaN
dtype: float64

In [57]:
# iloc = integer index location
s.iloc[:3]

49   NaN
48   NaN
47   NaN
dtype: float64

In [58]:
# all items up to and including the string index '3'
# (not the 3rd element in the series)
s.loc[:3]

49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
2    NaN
3    NaN
dtype: float64

In [67]:
# there is no index '6', so 
# s[:6] == s.iloc[:6]
s[:6]

49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
dtype: float64

In [68]:
s.iloc[:6]

49   NaN
48   NaN
47   NaN
46   NaN
45   NaN
1    NaN
dtype: float64

In [69]:
s.loc[:6]

KeyError: 6

In [70]:
ieee2015[1:4]

C         99.9
C++       99.4
Python    96.5
dtype: float64

In [71]:
ieee2015['C++':'R']

C++       99.4
Python    96.5
C#        91.3
R         84.8
dtype: float64

In [72]:
ieee2015[ieee2015 > 95]

Java      100.0
C          99.9
C++        99.4
Python     96.5
dtype: float64

# Panda series from dict

In [59]:
ieee2015 = pd.Series({'Java': 100.0, 'C': 99.9, 'C++': 99.4,
                      'Python': 96.5, 'C#': 91.3, 'R': 84.8,
                      'PHP': 84.5, 'JavaScript': 83.0, 'Ruby': 76.2,
                      'Matlab': 72.4})

In [60]:
ieee2015

Java          100.0
C              99.9
C++            99.4
Python         96.5
C#             91.3
R              84.8
PHP            84.5
JavaScript     83.0
Ruby           76.2
Matlab         72.4
dtype: float64

# Panda DataFrames
* extend numpy 2D arrays by giving labels to the columns and also to the rows (if you provide an explicit index)


In [61]:
ieee2014 = pd.Series([100.0, 99.3, 95.5, 94.5, 92.4, 84.8, 84.5,
    78.9, 74.3, 72.8], index=['Java', 'C', 'C++',
    'Python', 'C#', 'PHP', 'JavaScript', 'Ruby', 'R', 'Matlab'])
ieee2015 = pd.Series({'Java': 100.0, 'C': 99.9, 'C++': 99.4,
        'Python': 96.5, 'C#': 91.3, 'R': 84.8, 'PHP': 84.5,
        'JavaScript': 83.0, 'Ruby': 76.2, 'Matlab': 72.4})
pldata = pd.DataFrame({'2014': ieee2014, '2015': ieee2015})
#ieee2014, ieee2015

#pldata = pd.DataFrame(ieee2014, ieee2015)
print(pldata)

             2014   2015
C            99.3   99.9
C#           92.4   91.3
C++          95.5   99.4
Java        100.0  100.0
JavaScript   84.5   83.0
Matlab       72.8   72.4
PHP          84.8   84.5
Python       94.5   96.5
R            74.3   84.8
Ruby         78.9   76.2


In [62]:
pldata

Unnamed: 0,2014,2015
C,99.3,99.9
C#,92.4,91.3
C++,95.5,99.4
Java,100.0,100.0
JavaScript,84.5,83.0
Matlab,72.8,72.4
PHP,84.8,84.5
Python,94.5,96.5
R,74.3,84.8
Ruby,78.9,76.2


In [63]:
pldata.sort_values(by='2015', ascending=False)

Unnamed: 0,2014,2015
Java,100.0,100.0
C,99.3,99.9
C++,95.5,99.4
Python,94.5,96.5
C#,92.4,91.3
R,74.3,84.8
PHP,84.8,84.5
JavaScript,84.5,83.0
Ruby,78.9,76.2
Matlab,72.8,72.4


In [64]:
pldata.values

array([[ 99.3,  99.9],
       [ 92.4,  91.3],
       [ 95.5,  99.4],
       [100. , 100. ],
       [ 84.5,  83. ],
       [ 72.8,  72.4],
       [ 84.8,  84.5],
       [ 94.5,  96.5],
       [ 74.3,  84.8],
       [ 78.9,  76.2]])

In [65]:
pldata.columns

Index(['2014', '2015'], dtype='object')

In [66]:
pldata['2014']

C              99.3
C#             92.4
C++            95.5
Java          100.0
JavaScript     84.5
Matlab         72.8
PHP            84.8
Python         94.5
R              74.3
Ruby           78.9
Name: 2014, dtype: float64

# Adding a column to a DataFrame

In [67]:
pldata['avg'] = (pldata['2014'] + pldata['2015']) / 2
pldata

Unnamed: 0,2014,2015,avg
C,99.3,99.9,99.6
C#,92.4,91.3,91.85
C++,95.5,99.4,97.45
Java,100.0,100.0,100.0
JavaScript,84.5,83.0,83.75
Matlab,72.8,72.4,72.6
PHP,84.8,84.5,84.65
Python,94.5,96.5,95.5
R,74.3,84.8,79.55
Ruby,78.9,76.2,77.55


# Creating a DataFrame from dicts

In [70]:
presidents = pd.DataFrame([
    { 'name': 'Barack Obama', 'elect': 2008, 'born': 1961 },
    { 'name': 'George W. Bush', 'elect': 2000, 'born': 1946 },
    { 'name': 'Bill Clinton', 'elect': 1992, 'born': 1946 },
    { 'name': 'George H.W. Bush', 'elect': 1988, 'born': 1924 },
])
presidents

Unnamed: 0,name,elect,born
0,Barack Obama,2008,1961
1,George W. Bush,2000,1946
2,Bill Clinton,1992,1946
3,George H.W. Bush,1988,1924


# Setting the Index of a DataFrame

In [71]:
president_indexes = presidents.set_index('name')
president_indexes

Unnamed: 0_level_0,elect,born
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Barack Obama,2008,1961
George W. Bush,2000,1946
Bill Clinton,1992,1946
George H.W. Bush,1988,1924


In [72]:
presidents

Unnamed: 0,name,elect,born
0,Barack Obama,2008,1961
1,George W. Bush,2000,1946
2,Bill Clinton,1992,1946
3,George H.W. Bush,1988,1924


# Manipulating a DataFrame

In [92]:
president_indexes

Unnamed: 0_level_0,born,elect
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Barack Obama,1961,2008
George W. Bush,1946,2000
Bill Clinton,1946,1992
George H.W. Bush,1924,1988


In [73]:
president_indexes['born'].idxmax()

'Barack Obama'

In [74]:
president_indexes['born']['Bill Clinton']

1946

In [75]:
president_indexes.loc['Bill Clinton']

elect    1992
born     1946
Name: Bill Clinton, dtype: int64

In [76]:
president_indexes.loc['Bill Clinton']['born']

1946

In [77]:
#presidents['born']
pd.DataFrame(presidents['born'])

Unnamed: 0,born
0,1961
1,1946
2,1946
3,1924


In [99]:
presidents['born'][2]

1946

In [100]:
presidents.iloc[2]

born             1946
elect            1992
name     Bill Clinton
Name: 2, dtype: object

In [101]:
presidents.iloc[2]['born']

1946

# Merging Two DataFrames

In [78]:
presidents_dads = pd.DataFrame([
    { 'son': 'Barack Obama', 'father': 'Barack Obama, Sr.' },
    { 'son': 'George W. Bush', 'father': 'George H.W. Bush' },
    { 'son': 'George H.W. Bush', 'father': 'Prescott Bush' },
])

presidents_dads

Unnamed: 0,son,father
0,Barack Obama,"Barack Obama, Sr."
1,George W. Bush,George H.W. Bush
2,George H.W. Bush,Prescott Bush


In [79]:
presidents

Unnamed: 0,name,elect,born
0,Barack Obama,2008,1961
1,George W. Bush,2000,1946
2,Bill Clinton,1992,1946
3,George H.W. Bush,1988,1924


In [104]:
pd.merge(presidents, presidents_dads, 
         left_on='name', right_on='son')

Unnamed: 0,born,elect,name,father,son
0,1961,2008,Barack Obama,"Barack Obama, Sr.",Barack Obama
1,1946,2000,George W. Bush,George H.W. Bush,George W. Bush
2,1924,1988,George H.W. Bush,Prescott Bush,George H.W. Bush


In [80]:
pd.merge(presidents, presidents_dads, left_on='name',
         right_on='son').drop('son' , axis=1)

Unnamed: 0,name,elect,born,father
0,Barack Obama,2008,1961,"Barack Obama, Sr."
1,George W. Bush,2000,1946,George H.W. Bush
2,George H.W. Bush,1988,1924,Prescott Bush


In [81]:
pd.merge(presidents, presidents_dads, left_on='name',
         right_on='son', how='left').drop('son', axis=1)

Unnamed: 0,name,elect,born,father
0,Barack Obama,2008,1961,"Barack Obama, Sr."
1,George W. Bush,2000,1946,George H.W. Bush
2,Bill Clinton,1992,1946,
3,George H.W. Bush,1988,1924,Prescott Bush


# Lab: Pandas
*  read the weather data from __`weather.csv`__ (__`http://bit.ly/1PL3X6t`__) into a DataFrame called __`weather`__ (there is a Pandas function __`read_csv`__ that will parse and read a CSV file)
* set the index of weather to be __`DATE`__
* examine the column __`PrecipitationIn`__ (precipitation in inches by date)
* determine the total amount of precipitation for the entire dataset (there is a __`.sum()`__ function)
* determine the total amount of precipitation for the month of February 2013
* create a new __`DataFrame`__ which only contains the rows of weather for which there was some precipitation

In [109]:
weather = pd.read_csv('weather.csv')
weather.info()
weather = weather.set_index('DATE')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 23 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   DATE                        366 non-null    object 
 1   max_tempF                   366 non-null    int64  
 2   mean_tempF                  366 non-null    int64  
 3   min_tempF                   366 non-null    int64  
 4   max_dew_pointF              366 non-null    int64  
 5   mean_dew_pointF             366 non-null    int64  
 6   min_dew_pointF              366 non-null    int64  
 7   max_humidity                366 non-null    int64  
 8   mean_humidity               366 non-null    int64  
 9   min_humidity                366 non-null    int64  
 10  Max Sea Level PressureIn    366 non-null    float64
 11   Mean Sea Level PressureIn  366 non-null    float64
 12   Min Sea Level PressureIn   366 non-null    float64
 13   Max VisibilityMiles        366 non

In [110]:
weather.index

Index(['2012-3-10', '2012-3-11', '2012-3-12', '2012-3-13', '2012-3-14',
       '2012-3-15', '2012-3-16', '2012-3-17', '2012-3-18', '2012-3-19',
       ...
       '2013-3-1', '2013-3-2', '2013-3-3', '2013-3-4', '2013-3-5', '2013-3-6',
       '2013-3-7', '2013-3-8', '2013-3-9', '2013-3-10'],
      dtype='object', name='DATE', length=366)

In [95]:
weather['PrecipitationIn']

DATE
2012-3-10    0.00
2012-3-11    0.00
2012-3-12    0.03
2012-3-13    0.00
2012-3-14    0.00
             ... 
2013-3-6     0.04
2013-3-7     0.00
2013-3-8     0.00
2013-3-9     0.00
2013-3-10    0.00
Name: PrecipitationIn, Length: 366, dtype: float64

In [114]:
pin = weather['PrecipitationIn']
pin.sum()

35.46

In [121]:
pin.head()

DATE
2012-03-10    0.00
2012-03-11    0.00
2012-03-12    0.03
2012-03-13    0.00
2012-03-14    0.00
Name: PrecipitationIn, dtype: float64

In [115]:
pin.index = pd.to_datetime(weather.index)

In [116]:
type(pin.index)

pandas.core.indexes.datetimes.DatetimeIndex

In [123]:
pin['2013-02-01':'2013-02-28'].sum()

2.5499999999999994

In [126]:
wplus = weather[weather['PrecipitationIn'] > 0]
min(wplus['PrecipitationIn'])

0.01