# Agenda

1. Pandas in general
2. Series
    - Creating 
    - Retrieving from them
    - Methods
    - Working with `nan`
    - Broadcasting
    - Indexes
3. Data frames
    - Creating
    - Retrieving from them
    - Applying methods across all columns
4. Reading data from files
    - CSV
    - (Excel, a little)
    - JSON
    - Retrieving from the network

# Pandas data structures

There are two main data structures you need to know about with Pandas:

- `Series` -- basically a 1D NumPy array
- `DataFrame` -- basically a 2D NumPy array

The rows of a data frame are going to be equivalent to the rows of a NumPy array.

The columns of a data frame are all going to be Series objects.

# Installing pandas

You can install it along with many other Python packages with `pip` (command-line program, not to be run inside of Python or Jupyter):

    python3 -m pip install -U pandas
    
Once it is installed, then we'll want to load it into our program.

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# let's create a Pandas Series

s = Series([10, 20, 30, 40, 50, 60, 70])

In [4]:
type(s)

pandas.core.series.Series

In [5]:
s

0    10
1    20
2    30
3    40
4    50
5    60
6    70
dtype: int64

In [6]:
# I can do many of the same things with a series that I did with NumPy arrays

s[0]

10

In [7]:
s[3]

40

In [8]:
s.sum()

280

In [9]:
s.mean()

40.0

In [10]:
s = Series([10, 20, 30, 40, 50, 60, 70], dtype=np.float64)

In [11]:
s

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
5    60.0
6    70.0
dtype: float64

In [12]:
# behind the scenes, we can find a NumPy array
s.values

array([10., 20., 30., 40., 50., 60., 70.])

In [13]:
# I can create a Pandas Series  based on a 1D NumPy array

np.random.seed(0)
s = Series(np.random.randint(0, 100, 10))

s

0    44
1    47
2    64
3    67
4    67
5     9
6    83
7    21
8    36
9    87
dtype: int64

In [14]:
s.min()

9

In [15]:
s.max()

87

In [16]:
s.mean()

52.5

In [17]:
s.std()

25.67424130654432

In [18]:
s.sum()

525

In [20]:
s.count()   # how many non-nan values are there in this series?

10

In [21]:
# can I create a boolean series, and use it as a mask index? YES!

s < 30

0    False
1    False
2    False
3    False
4    False
5     True
6    False
7     True
8    False
9    False
dtype: bool

In [22]:
s[s<30]  # notice that when I filter the elements of s, the items I get back retain their original indexes

5     9
7    21
dtype: int64

In [23]:
s[s<s.mean()]  # retrieve values less than the mean

0    44
1    47
5     9
7    21
8    36
dtype: int64

# Ways to retrieve from a series

1. We've now seen that we can retrieve from a series using `[]`, just as we did with NumPy arrays.  But I'd like you to break this habit, ASAP.  (When we get to data frames, this will make more sense, and you won't regret it.)

2. We can also retrieve from a series using the `.loc` accessor and `[]`.  This lets us retrieve via the index.

3. We can also retrieve from a series using the `iloc` accessor and `[]`.  This lets us retrieve via the position in the series, regardless of the index, starting at 0.

In [24]:
a = s[s<s.mean()]

In [25]:
a   # this is a series containing 5 elements.

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [26]:
# I can retrieve via the index using just [].  Again: Probably not a good idea, but it will work!
a[5]

9

In [27]:
a[8]

36

In [28]:
a[0]

44

In [29]:
a[[1, 7]]   # fancy indexing works here, too!

1    47
7    21
dtype: int64

In [30]:
# Better than [] alone is retrieving with .loc, which lets us specify the index

a.loc[5]

9

In [31]:
a.loc[8]

36

In [32]:
a.loc[0]

44

In [33]:
a.loc[[1,7]]

1    47
7    21
dtype: int64

In [36]:
# Another way to retrieve is with .iloc, where we specify the numeric position in the series,
# starting with 0.

a.iloc[3]

21

In [37]:
a

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [38]:
a.loc[7]    # same as a.iloc[3]

21

In [39]:
a.iloc[2]

9

In [40]:
a.loc[2]

KeyError: 2

In [41]:
a

0    44
1    47
5     9
7    21
8    36
dtype: int64

In [43]:
# is there a way for me to set the index when I create a series?
# yes!

s = Series([10, 20, 30, 40, 50],
          index=[2,4,6,8,10])      # index and values must be of the same length!

In [44]:
s

2     10
4     20
6     30
8     40
10    50
dtype: int64

In [45]:
s.loc[4]  # this retrieves the item whose index is 4

20

In [46]:
s.iloc[4]  # this retrieves the item whose position is 4

50

In [47]:
s = Series([10, 20, 30, 40, 50],
          index=list('abcde'))    # indexes can be strings, including 1-character strings!

s

a    10
b    20
c    30
d    40
e    50
dtype: int64

In [48]:
s.loc['d']

40

In [49]:
s.iloc[3]

40

In [51]:
# we can retrieve multiple values with fancy indexing
# we'll get a series back as a result (assuming that there's more than one value there)

s.loc[['b', 'd']] # fancy indexing with strings

b    20
d    40
dtype: int64

In [52]:
# I can even use a slice!

s.loc['b':'d']   # in general, Python slices are [start:end+1] -- but not in Pandas, where slices are INCLUSIVE

b    20
c    30
d    40
dtype: int64

# Exercises: Series

1. Create a series containing 10 random integers from 0-100, with index `a` - `j`.
2. Retrieve the item at index `b`.
3. Retrieve the items at indexes `c`, `d`, and `f`.
4. What is the mean of the items at indexes `a`, `e`, `g`, and `h`?
5. What is the sum of the even values?
6. What is the sum of the items whose positions (`.iloc`) are even?

In [56]:
np.random.seed(0)
s = Series(np.random.randint(0, 100, 10),
          index=list('abcdefghij'))
s

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [58]:
# retrieve the item at index b

s.loc['b']

47

In [59]:
# Retrieve the items at indexes c, d, and f.

s.loc[['c', 'd', 'f']]   # fancy indexing -- pass a list of indexes to .loc[]

c    64
d    67
f     9
dtype: int64

In [60]:
# Retrieve the items at indexes c, d, and e

s.loc['c':'e']  # one set of square brackets here

c    64
d    67
e    67
dtype: int64

In [63]:
# What is the mean of the items at indexes a, e, g, and h?

s.loc[['a', 'e', 'g', 'h']].mean()

53.75

In [66]:
# What is the sum of the even values?

s.loc[s%2==0].sum()

144

In [67]:
# What is the sum of the items whose positions (.iloc) are even?

s.iloc[[0,2,4,6,8]]

a    44
c    64
e    67
g    83
i    36
dtype: int64

In [69]:
s.iloc[range(0, 10, 2)]   # from 0, up to (not including) 10, skipping by 2

a    44
c    64
e    67
g    83
i    36
dtype: int64

In [70]:
s.iloc[range(0, 10, 2)].sum()

294

# How is Pandas different from NumPy?


In [71]:
np.random.seed(0)
s1 = Series(np.random.randint(0, 100, 10), 
           index=list('abcdefghij'))
s2 = Series(np.random.randint(0, 100, 10), 
           index=list('defghijklm'))


In [72]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [73]:
s2

d    70
e    88
f    88
g    12
h    58
i    65
j    39
k    87
l    46
m    88
dtype: int64

In [74]:
s1 + s1  # add it to itself

a     88
b     94
c    128
d    134
e    134
f     18
g    166
h     42
i     72
j    174
dtype: int64

In [75]:
s1 + s2  # what happens here?

a      NaN
b      NaN
c      NaN
d    137.0
e    155.0
f     97.0
g     95.0
h     79.0
i    101.0
j    126.0
k      NaN
l      NaN
m      NaN
dtype: float64

In [76]:
s3 = Series([10, 20, 30, 40], index=list('abcd'))

s1 + s3

a     54.0
b     67.0
c     94.0
d    107.0
e      NaN
f      NaN
g      NaN
h      NaN
i      NaN
j      NaN
dtype: float64

In [77]:
# how can I get around this, if I want to?

s1 + s2   # this is translated in Python to s1.__add__(s2).  In Pandas, that becomes s1.add(s2)

a      NaN
b      NaN
c      NaN
d    137.0
e    155.0
f     97.0
g     95.0
h     79.0
i    101.0
j    126.0
k      NaN
l      NaN
m      NaN
dtype: float64

In [78]:
s1.add(s2)

a      NaN
b      NaN
c      NaN
d    137.0
e    155.0
f     97.0
g     95.0
h     79.0
i    101.0
j    126.0
k      NaN
l      NaN
m      NaN
dtype: float64

In [79]:
# we can pass a value to the "fill_value" keyword argument in s1.add, telling it what
# value to use if the index isn't matched on the other side.

s1.add(s2, fill_value=0)

a     44.0
b     47.0
c     64.0
d    137.0
e    155.0
f     97.0
g     95.0
h     79.0
i    101.0
j    126.0
k     87.0
l     46.0
m     88.0
dtype: float64

# Exercises: Indexing and series

1. Create two series, each with 10 random integers from 0-100.  The first should have an index of a-j. The second should have an index of b-k.
2. Add them together, but first predicting which indexes will give you `nan` results.
3. Add them together, with 0 being a fill value.  Can you use a different fill value?
4. Multiply them together, with 1 being a fill value. 
5. Sum all of the even numbers together from both of the series.

In [80]:
np.random.seed(0)

s1 = Series(np.random.randint(0, 100, 10), 
           index=list('abcdefghij'))

s2 = Series(np.random.randint(0, 100, 10), 
           index=list('bcdefghijk'))



In [81]:
s1

a    44
b    47
c    64
d    67
e    67
f     9
g    83
h    21
i    36
j    87
dtype: int64

In [82]:
s2

b    70
c    88
d    88
e    12
f    58
g    65
h    39
i    87
j    46
k    88
dtype: int64

In [83]:
# Add them together, but first predicting which indexes will give you nan results.

# we should get nan for a (only in s1) and k (only in s2)

s1 + s2

a      NaN
b    117.0
c    152.0
d    155.0
e     79.0
f     67.0
g    148.0
h     60.0
i    123.0
j    133.0
k      NaN
dtype: float64

In [84]:
# Add them together, with 0 being a fill value.  Can you use a different fill value?

s1.add(s2, fill_value=0)  # now, if any index doesn't have a match in the other series, we'll use 0

a     44.0
b    117.0
c    152.0
d    155.0
e     79.0
f     67.0
g    148.0
h     60.0
i    123.0
j    133.0
k     88.0
dtype: float64

In [85]:
s1.add(s2, fill_value=123.456)

a    167.456
b    117.000
c    152.000
d    155.000
e     79.000
f     67.000
g    148.000
h     60.000
i    123.000
j    133.000
k    211.456
dtype: float64

In [88]:
# Multiply them together, with 1 being a fill value. 

s1.mul(s2, fill_value=1)

a      44.0
b    3290.0
c    5632.0
d    5896.0
e     804.0
f     522.0
g    5395.0
h     819.0
i    3132.0
j    4002.0
k      88.0
dtype: float64

In [93]:
# Sum all of the even numbers together from both of the series.

s1[s1%2==0]   # all even items in s1


a    44
c    64
i    36
dtype: int64

In [94]:
s2[s2%2==0]  # all even items in s2

b    70
c    88
d    88
e    12
f    58
j    46
k    88
dtype: int64

In [96]:
s1[s1%2==0].add(s2[s2%2==0], fill_value=0).sum()

594.0

# Next up

1. `nan` in Pandas (removing, filling)
2. dtypes and assignment
3. Aggregate methods and convenience methods
4. Sorting by value

Resume at :25

In [97]:
s = Series([10, 20, 30, 40, 50])
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [98]:
a = np.array([10, 20, 30, 40, 50])
a

array([10, 20, 30, 40, 50])

In [99]:
a[2] = np.nan

ValueError: cannot convert float NaN to integer

In [100]:
s.loc[2] = np.nan

In [101]:
s

0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
dtype: float64

In [102]:
a = np.array([10, 20, np.nan, 40, 50])
a

array([10., 20., nan, 40., 50.])

In [103]:
a.mean()

nan

In [104]:
s.mean()    # what happens when we ask for the mean of our nan-containing series?

30.0

# Pandas and `nan`

In Pandas, much of the behavior around `nan` is the opposite of what we've seen in NumPy:

- Assigning `nan` (or any float value) to a series with an integer dtype will result in the dtype changing to accommodate our data, including `nan`.
- Invoking an aggregate method, such as `mean` or `std` on a series containing `np.nan` will *ignore* the `np.nan` value(s) when making its calculation.

In [107]:
# what if you really want the nan values, and want them to mess up your calculation?
s.mean(skipna=False)

nan

In [111]:
s.mean()   # implicitly, skipna=True

30.0

In [112]:
s.dropna().mean()

30.0

In [108]:
s

0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
dtype: float64

In [109]:
# if I want to get rid of the nan values, I can use "dropna"

s.dropna()  # we get a new series back, without any nan values

0    10.0
1    20.0
3    40.0
4    50.0
dtype: float64

In [110]:
s  # s is totally unchanged -- dropna returns a new series, not touching/changing the original one

0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
dtype: float64

In [113]:
# fillna replaces all nan values with another value -- it returns a new series
s

0    10.0
1    20.0
2     NaN
3    40.0
4    50.0
dtype: float64

In [114]:
s.fillna(999)

0     10.0
1     20.0
2    999.0
3     40.0
4     50.0
dtype: float64

In [115]:
s.fillna(s.mean())

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float64

# dtypes are similar to NumPy

In [125]:
s = Series([10, 20, 30, 40, 50], dtype=np.float16)

In [126]:
s

0    10.0
1    20.0
2    30.0
3    40.0
4    50.0
dtype: float16

In [127]:
# how can I now turn my series into a dtype of np.int64?
# you cannot set it - -but you can use "astype", just like with NumPy

In [128]:
s.astype(np.int64)

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [129]:
s = s.astype(np.int64)

In [130]:
s

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [131]:
s.loc[3] = 12.34

In [132]:
s

0    10.00
1    20.00
2    30.00
3    12.34
4    50.00
dtype: float64

# Exercises with dtypes and `nan`

1. Create a series containing 10 random *floats* from 0-1,000.
2. Replace those numbers whose integer part is even with `nan`.
3. Replace those `nan` values with the mean of the remaining numbers.

In [134]:
np.random.seed(0)
s = Series(np.random.rand(10) * 1000)
s

0    548.813504
1    715.189366
2    602.763376
3    544.883183
4    423.654799
5    645.894113
6    437.587211
7    891.773001
8    963.662761
9    383.441519
dtype: float64

In [139]:
# let's try to find the even numbers here

s.loc[s.astype(np.int64)%2==0] = np.nan

In [140]:
s

0           NaN
1    715.189366
2           NaN
3           NaN
4    423.654799
5    645.894113
6    437.587211
7    891.773001
8    963.662761
9    383.441519
dtype: float64

In [142]:
s.fillna(s.mean())

0    637.314681
1    715.189366
2    637.314681
3    637.314681
4    423.654799
5    645.894113
6    437.587211
7    891.773001
8    963.662761
9    383.441519
dtype: float64

In [143]:
# actually assign the new series back to s
s = s.fillna(s.mean())

In [144]:
s

0    637.314681
1    715.189366
2    637.314681
3    637.314681
4    423.654799
5    645.894113
6    437.587211
7    891.773001
8    963.662761
9    383.441519
dtype: float64

In [145]:
s

0    637.314681
1    715.189366
2    637.314681
3    637.314681
4    423.654799
5    645.894113
6    437.587211
7    891.773001
8    963.662761
9    383.441519
dtype: float64

In [146]:
s = Series(np.random.randint(0, 100, 50))
s

0     39
1     87
2     46
3     88
4     81
5     37
6     25
7     77
8     72
9      9
10    20
11    80
12    69
13    79
14    47
15    64
16    82
17    99
18    88
19    49
20    29
21    19
22    19
23    14
24    39
25    32
26    65
27     9
28    57
29    32
30    31
31    74
32    23
33    35
34    75
35    55
36    28
37    34
38     0
39     0
40    36
41    53
42     5
43    38
44    17
45    79
46     4
47    42
48    58
49    31
dtype: int64

In [147]:
# descriptive statistics
s.min()

0

In [148]:
s.max()

99

In [149]:
s.mean()

45.42

In [150]:
s.std()

27.180492704759224

In [151]:
s.quantile(0.25)   # 25% mark

25.75

In [152]:
s.quantile(0.5)    # 50% mark, aka median

39.0

In [153]:
s.quantile(0.75)

71.25

In [154]:
# having all of these descriptive statistics for our series would be really great
# fortunately, Pandas provides us with such a method, namely .describe

s.describe()

count    50.000000
mean     45.420000
std      27.180493
min       0.000000
25%      25.750000
50%      39.000000
75%      71.250000
max      99.000000
dtype: float64

In [155]:
# another useful method is .head, which shows the first part of a series
s.head()

0    39
1    87
2    46
3    88
4    81
dtype: int64

In [156]:
s.head(3)

0    39
1    87
2    46
dtype: int64

In [157]:
# similarly, we have tail
s.tail()

45    79
46     4
47    42
48    58
49    31
dtype: int64

In [158]:
s.tail(3)

47    42
48    58
49    31
dtype: int64

In [159]:
# how often does each value appear?
s.value_counts()

39    2
9     2
31    2
19    2
79    2
0     2
32    2
88    2
55    1
42    1
74    1
23    1
35    1
75    1
38    1
5     1
4     1
34    1
57    1
36    1
17    1
53    1
28    1
14    1
65    1
87    1
46    1
81    1
37    1
25    1
77    1
72    1
20    1
80    1
69    1
47    1
64    1
82    1
99    1
49    1
29    1
58    1
dtype: int64

In [160]:
s = Series([10, 20, 30, 10, 10, 30, np.nan])

In [161]:
s.value_counts()

10.0    3
30.0    2
20.0    1
dtype: int64

In [163]:
s.value_counts(dropna=False)

10.0    3
30.0    2
20.0    1
NaN     1
dtype: int64

# Exercises: Weather forecast

1. Create a series in which the values are the forecast high temperatures in the next 10-12 days.  The index should be a list of strings containing the dates in 'MMDD' format.
2. What 2 temperatures occur the greatest number of times?
3. Get descriptive statistics for the weather forecast.

In [165]:
s = Series([30, 31, 33, 34, 34, 30, 28, 29, 28, 29, 29, 31, 31, 32],
          index='0607 0608 0609 0610 0611 0612 0613 0614 0615 0616 0617 0618 0619 0620'.split())

In [166]:
s

0607    30
0608    31
0609    33
0610    34
0611    34
0612    30
0613    28
0614    29
0615    28
0616    29
0617    29
0618    31
0619    31
0620    32
dtype: int64

In [167]:
s.value_counts()

31    3
29    3
30    2
34    2
28    2
33    1
32    1
dtype: int64

In [168]:
s.value_counts().head(2)

31    3
29    3
dtype: int64

In [169]:
s.describe()

count    14.000000
mean     30.642857
std       2.023217
min      28.000000
25%      29.000000
50%      30.500000
75%      31.750000
max      34.000000
dtype: float64

# Sorting

Yesterday, we saw that we can sort a NumPy array using `np.sort`.

In Pandas, we actually have two different ways to sort a series:

- Sort by values
- Sort by the index

In [171]:
s.sort_values()  # returns a new series, with the old series' index and values, sorted ascending

0613    28
0615    28
0614    29
0616    29
0617    29
0607    30
0612    30
0608    31
0618    31
0619    31
0620    32
0609    33
0610    34
0611    34
dtype: int64

In [172]:
s.sort_values(ascending=False)

0610    34
0611    34
0609    33
0620    32
0608    31
0618    31
0619    31
0607    30
0612    30
0614    29
0616    29
0617    29
0613    28
0615    28
dtype: int64

In [174]:
# What are the three highest temperatures we'll see?
s.sort_values(ascending=False).head(3)

0610    34
0611    34
0609    33
dtype: int64

In [175]:
# let's sort by the index, instead

s.sort_index()

0607    30
0608    31
0609    33
0610    34
0611    34
0612    30
0613    28
0614    29
0615    28
0616    29
0617    29
0618    31
0619    31
0620    32
dtype: int64

In [176]:
s.sort_index(ascending=False)

0620    32
0619    31
0618    31
0617    29
0616    29
0615    28
0614    29
0613    28
0612    30
0611    34
0610    34
0609    33
0608    31
0607    30
dtype: int64

In [177]:
help(s.sort_values)

Help on method sort_values in module pandas.core.series:

sort_values(axis=0, ascending: 'bool | int | Sequence[bool | int]' = True, inplace: 'bool' = False, kind: 'str' = 'quicksort', na_position: 'str' = 'last', ignore_index: 'bool' = False, key: 'ValueKeyFunc' = None) method of pandas.core.series.Series instance
    Sort by the values.
    
    Sort a Series in ascending or descending order by some
    criterion.
    
    Parameters
    ----------
    axis : {0 or 'index'}, default 0
        Axis to direct sorting. The value 'index' is accepted for
        compatibility with DataFrame.sort_values.
    ascending : bool or list of bools, default True
        If True, sort values in ascending order, otherwise descending.
    inplace : bool, default False
        If True, perform operation in-place.
    kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort'
        Choice of sorting algorithm. See also :func:`numpy.sort` for more
        information. 'mergesort' an

# Strings

Python offers powerful strings, and NumPy arrays can contain strings.

In [178]:
a = np.array('this is a test'.split())

In [179]:
a

array(['this', 'is', 'a', 'test'], dtype='<U4')

In [180]:
a[0] = 'pqrstuv'

In [181]:
a

array(['pqrs', 'is', 'a', 'test'], dtype='<U4')

In [184]:
s = Series('this is a test'.split())  # dtype will be "object", meaning: Python objects

In [185]:
s

0    this
1      is
2       a
3    test
dtype: object

In [186]:
# how many characters are in each word?


len(s)

4

# Working with strings in Pandas

A Pandas series has an `.str` attribute, which is used to access string functionality. This functionality includes all of the methods provided by the Python `str` class, plus a bunch of other convenience methods that you might want to use. 

So *never* use a `for` loop on your Pandas series.  Instead, find the appropriate method via `.str`, and apply it on the series.

In [187]:
s.str.len()   # this means: run len() on each element of s, and return a series with its results

0    4
1    2
2    1
3    4
dtype: int64

In [188]:
# another method: contains

s.str.contains('a')  # sort of like the "in" operator -- does the argument string appear in each element of s?

0    False
1    False
2     True
3    False
dtype: bool

In [189]:
s.str.contains('e')

0    False
1    False
2    False
3     True
dtype: bool

In [190]:
# contains can actually look for a regular expression

s.str.contains('[ae]')  # does it contain either a or e?

0    False
1    False
2     True
3     True
dtype: bool

In [191]:
s.str.isdigit()  # can I turn each of these elements into an integer?

0    False
1    False
2    False
3    False
dtype: bool

In [192]:
s = Series('ab 12 cd 34'.split())
s

0    ab
1    12
2    cd
3    34
dtype: object

In [193]:
s.astype(np.int64)  # let's convert our series to a bunch of integers

ValueError: invalid literal for int() with base 10: 'ab'

In [197]:
# let's find all of the int-able (digits-only) strings, retrieve them,
# and then turn them into integers

s.loc[s.str.isdigit()].astype(np.int64).sum()

46

In [198]:
# what happens if we sum the elements in s?
s.sum()

'ab12cd34'

# Exercises: Strings and series

1. Create a series containing 10 words of different lengths.
2. Find the words whose lengths are odd.
3. Find the words whose lengths are greater than the mean word length (i.e., longer than average).
4. Find words that contain the letter 't'.

In [199]:
s = Series('this is a fantastic example of brilliant writing for my course'.split())

In [200]:
s

0          this
1            is
2             a
3     fantastic
4       example
5            of
6     brilliant
7       writing
8           for
9            my
10       course
dtype: object

In [201]:
s.values

array(['this', 'is', 'a', 'fantastic', 'example', 'of', 'brilliant',
       'writing', 'for', 'my', 'course'], dtype=object)

In [202]:
s.str.len()

0     4
1     2
2     1
3     9
4     7
5     2
6     9
7     7
8     3
9     2
10    6
dtype: int64

In [204]:
# get a boolean series in which True says, "This element has an odd length"

s.str.len() % 2 == 1

0     False
1     False
2      True
3      True
4      True
5     False
6      True
7      True
8      True
9     False
10    False
dtype: bool

In [205]:
# apply that boolean series to s.loc, and thus get only those words with odd lengths from s
s.loc[s.str.len() % 2 == 1]

2            a
3    fantastic
4      example
6    brilliant
7      writing
8          for
dtype: object

In [207]:
# get the mean word length
s.str.len().mean()

4.7272727272727275

In [208]:
# get a boolean series based on this, for words longer than the mean length 
s.str.len() > s.str.len().mean()

0     False
1     False
2     False
3      True
4      True
5     False
6      True
7      True
8     False
9     False
10     True
dtype: bool

In [209]:
# use this boolean series as a mask index to get only those words in s that are longer than the mean
s.loc[s.str.len() > s.str.len().mean()]

3     fantastic
4       example
6     brilliant
7       writing
10       course
dtype: object

In [211]:
# find words that contain the letter 't'

s.loc[s.str.contains('t')]

0         this
3    fantastic
6    brilliant
7      writing
dtype: object

In [212]:
s = Series('123 456 789'.split())
s

0    123
1    456
2    789
dtype: object

In [213]:
s.mean()

41152263.0

In [214]:
# what is the mean? It's the sum / the length

s.sum()

'123456789'

In [215]:
123456789 / 3

41152263.0

# Next up

1. Data frames
    - Creating them
    - Retrieving from them
    - Applying the methods we've learned for series on data frames
2. Loading and storing data in different formats    

Resume at 1:10 p.m. Eastern

# Data frames

A data frame is a 2D data structure. It has column and rows.

The more important thing to understand is that each column is a Pandas series.  Each column has a dtype.

We can create data frames in a few different ways:

- List of lists (or list of series, or a 2D NumPy array)
- List of dicts
- Dict of lists


In [216]:
# 3 rows and 4 columns as a list of lists

df = DataFrame([[10, 20, 30, 40],
                [50, 60, 70, 80], 
               [90, 100, 110, 120]])

In [217]:
df

Unnamed: 0,0,1,2,3
0,10,20,30,40
1,50,60,70,80
2,90,100,110,120


In [218]:
# the rows are the index, just like before with series, and can be set in the same way (index=...)
# the columns are set in the same way, but with columns=...

df = DataFrame([[10, 20, 30, 40],
                [50, 60, 70, 80], 
               [90, 100, 110, 120]],
              index=list('xyz'),
              columns=list('abcd'))

In [219]:
df

Unnamed: 0,a,b,c,d
x,10,20,30,40
y,50,60,70,80
z,90,100,110,120


In [220]:
df.index

Index(['x', 'y', 'z'], dtype='object')

In [221]:
df.columns

Index(['a', 'b', 'c', 'd'], dtype='object')

# Retrieving from a data frame

In a data frame, we can use `[]` to retrieve the *columns*.

We can retrieve the rows with our old friends `.loc` and `.iloc` using the index and the numeric position.

In [222]:
df['a']  # this means: retrieve column a

x    10
y    50
z    90
Name: a, dtype: int64

In [223]:
# multiple columns via fancy indexing

df[['a', 'b']]

Unnamed: 0,a,b
x,10,20
y,50,60
z,90,100


In [224]:
# how do I retrieve rows? Mainly via .loc

df.loc['x']    # when you retrieve a row, you get a (temporary) series based on the items in that row

a    10
b    20
c    30
d    40
Name: x, dtype: int64

In [225]:
# fancy indexing with .loc

df.loc[['x', 'z']]

Unnamed: 0,a,b,c,d
x,10,20,30,40
z,90,100,110,120


In [228]:
# what if I want to retrieve a particular item in a row and a column?

df['c']['x']     # row 'x', column 'c'

30

# Don't retrieve in this way!

If you have multiple sets of square brackets, you're asking for trouble!

1. It'll take more time, because every `[]` is a method call
2. It'll take more memory, because it has to load the temporary data
3. There will likely be an interim, temporary series created behind the scenes, which if you try to assign to it, will give you trouble.

Rule of thumb is: Don't use square brackets after square brackets.

In [229]:
# how should I do it?

df.loc['x', 'c']     # .loc takes two arguments -- a row selector and a column selector

30

In [230]:
df.loc[['x', 'y'], 'c']

x    30
y    70
Name: c, dtype: int64

In [231]:
df.loc[['x', 'y'], ['a', 'c']]

Unnamed: 0,a,c
x,10,30
y,50,70


In [232]:
df

Unnamed: 0,a,b,c,d
x,10,20,30,40
y,50,60,70,80
z,90,100,110,120


In [233]:
df['a'] > 40    # this will compare every element in df['a'] with 40, and return True/False

x    False
y     True
z     True
Name: a, dtype: bool

In [234]:
df.loc[ df['a']>40   ]   # retrieve all rows from df where 'a' > 40

Unnamed: 0,a,b,c,d
y,50,60,70,80
z,90,100,110,120


In [235]:
#       row selector    column selector
df.loc[ df['a']>40,      ['a', 'c']   ] 

Unnamed: 0,a,c
y,50,70
z,90,110


# Exercises: Retrieving from data frames

1. Create a 6x6 data frame with rows 'uvwxyz' and columns 'abcdef'.  The values should be random integers from 0-1,000.  (Remember that you can use a 2D NumPy array to seed a data frame.)
2. Retrieve column `b`.
3. Retrieve row `v`.
4. Retrieve columns `b` and `d`.
5. Retrieve rows `u` and `y`.
6. Retrieve the item at row `x`, and column `d`.
7. Retrieve the items at rows `y` and `z`, and columns `a` and `c`.

