# Introduction to Pandas: the Python Data Analysis library
This is a short introduction to pandas, geared mainly for new users and adapted heavily from the "10 Minutes to Pandas" tutorial from http://pandas.pydata.org. You can see more complex recipes in the Pandas Cookbook: http://pandas.pydata.org/pandas-docs/stable/cookbook.html
## Initial setup
Let's start by importing 2 useful libraries: `pandas` and `numpy`

In [1]:
import pandas, numpy

You can create a `Series` by passing a list of values. By default, `pandas` will create an integer index.

In [3]:
s = pandas.Series([1,3,5,"data wrangling is awesome",8])
s

0                            1
1                            3
2                            5
3    data wrangling is awesome
4                            8
dtype: object

You can also use `pandas` to create an series of `datetime` objects. Let's make one for the week beginning September 9th, 2024:

In [4]:
dates = pandas.date_range('20240909',
                          periods = 7)
dates

DatetimeIndex(['2024-09-09', '2024-09-10', '2024-09-11', '2024-09-12',
               '2024-09-13', '2024-09-14', '2024-09-15'],
              dtype='datetime64[ns]', freq='D')

Now we'll create a `DataFrame` using the `dates` array as our index, fill it with some random values using `numpy`, and give the columns some labels.

In [6]:
df = pandas.DataFrame(numpy.random.randn(7,4), 
                      index = dates, 
                      columns = ['dog','cat','mouse','duck'])
df

Unnamed: 0,dog,cat,mouse,duck
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608
2024-09-10,0.003013,0.470142,0.581258,-2.682506
2024-09-11,-0.393719,-1.855561,-1.684294,-2.943739
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231
2024-09-13,0.248999,0.014043,0.707281,1.048916
2024-09-14,-0.705177,-1.468168,1.403105,0.215721
2024-09-15,0.136218,-3.372879,2.391618,-0.167365


It can also be useful to know how to create a `DataFrame` from a `dict` of objects. This comes in particularly handy when working with JSON-like structures.

In [7]:
df2 = pandas.DataFrame({ 'A' : 1.,
                     'B' : pandas.Timestamp('20130102'),
                     'C' : pandas.Series(1,index=list(range(5)),dtype='float32'),
                     'D' : numpy.array([3] * 5
                                    ,dtype='int32'),
                     'E' : pandas.Categorical(["test","train","blah","train","blah"]),
                     'F' : 'foo' })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,blah,foo
3,1.0,2013-01-02,1.0,3,train,foo
4,1.0,2013-01-02,1.0,3,blah,foo


## Exploring the data in a DataFrame
We can access the data types of each column in a `DataFrame` as follows:

In [8]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

We can display the `index`, `columns`, and the underlying `numpy values` separately:

In [9]:
df2.index

Index([0, 1, 2, 3, 4], dtype='int64')

In [10]:
df2.columns

Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')

In [11]:
df2.values

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'blah', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'blah', 'foo']],
      dtype=object)

To get a quick statistical summary of your data, use the `.describe()` method:

In [12]:
df2.describe()

Unnamed: 0,A,B,C,D
count,5.0,5,5.0,5.0
mean,1.0,2013-01-02 00:00:00,1.0,3.0
min,1.0,2013-01-02 00:00:00,1.0,3.0
25%,1.0,2013-01-02 00:00:00,1.0,3.0
50%,1.0,2013-01-02 00:00:00,1.0,3.0
75%,1.0,2013-01-02 00:00:00,1.0,3.0
max,1.0,2013-01-02 00:00:00,1.0,3.0
std,0.0,,0.0,0.0


## Some basic data transformations
`DataFrames` have a built-in transpose:

In [13]:
df.T

Unnamed: 0,2024-09-09,2024-09-10,2024-09-11,2024-09-12,2024-09-13,2024-09-14,2024-09-15
dog,-2.10556,0.003013,-0.393719,-0.418953,0.248999,-0.705177,0.136218
cat,-0.042539,0.470142,-1.855561,-1.248454,0.014043,-1.468168,-3.372879
mouse,-0.528332,0.581258,-1.684294,0.59028,0.707281,1.403105,2.391618
duck,1.039608,-2.682506,-2.943739,-1.503231,1.048916,0.215721,-0.167365


We can also sort a `DataFrame` along a given data dimension. For example, we might want to `sort` by the values in the `duck` column:

In [14]:
df.sort_values(by = "duck")

Unnamed: 0,dog,cat,mouse,duck
2024-09-11,-0.393719,-1.855561,-1.684294,-2.943739
2024-09-10,0.003013,0.470142,0.581258,-2.682506
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231
2024-09-15,0.136218,-3.372879,2.391618,-0.167365
2024-09-14,-0.705177,-1.468168,1.403105,0.215721
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608
2024-09-13,0.248999,0.014043,0.707281,1.048916


We can also sort the rows `(axis=0)` and columns `(axis=1)` by their index/header values:

In [15]:
df.sort_index(axis = 0, 
              ascending = False)

Unnamed: 0,dog,cat,mouse,duck
2024-09-15,0.136218,-3.372879,2.391618,-0.167365
2024-09-14,-0.705177,-1.468168,1.403105,0.215721
2024-09-13,0.248999,0.014043,0.707281,1.048916
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231
2024-09-11,-0.393719,-1.855561,-1.684294,-2.943739
2024-09-10,0.003013,0.470142,0.581258,-2.682506
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608


In [16]:
df.sort_index(axis = 1)

Unnamed: 0,cat,dog,duck,mouse
2024-09-09,-0.042539,-2.10556,1.039608,-0.528332
2024-09-10,0.470142,0.003013,-2.682506,0.581258
2024-09-11,-1.855561,-0.393719,-2.943739,-1.684294
2024-09-12,-1.248454,-0.418953,-1.503231,0.59028
2024-09-13,0.014043,0.248999,1.048916,0.707281
2024-09-14,-1.468168,-0.705177,0.215721,1.403105
2024-09-15,-3.372879,0.136218,-0.167365,2.391618


## Selection
To see only only the first few rows of a `DataFrame`, use the `.head()` function:

In [17]:
df.head()

Unnamed: 0,dog,cat,mouse,duck
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608
2024-09-10,0.003013,0.470142,0.581258,-2.682506
2024-09-11,-0.393719,-1.855561,-1.684294,-2.943739
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231
2024-09-13,0.248999,0.014043,0.707281,1.048916


To view only the last few rows, use the `.tail()` function. Note that by default, both `.head()` and `.tail()`$ return 5 rows. You can also specify the number you want by passing in an integer.

In [18]:
df.tail(3)

Unnamed: 0,dog,cat,mouse,duck
2024-09-13,0.248999,0.014043,0.707281,1.048916
2024-09-14,-0.705177,-1.468168,1.403105,0.215721
2024-09-15,0.136218,-3.372879,2.391618,-0.167365


Selecting a single column via indexing yields a `Series`:

In [19]:
df2['A'] # a lot like df$A in R

0    1.0
1    1.0
2    1.0
3    1.0
4    1.0
Name: A, dtype: float64

We can also select a subset of the rows using slicing. You can select either by integer indexing:

In [20]:
df[2:5]

Unnamed: 0,dog,cat,mouse,duck
2024-09-11,-0.393719,-1.855561,-1.684294,-2.943739
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231
2024-09-13,0.248999,0.014043,0.707281,1.048916


To select more than one column at a time, try `.loc[]`:

In [21]:
df.loc[:, ['cat', 'dog']]

Unnamed: 0,cat,dog
2024-09-09,-0.042539,-2.10556
2024-09-10,0.470142,0.003013
2024-09-11,-1.855561,-0.393719
2024-09-12,-1.248454,-0.418953
2024-09-13,0.014043,0.248999
2024-09-14,-1.468168,-0.705177
2024-09-15,-3.372879,0.136218


And of course, you might want to do both at the same time:

In [24]:
df.loc['20240909':'20240911', ['cat', 'duck']]

Unnamed: 0,cat,duck
2024-09-09,-0.042539,1.039608
2024-09-10,0.470142,-2.682506
2024-09-11,-1.855561,-2.943739


## Boolean Indexing
Sometimes it's useful to be able to select all rows that meet some criteria. For example, we might want all rows where the value in the `cat` column is greater than 0:

In [25]:
df[df['cat'] > 0]

Unnamed: 0,dog,cat,mouse,duck
2024-09-10,0.003013,0.470142,0.581258,-2.682506
2024-09-13,0.248999,0.014043,0.707281,1.048916


Or perhaps we'd like to eliminate all negative values:

In [26]:
nonneg = df[df > 0]
nonneg

Unnamed: 0,dog,cat,mouse,duck
2024-09-09,,,,1.039608
2024-09-10,0.003013,0.470142,0.581258,
2024-09-11,,,,
2024-09-12,,,0.59028,
2024-09-13,0.248999,0.014043,0.707281,1.048916
2024-09-14,,,1.403105,0.215721
2024-09-15,0.136218,,2.391618,


And then maybe we'd like to drop all the rows with missing values:

In [27]:
nonneg.dropna()

Unnamed: 0,dog,cat,mouse,duck
2024-09-13,0.248999,0.014043,0.707281,1.048916


Oops... maybe not. How about we set them to 0 instead?

In [28]:
nonneg.fillna(value = 0)

Unnamed: 0,dog,cat,mouse,duck
2024-09-09,0.0,0.0,0.0,1.039608
2024-09-10,0.003013,0.470142,0.581258,0.0
2024-09-11,0.0,0.0,0.0,0.0
2024-09-12,0.0,0.0,0.59028,0.0
2024-09-13,0.248999,0.014043,0.707281,1.048916
2024-09-14,0.0,0.0,1.403105,0.215721
2024-09-15,0.136218,0.0,2.391618,0.0


But what if your values aren't numeric? No problem, we can also do filtering. First, let's copy the `DataFrame` and add a new column of nominal values:

In [29]:
df3 = df.copy()
df3['color'] = ['blue', 'green','red','blue','green','red','blue']
df3

Unnamed: 0,dog,cat,mouse,duck,color
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608,blue
2024-09-10,0.003013,0.470142,0.581258,-2.682506,green
2024-09-11,-0.393719,-1.855561,-1.684294,-2.943739,red
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231,blue
2024-09-13,0.248999,0.014043,0.707281,1.048916,green
2024-09-14,-0.705177,-1.468168,1.403105,0.215721,red
2024-09-15,0.136218,-3.372879,2.391618,-0.167365,blue


Now we can use the `.isin()` function to select only the rows with `green` or `blue` in the `color` column:

In [33]:
df3[df3['color'].isin(['green', 'blue'])]

Unnamed: 0,dog,cat,mouse,duck,color
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608,blue
2024-09-10,0.003013,0.470142,0.581258,-2.682506,green
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231,blue
2024-09-13,0.248999,0.014043,0.707281,1.048916,green
2024-09-15,0.136218,-3.372879,2.391618,-0.167365,blue


In [34]:
df3 # Note that none of the filtering operations change the original DataFrame

Unnamed: 0,dog,cat,mouse,duck,color
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608,blue
2024-09-10,0.003013,0.470142,0.581258,-2.682506,green
2024-09-11,-0.393719,-1.855561,-1.684294,-2.943739,red
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231,blue
2024-09-13,0.248999,0.014043,0.707281,1.048916,green
2024-09-14,-0.705177,-1.468168,1.403105,0.215721,red
2024-09-15,0.136218,-3.372879,2.391618,-0.167365,blue


## Prefer something more like `dplyr`?
The `dfply` module provides dplyr-style piping operations for `pandas DataFrames`, along with many familiar verbs. Note that `>>` is the `dfply` equivalent of `%>%`, and `X` represents the piped `DataFrame`.

### `select()`

In [36]:
from dfply import *

df2 >> select("A")

Unnamed: 0,A
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


Just be careful about python and spacing:

In [40]:
# Broken
# df2 >> 
#    select(X.A)
    
# If you want to indent, you need parens
(
df2 >> 
   select('A')
)

Unnamed: 0,A
0,1.0
1,1.0
2,1.0
3,1.0
4,1.0


### Dropping columns with `drop()` or `select()`

`drop()` is a helper function that does the opposite of `select()`:

In [38]:
(df2 >> 
   drop(X.E, X.F))

Unnamed: 0,A,B,C,D
0,1.0,2013-01-02,1.0,3
1,1.0,2013-01-02,1.0,3
2,1.0,2013-01-02,1.0,3
3,1.0,2013-01-02,1.0,3
4,1.0,2013-01-02,1.0,3


You can also use `select()` along with the `~` (which means "not") to drop unwanted columns:

In [41]:
(df2 >> 
   select(~X.E, ~X.F))

Unnamed: 0,A,B,C,D
0,1.0,2013-01-02,1.0,3
1,1.0,2013-01-02,1.0,3
2,1.0,2013-01-02,1.0,3
3,1.0,2013-01-02,1.0,3
4,1.0,2013-01-02,1.0,3


### Filtering with `mask()`
`mask()` keeps all the rows where the criteria is/are true:

In [42]:
# this is just like filter() in R
(df3 >>
   mask(X.color.isin(['green','blue']), X.cat > 0))

Unnamed: 0,dog,cat,mouse,duck,color
2024-09-10,0.003013,0.470142,0.581258,-2.682506,green
2024-09-13,0.248999,0.014043,0.707281,1.048916,green


### Add new columns with `mutate()`

In [43]:
(df3 >>
   mutate(platypus = (X.cat + X.duck)/2))

Unnamed: 0,dog,cat,mouse,duck,color,platypus
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608,blue,0.498534
2024-09-10,0.003013,0.470142,0.581258,-2.682506,green,-1.106182
2024-09-11,-0.393719,-1.855561,-1.684294,-2.943739,red,-2.39965
2024-09-12,-0.418953,-1.248454,0.59028,-1.503231,blue,-1.375843
2024-09-13,0.248999,0.014043,0.707281,1.048916,green,0.531479
2024-09-14,-0.705177,-1.468168,1.403105,0.215721,red,-0.626224
2024-09-15,0.136218,-3.372879,2.391618,-0.167365,blue,-1.770122


## Basic Math
Helper methods on the `DataFrame` object make it straightforward to calculate things like the `.mean()` across all numeric columns:

In [44]:
df.mean(axis = 0)

dog     -0.462168
cat     -1.071917
mouse    0.494417
duck    -0.713228
dtype: float64

We can also perform the same operation on individual rows:

In [45]:
df.mean(axis = 1)

2024-09-09   -0.409206
2024-09-10   -0.407023
2024-09-11   -1.719328
2024-09-12   -0.645090
2024-09-13    0.504810
2024-09-14   -0.138630
2024-09-15   -0.253102
Freq: D, dtype: float64

`.median()` also behaves as expected:

In [46]:
df.median(axis = 0)

dog     -0.393719
cat     -1.248454
mouse    0.590280
duck    -0.167365
dtype: float64

No helper function for what you need? No problem. You can also use the `.apply()` function to evaluate functions to the data. The defauls is to apply to `axis=0` aka apply the function within each column. For example, we might want to perform a cumulative summation (thanks, `numpy`!):

In [50]:
df.apply(numpy.cumsum)

Unnamed: 0,dog,cat,mouse,duck
2024-09-09,-2.10556,-0.042539,-0.528332,1.039608
2024-09-10,-2.102546,0.427603,0.052925,-1.642898
2024-09-11,-2.496265,-1.427958,-1.631368,-4.586637
2024-09-12,-2.915218,-2.676412,-1.041088,-6.089868
2024-09-13,-2.66622,-2.662369,-0.333807,-5.040952
2024-09-14,-3.371397,-4.130537,1.069298,-4.825231
2024-09-15,-3.235179,-7.503416,3.460916,-4.992595


Or apply your own function, such as finding the spread (`max` value - `min` value):

In [54]:
df.apply(lambda x: x.max() - x.min())

dog      2.354558
cat      3.843022
mouse    4.075912
duck     3.992654
dtype: float64

## Combining DataFrames
Combining `DataFrame` objects can be done using simple concatenation (provided they have the same columns):

In [55]:
frame_one = pandas.DataFrame(numpy.random.randn(5, 4))
frame_one

Unnamed: 0,0,1,2,3
0,1.836785,0.166022,-0.157606,1.879076
1,-1.523137,0.596956,3.116997,0.622276
2,0.603631,-0.559675,-0.273719,0.549281
3,1.191431,-2.194677,0.106545,0.863456
4,-1.202323,-0.702567,-1.078121,-0.915032


In [56]:
frame_two = pandas.DataFrame(numpy.random.randn(5, 4))
frame_two

Unnamed: 0,0,1,2,3
0,-0.261409,-1.106251,-1.270702,-1.476162
1,-0.473184,-0.168612,-0.841774,-0.224236
2,-0.000337,0.661093,-0.204945,0.854811
3,-0.598379,1.45713,0.969852,0.795908
4,1.303509,-0.756101,-1.56232,-0.827245


In [57]:
pandas.concat([frame_one, frame_two])

Unnamed: 0,0,1,2,3
0,1.836785,0.166022,-0.157606,1.879076
1,-1.523137,0.596956,3.116997,0.622276
2,0.603631,-0.559675,-0.273719,0.549281
3,1.191431,-2.194677,0.106545,0.863456
4,-1.202323,-0.702567,-1.078121,-0.915032
0,-0.261409,-1.106251,-1.270702,-1.476162
1,-0.473184,-0.168612,-0.841774,-0.224236
2,-0.000337,0.661093,-0.204945,0.854811
3,-0.598379,1.45713,0.969852,0.795908
4,1.303509,-0.756101,-1.56232,-0.827245


If your `DataFrames` do not have an identical structure, but do share a common key, you can also perform a SQL-style join using the `.merge()` function:

In [58]:
left = pandas.DataFrame({'blah': ['foo', 'bar'], 
                         'lval': [1, 2]})
left

Unnamed: 0,blah,lval
0,foo,1
1,bar,2


In [59]:
right = pandas.DataFrame({'blah': ['foo', 'foo', 'bar'], 
                          'rval': [3, 4, 5]})
right

Unnamed: 0,blah,rval
0,foo,3
1,foo,4
2,bar,5


In [60]:
joined_data = pandas.merge(left, right, on = "blah")
joined_data

Unnamed: 0,blah,lval,rval
0,foo,1,3
1,foo,1,4
2,bar,2,5


Don't worry, `dfply` lets us do this using `join`s too:

In [61]:
left >> inner_join(right, by = "blah")

Unnamed: 0,blah,lval,rval
0,foo,1,3
1,foo,1,4
2,bar,2,5


## Grouping
Sometimes when working with multivariate data, it's helpful to be able to condense the data along a certain dimension in order to perform a calculation more efficiently. Let's start by creating a somewhat messy `DataFrame`:

In [62]:
foo_bar = pandas.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                            'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],
                            'C' : numpy.random.randn(8),
                            'D' : numpy.random.randn(8)})
foo_bar

Unnamed: 0,A,B,C,D
0,foo,one,0.15929,-0.821985
1,bar,one,1.38279,-0.92772
2,foo,two,-0.294717,0.2386
3,bar,three,0.538998,-0.158754
4,foo,two,2.54256,0.872077
5,bar,two,-0.737196,-1.041347
6,foo,one,0.351952,-1.879765
7,foo,three,1.047224,0.453515


Now let's group by column `A`, and `sum()` along the other columns:

In [63]:
foo_bar.groupby('A').sum()

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,onethreetwo,1.184592,-2.127821
foo,onetwotwoonethree,3.80631,-1.137558


Note that column `B` doesn't make a ton of sense, because the summation operator concatenated all the strings. However, if we wanted to retain that information, we could perform the same operation using a hierarchical index:

In [64]:
grouped = foo_bar.groupby(['A','B']).sum()
grouped

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.38279,-0.92772
bar,three,0.538998,-0.158754
bar,two,-0.737196,-1.041347
foo,one,0.511243,-2.70175
foo,three,1.047224,0.453515
foo,two,2.247843,1.110677


The `stack()` function can be used to "compress” a level in the `DataFrame`’s columns:

In [65]:
stacked = grouped.stack()
stacked

A    B       
bar  one    C    1.382790
            D   -0.927720
     three  C    0.538998
            D   -0.158754
     two    C   -0.737196
            D   -1.041347
foo  one    C    0.511243
            D   -2.701750
     three  C    1.047224
            D    0.453515
     two    C    2.247843
            D    1.110677
dtype: float64

To uncompress the last column of a stacked `DataFrame`, you can call `.unstack()`:

In [66]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.38279,-0.92772
bar,three,0.538998,-0.158754
bar,two,-0.737196,-1.041347
foo,one,0.511243,-2.70175
foo,three,1.047224,0.453515
foo,two,2.247843,1.110677


## Time Series
`pandas` has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (for example, converting secondly data into minutely data). Firset, let's create an array of 150 `dateTime` objects at a frequency of 1 second:

In [67]:
rng = pandas.date_range('1/1/2021', 
                        periods = 350, 
                        freq = 'S')
rng

DatetimeIndex(['2021-01-01 00:00:00', '2021-01-01 00:00:01',
               '2021-01-01 00:00:02', '2021-01-01 00:00:03',
               '2021-01-01 00:00:04', '2021-01-01 00:00:05',
               '2021-01-01 00:00:06', '2021-01-01 00:00:07',
               '2021-01-01 00:00:08', '2021-01-01 00:00:09',
               ...
               '2021-01-01 00:05:40', '2021-01-01 00:05:41',
               '2021-01-01 00:05:42', '2021-01-01 00:05:43',
               '2021-01-01 00:05:44', '2021-01-01 00:05:45',
               '2021-01-01 00:05:46', '2021-01-01 00:05:47',
               '2021-01-01 00:05:48', '2021-01-01 00:05:49'],
              dtype='datetime64[ns]', length=350, freq='S')

Now we'll use that to create a time series, assigning a random integer to each element of the range:

In [68]:
time_series = pandas.Series(numpy.random.randint(0, 500, len(rng)), 
                            index = rng)
time_series.head()

2021-01-01 00:00:00    474
2021-01-01 00:00:01    304
2021-01-01 00:00:02    282
2021-01-01 00:00:03    181
2021-01-01 00:00:04    374
Freq: S, dtype: int64

Next, we'll resample the data by binning the one-second raw values into minutes (and summing the associated values):

In [69]:
time_series.resample('1Min').sum()

2021-01-01 00:00:00    13753
2021-01-01 00:01:00    14051
2021-01-01 00:02:00    16498
2021-01-01 00:03:00    14979
2021-01-01 00:04:00    13111
2021-01-01 00:05:00    12108
Freq: T, dtype: int64

We also have support for time zone conversion. For example, if we assume the original `time_series` was in UTC:

In [70]:
ts_utc = time_series.tz_localize('UTC')
ts_utc.head()

2021-01-01 00:00:00+00:00    474
2021-01-01 00:00:01+00:00    304
2021-01-01 00:00:02+00:00    282
2021-01-01 00:00:03+00:00    181
2021-01-01 00:00:04+00:00    374
Freq: S, dtype: int64

We can easily convert it to Eastern time:

In [71]:
ts_utc.tz_convert('US/Eastern').head()

2020-12-31 19:00:00-05:00    474
2020-12-31 19:00:01-05:00    304
2020-12-31 19:00:02-05:00    282
2020-12-31 19:00:03-05:00    181
2020-12-31 19:00:04-05:00    374
Freq: S, dtype: int64

## Reading/Writing to files
Writing to a file works just like we'd expect:

In [72]:
ts_utc.to_csv("foo.csv")

As does reading:

In [73]:
new_frame = pandas.read_csv("foo.csv")
new_frame.head()

Unnamed: 0.1,Unnamed: 0,0
0,2021-01-01 00:00:00+00:00,474
1,2021-01-01 00:00:01+00:00,304
2,2021-01-01 00:00:02+00:00,282
3,2021-01-01 00:00:03+00:00,181
4,2021-01-01 00:00:04+00:00,374


We can also read/write to `.xlsx` format, if for some reason we want to work in Excel:

In [76]:
# excel conversion won't support timezones so we have to remove them first
ts_none = ts_utc.tz_localize(None)

ts_none.to_excel("foo.xlsx", sheet_name = "Sample 1")

But what if the data is a little... messy? Something like this:

In [77]:
broken_df = pandas.read_csv('bikes.csv')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 15: invalid continuation byte

No problem! The `read_csv()` function has lots of tools to help wrangle this mess. Here we'll

    - change the column separator to a ;
    - Set the encoding to `latin1` (the default is `utf8`)
    - Parse the dates in the `Date` column
    - Tell it that our dates have the day first instead of the month first

In [78]:
fixed_df = pandas.read_csv('bikes.csv', 
                           sep = ';', 
                           encoding = 'latin1', 
                           parse_dates = ['Date'], 
                           dayfirst = True)
fixed_df.head()

Unnamed: 0,Date,Berri 1,Brébeuf (données non disponibles),Côte-Sainte-Catherine,Maisonneuve 1,Maisonneuve 2,du Parc,Pierre-Dupuy,Rachel1,St-Urbain (données non disponibles)
0,2012-01-01,35,,0,38,51,26,10,16,
1,2012-01-02,83,,1,68,153,53,6,43,
2,2012-01-03,135,,2,104,248,89,3,58,
3,2012-01-04,144,,1,116,318,111,8,61,
4,2012-01-05,197,,2,124,330,97,13,95,


## Scraping Data
Many of you will probably be interested in scraping data from the web for your projects. For example, what if we were interested in working with some data on municipalities in the state of Massachusetts? Well, we can get that from https://en.wikipedia.org/wiki/List_of_municipalities_in_Massachusetts

First, we'll import two useful libraries: `requests` (to query a URL) and `BeautifulSoup` (to parse the web page):

In [79]:
import requests
from bs4 import BeautifulSoup

Now let's request the page:

In [80]:
result = requests.get("https://en.wikipedia.org/wiki/List_of_municipalities_in_Massachusetts")
result

<Response [200]>

The response code `[200]` indicates that the request was successful. Now let's use `BeautifulSoup` to parse the web page into something semi-readable:

In [81]:
page = result.content
soup = BeautifulSoup(page)
print(soup.prettify())

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of municipalities in Massachusetts - Wikipedia
  </title>
  <script>
   (function(){var className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature

Notice that there's a lot of code in there that isn't related to the data we're trying to get. If we inspect the page source of the table we're interested in, we see that it's a `<table>` elemnt with `class = wikitable sortable static-row-numbers`. Let's ask `BeautifulSoup` to give us just that piece of the page:

In [87]:
table = soup.find("table", { "class" : "wikitable sortable static-row-numbers" })
print(table.prettify())

<table class="wikitable sortable static-row-numbers" style="text-align:center">
 <tbody>
  <tr class="static-row-header" style="text-align:center;vertical-align:bottom;">
   <th>
    Municipality
   </th>
   <th>
    Type
    <sup class="reference" id="cite_ref-MMA_1-0">
     <a href="#cite_note-MMA-1">
      <span class="cite-bracket">
       [
      </span>
      1
      <span class="cite-bracket">
       ]
      </span>
     </a>
    </sup>
   </th>
   <th data-sort-type="text">
    County
    <sup class="reference" id="cite_ref-census_2-0">
     <a href="#cite_note-census-2">
      <span class="cite-bracket">
       [
      </span>
      2
      <span class="cite-bracket">
       ]
      </span>
     </a>
    </sup>
   </th>
   <th>
    Form of government
    <sup class="reference" id="cite_ref-MMA_1-1">
     <a href="#cite_note-MMA-1">
      <span class="cite-bracket">
       [
      </span>
      1
      <span class="cite-bracket">
       ]
      </span>
     </a>
    </sup>
   <

Excellent! Now we just need to pull out the data from the rows and columns of that table. Notice that there are 6 columns, each `row` is contained in a `<tr>` element (which stands for `table row`), and inside each row there is a `<td>` element for each entry. That's what we want, so let's loop through them:

In [88]:
import re # Might need to do some pattern matching with regular expressions

# Open a file to store the data
f = open('output.csv', 'w')
 
# First grab the headers
t_headers = []
for th in table.find_all("th"):
        # remove any references in square brackets and extra spaces from left and right
        t_headers.append(re.sub('\[.*?\]', '', th.text).strip())

f.write(",".join(t_headers) + "\n")
    
# Loop over each row in the table
for row in table.findAll("tr"):
    
    # Find all the cells in the row
    cells = row.findAll("td")
    
    # If the row is complete (i.e. there are 7 cells)
    #    assign the value in each cell to the appropriate variable.
    if len(cells) == 7:
        
        Municipality = cells[0].find(text = True)
        
        Type = cells[1].find(text = True)
        
        County = cells[2].find(text = True)
        
        Form_of_government = cells[3].find(text = True)
        
        # This cell sometimes has a comma in it, let's get rid opf that while we're here
        Population = cells[4].find(text = True).replace(',', '')
        
        Area = cells[5].find(text = True)
        
        Year_est = cells[6].find(text = True)
 
        # Concatenate all the cells together, separated by commas
        line = Municipality  + "," + Type + "," + County + "," + Form_of_government + "," + Population + "," + Area + "," + Year_est + "\n"
        
        # Append the line to the file
        f.write(line)
        
# Clean up when we're done
f.close()

## And there you have it! You're ready to wrangle some data of your own.