## [ Time Series Basics ]
A basic kind of time series object in pandas is a `Series` indexed by timestamps, which is often represented outside of pandas as Python strings or datetime objects

In [4]:
import numpy as np 
import pandas as pd 
from datetime import datetime

dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
         datetime(2011, 1, 7), datetime(2011, 1, 8),
         datetime(2011, 1, 10), datetime(2011, 1, 12)]

ts = pd.Series(np.random.standard_normal(6), index=dates)
ts

2011-01-02    0.672367
2011-01-05    0.735144
2011-01-07    0.527393
2011-01-08   -0.412556
2011-01-10   -1.455057
2011-01-12   -0.234365
dtype: float64

In [5]:
# under the hood, these datetime objects have been put in a DatetimeIndex
ts.index

# DatetimeIndex
    # it's an index made up of datetime values
    # used in time series DataFrames or Series
    # allows for easy date/time filtering, grouping, resampling, and frequency conversion
    # it makes pandas time-aware
    # it enhances performance and flexibility for time-related operations

DatetimeIndex(['2011-01-02', '2011-01-05', '2011-01-07', '2011-01-08',
               '2011-01-10', '2011-01-12'],
              dtype='datetime64[ns]', freq=None)

In [6]:
# like other series, arithmetic operations between differently indexed time series automatically align on the dates

ts + ts[::2] # ts[::2] selects every second element in ts

2011-01-02    1.344735
2011-01-05         NaN
2011-01-07    1.054786
2011-01-08         NaN
2011-01-10   -2.910115
2011-01-12         NaN
dtype: float64

In [7]:
# pandas store timestamps using NumPy's datetime64 data type at the nanosecond resolution
ts.index.dtype

dtype('<M8[ns]')

In [9]:
# scalar values from a DatetimeIndex are pandas Timestamp objects
stamp = ts.index[0]
stamp

Timestamp('2011-01-02 00:00:00')

- `pd.TimeStamp` can be substituted most place where you would use a `datetime` object. The reverse is not true.
- however, because `pd.Timestamp` can store nanosecond precision data, while `datetime` stores only up to microseconds.
- additionally, `pd.Timestamp` can store frequency information (if any) and understands how to do time zone conversions and other kinds of manipulations.

## [ Indexing, Selection, Subsetting ]

In [12]:
# time series behaves like any other Series when you are indexing and selecting data based on the label
stamp = ts.index[2]
ts[stamp]

np.float64(0.5273930212284165)

In [13]:
# as a convenience, you can also pass a string that is interpretable as a date
ts["2011-01-10"]

np.float64(-1.4550573267274463)

For longer time seris, a year or only a year and month can be passed to easily select slices of data

In [14]:
longer_ts = pd.Series(np.random.standard_normal(1000), index=pd.date_range("2000-01-01", periods=1000))
longer_ts

2000-01-01    0.493957
2000-01-02   -1.296484
2000-01-03    0.526959
2000-01-04   -1.668260
2000-01-05   -1.028617
                ...   
2002-09-22   -1.537204
2002-09-23   -0.048205
2002-09-24   -1.239156
2002-09-25    0.375952
2002-09-26    1.559289
Freq: D, Length: 1000, dtype: float64

In [15]:
longer_ts["2001"]
# here the string "2001" is interpreted as a year and selects that time period. This also works if you specify the month

2001-01-01   -0.117729
2001-01-02   -0.826312
2001-01-03    0.450641
2001-01-04    0.003617
2001-01-05   -1.505236
                ...   
2001-12-27   -0.049941
2001-12-28   -2.369377
2001-12-29   -0.107348
2001-12-30   -1.217106
2001-12-31    0.535427
Freq: D, Length: 365, dtype: float64

In [20]:
may = longer_ts["2001-05"]
may[2:10]

2001-05-03    1.353820
2001-05-04   -0.452741
2001-05-05    0.229028
2001-05-06    0.274359
2001-05-07    0.804381
2001-05-08    1.687818
2001-05-09    0.078153
2001-05-10    1.294241
Freq: D, dtype: float64

In [21]:
# slicing with datetime objects work as well
ts[datetime(2011, 1, 7):]

2011-01-07    0.527393
2011-01-08   -0.412556
2011-01-10   -1.455057
2011-01-12   -0.234365
dtype: float64

In [22]:
ts[datetime(2011, 1, 7):datetime(2011, 1, 10)]

2011-01-07    0.527393
2011-01-08   -0.412556
2011-01-10   -1.455057
dtype: float64

In [23]:
# Since time series data is usually arranged in order by date/time, you can use dates that aren't even in the data to select a range of values between two points.
ts

2011-01-02    0.672367
2011-01-05    0.735144
2011-01-07    0.527393
2011-01-08   -0.412556
2011-01-10   -1.455057
2011-01-12   -0.234365
dtype: float64

In [24]:
ts["2011-01-06":"2011-01-11"]

2011-01-07    0.527393
2011-01-08   -0.412556
2011-01-10   -1.455057
dtype: float64

- As before, you can pass a string date, datetime, or timestamp. Remember that slicing in this manner produces views on the source time series, like slicing NumPy arrays.
- This means that no data is copied, and modifications on the slice will be reflected in the original data.

In [25]:
# pandas has a built-in method called truncate() that you can use to cut out a part of a time series between two dates (start and end)

ts.truncate(after="2011-01-09")

2011-01-02    0.672367
2011-01-05    0.735144
2011-01-07    0.527393
2011-01-08   -0.412556
dtype: float64

In [27]:
# all of this holds true for dataframe as well, indexing on its rows
dates = pd.date_range("2000-01-01", periods=100, freq="W-WED")

# W-WED: means weekly frequency, where each date will be the wednesday of each week

long_df = pd.DataFrame(np.random.standard_normal((100, 4)), index=dates, 
                       columns=["Colorado", "Texas", "New York", "Ohio"])

long_df
# long_df.loc["2001-05"]

Unnamed: 0,Colorado,Texas,New York,Ohio
2000-01-05,-0.168608,2.164293,-1.565414,-0.393080
2000-01-12,1.137607,-0.984427,-0.191096,0.954420
2000-01-19,-1.803073,0.282805,-0.475744,1.176564
2000-01-26,-2.402191,-0.488424,-0.358143,0.728807
2000-02-02,0.346653,-0.477836,0.838453,1.923924
...,...,...,...,...
2001-10-31,-0.380216,-0.785883,-0.095600,-0.314055
2001-11-07,0.023920,-2.478984,-0.705827,0.386737
2001-11-14,0.012912,0.072805,1.264567,-0.452869
2001-11-21,0.380315,0.112513,-0.034569,0.755669


In [29]:
long_df.loc["2001-05"]

Unnamed: 0,Colorado,Texas,New York,Ohio
2001-05-02,-1.038414,-0.325998,-0.994882,-0.03702
2001-05-09,-0.890256,0.478504,0.681179,2.464736
2001-05-16,-0.143593,-0.652569,1.384029,-1.187708
2001-05-23,-0.972669,-0.537878,-0.578028,-0.706344
2001-05-30,0.376854,0.218363,1.09446,-2.703103


## [ Time Series with Duplicate Indices ]


In [30]:
# in some applications, there may be multiple data observations falling on a particular timestamp

dates = pd.DatetimeIndex(["2000-01-01", "2000-01-02", "2000-01-02",
                          "2000-01-02", "2000-01-03"])

dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts

2000-01-01    0
2000-01-02    1
2000-01-02    2
2000-01-02    3
2000-01-03    4
dtype: int64

In [31]:
# we can tell that the index is not unique by checking its is_unique property
dup_ts.index.is_unique

False

In [32]:
# indexing into this time series will now either produce scalar values or slices, depending on whether a timestamp is duplicated

dup_ts["2000-01-03"] # not duplicated

np.int64(4)

In [33]:
dup_ts["2000-01-02"] # duplicated

2000-01-02    1
2000-01-02    2
2000-01-02    3
dtype: int64

In [34]:
# to aggregate the data having nonunique timestamps
# one way to do this is to use groupby and pass level=0 (the one and only level)

grouped = dup_ts.groupby(level=0)
grouped.mean()

2000-01-01    0.0
2000-01-02    2.0
2000-01-03    4.0
dtype: float64

In [35]:
grouped.count()

2000-01-01    1
2000-01-02    3
2000-01-03    1
dtype: int64