<a href="https://colab.research.google.com/github/anyuanay/info212/blob/main/INFO212_Week9_Lecture1_time_series.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INFO 212: Data Science Programming

## Week 9: Lecture 1: Time Series Data Analysis

---

**Agenda:**
- Apply techiques to time series data

# Time Series
Time series data is an important form of structured data in many different fields, such
as finance, economics, ecology, neuroscience, and physics. Anything that is observed
or measured at many points in time forms a time series. Many time series are fixed
frequency, which is to say that data points occur at regular intervals according to some
rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can
also be irregular without a fixed unit of time or offset between units. How you mark
and refer to time series data depends on the application, and you may have one of the
following:

- Timestamps, specific instants in time Fixed periods, such as the month January 2007 or the full year 2010
- Intervals of time, indicated by a start and end timestamp. Periods can be thought
of as special cases of intervals
- Experiment or elapsed time; each timestamp is a measure of time relative to a
particular start time (e.g., the diameter of a cookie baking each second since
being placed in the oven)

In [None]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Date and Time Data Types and Tools
The Python standard library includes data types for date and time data, as well as
calendar-related functionality. The datetime, time, and calendar modules are the
main places to start. The datetime.datetime type, or simply datetime, is widely
used.
```
from datetime import datetime
now = datetime.now()

now.year, now.month, now.day
```

We can apply arithmatic operations on datetime objects:
```
datetime.now() - datetime(2024, 4, 20)
```

The result have time related properties:
```
delta = datetime(2021, 1, 7) - datetime(2008, 6, 24, 8, 15)

delta.days
delta.seconds
```

# Add time zone to now()
```
import pytz
local_tz = pytz.timezone('America/New_York')

now = datetime.now()

now = now.astimezone(local_tz)

now.year

now.hour

now.tzinfo
```

### Converting Between String and Datetime
Format datetime objects and pandas Timestamp objects as strings using str or the strftime method, passing a format specification.
```
stamp = datetime(2021, 1, 3)
str(stamp)
stamp.strftime('%Y-%m-%d')

s = stamp.strftime('%m/%d/%Y')
```

Convert from string to datetime:
```
d = datetime.strptime(s, '%m/%d/%Y')
```

## Exercise:
```
value = '2021-01-03'
datetime.strptime(value, '%Y-%m-%d')
datestrs = ['7/6/2021', '8/6/2021']
[datetime.strptime(x, '%m/%d/%Y') for x in datestrs]
```

pandas is generally oriented toward working with arrays of dates, whether used as an
axis index or a column in a DataFrame. The to_datetime method parses many different
kinds of date representations. Standard date formats like ISO 8601 can be
parsed very quickly:
```
datestrs = ['2021-07-06 12:00:00', '2021-08-06 00:00:00']
pd.to_datetime(datestrs)
```

## Time Series Basics
A basic kind of time series object in pandas is a Series indexed by timestamps, which
is often represented external to pandas as Python strings or datetime objects.

```
from datetime import datetime
dates = [datetime(2021, 1, 2), datetime(2021, 1, 5),
         datetime(2021, 1, 7), datetime(2021, 1, 8),
         datetime(2021, 1, 10), datetime(2021, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts
```

```
ts.index
```

Like other Series, arithmetic operations between differently indexed time series automatically
align on the dates:
```
ts + ts[::2]
```

### Indexing, Selection, Subsetting
Time series behaves like any other pandas.Series when you are indexing and selecting
data based on label:
```
stamp = ts.index[2]
ts[stamp]
```

For longer time series, a year or only a year and month can be passed to easily select
slices of data:
```
longer_ts = pd.Series(np.random.randn(1000),
                      index=pd.date_range('1/1/2023', periods=1000))
longer_ts
```

```
longer_ts['2023-5']
```

### Time Series with Duplicate Indices
In some applications, there may be multiple data observations falling on a particular
timestamp. Here is an example:
```
dates = pd.DatetimeIndex(['1/1/2000', '1/2/2000', '1/2/2000',
                          '1/2/2000', '1/3/2000'])
dup_ts = pd.Series(np.arange(5), index=dates)
dup_ts
```

Suppose you wanted to aggregate the data having non-unique timestamps. One way
to do this is to use groupby and pass level=0:
```
grouped = dup_ts.groupby(level=0)
grouped.mean()
grouped.count()
```

## Date Ranges, Frequencies, and Shifting
Generic time series in pandas are assumed to be irregular; that is, they have no fixed
frequency. For many applications this is sufficient. However, it’s often desirable to
work relative to a fixed frequency, such as daily, monthly, or every 15 minutes, even if
that means introducing missing values into a time series. Fortunately pandas has a
full suite of standard time series frequencies and tools for resampling, inferring frequencies,
and generating fixed-frequency date ranges. For example, you can convert
the sample time series to be fixed daily frequency by calling resample:

```
from datetime import datetime
dates = [datetime(2021, 1, 2), datetime(2021, 1, 5),
         datetime(2021, 1, 7), datetime(2021, 1, 8),
         datetime(2021, 1, 10), datetime(2021, 1, 12)]
ts = pd.Series(np.random.randn(6), index=dates)
ts
```

```
ts.resample('D')
```

```
resampler = ts.resample('D')
resampler
```

### Generating Date Ranges
pandas.date_range is responsible for
generating a DatetimeIndex with an indicated length according to a particular
frequency:
```
index = pd.date_range('2012-04-01', '2012-06-01')
index
```

```
pd.date_range(start='2012-04-01', periods=20)
pd.date_range(end='2012-06-01', periods=20)
```

```
pd.date_range('2000-01-01', '2000-12-01', freq='BM')
```

```
pd.date_range('2012-05-02 12:56:31', periods=5)
```

```
pd.date_range('2012-05-02 12:56:31', periods=5, normalize=True)
```

### Frequencies and Date Offsets

```
from pandas.tseries.offsets import Hour, Minute
hour = Hour()
hour
```

```
four_hours = Hour(4)
four_hours
```

#### Week of month dates
One useful frequency class is “week of month,” starting with WOM. This enables you to
get dates like the third Friday of each month:
```
rng = pd.date_range('2012-01-01', '2012-09-01', freq='WOM-3FRI')
list(rng)
```

### Shifting (Leading and Lagging) Data
“Shifting” refers to moving data backward and forward through time. Both Series and
DataFrame have a shift method for doing naive shifts forward or backward, leaving
the index unmodified:

```
ts = pd.Series(np.random.randn(4),
               index=pd.date_range('1/1/2000', periods=4, freq='M'))
ts
ts.shift(2)
ts.shift(-2)
```

## Time Zone Handling
Working with time zones is generally considered one of the most unpleasant parts of
time series manipulation. As a result, many time series users choose to work with
time series in coordinated universal time or UTC, which is the successor to Greenwich
Mean Time and is the current international standard. Time zones are expressed as
offsets from UTC; for example, New York is four hours behind UTC during daylight
saving time and five hours behind the rest of the year.

In Python, time zone information comes from the third-party pytz library (installable
with pip or conda), which exposes the Olson database, a compilation of world
time zone information. This is especially important for historical data because the
daylight saving time (DST) transition dates (and even UTC offsets) have been
changed numerous times depending on the whims of local governments. In the United
States, the DST transition times have been changed many times since 1900!
```
for tz in pytz.common_timezones:
    if 'America' in tz:
        print(tz)
```

### Time Zone Localization and Conversion
By default, time series in pandas are time zone naive. For example, consider the following
time series:
```
rng = pd.date_range('3/9/2012 9:30', periods=6, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
```

### Operations with Time Zone−Aware Timestamp Objects
Similar to time series and date ranges, individual Timestamp objects similarly can be
localized from naive to time zone–aware and converted from one time zone to
another:
```
stamp = pd.Timestamp('2011-03-12 04:00')
stamp_utc = stamp.tz_localize('utc')
stamp_utc.tz_convert('America/New_York')
```

### Operations Between Different Time Zones
If two time series with different time zones are combined, the result will be UTC.
Since the timestamps are stored under the hood in UTC, this is a straightforward
operation and requires no conversion to happen:
```
rng = pd.date_range('3/7/2012 9:30', periods=10, freq='B')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
ts1 = ts[:7].tz_localize('Europe/London')
ts2 = ts1[2:].tz_convert('Europe/Moscow')
result = ts1 + ts2
result.index
```

## Periods and Period Arithmetic
Periods represent timespans, like days, months, quarters, or years. The Period class
represents this data type, requiring a string or integer and a frequency.
```
p = pd.Period(2007, freq='A-DEC')
p
```

### Period Frequency Conversion
```
p = pd.Period('2007', freq='A-DEC')
p
p.asfreq('M', how='start')
p.asfreq('M', how='end')
```

### Quarterly Period Frequencies
```
p = pd.Period('2012Q4', freq='Q-JAN')
p
```

### Converting Timestamps to Periods (and Back)
```
rng = pd.date_range('2000-01-01', periods=3, freq='M')
ts = pd.Series(np.random.randn(3), index=rng)
ts
pts = ts.to_period()
pts
```

## Resampling and Frequency Conversion
```
rng = pd.date_range('2000-01-01', periods=100, freq='D')
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
ts.resample('M').mean()
ts.resample('M', kind='period').mean()
```

### Downsampling
```
rng = pd.date_range('2000-01-01', periods=12, freq='T')
ts = pd.Series(np.arange(12), index=rng)
ts
```

```
ts.resample('5min', closed='right').sum()
```

```
ts.resample('5min', closed='right',
            label='right', loffset='-1s').sum()
```

#### Open-High-Low-Close (OHLC) resampling
```
ts.resample('5min').ohlc()
```

### Upsampling and Interpolation
```
frame = pd.DataFrame(np.random.randn(2, 4),
                     index=pd.date_range('1/1/2000', periods=2,
                                         freq='W-WED'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame
```

```
frame.resample('W-THU').ffill()
```

### Resampling with Periods
```
frame = pd.DataFrame(np.random.randn(24, 4),
                     index=pd.period_range('1-2000', '12-2001',
                                           freq='M'),
                     columns=['Colorado', 'Texas', 'New York', 'Ohio'])
frame[:5]
annual_frame = frame.resample('A-DEC').mean()
annual_frame
```

```
# Q-DEC: Quarterly, year ending in December
annual_frame.resample('Q-DEC').ffill()
annual_frame.resample('Q-DEC', convention='end').ffill()
```

```
annual_frame.resample('Q-MAR').ffill()
```