<div style="color:white;
           display:fill;
           border-radius:5px;
           background-color:#5642C5;
           font-size:200%;
           font-family:Arial;letter-spacing:0.5px">

<p width = 20%, style="padding: 10px;
              color:white;">
Time Series: Data Manipulation
              
</p>
</div>

Data Science Cohort Live NYC Nov 2022
<p>Phase 4: Topic 34</p>
<br>
<br>

<div align = "right">
<img src="Images/flatiron-school-logo.png" align = "right" width="200"/>
</div>
    
    

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
energy_data = pd.read_csv('data/energydata_complete.csv', parse_dates = True, index_col = 'date')

#  Objectives
- Understand the use case for time series data
- Manipulate datetime objects
- Understand different resampling techniques
- Visualization for time series data

Time series:

- set of data as a function of / indexed by time

Energy expenditure and sensor monitoring for a household:

<img src = "Images/householdenergy.jpg">

In [None]:
fig, ax = plt.subplots()
energy_data.plot(y = 'Appliances', ax = ax, label = 'House1')
ax.set_ylabel('Total Appliance Usage (Watt-hours)')
plt.show()

**Time series forecasting** 

Energy usage forecasting in aggregate and across sets of devices:

*Projected energy usage for next month?*

- consumer end: smart house optimization
- business end: how many power plants should be open / capacity?

**Time series classification**

- Based on EEG voltage data: seizure or eye-lid opening?
- Use temporal features and past-future dependencies for trace class recognition

<img src = "images/x4.png" >

We'll be focusing in next series of lectures on univariate time series forecasting:
- given a time series of a single quantity
- predict future values of quantity as function of time

But there are many, many other types of tasks involving time-series data:

- real-time state prediction
- machine signal encoding-decoding
- etc.

Before any modeling:
- need to understand how to manipulate Python datastructures involving time series
- common and important time series data manipulation operations

Time series have a notion of ordering:
- past comes before future
- indexing reflects this temporal ordering

Need special kind of indexing and data type to deal with this:

- datetime objects
- pandas DateTimeIndex

Special methods associated with these.

Let's have a closer look at the house energy usage dataset:

In [None]:
energy_data_df = pd.read_csv('data/energy_expenditure.csv', index_col=[0])

In [None]:
energy_data_df.info()

Time series data:
- appliance energy usage
- light energy usage

A look at the dataframe:

In [None]:
energy_data_df.head()

The date column is a string

In [None]:
print(type(energy_data_df.loc[0, 'date']))
energy_data_df.loc[0, 'date']

Incovenient for many reasons:
- No implicit understanding of the different components of string 
- No notion of date/time ordering understood between the strings

**Want to convert column to column of datetime objects**

#### Converting column to datetime:
- pd.to_datetime()

In [None]:
energy_data_df['date']

In [None]:
date_asdt = pd.to_datetime(energy_data_df['date'])
date_asdt

Intelligently parses string and returns a pandas datetime column
- works for many common string representations of date times

Looking at a single element:

In [None]:
datetime0 = date_asdt[0]
datetime0

Timestamp object attributes for extracting parts of datetime:

In [None]:
print(datetime0.year)
print(datetime0.month)
print(datetime0.day)

Extracting other components:
- .hour
- .minute
- .second

Can go all the way to nanoseconds if you like

Some nifty methods:

In [None]:
datetime0.day_name() # get the name of the day of the week

In [None]:
datetime0.month_name() # get the name of the day of the week

#### Time-zone awareness
- .tz attribute
- .tz_localize

- pandas TimeStamp() objects can be made timezone aware
- this functionality useful when comparing time series data:
    - taken contemporaneously across different locations

Right now, not time-zone aware:
- House is in Belgium

In [None]:
dt0_timeaware = datetime0.tz_localize('Europe/Brussels')
dt0_timeaware

Getting a list of useful timezone strings:

In [None]:
from pytz import common_timezones
common_timezones

We can immediately take our datetime and convert to:
- US/Eastern time

In [None]:
dt0_timeaware

In [None]:
dt0_timeaware.tz_convert('US/Eastern')

Calculating the time difference between the two occupancy loggings:

In [None]:
date_asdt

In [None]:
time_diff = date_asdt[19730] - date_asdt[0]
time_diff

Outputs a TimeDelta object:
- encodes time difference in useful representation

#### Useful Timedelta methods/attributes

Getting components in time day/hour/minutes/seconds/... representation

In [None]:
time_diff.components


In [None]:
time_diff.components.minutes

May be useful to convert to difference in units of seconds or milliseconds:

In [None]:
print(time_diff.days)
print(time_diff.total_seconds())

Nano seconds:

In [None]:
# time difference in nanoseconds
time_diff.value

#### Vectorized datetime operations in Pandas Series

All operations used on pandas TimeStamps: can be vectorized on pandas datetime Series
- Series.dt.attribute
- Series.dt.method()


In [None]:
date_asdt

In [None]:
date_asdt.dt.year

In [None]:
date_asdt.dt.month

In [None]:
date_asdt.dt.day

#### Setting datetime columns as index of Series/DataFrame

First check the data type:

In [None]:
energy_data_df.date

In [None]:
energy_data_df.date = pd.to_datetime(
    energy_data_df.date)
energy_data_df.date

Setting the index

In [None]:
energy_data = energy_data_df.set_index('date')
energy_data.head()

Checking index type

In [None]:
energy_data.index

Selecting the column, yields a series:
- indexed by time

In [None]:
appliance_series = energy_data['Appliances']
appliance_series

We can do some nifty things now.

#### DataFrame selection via datetime index

Getting all timestamps on a given day

In [None]:
appliance_series.loc['2016-03-04']

Using partial string addressing:
- Get all times on this day from 8-9 AM    

In [None]:
appliance_series.loc['2016-03-04 08']

Getting all timestamps for a given month: April

In [None]:
appliance_series.loc['2016-04']

Same idea with partial string selection

Recognizes named strings as well:

In [None]:
appliance_series.loc['April 2016']

In [None]:
appliance_series.loc['April 3, 2016']

**Slicing pandas Dataframes and Series using datetime ranges**

- The datetime index allows for slicing dataframes and Series in datetime ranges

In [None]:
appliance_series.loc['2016-4-04':'2016-4-08' ]

Partial slicing:
    
- Can slice on different time scales:
    - e.g., slicing between years
    - slicing between months, etc

In [None]:
#getting appliance energy usage for january and february
janfeb_appliance = appliance_series.loc['2016-1': '2016-2']
janfeb_appliance

We may want to make our date-time index, time-zone aware:
- first access the DateTimeIndex (a Series)
- use vectorized datetime method .tz_localize()

In [None]:
janfeb_appliance = janfeb_appliance.tz_localize('Europe/Brussels')

In [None]:
janfeb_appliance

In [None]:
janfeb_appliance.index

Your name is Adeel, a data analyst working for a tech/ customer service contractor in Bangalore, India. 

- The smart utility company in Brussels has outsourced certain data tasks as well as customer service issues to your company. 
- Need data for Jan/Feb reindexed in Bangalore time to join with customer service database from your call center.

In [None]:
# time in India based on pytz timezones
janfeb_appliance.tz_convert('Asia/Kolkata')

In [None]:
# time in Brussels
janfeb_appliance

Plotting the time series

In [None]:
data_subset = appliance_series.loc['2016-4-04':'2016-4-11']
data_subset.plot(y = 'Occupancy');

There is a lot of data here:
- Frequency of sampling: every 10 minutes.
- Maybe for a specific task: need points only every hour

- E.g., another time series only has hourly samples. Want to compare time series.
- Too much data: hard to store and already have enough data when sampled every hour.

#### Resampling 

> **Resampling** allows us to convert the time series into a particular frequency

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

**To upsample** is to increase the frequency of the data of interest.  
**To downsample** is to decrease the frequency of the data of interest.


**Down-sampling**

- series and dateframes indexed by datetime can be resampled
- .resample(): takes in string argument for sampling frequency

- '1H': every 1 hour samples
- '2H': every 2 hours, etc
- 'T': minute frequency
- 'S' : second frequency
- 'D': daily
- 'W': weekly
- 'M': monthly

For more frequencies:

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

In [None]:
# hourly samples: downsamping from every 10 minutes
data_subset.resample('1H')

Creates a resampler object:
- with downsampling need to aggregate
- aggregates on data points within one hour interval

Aggregating with mean via chaining:

In [None]:
# time series of hourly mean of appliance usage
data_subset_downsamp = \
data_subset.resample('1H').mean()

The usual aggregation functions are available:
- exactly same as with groupby
- .mean(), .std(), .median(). etc
- .agg(func) for customized aggregation

Comparing the downsampled and original data

In [None]:
data_subset_downsamp

In [None]:
data_subset

Plotting the data of the downsampled data vs. actual data:

In [None]:
# down sampled data
data_subset_downsamp.plot();

In [None]:
# original data
data_subset.plot();

Reproduces many of the longer term features:
- averages out finer spiking as a function of time

Be careful when downsampling:
- aggregating can throw out useful information
- but maybe fine for your use case

**Upsampling**

We may want to know local weather conditions:
- local airport has windspeed and temperature data
- merge it with energy usage data

In [None]:
weather_df = pd.read_csv('data/weather.csv')
weather_df.head()

weather_df.date = pd.to_datetime(weather_df.date)
weather_df.head()

Problem: 
- Weather sampled every 30 minutes. 
- Energy usage data is sampled every 10 mins.

In [None]:
energy_data_df

#### Ordered Merges

First step is to merge the dataset:
- merge two dataframes on a date-time ordered column
- preserving order of observations

Can't do a standard merge:
- need ordered merge

pandas.merge_ordered(left, right, on=None, left_on=None, right_on=None, left_by=None, how='outer')

In [None]:
# columns in each dataframe that are merged on must be ordered
# datetime automatically satisfies this
combined_data = pd.merge_ordered(energy_data_df, weather_df, on = 'date', how = 'outer') # usually will use outer join
combined_data.set_index('date', inplace = True)
combined_data.head()

Can downsample energy usage data to 30 min intervals: 
- but may actually want this data at 10 min intervals.


Can **upsample** weather data to 10 min frequency:
- if between every 30 min, expect time series behaves relatively smoothly

Let's upsample by filling in NaN values. Sequential data has multiple relevant imputation methods:
- .ffill() 
- .bfill()
- .interpolate()

- Fill series forward from last non-NaN value.
- Fill series backwards from last non-NaN value in reverse direction.
- Linear (by default) interpolation between non-NaNs

**.ffill()**

Forward fills NaNs from last non-empty value

In [None]:
combined_data

In [None]:
ff = combined_data.ffill()
ff

**.bfill()**

Backwards fills from first value non-empty after a sequence of NaNs

In [None]:
combined_data

In [None]:
combined_data.bfill()

**.interpolate(method= '...')**
- interpolates NaNs between two values
- can use various specified strategies 

method

- 'linear' (default)
- 'spline' 
- etc.

In [None]:
combined_data

In [None]:
combined_data.interpolate()

Used ordered merge and imputation to upsample less frequent columns:

Can also upsample columns/series in a different way:
- using the resampler object

Take original weather data:

In [None]:
weather_df['date'] = pd.to_datetime(
    weather_df['date'])
weather_df_ind = weather_df.set_index('date')
weather_df_ind.head()

Construct resampler object at 10 minute interval:

In [None]:
upsamp = weather_df_ind.resample('10T')
upsamp

.asfreq() can return original timeseries values at new sampling frequency

In [None]:
upsamp.asfreq()

The data has NaNs where there are no samples at this frequency. Need to impute these.

Resampler object has same fill / imputation methods as dataframe to do this.

- .ffill()
- .bfill()
- .interpolate()

In [None]:
upsamp.ffill()

In [None]:
upsamp.asfreq()

In [None]:
upsamp.bfill()

In [None]:
upsamp.asfreq()

In [None]:
upsamp.interpolate()

In [None]:
upsamp.asfreq()

- subset upsampled and interpolated data for April 4 to 11
- join with our energy data for these data

In [None]:
data_joined = upsamp.interpolate().loc['April 4 2016': 'April 11 2016']
data_joined['Appliances']= data_subset
data_joined.head()

In [None]:
data_joined.info()

Oversampled less frequent weather data and joined with more frequent energy usage measuremts.

#### Time Series Visualization

Loading pandas dataframe with datetime index:
- column 0 is the date
- parse_date = True interprets index as datetime index automatically.

In [None]:
# column 0 is the date the parse_date = True interprets index as datetime index automatically.
uber_data = pd.read_csv("Data/uber.csv", index_col = [0], parse_dates = True)
uber_data.head()

Multiple time series indexed on same datetime:
- pandas plotting useful
- subplots = True option 

In [None]:
uber_data.plot(subplots = True, figsize = (8,6))
plt.show()

#### Time Series Differencing and Trend Computation

In many applications:
- don't care *as much* about actual values of time series
- care *more* about changes in values of time series
- or relative changes in values of time series

- Time series differencing
- Evaluating change in percentage from previous value

**Differencing**
- pandas Series has .diff() method
- .diff(period = ) where period indicates differencing lag

- period = 1: return Series $Y[t] - Y[t - 1]$
- period = 2: returns Series $Y[t] - Y[t-2]$
- period = k: returns Series $Y[t] - Y[t-k]$

Let's first order difference the adjusted close Series and compare to original series.

In [None]:
uber_data['Adj Close'].diff() # default is first order difference

In [None]:
uber_data['Adj Close']

Naturally produces NaN at first element: no previous element to difference on.

Plot visualizing the Adjusted close and the differenced Adjusted Close:

In [None]:
diff_df = pd.DataFrame(uber_data['Adj Close'])
diff_df['Differenced'] = uber_data['Adj Close'].diff()
diff_df.plot(subplots = True, figsize = (7,5))
plt.show()

With stocks, in particular:
- want to predict where percentage change between subsequent time-steps might be large

- pandas Series has .pct_change() calculating:
$$ \frac{Y[t] - Y[t-1]}{Y[t]} $$

In [None]:
diff_df['pct_change'] = diff_df['Adj Close'].pct_change()*100
diff_df['pct_change'] 

Visualizing original time series, differenced series, sequential percentage change:

In [None]:
diff_df.plot(subplots = True, figsize = (12,8))
plt.show()

In some cases, want to smooth a time series:
- can be helpful in evaluating time series trends
- around which there are noise fluctuations

.rolling() method:
- creates an object which creates window that slides across time series
- can aggregate within window

Generates a rolling aggregation (rolling mean, etc) as a function of time.

- .rolling(n) generates a rolling object that contains a sequence of windows:
    - each window has $n$ observations in it

In [None]:
diff_df['Adj Close'].rolling(8)

Aggregating will compute statistic in window
- sliding window through time series

In [None]:
# note that first four rollng means will be NaN
# makes sense for n = 5 window
diff_df['Adj Close'].rolling(8).mean()

Plot the rolling mean and  the actual series:

In [None]:
diff_df['rolling_mean'] = diff_df['Adj Close'].rolling(8).mean()
diff_df[['Adj Close', 'rolling_mean']].plot(figsize = (12,8), linewidth = 4)
plt.show()

Gives us a smoothed version of the Close prices:
- analyze sustained increasing and decreasing trends
- discarding/ignoring high frequency fluctuation/noise

We/you will use many of these Time Series methods:
- datetime manipulation
- resampling/imputation techniques
- windowed aggregates
- differencing

For:
- exploring time series processes and their internal structure
- modeling the process that generated the series:
    - trend, seasonality, statistics of fluctuations
- constructing parsimonious models for out-of-sample predictions/forecasting
