<a href="https://colab.research.google.com/github/cagBRT/timeSeries/blob/main/1_TimeSeriesData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

By the end of this notebook the student: <br>
1. Will be able to explain the basic process for using time series data <br>
2. Will be able to define and explain time series data<br>
3. Will be able to manipulate time series data<br>

# **Time Series Data**

Time series data is a collection of observations obtained through repeated measurements over time. <br>
**Plot the points on a graph, and one of your axes would always be time.**

Time series data can be useful for:<br>

>Tracking daily, hourly, or weekly weather data<br>
Tracking changes in application performance<br>
Medical devices to visualize vitals in real time<br>
Tracking network logs<br>

Time series data is unique in that it has a natural time order: *the order in which the data was observed matters*. <br>
The key difference with time series data from regular data is that you’re always asking questions about it over time.<br>
 **An often simple way to determine if the dataset you are working with is time series or not, is to see if one of your axes is time.**

To determine whether your data is time series data, figure out what you’ll need to determine a unique record in the data set.

- If all you need is a timestamp, it’s probably time series data.
- If you need something other than a timestamp, it’s probably cross-sectional data.
- If you need a timestamp plus something else, like an ID, it’s probably panel data.

- **Time series** is a group of observations on a single entity over time — 
  - the daily closing prices over one year for a single financial security, 
 - a single patient’s heart rate measured every minute over a one-hour procedure, 
 - Max Temperature, Humidity and Wind (all three behaviors) in New York City (single entity) collected on First day of every year (multiple intervals of time)<br><br>
The relevance of time as an axis makes time series data distinct from other types of data..
- **Cross-section** is a group of observations of multiple entities at a single time — 
 - today’s closing prices for each of the S&P 500 companies, 
 - the heart rates of 100 patients at the beginning of the same procedure, 
 - an inventory of a given product in stock at a specific stores, 
 - a list of grades obtained by a class of students on a given exam..
- **Panel data**: If your data is organized in both dimensions 
 -  Max Temperature, Humidity and Wind (all three behaviors) in New York City, SFO, Boston, Chicago (multiple entities) on the first day of every year (multiple intervals of time). 

https://towardsdatascience.com/getting-started-with-time-series-using-pandas-b6b9c9d11949


In [None]:
!git clone -l -s https://github.com/cagBRT/timeSeries.git cloned-repo
%cd cloned-repo

In addition to being captured at regular time intervals, time series data can be captured whenever it happens — regardless of the time interval, such as in logs. <br>
Logs are a registry of events, processes, messages and communication between software applications and the operating system. Every executable file produces a log file where all activities are noted. Log data is an important contextual source to triage and resolve issues. <br>
For example, in networking, an event log helps provide information about network traffic, usage and other conditions.

In [None]:
from IPython.display import Image
Image("regular-vs-irregular-time-series-data.png" , width=640)

Because they happen at irregular intervals, *events are unpredictable and cannot be modeled or forecasted* since **forecasting assumes that whatever happened in the past is a good indicator of what will happen in the future**.

**Import libraries and prepare the environment**

In [None]:
# Importing required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
from datetime import datetime    # To access datetime 
from pandas import Series        # To work on series

In [None]:
# Settings for pretty nice plots 
plt.style.use('fivethirtyeight')
%matplotlib inline

In [None]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

**Get the data**<br>
The data file is a CSV file of the stock prices for a car manufacturer in India.

In [None]:
# Reading in the data
df = pd.read_csv('/content/cloned-repo/MARUTI.csv')

In [None]:
# Inspecting the data
df.head()

Save the columns for daily open, close, high, low, trading volume, and VWAP (volume weighted average price).  

In [None]:
data = df[['Date','Open','High','Low','Close','Volume','VWAP']]

In [None]:
data.info()

# **Convert the index to the datetime**<br>
Now the data can be referenced by the date. 

In [None]:
# Convert string to datetime64
data['Date'] = data['Date'].apply(pd.to_datetime)
data.set_index('Date',inplace=True)
data.head()

In [None]:
data.tail()

**Data for one row**

In [None]:
data.values[0:1]

In [None]:
print('Specific date: \n',data.loc['2020-04-01'])

**Specific date and column**

In [None]:
print('Specific date: ',data.loc['2020-04-01'][4])

# **Methods to manipulate the data**

**Using the datetime function**

In [None]:
my_year = 2019
my_month = 4
my_day = 21
my_hour = 10
my_minute = 5
my_second = 30

In [None]:
test_data = datetime(my_year,my_month,my_day)
print(test_data)

In [None]:
test_data = datetime(my_year,my_month,my_day,my_hour,my_minute,my_second)
print("The day is : ",test_data.day)
print("The hour is : ",test_data.hour)
print("The month is : ",test_data.month)

**Get the oldest and newest date**

In [None]:
print(data.index.max())
print(data.index.min())

In [None]:
# Earliest date index location
print('Earliest date index location is: ',data.index.argmin())

# Latest date location
print('Latest date location: ',data.index.argmax())

**Assignment 0**<br>
Find the high and low values for the stock on Dec. 20, 2019<br>


In [None]:
#Assignment 0


**Manipulating and plotting the data**

In [None]:
#prepare to plot VWAP against the Date
df_vwap = df[['Date','VWAP']] # df is the original dataframe
df_vwap['Date'] = df_vwap['Date'].apply(pd.to_datetime)
df_vwap.set_index("Date", inplace = True)
df_vwap.head()

**Plot VWAP vs Year**

In [None]:
df_vwap['VWAP'].plot(figsize=(16,8),title=' volume weighted average price')

**Slicing the data by time periods**

In [None]:
# Slicing on year
vwap_subset = df_vwap['2017':'2020']

# Slicing on month
vwap_subset = df_vwap['2017-01':'2020-12']

#Slicing on day
vwap_subset = df_vwap['2017-01-01':'2020-12-15']

**Assignment 1**<br>
1. Plot the VWAP for the years 2017 - 2020
2. Plot the VWAP for the time period Jan 1, 2020 - Dec 31,


In [None]:
#Assignment 1

In [None]:
#@title 
#vwap_subset['VWAP'].plot(figsize=(16,8),title=' volume weighted average price')

**Plot the data per month**

In [None]:
ax = df_vwap.loc['2018', 'VWAP'].plot(figsize=(15,6))
ax.set_title('Month-wise Trend in 2018'); 
ax.set_ylabel('VWAP');
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'));

**Plot data per day of the week**

In [None]:
ax = df_vwap.loc['2018-10':'2018-11','VWAP'].plot(marker='o', linestyle='-',figsize=(15,6))
ax.set_title('Oct-Nov 2018 trend'); 
ax.set_ylabel('VWAP');
ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.MONDAY))
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

**Breakdown data into day of week, day, month, and or year**

In [None]:
df_vwap.reset_index(inplace=True)
df_vwap['year'] = df_vwap.Date.dt.year
df_vwap['month'] = df_vwap.Date.dt.month
df_vwap['day'] = df_vwap.Date.dt.day
df_vwap['day of week'] = df_vwap.Date.dt.dayofweek

#Set Date column as the index column.
df_vwap.set_index('Date', inplace=True)
df_vwap.head()

**Assignment 2**<br>
Plot the data to show VWAP on Wednesdays in 2020

In [None]:
#Assignment 2


In [None]:
#@title
#ax = df_vwap.loc['2020-1':'2020-12','VWAP'].plot(marker='o', linestyle='-',figsize=(15,6))
#ax.set_title('Oct-Nov 2018 trend'); 
#ax.set_ylabel('VWAP');
#ax.xaxis.set_major_locator(mdates.WeekdayLocator(byweekday=mdates.WEDNESDAY))
#ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'));

# **Resampling**<br>
Resampling involves changing the frequency of your time series observations.<br>

Two types of resampling are:<br>
- Downsampling: Where you decrease the frequency of the samples, such as from days to months. Downsampling aggregates data based on specified frequency and aggregation function
- Upsampling: Where you increase the frequency of the samples, such as from minutes to seconds.
<br><br>

In both cases, data must be invented.

With upsampling, care is needed in determining how the fine-grained observations are calculated using interpolation. 

With downsampling, care is needed in selecting the summary statistics used to calculate the new aggregated values.

The two main reasons to resample your time series data are:

- Problem Framing: Resampling may be required if your data is not available at the same frequency that you want to make predictions.<br>
- Feature Engineering: Resampling can also be used to provide additional structure or insight into the learning problem for supervised learning models.

[Pandas resample method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html)<br>
[Using the Pandas Resample Function](https://towardsdatascience.com/using-the-pandas-resample-function-a231144194c4)

Note in the three plots below, resampling keeps the same shape of the original data series, but it is smoother as the time interval increases. 

In [None]:
df_vwap['VWAP'].plot(figsize=(12,8),title=' volume weighted average price')

In [None]:
fig, ax = plt.subplots(2, sharex=True)
df_vwap['VWAP'].resample('10D').mean().plot(figsize=(12,8), ax=ax[0], 
title="VWAP Down-sampled to 10-day periods",legend=False)
df_vwap['VWAP'].resample('30D').mean().plot(figsize=(12,8), ax=ax[1],
title="VWAP Down-sampled to 30-day periods",legend=False)


Below is a table of offset aliases used with the Resample function

In [None]:
Image("resamplingRule.png" , width=640)

Let's look at the data<br>
Note it lists the VWAP for each day

In [None]:
df_vwap.head()

Let's now resample that data to look at the VWAP for an end-of-year average.<br>

In [None]:
df_vwap.resample(rule = 'A').mean()[:5]

**Assignment 3** <br>
Plot the end of year average for all the years where there is data<br>

In [None]:
#Assignment 3

In [None]:
#@title
#ax = df_vwap.resample(rule = 'A').mean().plot(marker='o', linestyle='-',figsize=(15,6))
#ax.set_title('Yearly Avg'); 
#ax.set_ylabel('VWAP');


# **UpSampling**<br>

The Series Pandas object provides an interpolate() function to interpolate missing values, and there is a nice selection of simple and more complex interpolation functions. 

A good starting point is to use a linear interpolation. This draws a straight line between available data and fills in values at the chosen frequency from this line.

In the sample below the code upsamples the data per hour. <br>
The dataset does not have VWAP per hour, so the data is created by the software using linear interpolation. 

In [None]:
df_vwapDay = df_vwap[0:2]
vwapResampled = df_vwapDay.resample(rule = 'H')
interpolated = vwapResampled.interpolate(method='linear')
interpolated.head(30)

When the interpolated data is plotted, it is a straight line

In [None]:
interpolated['VWAP'].plot(figsize=(16,8),title=' volume weighted average price')

**Assignment 4**<br>
Upsample the dataset to interpolate for minute data. 

In [None]:
#Assign 4


**[Shift the time series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shift.html)**

In [None]:
df_vwap.shift(1).head()

In [None]:
df_vwap.shift(-1).head()

In [None]:
df_vwap.tshift(periods=3, freq = 'M').head()


In [None]:
df_vwap['VWAP'].plot(figsize = (10,6))

**Do rolling window calcuations using the [rolling function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html)**

In [None]:
df_vwap.rolling(7).mean().head(10)

In [None]:
df_vwap['VWAP'].plot()
df_vwap.rolling(window=30).mean()['VWAP'].plot(figsize=(16, 6))
