## Time Series Data

Time series data is common in many industries, including finance, healthcare, and manufacturing. Pandas provides powerful tools for working with time-indexed data, enabling you to manipulate, analyze, and visualize time-dependent patterns efficiently.

In this module, we will cover:

- Working with datetime objects,
- Time-based indexing and slicing,
- Resampling and aggregation,
- Handling missing time data,
- Rolling and expanding windows,
- Time shifts and differences.

## What is a Time Series?

A time series is a sequence of data points collected or recorded at successive points in time, often at uniform intervals (such as hourly, daily, monthly, or yearly). Unlike other types of data, time series data has a temporal aspect, meaning that time plays a key role in the analysis. The order of the data points matters, and analyzing how the data changes over time is often a central focus.

# Key Characteristics of Time Series Data:
#### Temporal Ordering: 
Time series data is ordered by time, and the sequence in which data points occur is crucial.
Frequency: Time series data can be recorded at various intervals, such as:
- Hourly: Sensor readings from a machine every hour.
- Daily: Stock prices at the end of each trading day.
- Monthly: Monthly sales data for a retail store.
#### Trend and Seasonality: 
Time series data often exhibits trends (long-term upward or downward movements) and seasonality (recurring patterns over time).
#### Examples of Time Series Data
- Stock Market Prices: The closing price of a stock is recorded at the end of each trading day, forming a time series. You could analyze how the price fluctuates daily, weekly, or monthly.

Date	Stock Price
2024-01-01	$150
2024-01-02	$152
2024-01-03	$148
- Weather Data: Daily temperature readings form a time series. You can analyze temperature trends over days, months, or years.

Date	Temperature (°C)
2024-01-01	5.2
2024-01-02	4.8
2024-01-03	6.0
- Sales Data: A retail store’s daily or monthly sales figures are time series data. You could look for trends over time (e.g., sales growing during holiday seasons or dipping during off-seasons).

Month	Sales Amount
2024-01	$10,000
2024-02	$9,500
2024-03	$12,000
- Website Traffic: A website’s hourly or daily visitors form a time series, allowing analysis of how traffic varies throughout the day or week.

Hour	Visitors
09:00 AM	200
10:00 AM	230
11:00 AM	250


## Why is Time Series Data Important?

Time series data allows you to analyze patterns over time and make predictions about future values based on past behavior. It’s used in various domains to answer questions like:

- `Trends`: Is there a long-term upward or downward trend in stock prices or sales?
- `Seasonality`: Are there recurring patterns, such as higher sales during holiday seasons or lower website traffic on weekends?
- `Forecasting`: Can we predict the future temperature, sales, or stock prices based on historical data?

In [1]:
import pandas as pd 

In [2]:
# Sample data with date strings
data = {'Date': ['2024-01-01', '2024-01-02', '2024-01-03'], 'SalesAmount': [200, 150, 300]}
df = pd.DataFrame(data)

In [3]:
# convert date cumns into datetime 

df['Date'] = pd.to_datetime(df['Date'])
print(df['Date'].dtypes)

datetime64[ns]


## Extracting Date Components

In [4]:
df

Unnamed: 0,Date,SalesAmount
0,2024-01-01,200
1,2024-01-02,150
2,2024-01-03,300


In [5]:
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
print(df)

        Date  SalesAmount  Year  Month  Day
0 2024-01-01          200  2024      1    1
1 2024-01-02          150  2024      1    2
2 2024-01-03          300  2024      1    3


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         3 non-null      datetime64[ns]
 1   SalesAmount  3 non-null      int64         
 2   Year         3 non-null      int32         
 3   Month        3 non-null      int32         
 4   Day          3 non-null      int32         
dtypes: datetime64[ns](1), int32(3), int64(1)
memory usage: 212.0 bytes


In [7]:
print(type(df['Date'].dt.year))

<class 'pandas.core.series.Series'>


## Time-Based Indexing and Slicing 
In time series data, having the date as the index allows for more efficient slicing and querying. You can set a datetime column as the index using set_index().

In [8]:
df.set_index('Date', inplace=True)

In [9]:
df

Unnamed: 0_level_0,SalesAmount,Year,Month,Day
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-01-01,200,2024,1,1
2024-01-02,150,2024,1,2
2024-01-03,300,2024,1,3


## Slicing Time Series

Once the date is the index, you can slice the DataFrame by specific dates or date ranges.

In [10]:
df_slice = df.loc['2024-01-02']
print(df_slice)

SalesAmount     150
Year           2024
Month             1
Day               2
Name: 2024-01-02 00:00:00, dtype: int64


In [12]:
df_s = df.loc['2024-01-01': '2024-01-03']
print(df_s)

            SalesAmount  Year  Month  Day
Date                                     
2024-01-01          200  2024      1    1
2024-01-02          150  2024      1    2
2024-01-03          300  2024      1    3


## Resampling and Aggregation

Resampling involves changing the frequency of your time series data. For example, you can aggregate daily data into weekly, monthly, or yearly summaries. Pandas provides the resample() function for this purpose.

### Resampling to a Different Frequency

You can downsample or upsample time series data to different time frequencies. Common time frequencies include:

- `D`: Daily
- `W`: Weekly
- `M`: Monthly
- `Y`: Yearly

In [19]:
df_resampled = df.resample('M').mean()
print(df_resampled)

            SalesAmount    Year  Month  Day
Date                                       
2024-01-31   216.666667  2024.0    1.0  2.0


## Handling Missing Time Data

Time series data often has missing time periods, which can affect your analysis. You can handle these gaps by either filling the missing values or using interpolation techniques.

### Filling Missing Time Periods

The `asfreq()` function is used to convert the DataFrame to a specified frequency, filling in any missing time periods with NaN. Assuming that we are using the same dataframe from the previous section:

In [46]:
data = {'Date': ['2024-01-1', '2024-01-02', '2024-01-03'], 'SalesAmount': [200, 150, 300]}
df = pd.DataFrame(data)

In [47]:
df['Date'] = pd.to_datetime(df['Date'])

In [48]:
df_filled = df.asfreq('D')
print(df_filled)

           Date  SalesAmount
1970-01-01  NaT          NaN


## Rolling and Expanding Windows

Rolling and expanding windows are techniques used to calculate statistics over a moving window of time. These are commonly used for calculating moving averages, rolling sums, or other time-based metrics.

### Rolling Windows

The `rolling()` function allows you to calculate metrics over a specified window size. Using our example dataframe

In [49]:
df

Unnamed: 0,Date,SalesAmount
0,2024-01-01,200
1,2024-01-02,150
2,2024-01-03,300


In [51]:
df['3-day mean'] = df['SalesAmount'].rolling(window=3).mean()

In [52]:
df

Unnamed: 0,Date,SalesAmount,3-day mean
0,2024-01-01,200,
1,2024-01-02,150,
2,2024-01-03,300,216.666667


In [54]:
df['expanding sum'] = df['SalesAmount'].expanding().sum()

In [55]:
df

Unnamed: 0,Date,SalesAmount,3-day mean,expanding sum
0,2024-01-01,200,,200.0
1,2024-01-02,150,,350.0
2,2024-01-03,300,216.666667,650.0


## Time Shifts and Differences

Time shifting refers to moving data points forward or backward along the time axis. It is useful in time series analysis for comparing data at different points in time, calculating lags, or generating features that capture changes over time. Pandas makes it easy to shift and compute differences between time periods using `shift()` and `diff()` functions.

#### Shifting Data with shift()

The shift() function allows you to move (shift) your data forward or backward by a specified number of periods. This is useful when calculating lag features or creating shifted versions of time series data for comparison.

In [56]:
data = {
    'Date': pd.date_range(start='2024-01-01', periods=5, freq='D'),
    'SalesAmount': [200, 250, 300, 350, 400]
}

df = pd.DataFrame(data)
df.set_index('Date', inplace=True)
print(df)


            SalesAmount
Date                   
2024-01-01          200
2024-01-02          250
2024-01-03          300
2024-01-04          350
2024-01-05          400


In [57]:
df['shifted_sales'] = df['SalesAmount'].shift(1) # shit(1) one means to move data forward by one
print(df)

            SalesAmount  shifted_sales
Date                                  
2024-01-01          200            NaN
2024-01-02          250          200.0
2024-01-03          300          250.0
2024-01-04          350          300.0
2024-01-05          400          350.0


## Calculating Differences with diff()

The `diff()` function calculates the difference between consecutive time periods. This is particularly useful for measuring how much a value changes from one time step to the next (e.g., sales growth, stock price changes).

In [58]:
df['sales_diff'] = df['SalesAmount'].diff()
print(df)

            SalesAmount  shifted_sales  sales_diff
Date                                              
2024-01-01          200            NaN         NaN
2024-01-02          250          200.0        50.0
2024-01-03          300          250.0        50.0
2024-01-04          350          300.0        50.0
2024-01-05          400          350.0        50.0


## Applications of Shifting and Differences

- Lag Features: You can use shift() to create lagged features that capture the value of a time series at previous time steps. For example, creating features like "Sales lagged by 1 day" can help in building models that predict future sales based on past sales.
- Percentage Change: You can calculate the percentage change between periods using the `pct_change()` function, which is useful for measuring relative changes over time.

In [59]:
df['% change'] = df['SalesAmount'].pct_change() * 100

In [60]:
df

Unnamed: 0_level_0,SalesAmount,shifted_sales,sales_diff,% change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-01-01,200,,,
2024-01-02,250,200.0,50.0,25.0
2024-01-03,300,250.0,50.0,20.0
2024-01-04,350,300.0,50.0,16.666667
2024-01-05,400,350.0,50.0,14.285714


This gives the percentage change in `SalesAmount` between consecutive periods.

- Moving Averages: You can combine shifting with rolling functions to calculate moving averages, a key technique in smoothing out time series data and identifying trends.