In [1]:
import pandas as pd
import numpy as np
from time_series_transform.sklearn import *
import time_series_transform as tst

# Introduction

This package provides tools for time series data preprocessing. There are two main components inside the package: Time_Series_Transformer and Stock_Transformer. Time_Series_Transformer is a general class for all type of time series data, while Stock_Transformer is a sub-class of Time_Series_Transformer. Time_Series_Transformer has different functions for data manipulation, io transformation, and making simple plots. This tutorial will take a quick look at the functions for data manipulation and basic io. For the plot functions, there will be other tutorial to explain. 

# Time_Series_Transformer

Since all the time series data having time data, Time_Series_Transformer is required to specify time index. The basic time series data is time series data with no special category. However, there a lot of cases that a time series data is associating with categories. For example, inventory data is usually associate with product name or stores, or stock data is having different ticker names or brokers. To address this question, Time_Series_Transformer can specify the main category index. Given the main category index, the data can be manipulated in parallel corresponding to its category.

Here is a simple example to create a Time_Series_Transformer without specifying its category.

In [2]:
data = {
    'time':[1,2,3,4,5],
    'data1':[1,2,3,4,5],
    'data2':[6,7,8,9,10]
}
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans

data column
-----------
time
data1
data2
time length: 5
category: None


There are two ways to manipulate the data. The first way is use the pre-made functions, and the second way is to use the transform function and provide your custom function. There are six pre-made functions including make_lag, make_lead, make_lag_sequence, make_lead_sequence, and make_stack_sequence. In the following demonstration, we will show each of the pre-made functions.

### Pre-made functions
make_lag and make_lead functions are going to create lag/lead data for input columns. This type of manipulation could be useful for machine learning.

In [3]:
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.make_lag(
    inputLabels = ['data1','data2'],
    lagNum = 1,
    suffix = '_lag_',
    fillMissing = np.nan
            )
trans.to_pandas()

Unnamed: 0,time,data1,data2,data1_lag_1,data2_lag_1
0,1,1,6,,
1,2,2,7,1.0,6.0
2,3,3,8,2.0,7.0
3,4,4,9,3.0,8.0
4,5,5,10,4.0,9.0


In [4]:
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.make_lead(
    inputLabels = ['data1','data2'],
    leadNum = 1,
    suffix = '_lead_',
    fillMissing = np.nan
            )
trans.to_pandas()

Unnamed: 0,time,data1,data2,data1_lead_1,data2_lead_1
0,1,1,6,2.0,7.0
1,2,2,7,3.0,8.0
2,3,3,8,4.0,9.0
3,4,4,9,5.0,10.0
4,5,5,10,,


make_lag_sequence and make_lead_sequence is to create a sequence for a given window length and lag or lead number. This manipulation could be useful for Deep learning.

In [5]:
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.make_lag_sequence(
    inputLabels = ['data1','data2'],
    windowSize = 2,
    lagNum =1,
    suffix = '_lag_seq_'
)
trans.to_pandas()

Unnamed: 0,time,data1,data2,data1_lag_seq_2,data2_lag_seq_2
0,1,1,6,"[nan, nan]","[nan, nan]"
1,2,2,7,"[nan, 1.0]","[nan, 6.0]"
2,3,3,8,"[1.0, 2.0]","[6.0, 7.0]"
3,4,4,9,"[2.0, 3.0]","[7.0, 8.0]"
4,5,5,10,"[3.0, 4.0]","[8.0, 9.0]"


In [6]:
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.make_lead_sequence(
    inputLabels = ['data1','data2'],
    windowSize = 2,
    leadNum =1,
    suffix = '_lead_seq_'
)
trans.to_pandas()

Unnamed: 0,time,data1,data2,data1_lead_seq_2,data2_lead_seq_2
0,1,1,6,"[2.0, 3.0]","[7.0, 8.0]"
1,2,2,7,"[3.0, 4.0]","[8.0, 9.0]"
2,3,3,8,"[4.0, 5.0]","[9.0, 10.0]"
3,4,4,9,"[nan, nan]","[nan, nan]"
4,5,5,10,"[nan, nan]","[nan, nan]"


### Custom Functions

To use the transform function, you have to create your custom functions. The input data will be passed as dict of list, and the output data should be either pandas DataFrame, pandas Series, numpy ndArray or list. Note, the output length should be in consist with the orignal data length.

For exmaple, this function takes input dictionary data and sum them up. The final output is a list.

In [7]:
import copy
def list_output (dataDict):
    res = []
    for i in dataDict:
        if len(res) == 0:
            res = copy.deepcopy(dataDict[i])
            continue
        for ix,v in enumerate(dataDict[i]):
            res[ix] += v
    return res

In [8]:
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.transform(
    inputLabels = ['data1','data2'],
    newName = 'sumCol',
    func = list_output
)
trans.to_pandas()

Unnamed: 0,time,data1,data2,sumCol
0,1,1,6,7
1,2,2,7,9
2,3,3,8,11
3,4,4,9,13
4,5,5,10,15


The following example will output as pandas DataFrame and also takes additional parameters. Note: since pandas already has column name, the new name will automatically beocme suffix.

In [9]:
def pandas_output(dataDict, pandasColName):
    res = []
    for i in dataDict:
        if len(res) == 0:
            res = copy.deepcopy(dataDict[i])
            continue
        for ix,v in enumerate(dataDict[i]):
            res[ix] += v
    return pd.DataFrame({pandasColName:res})

In [10]:
trans = tst.Time_Series_Transformer(data,timeSeriesCol='time')
trans = trans.transform(
    inputLabels = ['data1','data2'],
    newName = 'sumCol',
    func = pandas_output,
    pandasColName = "pandasName"
)
trans.to_pandas()

Unnamed: 0,time,data1,data2,sumCol_pandasName
0,1,1,6,7
1,2,2,7,9
2,3,3,8,11
3,4,4,9,13
4,5,5,10,15


### Data with Category

Since time series data could be associated with different category, Time_Series_Transformer can specify the mainCategoryCol parameter to point out the main category. This class only provide one columns for main category because multiple dimensions can be aggregated into a new column as main category.

The following example has one category with two type a and b. Each of them has some overlaped and different timestamp.

In [15]:
data = {
    "time":[1,2,3,4,5,1,3,4,5],
    'data':[1,2,3,4,5,1,2,3,4],
    "category":['a','a','a','a','a','b','b','b','b']
}

In [17]:
trans = tst.Time_Series_Transformer(data,'time','category')
trans

data column
-----------
time
data
time length: 4
category: b

data column
-----------
time
data
time length: 5
category: a

main category column: category

Since we specify the main category column, data manipulation functions can use n_jobs to execute the function in parallel. The parallel execution is with joblib implmentation (https://joblib.readthedocs.io/en/latest/). 

In [18]:
trans = trans.make_lag(
    inputLabels = ['data'],
    lagNum = 1,
    suffix = '_lag_',
    fillMissing = np.nan,
    n_jobs = 2,
    verbose = 10        
)
trans.to_pandas()

[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    2.9s remaining:    0.0s
[Parallel(n_jobs=2)]: Done   2 out of   2 | elapsed:    2.9s finished


Unnamed: 0,time,data,data_lag_1,category
0,1,1,,b
1,3,2,1.0,b
2,4,3,2.0,b
3,5,4,3.0,b
4,1,1,,a
5,2,2,1.0,a
6,3,3,2.0,a
7,4,4,3.0,a
8,5,5,4.0,a


To further support the category, there are two functions to deal with different time length data: pad_different_category_time and remove_different_category_time. The first function is padding the different length into same length, while the other is remove different timestamp.

In [20]:
trans = tst.Time_Series_Transformer(data,'time','category')
trans = trans.pad_different_category_time(fillMissing = np.nan
)
trans.to_pandas()

Unnamed: 0,time,data,category
0,1,1.0,b
1,2,,b
2,3,2.0,b
3,4,3.0,b
4,5,4.0,b
5,1,1.0,a
6,2,2.0,a
7,3,3.0,a
8,4,4.0,a
9,5,5.0,a


In [21]:
trans = tst.Time_Series_Transformer(data,'time','category')
trans = trans.remove_different_category_time()
trans.to_pandas()

Unnamed: 0,time,data,category
0,1,1,b
1,3,2,b
2,4,3,b
3,5,4,b
4,1,1,a
5,3,3,a
6,4,4,a
7,5,5,a


## IO

IO is a huge component for this package. The current version support pandas DataFrame, numpy ndArray, Apache Arrow Table, Apache Feather, and Apache Parquet. All those io can specify whether to expand category or time for the export format. In this demo, we will show numpy and pandas. Also, Transformer can combine make_label function and sepLabel parameter inside of export to seperate data and label.

### pandas

In [22]:
data = {
    "time":[1,2,3,4,5,1,3,4,5],
    'data':[1,2,3,4,5,1,2,3,4],
    "category":['a','a','a','a','a','b','b','b','b']
}
df = pd.DataFrame(data)

In [25]:
trans = tst.Time_Series_Transformer.from_pandas(
    pandasFrame = df,
    timeSeriesCol = 'time',
    mainCategoryCol= 'category'
)
trans

data column
-----------
time
data
time length: 4
category: b

data column
-----------
time
data
time length: 5
category: a

main category column: category

To expand the data, all category should be in consist. Besides the pad and remove function, we can use preprocessType parameter to achive that.

In [27]:
trans.to_pandas(
    expandCategory = True,
    expandTime = False,
    preprocessType = 'pad'
)

Unnamed: 0,time,data_b,data_a
0,1,1.0,1
1,2,,2
2,3,2.0,3
3,4,3.0,4
4,5,4.0,5


In [28]:
trans.to_pandas(
    expandCategory = False,
    expandTime = True,
    preprocessType = 'pad'
)

Unnamed: 0,data_1,data_2,data_3,data_4,data_5,category
0,1,,2,3,4,b
1,1,2.0,3,4,5,a


In [29]:
trans.to_pandas(
    expandCategory = True,
    expandTime = True,
    preprocessType = 'pad'
)

Unnamed: 0,data_b_1,data_a_1,data_b_2,data_a_2,data_b_3,data_a_3,data_b_4,data_a_4,data_b_5,data_a_5
0,1.0,1,,2,2.0,3,3.0,4,4.0,5


make_label function can be used with sepLabel parameter. This function can be used for seperating X and y for machine learning cases.

In [35]:
trans = trans.make_lead('data',leadNum = 1,suffix = '_lead_')
trans = trans.make_label("data_lead_1")

In [37]:
data, label = trans.to_pandas(
    expandCategory = False,
    expandTime = False,
    preprocessType = 'pad',
    sepLabel = True
)

In [38]:
data

Unnamed: 0,time,data,category
0,1,1.0,b
1,2,,b
2,3,2.0,b
3,4,3.0,b
4,5,4.0,b
5,1,1.0,a
6,2,2.0,a
7,3,3.0,a
8,4,4.0,a
9,5,5.0,a


In [39]:
label

Unnamed: 0,data_lead_1
0,2.0
1,
2,3.0
3,4.0
4,
5,2.0
6,3.0
7,4.0
8,5.0
9,


### numpy
Since numpy has no column name, it has to use index number to specify column.

In [40]:
data = {
    "time":[1,2,3,4,5,1,3,4,5],
    'data':[1,2,3,4,5,1,2,3,4],
    "category":['a','a','a','a','a','b','b','b','b']
}
npArray = pd.DataFrame(data).values

In [47]:
trans = tst.Time_Series_Transformer.from_numpy(
    numpyData= npArray,
    timeSeriesCol = 0,
    mainCategoryCol = 2)
trans

data column
-----------
0
1
time length: 4
category: b

data column
-----------
0
1
time length: 5
category: a

main category column: 2

In [50]:
trans = trans.make_lead(1,leadNum = 1,suffix = '_lead_')
trans = trans.make_label("1_lead_1")

In [52]:
X,y = trans.to_pandas(
    expandCategory = False,
    expandTime = False,
    preprocessType = 'pad',
    sepLabel = True
)

In [53]:
X

Unnamed: 0,0,1,2
0,1,1.0,b
1,2,,b
2,3,2.0,b
3,4,3.0,b
4,5,4.0,b
5,1,1.0,a
6,2,2.0,a
7,3,3.0,a
8,4,4.0,a
9,5,5.0,a


In [54]:
y

Unnamed: 0,1_lead_1
0,2.0
1,
2,3.0
3,4.0
4,
5,2.0
6,3.0
7,4.0
8,5.0
9,


# Stock_Transformer

Stock_Transformer is a subclass of Time_Series_Transformer. Hence, all the function demonstrated in Time_Series_Transformer canbe used in Stock_Transformer. The differences for Stock_Transformer is that it is required to specify High, Low, Open, Close, Volume columns. Besides these information, it has pandas-ta strategy implmentation to create technical indicator (https://github.com/twopirllc/pandas-ta). Moreover, the io class for Stock_Transformer support yfinance and investpy. We can directly extract data from these api.

### create technical indicator

In [56]:
stock = tst.Stock_Transformer.from_stock_engine_period(
    symbols = 'GOOGL',period ='1y', engine = 'yahoo'
)
stock

data column
-----------
Date
Open
High
Low
Close
Volume
Dividends
Stock Splits
time length: 253
category: None


In [58]:
import pandas_ta as ta
MyStrategy = ta.Strategy(
    name="DCSMA10",
    ta=[
        {"kind": "ohlc4"},
        {"kind": "sma", "length": 10},
        {"kind": "donchian", "lower_length": 10, "upper_length": 15},
        {"kind": "ema", "close": "OHLC4", "length": 10, "suffix": "OHLC4"},
    ]
)

In [61]:
stock = stock.get_technial_indicator(MyStrategy)
stock.to_pandas().head()

Unnamed: 0,Date,Open,High,Low,Close,Volume,Dividends,Stock Splits,OHLC4,SMA_10,DCL_10_15,DCM_10_15,DCU_10_15,EMA_10_OHLC4
0,2020-01-02,1348.410034,1368.680054,1346.48999,1368.680054,1363900,0,0,1358.065033,,,,,
1,2020-01-03,1348.0,1373.75,1347.319946,1361.52002,1170400,0,0,1357.647491,,,,,
2,2020-01-06,1351.630005,1398.319946,1351.0,1397.810059,2338400,0,0,1374.690002,,,,,
3,2020-01-07,1400.459961,1403.5,1391.560059,1395.109985,1716500,0,0,1397.657501,,,,,
4,2020-01-08,1394.819946,1411.849976,1392.630005,1405.040039,1765700,0,0,1401.084991,,,,,


For more usage please visit our gallery

In [1]:
from time_series_transform.transform_core_api.tfDataset_adopter import *
from time_series_transform.stock_transform.stock_transfromer import Stock_Transformer

In [2]:
st = Stock_Transformer.from_stock_engine_period('AAPL','1y','yahoo')

In [12]:
st = st.make_lag_sequence(["Close"],20,1,'_lag_seq_')

In [15]:
df = st.to_pandas()
data =df.to_dict('records')

In [16]:
tw = TFRecord_Writer('./test.tfRecord')
tw.write_tfRecord(data)

In [17]:
tr = TFRecord_Reader('./test.tfRecord',tw.get_tfRecord_dtype())

In [18]:
dataset = tr.make_tfDataset()

In [19]:
for i in dataset.as_numpy_iterator():
    print(i)

{'Date': b'2020-01-02', 'Open': 73.425896, 'High': 74.50657, 'Low': 73.16565, 'Close': 74.4446, 'Volume': 135480400, 'Dividends': 0.0, 'Stock Splits': 0.0, 'Close_lag_seq_20': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan], dtype=float32)}
{'Date': b'2020-01-03', 'Open': 73.65144, 'High': 74.501595, 'Low': 73.49033, 'Close': 73.72084, 'Volume': 146322800, 'Dividends': 0.0, 'Stock Splits': 0.0, 'Close_lag_seq_20': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan], dtype=float32)}
{'Date': b'2020-01-06', 'Open': 72.818634, 'High': 74.34792, 'Low': 72.56086, 'Close': 74.308266, 'Volume': 118387200, 'Dividends': 0.0, 'Stock Splits': 0.0, 'Close_lag_seq_20': array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan, nan, nan, nan, nan, nan, nan], dtype=float32)}
{'Date': b'2020-01-07', 'Open': 74.318184, 'High': 74.58092, 'Low': 73.73324

{'Date': b'2020-09-09', 'Open': 117.05805, 'High': 118.93481, 'Low': 115.06149, 'Close': 117.11794, 'Volume': 176940500, 'Dividends': 0.0, 'Stock Splits': 0.0, 'Close_lag_seq_20': array([109.18662 , 112.81537 , 114.81192 , 114.7096  , 114.41011 ,
       115.36347 , 115.50822 , 118.0713  , 124.1558  , 125.64074 ,
       124.610016, 126.304596, 124.7947  , 124.59255 , 128.81775 ,
       133.9489  , 131.17369 , 120.67181 , 120.75167 , 112.625694],
      dtype=float32)}
{'Date': b'2020-09-10', 'Open': 120.15271, 'High': 120.29247, 'Low': 112.306244, 'Close': 113.29454, 'Volume': 182274400, 'Dividends': 0.0, 'Stock Splits': 0.0, 'Close_lag_seq_20': array([112.81537 , 114.81192 , 114.7096  , 114.41011 , 115.36347 ,
       115.50822 , 118.0713  , 124.1558  , 125.64074 , 124.610016,
       126.304596, 124.7947  , 124.59255 , 128.81775 , 133.9489  ,
       131.17369 , 120.67181 , 120.75167 , 112.625694, 117.11794 ],
      dtype=float32)}
{'Date': b'2020-09-11', 'Open': 114.37268, 'High': 115.03

In [36]:
np.ndim(1)

0

In [3]:
tw.write_tfRecord([{"data":1}])

In [34]:
dataset = tf.data.TFRecordDataset("./test.tfRecord","GZIP")

In [35]:
for i in dataset.as_numpy_iterator():
    print(i)
    break

b'\n\x8d\x01\n\x10\n\x04Open\x12\x08\x12\x06\n\x04\x0f\xda\x92B\n\x10\n\x04High\x12\x08\x12\x06\n\x04]\x03\x95B\n\x0f\n\x03Low\x12\x08\x12\x06\n\x04\xd0T\x92B\n\x11\n\x05Close\x12\x08\x12\x06\n\x04\xa3\xe3\x94B\n\x12\n\x06Volume\x12\x08\x1a\x06\n\x04\xd0\x88\xcd@\n\x15\n\tDividends\x12\x08\x12\x06\n\x04\x00\x00\x00\x00\n\x18\n\x0cStock Splits\x12\x08\x12\x06\n\x04\x00\x00\x00\x00'


In [29]:
!dir

 Volume in drive C is Blade 15
 Volume Serial Number is B67F-892F

 Directory of C:\Users\Allen Chiang\Documents\Studio\Time-Series-Transformer\Notebooks\v1_ml

01/03/2021  03:56 AM    <DIR>          .
01/03/2021  03:56 AM    <DIR>          ..
01/02/2021  04:49 PM    <DIR>          .ipynb_checkpoints
01/03/2021  03:56 AM           102,726 introduction.ipynb
01/03/2021  02:35 AM         4,159,828 stock_transform.ipynb
01/03/2021  03:55 AM             7,728 test.tfRecord
               3 File(s)      4,270,282 bytes
               3 Dir(s)  388,294,967,296 bytes free
