# ANOVOS- TS Auto Detection
**Following notebook shows the list of functions related to "ts_auto_detection" module provided under ANOVOS package**
- [ts_loop_cols_pre](#ts_loop_cols_pre)
- [regex_date_time_parser](#regex_date_time_parser)
- [ts_preprocess](#ts_preprocess)

API specification of **ts_auto_detection** module can be found here: [API Specification](https://docs.anovos.ai/api/data_ingest/ts_auto_detection.html)

**Setting Spark Session**

In [1]:
from anovos.shared.spark import *

2022-03-09 13:00:27.225 | INFO     | anovos.shared.spark:init_spark:54 - Getting spark session, context and sql context app_name: Anovos_pipeline


**Input/Output Path**

In [2]:
inputPath = "../data/time_series_data/csv"
outputPath = "../output/time_series_data/ts_autodetection"

In [3]:
from anovos.data_ingest.data_ingest import read_dataset
from anovos.shared.utils import ends_with

In [4]:
df = read_dataset(spark, file_path = inputPath, file_type = "csv",file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df.toPandas().head(5)

Unnamed: 0,STATE,YR,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP
0,ALABAMA,1970,15032.67,7325.8,1655.68,6051.2,35793.8,28418,1010.5,4.7
1,ALABAMA,1971,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2
2,ALABAMA,1972,15972.41,7765.42,1764.75,6442.23,38670.3,31303,1072.3,4.7
3,ALABAMA,1973,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9
4,ALABAMA,1974,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5


In [5]:
from anovos.data_ingest.ts_auto_detection import regex_date_time_parser,ts_loop_cols_pre,ts_preprocess

## ts_loop_cols_pre
- API specification of function **ts_loop_cols_pre** can be found <a href="https://docs.anovos.ai/api/data_ingest/ts_auto_detection.html">here</a>

In [6]:
odf = ts_loop_cols_pre(idf=df, id_col='STATE')
odf

(['STATE', 'YR', 'P_CAP', 'HWY', 'WATER', 'UTIL', 'PC', 'GSP', 'EMP', 'UNEMP'],
 ['NA', 'int_c', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'],
 [14, 4, 9, 8, 8, 8, 9, 6, 7, 4])

In [7]:
# List describing the names of different fields present in the dataset
odf[0]

['STATE', 'YR', 'P_CAP', 'HWY', 'WATER', 'UTIL', 'PC', 'GSP', 'EMP', 'UNEMP']

In [8]:
# List describing Custom data type of different fields which help to identify the potential columns which can be passed for time series analysis
odf[1]

['NA', 'int_c', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA']

In [9]:
# List describing maximum character length of each field
odf[2]

[14, 4, 9, 8, 8, 8, 9, 6, 7, 4]

**Note:** Here we identified 'YR' column having 'int_c' data type with 4 character length which can be passed for time series analysis

## regex_date_time_parser
- API specification of function **regex_date_time_parser** can be found <a href="https://docs.anovos.ai/api/data_ingest/ts_auto_detection.html">here</a>

In [10]:
odf=regex_date_time_parser(spark,idf=df,id_col="STATE",col="YR",tz="local",val_unique_cat=4,trans_cat="int_c")

In [11]:
odf.toPandas()

Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
2,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
3,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
4,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
...,...,...,...,...,...,...,...,...,...,...
39163,WYOMING,5700.41,3400.96,565.58,1733.88,27110.51,10870,196.3,9.0,1986-01-01
39164,WYOMING,5700.41,3400.96,565.58,1733.88,27110.51,10870,196.3,9.0,1986-01-01
39165,WYOMING,5700.41,3400.96,565.58,1733.88,27110.51,10870,196.3,9.0,1986-01-01
39166,WYOMING,5700.41,3400.96,565.58,1733.88,27110.51,10870,196.3,9.0,1986-01-01


## ts_preprocess
- API specification of function **ts_preprocess** can be found <a href="https://docs.anovos.ai/api/data_ingest/ts_auto_detection.html">here</a>

In [12]:
# Example 1 - by passing all arguments 
odf = ts_preprocess(spark, idf=df, id_col='STATE', output_path=outputPath, tz_offset="local", run_type="local")
odf.toPandas()

Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2,1971-01-01
2,ALABAMA,15972.41,7765.42,1764.75,6442.23,38670.30,31303,1072.3,4.7,1972-01-01
3,ALABAMA,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9,1973-01-01
4,ALABAMA,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5,1974-01-01
...,...,...,...,...,...,...,...,...,...,...
811,WYOMING,4731.98,3060.64,408.43,1262.90,27724.96,13056,217.7,5.8,1982-01-01
812,WYOMING,4950.82,3119.98,445.59,1385.25,28586.46,11922,202.5,8.4,1983-01-01
813,WYOMING,5184.73,3195.68,476.57,1512.48,28794.80,12073,204.3,6.3,1984-01-01
814,WYOMING,5448.38,3295.92,523.01,1629.45,29326.94,12022,206.9,7.1,1985-01-01


In [13]:
# Example 2 - with mandatory arguments (rest arguments have default values)
odf1 = ts_preprocess(spark, idf=df, id_col='STATE', output_path=outputPath)
odf1.toPandas()

Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2,1971-01-01
2,ALABAMA,15972.41,7765.42,1764.75,6442.23,38670.30,31303,1072.3,4.7,1972-01-01
3,ALABAMA,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9,1973-01-01
4,ALABAMA,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5,1974-01-01
...,...,...,...,...,...,...,...,...,...,...
811,WYOMING,4731.98,3060.64,408.43,1262.90,27724.96,13056,217.7,5.8,1982-01-01
812,WYOMING,4950.82,3119.98,445.59,1385.25,28586.46,11922,202.5,8.4,1983-01-01
813,WYOMING,5184.73,3195.68,476.57,1512.48,28794.80,12073,204.3,6.3,1984-01-01
814,WYOMING,5448.38,3295.92,523.01,1629.45,29326.94,12022,206.9,7.1,1985-01-01


In [14]:
# Example 3 - using tz_offset as gmt
odf = ts_preprocess(spark, idf=df, id_col='STATE', output_path=outputPath, tz_offset="gmt", run_type="local")
odf.toPandas()

Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2,1971-01-01
2,ALABAMA,15972.41,7765.42,1764.75,6442.23,38670.30,31303,1072.3,4.7,1972-01-01
3,ALABAMA,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9,1973-01-01
4,ALABAMA,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5,1974-01-01
...,...,...,...,...,...,...,...,...,...,...
811,WYOMING,4731.98,3060.64,408.43,1262.90,27724.96,13056,217.7,5.8,1982-01-01
812,WYOMING,4950.82,3119.98,445.59,1385.25,28586.46,11922,202.5,8.4,1983-01-01
813,WYOMING,5184.73,3195.68,476.57,1512.48,28794.80,12073,204.3,6.3,1984-01-01
814,WYOMING,5448.38,3295.92,523.01,1629.45,29326.94,12022,206.9,7.1,1985-01-01


In [15]:
# Example 4 - using tz_offset as utc
odf = ts_preprocess(spark, idf=df, id_col='STATE', output_path=outputPath, tz_offset="utc", run_type="local")
odf.toPandas()

Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2,1971-01-01
2,ALABAMA,15972.41,7765.42,1764.75,6442.23,38670.30,31303,1072.3,4.7,1972-01-01
3,ALABAMA,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9,1973-01-01
4,ALABAMA,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5,1974-01-01
...,...,...,...,...,...,...,...,...,...,...
811,WYOMING,4731.98,3060.64,408.43,1262.90,27724.96,13056,217.7,5.8,1982-01-01
812,WYOMING,4950.82,3119.98,445.59,1385.25,28586.46,11922,202.5,8.4,1983-01-01
813,WYOMING,5184.73,3195.68,476.57,1512.48,28794.80,12073,204.3,6.3,1984-01-01
814,WYOMING,5448.38,3295.92,523.01,1629.45,29326.94,12022,206.9,7.1,1985-01-01


**Note:** Varying tz_offset has similar output owing to the data / field limitations as dataset does not have timestamp columns