# ANOVOS- TS Auto Detection
**Following notebook shows the list of functions related to "ts_auto_detection" module provided under ANOVOS package**
- [ts_loop_cols_pre](#ts_loop_cols_pre)
- [regex_date_time_parser](#regex_date_time_parser)
- [ts_preprocess](#ts_preprocess)

API specification of **ts_auto_detection** module can be found here: [API Specification](https://docs.anovos.ai/api/data_ingest/ts_auto_detection.html)

**Setting Spark Session**

In [1]:
#set run type variable
run_type = "local" # "local", "emr", "databricks", "ak8s"

In [4]:
#For run_type Azure Kubernetes, run the following block 
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

if run_type == "ak8s":
    fs_path="<insert conf spark.hadoop.fs master url here> ex: spark.hadoop.fs.azure.sas.<container>.<account_name>.blob.core.windows.net"
    auth_key="<insert value of sas_token here>"
    master_url="<insert kubernetes master url path here> ex: k8s://"
    docker_image="<insert name docker image here>"
    kubernetes_namespace ="<insert kubernetes namespace here>"

    # Create Spark config for our Kubernetes based cluster manager
    sparkConf = SparkConf()
    sparkConf.setMaster(master_url)
    sparkConf.setAppName("Anovos_pipeline")
    sparkConf.set("spark.submit.deployMode","client")
    sparkConf.set("spark.kubernetes.container.image", docker_image)
    sparkConf.set("spark.kubernetes.namespace", kubernetes_namespace)
    sparkConf.set("spark.executor.instances", "4")
    sparkConf.set("spark.executor.cores", "4")
    sparkConf.set("spark.executor.memory", "16g")
    sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
    sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    sparkConf.set(fs_path,auth_key)
    sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
    sparkConf.set("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3,io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20,org.apache.spark:spark-avro_2.12:3.2.1")

    # Initialize our Spark cluster, this will actually
    # generate the worker nodes.
    spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
    sc = spark.sparkContext

#For other run types import from anovos.shared.
else:
    from anovos.shared.spark import *
    auth_key = "NA"

**Input/Output Path**

In [5]:
inputPath = "../data/time_series_data/csv"
outputPath = "../output/time_series_data/ts_autodetection"

In [6]:
from anovos.data_ingest.data_ingest import read_dataset
from anovos.shared.utils import ends_with

In [7]:
df = read_dataset(spark, file_path = inputPath, file_type = "csv",file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df.toPandas().head(5)

Unnamed: 0,STATE,YR,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP
0,ALABAMA,1970,15032.67,7325.8,1655.68,6051.2,35793.8,28418,1010.5,4.7
1,ALABAMA,1971,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2
2,ALABAMA,1972,15972.41,7765.42,1764.75,6442.23,38670.3,31303,1072.3,4.7
3,ALABAMA,1973,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9
4,ALABAMA,1974,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5


In [8]:
from anovos.data_ingest.ts_auto_detection import regex_date_time_parser,ts_loop_cols_pre,ts_preprocess

## ts_loop_cols_pre
- API specification of function **ts_loop_cols_pre** can be found <a href="https://docs.anovos.ai/api/data_ingest/ts_auto_detection.html">here</a>

In [9]:
odf = ts_loop_cols_pre(idf=df, id_col='STATE')
odf

(['STATE', 'YR', 'P_CAP', 'HWY', 'WATER', 'UTIL', 'PC', 'GSP', 'EMP', 'UNEMP'],
 ['NA', 'int_c', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA'],
 [14, 4, 9, 8, 8, 8, 9, 6, 7, 4])

In [10]:
# List describing the names of different fields present in the dataset
odf[0]

['STATE', 'YR', 'P_CAP', 'HWY', 'WATER', 'UTIL', 'PC', 'GSP', 'EMP', 'UNEMP']

In [11]:
# List describing Custom data type of different fields which help to identify the potential columns which can be passed for time series analysis
odf[1]

['NA', 'int_c', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA']

In [12]:
# List describing maximum character length of each field
odf[2]

[14, 4, 9, 8, 8, 8, 9, 6, 7, 4]

**Note:** Here we identified 'YR' column having 'int_c' data type with 4 character length which can be passed for time series analysis

## regex_date_time_parser
- API specification of function **regex_date_time_parser** can be found <a href="https://docs.anovos.ai/api/data_ingest/ts_auto_detection.html">here</a>

In [13]:
odf=regex_date_time_parser(spark,idf=df,id_col="STATE",col="YR",tz="local",val_unique_cat=4,trans_cat="int_c")

In [14]:
odf.toPandas()

Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
2,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
3,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
4,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
...,...,...,...,...,...,...,...,...,...,...
39163,WYOMING,5700.41,3400.96,565.58,1733.88,27110.51,10870,196.3,9.0,1986-01-01
39164,WYOMING,5700.41,3400.96,565.58,1733.88,27110.51,10870,196.3,9.0,1986-01-01
39165,WYOMING,5700.41,3400.96,565.58,1733.88,27110.51,10870,196.3,9.0,1986-01-01
39166,WYOMING,5700.41,3400.96,565.58,1733.88,27110.51,10870,196.3,9.0,1986-01-01


## ts_preprocess
- API specification of function **ts_preprocess** can be found <a href="https://docs.anovos.ai/api/data_ingest/ts_auto_detection.html">here</a>

In [15]:
# Example 1 - by passing all arguments 
odf = ts_preprocess(spark, idf=df, id_col='STATE', output_path=outputPath, tz_offset="local", run_type=run_type, auth_key=auth_key)
odf.toPandas()

                                                                                

Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2,1971-01-01
2,ALABAMA,15972.41,7765.42,1764.75,6442.23,38670.30,31303,1072.3,4.7,1972-01-01
3,ALABAMA,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9,1973-01-01
4,ALABAMA,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5,1974-01-01
...,...,...,...,...,...,...,...,...,...,...
811,WYOMING,4731.98,3060.64,408.43,1262.90,27724.96,13056,217.7,5.8,1982-01-01
812,WYOMING,4950.82,3119.98,445.59,1385.25,28586.46,11922,202.5,8.4,1983-01-01
813,WYOMING,5184.73,3195.68,476.57,1512.48,28794.80,12073,204.3,6.3,1984-01-01
814,WYOMING,5448.38,3295.92,523.01,1629.45,29326.94,12022,206.9,7.1,1985-01-01


In [16]:
# Example 2 - with mandatory arguments (rest arguments have default values)
odf1 = ts_preprocess(spark, idf=df, id_col='STATE', output_path=outputPath)
odf1.toPandas()

22/11/30 09:21:47 WARN CacheManager: Asked to cache already cached data.


Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2,1971-01-01
2,ALABAMA,15972.41,7765.42,1764.75,6442.23,38670.30,31303,1072.3,4.7,1972-01-01
3,ALABAMA,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9,1973-01-01
4,ALABAMA,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5,1974-01-01
...,...,...,...,...,...,...,...,...,...,...
811,WYOMING,4731.98,3060.64,408.43,1262.90,27724.96,13056,217.7,5.8,1982-01-01
812,WYOMING,4950.82,3119.98,445.59,1385.25,28586.46,11922,202.5,8.4,1983-01-01
813,WYOMING,5184.73,3195.68,476.57,1512.48,28794.80,12073,204.3,6.3,1984-01-01
814,WYOMING,5448.38,3295.92,523.01,1629.45,29326.94,12022,206.9,7.1,1985-01-01


In [17]:
# Example 3 - using tz_offset as gmt
odf = ts_preprocess(spark, idf=df, id_col='STATE', output_path=outputPath, tz_offset="gmt", run_type=run_type, auth_key=auth_key)
odf.toPandas()

22/11/30 09:21:51 WARN CacheManager: Asked to cache already cached data.


Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2,1971-01-01
2,ALABAMA,15972.41,7765.42,1764.75,6442.23,38670.30,31303,1072.3,4.7,1972-01-01
3,ALABAMA,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9,1973-01-01
4,ALABAMA,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5,1974-01-01
...,...,...,...,...,...,...,...,...,...,...
811,WYOMING,4731.98,3060.64,408.43,1262.90,27724.96,13056,217.7,5.8,1982-01-01
812,WYOMING,4950.82,3119.98,445.59,1385.25,28586.46,11922,202.5,8.4,1983-01-01
813,WYOMING,5184.73,3195.68,476.57,1512.48,28794.80,12073,204.3,6.3,1984-01-01
814,WYOMING,5448.38,3295.92,523.01,1629.45,29326.94,12022,206.9,7.1,1985-01-01


In [18]:
# Example 4 - using tz_offset as utc
odf = ts_preprocess(spark, idf=df, id_col='STATE', output_path=outputPath, tz_offset="utc", run_type=run_type, auth_key=auth_key)
odf.toPandas()

22/11/30 09:21:55 WARN CacheManager: Asked to cache already cached data.


Unnamed: 0,STATE,P_CAP,HWY,WATER,UTIL,PC,GSP,EMP,UNEMP,YR
0,ALABAMA,15032.67,7325.80,1655.68,6051.20,35793.80,28418,1010.5,4.7,1970-01-01
1,ALABAMA,15501.94,7525.94,1721.02,6254.98,37299.91,29375,1021.9,5.2,1971-01-01
2,ALABAMA,15972.41,7765.42,1764.75,6442.23,38670.30,31303,1072.3,4.7,1972-01-01
3,ALABAMA,16406.26,7907.66,1742.41,6756.19,40084.01,33430,1135.5,3.9,1973-01-01
4,ALABAMA,16762.67,8025.52,1734.85,7002.29,42057.31,33749,1169.8,5.5,1974-01-01
...,...,...,...,...,...,...,...,...,...,...
811,WYOMING,4731.98,3060.64,408.43,1262.90,27724.96,13056,217.7,5.8,1982-01-01
812,WYOMING,4950.82,3119.98,445.59,1385.25,28586.46,11922,202.5,8.4,1983-01-01
813,WYOMING,5184.73,3195.68,476.57,1512.48,28794.80,12073,204.3,6.3,1984-01-01
814,WYOMING,5448.38,3295.92,523.01,1629.45,29326.94,12022,206.9,7.1,1985-01-01


**Note:** Varying tz_offset has similar output owing to the data / field limitations as dataset does not have timestamp columns