# ANOVOS - Data Ingest
Following notebook shows the list of "data ingest" related functions supported under ANOVOS package and how it can be invoked accordingly.
* [Read Dataset](#Read-Dataset)
* [Select Columns](#Select-Columns)
* [Delete Columns](#Delete-Columns)
* [Rename Columns](#Rename-Columns)
* [Recast Columns](#Recast-Columns)
* [Concatenate Datasets](#Concatenate-Datasets)
* [Join Datasets](#Join-Datasets)
* [Write Datasets](#Write-Datasets)

**Setting Spark Session**

In [1]:
#set run type variable
run_type = "local" # "local", "emr", "databricks", "ak8s"

In [2]:
#For run_type Azure Kubernetes, run the following block 
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

if run_type == "ak8s":
    fs_path="<insert conf spark.hadoop.fs master url here> ex: spark.hadoop.fs.azure.sas.<container>.<account_name>.blob.core.windows.net"
    auth_key="<insert value of sas_token here>"
    master_url="<insert kubernetes master url path here> ex: k8s://"
    docker_image="<insert name docker image here>"
    kubernetes_namespace ="<insert kubernetes namespace here>"

    # Create Spark config for our Kubernetes based cluster manager
    sparkConf = SparkConf()
    sparkConf.setMaster(master_url)
    sparkConf.setAppName("Anovos_pipeline")
    sparkConf.set("spark.submit.deployMode","client")
    sparkConf.set("spark.kubernetes.container.image", docker_image)
    sparkConf.set("spark.kubernetes.namespace", kubernetes_namespace)
    sparkConf.set("spark.executor.instances", "4")
    sparkConf.set("spark.executor.cores", "4")
    sparkConf.set("spark.executor.memory", "16g")
    sparkConf.set("spark.kubernetes.pyspark.pythonVersion", "3")
    sparkConf.set("spark.kubernetes.authenticate.driver.serviceAccountName", "spark")
    sparkConf.set(fs_path,auth_key)
    sparkConf.set("spark.kubernetes.authenticate.serviceAccountName", "spark")
    sparkConf.set("spark.jars.packages", "org.apache.hadoop:hadoop-azure:3.2.0,com.microsoft.azure:azure-storage:8.6.3,io.github.histogrammar:histogrammar_2.12:1.0.20,io.github.histogrammar:histogrammar-sparksql_2.12:1.0.20,org.apache.spark:spark-avro_2.12:3.2.1")

    # Initialize our Spark cluster, this will actually
    # generate the worker nodes.
    spark = SparkSession.builder.config(conf=sparkConf).getOrCreate()
    sc = spark.sparkContext

#For other run types import from anovos.shared.
else:
    from anovos.shared.spark import *
    auth_key = "NA"

2022-06-07 11:56:19.534 | INFO     | anovos.shared.spark:init_spark:54 - Getting spark session, context and sql context app_name: Anovos_pipeline


:: loading settings :: url = jar:file:/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/mobilewalla/.ivy2/cache
The jars for the packages stored in: /Users/mobilewalla/.ivy2/jars
io.github.histogrammar#histogrammar_2.12 added as a dependency
io.github.histogrammar#histogrammar-sparksql_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-8adbedda-b1d4-4c9d-bdf7-23201d71403c;1.0
	confs: [default]
	found io.github.histogrammar#histogrammar_2.12;1.0.20 in central
	found io.github.histogrammar#histogrammar-sparksql_2.12;1.0.20 in central
	found org.apache.spark#spark-avro_2.12;3.2.1 in central
	found org.tukaani#xz;1.8 in central
	found org.spark-project.spark#unused;1.0.0 in central
:: resolution report :: resolve 258ms :: artifacts dl 12ms
	:: modules in use:
	io.github.histogrammar#histogrammar-sparksql_2.12;1.0.20 from central in [default]
	io.github.histogrammar#histogrammar_2.12;1.0.20 from central in [default]
	org.apache.spark#spark-avro_2.12

In [3]:
sc.setLogLevel("ERROR")
import warnings
warnings.filterwarnings('ignore')

**Input/Output Path**

In [4]:
inputPath = "../data/income_dataset/csv"
inputPath_parq = "../data/income_dataset/parquet"
inputPath_join = "../data/income_dataset/join"
outputPath = "../output/income_dataset/"

# Read Dataset

- API specification of function **read_dataset** can be found <a href="https://docs.anovos.ai/api/data_ingest/data_ingest.html">here</a>
- Currently supports - csv, parquet, avro

In [5]:
from anovos.data_ingest.data_ingest import read_dataset

In [6]:
# Example 1 - Reading CSV file
df = read_dataset(spark, file_path = inputPath, file_type = "csv",file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,dt_1,dt_2
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


In [7]:
# Example 2 - Reading Parquet file
df2 = read_dataset(spark, file_path = inputPath_parq, file_type = "parquet")
df2.toPandas().head(5)

[Stage 4:>                                                          (0 + 1) / 1]                                                                                

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,...,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,dt_1,dt_2,label
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,...,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59,0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,...,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09,0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,...,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21,0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,...,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31,0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,...,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45,0


In [8]:
# Example 3 - Reading Avro file
df3 = read_dataset(spark, inputPath_join, "avro")
df3.toPandas().head(5)

Unnamed: 0,ifa,age,workclass
0,2a,,Self-emp-not-inc
1,3a,38.0,Private
2,5a,,Private
3,7a,49.0,Private
4,8a,52.0,Self-emp-not-inc


# Select Columns
- API specification of function **select_column** can be found <a href="https://docs.anovos.ai/api/data_ingest/data_ingest.html">here</a>

In [9]:
from anovos.data_ingest.data_ingest import select_column

In [10]:
# Example 1 - list_of_cols in list format
odf = select_column(idf=df, list_of_cols=['age','race','income'], print_impact=True)
odf.toPandas().head(5)

Before: 
No. of Columns- 20
['ifa', 'age', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income', 'dt_1', 'dt_2']

After: 
No. of Columns- 3
['race', 'age', 'income']


Unnamed: 0,race,age,income
0,White,,<=50K
1,White,,<=50K
2,White,38.0,<=50K
3,Black,53.0,<=50K
4,Black,,<=50K


In [11]:
# Example 2 - list_of_cols in string format
odf = select_column(idf=df, list_of_cols='age|race|income')
odf.toPandas().head(5)

Unnamed: 0,race,age,income
0,White,,<=50K
1,White,,<=50K
2,White,38.0,<=50K
3,Black,53.0,<=50K
4,Black,,<=50K


In [12]:
# Example 3 - Without keyword arguments
odf = select_column(df,'age')
odf.toPandas().head(5)

Unnamed: 0,age
0,
1,
2,38.0
3,53.0
4,


# Delete Columns
- API specification of function **delete_column** can be found <a href="https://docs.anovos.ai/api/data_ingest/data_ingest.html">here</a>

In [13]:
from anovos.data_ingest.data_ingest import delete_column

In [14]:
# Example 1 - list_of_cols in list format
odf = delete_column(idf=df, list_of_cols=['age','race','income'], print_impact=True)
odf.toPandas().head(5)

Before: 
No. of Columns-  20
['ifa', 'age', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income', 'dt_1', 'dt_2']
After: 
No. of Columns-  17
['ifa', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dt_1', 'dt_2']


Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country,dt_1,dt_2
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,Male,2174.0,0.0,40.0,UnitedStates,1/8/16 5:59,1/16/16 5:59
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,Male,0.0,0.0,13.0,UnitedStates,1/8/16 21:09,1/12/16 21:09
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,Male,0.0,0.0,40.0,UnitedStates,3/8/16 2:21,3/20/16 2:21
3,4a,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Male,0.0,0.0,40.0,UnitedStates,3/8/16 6:31,3/14/16 6:31
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Female,0.0,0.0,40.0,Cuba,3/8/16 9:45,3/10/16 9:45


In [15]:
# Example 2 - list_of_cols in string format
odf = delete_column(idf=df, list_of_cols='age|race|income')
odf.toPandas().head(5)

Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country,dt_1,dt_2
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,Male,2174.0,0.0,40.0,UnitedStates,1/8/16 5:59,1/16/16 5:59
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,Male,0.0,0.0,13.0,UnitedStates,1/8/16 21:09,1/12/16 21:09
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,Male,0.0,0.0,40.0,UnitedStates,3/8/16 2:21,3/20/16 2:21
3,4a,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Male,0.0,0.0,40.0,UnitedStates,3/8/16 6:31,3/14/16 6:31
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Female,0.0,0.0,40.0,Cuba,3/8/16 9:45,3/10/16 9:45


In [16]:
# Example 3 - Without keyword arguments
odf = delete_column(df,'age')
odf.toPandas().head(5)

Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,dt_1,dt_2
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


# Rename Columns
- API specification of function **rename_column** can be found <a href="https://docs.anovos.ai/api/data_ingest/data_ingest.html">here</a>

In [17]:
from anovos.data_ingest.data_ingest import rename_column

In [18]:
# Example 1 - list_of_cols & list_of_newcols in list format
odf = rename_column(idf=df, list_of_cols=['age','race','income'], list_of_newcols=['dage','drace','dincome'], print_impact=True)
odf.toPandas().head(5)

Before: 
No. of Columns-  20
['ifa', 'age', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income', 'dt_1', 'dt_2']
After: 
No. of Columns-  20
['ifa', 'dage', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'drace', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dincome', 'dt_1', 'dt_2']


Unnamed: 0,ifa,dage,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,drace,sex,capital-gain,capital-loss,hours-per-week,native-country,dincome,dt_1,dt_2
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


In [19]:
# Example 2 - list_of_cols & list_of_newcols in string format
odf = rename_column(idf=df, list_of_cols='age|race|income', list_of_newcols='dage|drace|dincome')
odf.toPandas().head(5)

Unnamed: 0,ifa,dage,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,drace,sex,capital-gain,capital-loss,hours-per-week,native-country,dincome,dt_1,dt_2
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


In [20]:
# Example 3 - list_of_cols & list_of_newcols in mix of list/string format
odf = rename_column(idf=df, list_of_cols=['age','race','income'], list_of_newcols='dage|drace|dincome')
odf.toPandas().head(5)

Unnamed: 0,ifa,dage,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,drace,sex,capital-gain,capital-loss,hours-per-week,native-country,dincome,dt_1,dt_2
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


In [21]:
# Example 4 - Without keyword arguments
odf = rename_column(df,'age','dage')
odf.toPandas().head(5)

Unnamed: 0,ifa,dage,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,dt_1,dt_2
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


# Recast Columns
- API specification of function **recast_column** can be found <a href="https://docs.anovos.ai/api/data_ingest/data_ingest.html">here</a>

In [22]:
from anovos.data_ingest.data_ingest import recast_column

In [23]:
# Example 1 - list_of_cols & list_of_dtypes in list format, list_of_dtypes case-sensitive
odf = recast_column(idf=df, list_of_cols=['age','education-num'], list_of_dtypes=['double','Float'], print_impact=True)
odf.toPandas().head(5)

Before: 
root
 |-- ifa: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)
 |-- dt_1: string (nullable = true)
 |-- dt_2: string (nullable = true)

After: 
root
 |-- ifa: string (nullable = true)
 |-- age: double (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: 

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,dt_1,dt_2
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


In [24]:
# Example 2 - list_of_cols & list_of_newcols in mix of list/string format, list_of_dtypes short form allowed
odf = recast_column(idf=df, list_of_cols='age|logfnl', list_of_dtypes=['DOUble','int'])
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,dt_1,dt_2
0,1a,,State-gov,77516.0,4.0,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.0,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.0,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.0,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.0,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


In [25]:
# Example 3 - Without keyword arguments
odf = recast_column(df,'logfnl', 'integer')
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,dt_1,dt_2
0,1a,,State-gov,77516.0,4.0,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.0,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.0,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.0,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.0,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


# Concatenate Datasets
- API specification of function **concatenate_dataset** can be found <a href="https://docs.anovos.ai/api/data_ingest/data_ingest.html">here</a>

In [26]:
from anovos.data_ingest.data_ingest import concatenate_dataset

In [27]:
# Example 1: Concatenation by column names
odf = concatenate_dataset(df.select('ifa','age','workclass'),df2.select('ifa','workclass','age'),
                          method_type='name')
print(df.count())
print(df2.count())
odf.toPandas().tail(5)

32561
32561


Unnamed: 0,ifa,age,workclass
65117,32557a,27.0,Private
65118,32558a,40.0,Private
65119,32559a,58.0,Private
65120,32560a,22.0,Private
65121,32561a,52.0,Self-emp-inc


In [28]:
# Example 2: Concatenation by column index
odf = concatenate_dataset(df.select('ifa','age','workclass'),df2.select('ifa','age','workclass'),
                          method_type='index')
odf.toPandas().tail(5)

Unnamed: 0,ifa,age,workclass
65117,32557a,27.0,Private
65118,32558a,40.0,Private
65119,32559a,58.0,Private
65120,32560a,22.0,Private
65121,32561a,52.0,Self-emp-inc


In [29]:
# Example 3 (INCORRECT USAGE): Concatenation by column index
odf = concatenate_dataset(df.select('ifa','age','workclass'),df2.select('ifa','workclass','age'),
                          method_type='index')
odf.toPandas().tail(5)

Unnamed: 0,ifa,age,workclass
65117,32557a,Private,27
65118,32558a,Private,40
65119,32559a,Private,58
65120,32560a,Private,22
65121,32561a,Self-emp-inc,52


In [30]:
# Example 4: Multiple Datasets
odf = concatenate_dataset(df, df2, df2, method_type='name')
print(odf.count())
odf.toPandas().head(5)

97683


Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,dt_1,dt_2
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,1/8/16 5:59,1/16/16 5:59
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,1/8/16 21:09,1/12/16 21:09
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 2:21,3/20/16 2:21
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,3/8/16 6:31,3/14/16 6:31
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,3/8/16 9:45,3/10/16 9:45


# Join Datasets
- API specification of function **join_dataset** can be found <a href="https://docs.anovos.ai/api/data_ingest/data_ingest.html">here</a>

In [31]:
from anovos.data_ingest.data_ingest import join_dataset

In [32]:
# Example 1: Inner Join
tmp = rename_column(df3,'age|workclass','age_dupl|workclass_dupl')

odf = join_dataset(df.select('ifa','age','workclass'), tmp, join_cols='ifa',join_type='inner')
print(df.count())
print(df3.count())
print(odf.count())
odf.toPandas().head(5)

32561
24463
24463


Unnamed: 0,ifa,age,workclass,age_dupl,workclass_dupl
0,2a,,Self-emp-not-inc,,Self-emp-not-inc
1,3a,38.0,Private,38.0,Private
2,5a,,Private,,Private
3,7a,49.0,Private,49.0,Private
4,8a,52.0,Self-emp-not-inc,52.0,Self-emp-not-inc


In [33]:
# Example 2: Left Join + Join by multiple columns
tmp = rename_column(df3,'age','age_dupl')

odf = join_dataset(df.select('ifa','age','workclass'), tmp, join_cols='ifa|workclass',join_type='left')
print(df.count())
print(df3.count())
print(odf.count())
odf.toPandas().head(5)

32561
24463
32561


Unnamed: 0,ifa,workclass,age,age_dupl
0,1a,State-gov,,
1,2a,Self-emp-not-inc,,
2,3a,Private,38.0,38.0
3,4a,Private,53.0,
4,5a,Private,,


# Write Datasets

- API specification of function **write_dataset** can be found <a href="https://docs.anovos.ai/api/data_ingest/data_ingest.html">here</a> <br>
- Currently supports - csv, parquet, avro  
- Limitations:
    - csv doesn't work with array columns
    - avro doesn't work with certain special characters e.g. hyphen -

In [34]:
from anovos.data_ingest.data_ingest import write_dataset

In [39]:
#Example 1 - CSV
write_dataset(idf=df, file_path=outputPath, file_type='csv', 
              file_configs={'header':True,'repartition':1,'mode':'error','compression':'gzip'})

In [40]:
#Example 2 - Parquet
write_dataset(idf=df, file_path=outputPath, file_type='parquet',
              file_configs={'repartition':1,'mode':'append','compression':'snappy'})

In [41]:
#Example 3 - Avro
write_dataset(idf=df.select('ifa','age','workclass'), file_path=outputPath, file_type='avro', 
              file_configs={'repartition':1,'mode':'overwrite'})

In [42]:
#Example 4 - Without keywords arguments
write_dataset(df, outputPath, 'parquet',{'mode':'overwrite'})