<h1>ANOVOS - Data Ingest<span class="tocSkip"></span></h1>
<p>Following notebook shows the list of "data ingest" related functions supported under ANOVOS package and how it can be invoked accordingly</p>
<div class="toc"><ul class="toc-item"><li><span><a href="#Read-Dataset" data-toc-modified-id="Read-Dataset-1">Read Dataset</a></span></li><li><span><a href="#Select-Columns" data-toc-modified-id="Select-Columns-2">Select Columns</a></span></li><li><span><a href="#Delete-Columns" data-toc-modified-id="Delete-Columns-3">Delete Columns</a></span></li><li><span><a href="#Rename-Columns" data-toc-modified-id="Rename-Columns-4">Rename Columns</a></span></li><li><span><a href="#Recast-Columns" data-toc-modified-id="Recast-Columns-5">Recast Columns</a></span></li><li><span><a href="#Concatenate-Datasets" data-toc-modified-id="Concatenate-Datasets-6">Concatenate Datasets</a></span></li><li><span><a href="#Join-Datasets" data-toc-modified-id="Join-Datasets-7">Join Datasets</a></span></li><li><span><a href="#Write-Datasets" data-toc-modified-id="Write-Datasets-8">Write Datasets</a></span></li></ul></div>

**Setting Spark Session**

In [1]:
from anovos.shared.spark import *

**Input/Output Path**

In [2]:
inputPath = "../data/income_dataset/csv"
inputPath_parq = "../data/income_dataset/parquet"
inputPath_join = "../data/income_dataset/join"
outputPath = "../output/income_dataset/"

# Read Dataset

- API specification of function **read_dataset** can be found <a href="../api_specification/anovos/data_ingest/data_ingest.html#anovos.data_ingest.data_ingest.read_dataset">here</a>
- Currently supports - csv, parquet, avro

In [3]:
from anovos.data_ingest.data_ingest import read_dataset

In [4]:
# Example 1 - Reading CSV file
df = read_dataset(spark, file_path = inputPath, file_type = "csv",file_configs = {"header": "True", 
                                                                           "delimiter": "," , 
                                                                           "inferSchema": "True"})
df.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [5]:
# Example 2 - Reading Parquet file
df2 = read_dataset(spark, file_path = inputPath_parq, file_type = "parquet")
df2.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income,label
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K,0
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K,0
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K,0
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K,0
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K,0


In [6]:
# Example 3 - Reading Avro file
df3 = read_dataset(spark, inputPath_join, "avro")
df3.toPandas().head(5)

Unnamed: 0,ifa,age,workclass
0,2a,,Self-emp-not-inc
1,3a,38.0,Private
2,5a,,Private
3,7a,49.0,Private
4,8a,52.0,Self-emp-not-inc


# Select Columns
- API specification of function **select_column** can be found <a href="../api_specification/anovos/data_ingest/data_ingest.html#anovos.data_ingest.data_ingest.select_column">here</a>

In [7]:
from anovos.data_ingest.data_ingest import select_column

In [8]:
# Example 1 - list_of_cols in list format
odf = select_column(idf=df, list_of_cols=['age','race','income'], print_impact=True)
odf.toPandas().head(5)

Before: 
No. of Columns- 18
['ifa', 'age', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

After: 
No. of Columns- 3
['income', 'race', 'age']


Unnamed: 0,income,race,age
0,<=50K,White,
1,<=50K,White,
2,<=50K,White,38.0
3,<=50K,Black,53.0
4,<=50K,Black,


In [9]:
# Example 2 - list_of_cols in string format
odf = select_column(idf=df, list_of_cols='age|race|income')
odf.toPandas().head(5)

Unnamed: 0,income,race,age
0,<=50K,White,
1,<=50K,White,
2,<=50K,White,38.0
3,<=50K,Black,53.0
4,<=50K,Black,


In [10]:
# Example 3 - Without keyword arguments
odf = select_column(df,'age')
odf.toPandas().head(5)

Unnamed: 0,age
0,
1,
2,38.0
3,53.0
4,


# Delete Columns
- API specification of function **delete_column** can be found <a href="../api_specification/anovos/data_ingest/data_ingest.html#anovos.data_ingest.data_ingest.delete_column">here</a>

In [11]:
from anovos.data_ingest.data_ingest import delete_column

In [12]:
# Example 1 - list_of_cols in list format
odf = delete_column(idf=df, list_of_cols=['age','race','income'], print_impact=True)
odf.toPandas().head(5)

Before: 
No. of Columns-  18
['ifa', 'age', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
After: 
No. of Columns-  15
['ifa', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country']


Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,Male,2174.0,0.0,40.0,UnitedStates
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,Male,0.0,0.0,13.0,UnitedStates
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,Male,0.0,0.0,40.0,UnitedStates
3,4a,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Male,0.0,0.0,40.0,UnitedStates
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Female,0.0,0.0,40.0,Cuba


In [13]:
# Example 2 - list_of_cols in string format
odf = delete_column(idf=df, list_of_cols='age|race|income')
odf.toPandas().head(5)

Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,sex,capital-gain,capital-loss,hours-per-week,native-country
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,Male,2174.0,0.0,40.0,UnitedStates
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,Male,0.0,0.0,13.0,UnitedStates
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,Male,0.0,0.0,40.0,UnitedStates
3,4a,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Male,0.0,0.0,40.0,UnitedStates
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Female,0.0,0.0,40.0,Cuba


In [14]:
# Example 3 - Without keyword arguments
odf = delete_column(df,'age')
odf.toPandas().head(5)

Unnamed: 0,ifa,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Rename Columns
- API specification of function **rename_column** can be found <a href="../api_specification/anovos/data_ingest/data_ingest.html#anovos.data_ingest.data_ingest.rename_column">here</a>

In [15]:
from anovos.data_ingest.data_ingest import rename_column

In [16]:
# Example 1 - list_of_cols & list_of_newcols in list format
odf = rename_column(idf=df, list_of_cols=['age','race','income'], list_of_newcols=['dage','drace','dincome'], print_impact=True)
odf.toPandas().head(5)

Before: 
No. of Columns-  18
['ifa', 'age', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
After: 
No. of Columns-  18
['ifa', 'dage', 'workclass', 'fnlwgt', 'logfnl', 'empty', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'drace', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'dincome']


Unnamed: 0,ifa,dage,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,drace,sex,capital-gain,capital-loss,hours-per-week,native-country,dincome
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [17]:
# Example 2 - list_of_cols & list_of_newcols in string format
odf = rename_column(idf=df, list_of_cols='age|race|income', list_of_newcols='dage|drace|dincome')
odf.toPandas().head(5)

Unnamed: 0,ifa,dage,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,drace,sex,capital-gain,capital-loss,hours-per-week,native-country,dincome
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [18]:
# Example 3 - list_of_cols & list_of_newcols in mix of list/string format
odf = rename_column(idf=df, list_of_cols=['age','race','income'], list_of_newcols='dage|drace|dincome')
odf.toPandas().head(5)

Unnamed: 0,ifa,dage,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,drace,sex,capital-gain,capital-loss,hours-per-week,native-country,dincome
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [19]:
# Example 4 - Without keyword arguments
odf = rename_column(df,'age','dage')
odf.toPandas().head(5)

Unnamed: 0,ifa,dage,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Recast Columns
- API specification of function **recast_column** can be found <a href="../api_specification/anovos/data_ingest/data_ingest.html#anovos.data_ingest.data_ingest.recast_column">here</a>

In [20]:
from anovos.data_ingest.data_ingest import recast_column

In [21]:
# Example 1 - list_of_cols & list_of_dtypes in list format, list_of_dtypes case-sensitive
odf = recast_column(idf=df, list_of_cols=['age','education-num'], list_of_dtypes=['double','Float'], print_impact=True)
odf.toPandas().head(5)

Before: 
root
 |-- ifa: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- education: string (nullable = true)
 |-- education-num: integer (nullable = true)
 |-- marital-status: string (nullable = true)
 |-- occupation: string (nullable = true)
 |-- relationship: string (nullable = true)
 |-- race: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- capital-gain: integer (nullable = true)
 |-- capital-loss: integer (nullable = true)
 |-- hours-per-week: integer (nullable = true)
 |-- native-country: string (nullable = true)
 |-- income: string (nullable = true)

After: 
root
 |-- ifa: string (nullable = true)
 |-- age: double (nullable = true)
 |-- workclass: string (nullable = true)
 |-- fnlwgt: integer (nullable = true)
 |-- logfnl: double (nullable = true)
 |-- empty: string (nullable = true)
 |-- educa

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [22]:
# Example 2 - list_of_cols & list_of_newcols in mix of list/string format, list_of_dtypes short form allowed
odf = recast_column(idf=df, list_of_cols='age|logfnl', list_of_dtypes=['DOUble','int'])
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.0,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.0,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.0,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.0,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.0,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


In [23]:
# Example 3 - Without keyword arguments
odf = recast_column(df,'logfnl', 'integer')
odf.toPandas().head(5)

Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.0,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.0,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.0,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.0,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.0,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Concatenate Datasets
- API specification of function **concatenate_dataset** can be found <a href="../api_specification/anovos/data_ingest/data_ingest.html#anovos.data_ingest.data_ingest.concatenate_dataset">here</a>

In [24]:
from anovos.data_ingest.data_ingest import concatenate_dataset

In [25]:
# Example 1: Concatenation by column names
odf = concatenate_dataset(df.select('ifa','age','workclass'),df2.select('ifa','workclass','age'),
                          method_type='name')
print(df.count())
print(df2.count())
odf.toPandas().tail(5)

32561
32561


Unnamed: 0,ifa,age,workclass
65117,32557a,27.0,Private
65118,32558a,40.0,Private
65119,32559a,58.0,Private
65120,32560a,22.0,Private
65121,32561a,52.0,Self-emp-inc


In [26]:
# Example 2: Concatenation by column index
odf = concatenate_dataset(df.select('ifa','age','workclass'),df2.select('ifa','age','workclass'),
                          method_type='index')
odf.toPandas().tail(5)

Unnamed: 0,ifa,age,workclass
65117,32557a,27.0,Private
65118,32558a,40.0,Private
65119,32559a,58.0,Private
65120,32560a,22.0,Private
65121,32561a,52.0,Self-emp-inc


In [27]:
# Example 3 (INCORRECT USAGE): Concatenation by column index
odf = concatenate_dataset(df.select('ifa','age','workclass'),df2.select('ifa','workclass','age'),
                          method_type='index')
odf.toPandas().tail(5)

Unnamed: 0,ifa,age,workclass
65117,32557a,Private,27
65118,32558a,Private,40
65119,32559a,Private,58
65120,32560a,Private,22
65121,32561a,Self-emp-inc,52


In [28]:
# Example 4: Multiple Datasets
odf = concatenate_dataset(df, df2, df2, method_type='name')
print(odf.count())
odf.toPandas().head(5)

97683


Unnamed: 0,ifa,age,workclass,fnlwgt,logfnl,empty,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,1a,,State-gov,77516.0,4.889391,,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,UnitedStates,<=50K
1,2a,,Self-emp-not-inc,83311.0,4.920702,,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,UnitedStates,<=50K
2,3a,38.0,Private,215646.0,5.333741,,HS-grad,9.0,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.0,0.0,40.0,UnitedStates,<=50K
3,4a,53.0,Private,234721.0,5.370552,,11th,7.0,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.0,0.0,40.0,UnitedStates,<=50K
4,5a,,Private,338409.0,5.529442,,Bachelors,13.0,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.0,0.0,40.0,Cuba,<=50K


# Join Datasets
- API specification of function **join_dataset** can be found <a href="../api_specification/anovos/data_ingest/data_ingest.html#anovos.data_ingest.data_ingest.join_dataset">here</a>

In [29]:
from anovos.data_ingest.data_ingest import join_dataset

In [30]:
# Example 1: Inner Join
tmp = rename_column(df3,'age|workclass','age_dupl|workclass_dupl')

odf = join_dataset(df.select('ifa','age','workclass'), tmp, join_cols='ifa',join_type='inner')
print(df.count())
print(df3.count())
print(odf.count())
odf.toPandas().head(5)

32561
24463
24463


Unnamed: 0,ifa,age,workclass,age_dupl,workclass_dupl
0,2a,,Self-emp-not-inc,,Self-emp-not-inc
1,3a,38.0,Private,38.0,Private
2,5a,,Private,,Private
3,7a,49.0,Private,49.0,Private
4,8a,52.0,Self-emp-not-inc,52.0,Self-emp-not-inc


In [31]:
# Example 2: Left Join + Join by multiple columns
tmp = rename_column(df3,'age','age_dupl')

odf = join_dataset(df.select('ifa','age','workclass'), tmp, join_cols='ifa|workclass',join_type='left')
print(df.count())
print(df3.count())
print(odf.count())
odf.toPandas().head(5)

32561
24463
32561


Unnamed: 0,ifa,workclass,age,age_dupl
0,1a,State-gov,,
1,2a,Self-emp-not-inc,,
2,3a,Private,38.0,38.0
3,4a,Private,53.0,
4,5a,Private,,


# Write Datasets

- API specification of function **write_dataset** can be found <a href="../api_specification/anovos/data_ingest/data_ingest.html#anovos.data_ingest.data_ingest.write_dataset">here</a> <br>
- Currently supports - csv, parquet, avro  
- Limitations:
    - csv doesn't work with array columns
    - avro doesn't work with certain special characters e.g. hyphen -

In [32]:
from anovos.data_ingest.data_ingest import write_dataset

In [34]:
#Example 1 - CSV
write_dataset(idf=df, file_path=outputPath, file_type='csv', 
              file_configs={'header':True,'repartition':1,'mode':'error','compression':'gzip'})

In [35]:
#Example 2 - Parquet
write_dataset(idf=df, file_path=outputPath, file_type='parquet',
              file_configs={'repartition':1,'mode':'append','compression':'snappy'})

In [36]:
#Example 3 - Avro
write_dataset(idf=df.select('ifa','age','workclass'), file_path=outputPath, file_type='avro', 
              file_configs={'repartition':1,'mode':'overwrite'})

In [37]:
#Example 4 - Without keywords arguments
write_dataset(df, outputPath, 'parquet',{'mode':'overwrite'})