# Loading text files

`fugue` can read text files natively via `load` or by dropping into an execution engine

You might find it useful to use the execution engine directly for loading non-standard files or files that are not natively supported by `fugue`.  

We'll demonstrate `pandas`, `duckdb` & `dask` here

Here's an example where the headers are on the 2nd line & the data starts on the 5th:

"SITENAME"
"TIMESTAMP","RECORD","WS_80m_90deg_Avg","WS_80m_90deg_Std","WS_80m_90deg_3sGust_Max","WS_80m_90deg_Max","WS_80m_270deg_Avg","WS_80m_270deg_Std","WS_80m_270deg_3sGust_Max","WS_80m_270deg_Max","WS_65m_90deg_Avg","WS_65m_90deg_Std","WS_65m_90deg_3sGust_Max","WS_65m_90deg_Max","WS_65m_270deg_Avg","WS_65m_270deg_Std","WS_65m_270deg_3sGust_Max","WS_65m_270deg_Max","WS_50m_90deg_Avg","WS_50m_90deg_Std","WS_50m_90deg_3sGust_Max","WS_50m_90deg_Max","WS_50m_270deg_Avg","WS_50m_270deg_Std","WS_50m_270deg_3sGust_Max","WS_50m_270deg_Max","WS_30m_90deg_Avg","WS_30m_90deg_Std","WS_30m_90deg_3sGust_Max","WS_30m_90deg_Max","WS_30m_270deg_Avg","WS_30m_270deg_Std","WS_30m_270deg_3sGust_Max","WS_30m_270deg_Max","Dir_78m_90deg_avg","Dir_78m_90deg_std","Dir_63m_90deg_avg","Dir_63m_90deg_std","Dir_28m_90deg_avg","Dir_28m_90deg_std","Batt_Volt_Min","Press_Avg","Temp_C80_Avg","Temp_C15_Avg","Hum_Avg"
"TS","RN","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","meters/second","meters/second","meters/second","meters/second","Volts","mB","DegC","DegC","%"
"","","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","WVc","WVc","WVc","WVc","WVc","WVc","Min","Avg","Avg","Avg","Avg"
"2012-05-31 12:20:00",1,1.383,0.6,2.75,3.37,1.368,0.439,2.673,2.74,1.332,0.478,2.75,2.75,1.242,0.379,2.74,2.79,1.162,0.535,2.337,2.75,1.159,0.354,2.34,2.39,1.27,0.614,2.337,2.75,1.322,0.416,2.157,2.24,240.3,46,242,45.39,222,33.45,13.79,1009,13.84,14.08,65.67
"2012-05-31 12:30:00",2,1.183,0.449,1.923,2.13,1.135,0.324,1.94,1.99,0.948,0.524,1.923,2.13,1.068,0.303,1.723,1.74,0.701,0.547,1.923,2.13,0.913,0.308,1.673,1.74,0.771,0.539,1.717,2.13,0.997,0.28,1.657,1.74,282,26.79,264.3,30.25,278.5,62.87,13.73,1009,14.04,14.45,64.51

To read it we'll need to create a custom `Creator` in `fugue`!

In [None]:
import os
import typing

import duckdb
from fugue import DataFrame
from fugue import ExecutionEngine
from fugue import FugueWorkflow
from fugue import NativeExecutionEngine
from fugue_dask import DaskExecutionEngine
from fugue_duckdb import DuckExecutionEngine
import pandas as pd

## Standard text files

First let's create a sample `standard` text file ...

In [None]:
content = """\
a,b,c
1,2,3
1,2,3"""

We can read it natively

In [None]:
file = "/tmp/fugue_example_std_1.csv"
with open(file, "w") as f:
    f.write(content)

with FugueWorkflow() as dag:
    df = dag.load(file, header=True)
    df.show()

os.unlink(file)

PandasDataFrame
a:str|b:str|c:str
-----+-----+-----
1    |2    |3    
1    |2    |3    
Total count: 2



We can read multiple files using a wildcard `*` 

In [None]:
file_1 = "/tmp/fugue_example_std_2.csv"
file_2 = "/tmp/fugue_example_std_3.csv"
with open(file_1, "w") as f1, open(file_2, "w") as f2:
    f1.write(content)
    f2.write(content)
wildcard = "/tmp/fugue_example_std_*.csv"

with FugueWorkflow() as dag:
    df = dag.load(wildcard, header=True)
    df.show()

os.unlink(file_1)
os.unlink(file_2)

PandasDataFrame
a:str|b:str|c:str
-----+-----+-----
1    |2    |3    
1    |2    |3    
1    |2    |3    
1    |2    |3    
Total count: 4



## Non-standard text files

Or, if your input file is non-standard, we can use the execution engine directly 

In [None]:
content = '''\
"SITENAME"
"TIMESTAMP","RECORD","WS_80m_90deg_Avg","WS_80m_90deg_Std","WS_80m_90deg_3sGust_Max","WS_80m_90deg_Max","WS_80m_270deg_Avg","WS_80m_270deg_Std","WS_80m_270deg_3sGust_Max","WS_80m_270deg_Max","WS_65m_90deg_Avg","WS_65m_90deg_Std","WS_65m_90deg_3sGust_Max","WS_65m_90deg_Max","WS_65m_270deg_Avg","WS_65m_270deg_Std","WS_65m_270deg_3sGust_Max","WS_65m_270deg_Max","WS_50m_90deg_Avg","WS_50m_90deg_Std","WS_50m_90deg_3sGust_Max","WS_50m_90deg_Max","WS_50m_270deg_Avg","WS_50m_270deg_Std","WS_50m_270deg_3sGust_Max","WS_50m_270deg_Max","WS_30m_90deg_Avg","WS_30m_90deg_Std","WS_30m_90deg_3sGust_Max","WS_30m_90deg_Max","WS_30m_270deg_Avg","WS_30m_270deg_Std","WS_30m_270deg_3sGust_Max","WS_30m_270deg_Max","Dir_78m_90deg_avg","Dir_78m_90deg_std","Dir_63m_90deg_avg","Dir_63m_90deg_std","Dir_28m_90deg_avg","Dir_28m_90deg_std","Batt_Volt_Min","Press_Avg","Temp_C80_Avg","Temp_C15_Avg","Hum_Avg"
"TS","RN","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","","meters/second","meters/second","meters/second","meters/second","meters/second","meters/second","meters/second","Volts","mB","DegC","DegC","%"
"","","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","Avg","Std","Max","Max","WVc","WVc","WVc","WVc","WVc","WVc","Min","Avg","Avg","Avg","Avg"
"2012-05-31 12:20:00",1,1.383,0.6,2.75,3.37,1.368,0.439,2.673,2.74,1.332,0.478,2.75,2.75,1.242,0.379,2.74,2.79,1.162,0.535,2.337,2.75,1.159,0.354,2.34,2.39,1.27,0.614,2.337,2.75,1.322,0.416,2.157,2.24,240.3,46,242,45.39,222,33.45,13.79,1009,13.84,14.08,65.67
"2012-05-31 12:30:00",2,1.183,0.449,1.923,2.13,1.135,0.324,1.94,1.99,0.948,0.524,1.923,2.13,1.068,0.303,1.723,1.74,0.701,0.547,1.923,2.13,0.913,0.308,1.673,1.74,0.771,0.539,1.717,2.13,0.997,0.28,1.657,1.74,282,26.79,264.3,30.25,278.5,62.87,13.73,1009,14.04,14.45,64.51
'''

Let's read the headers on the 2nd line separately to loading the text file 

In [None]:
def read_header(filepath: str) -> typing.List[str]:
    row_1 = pd.read_csv(filepath, skiprows=1, nrows=0).columns
    header = [row_1[0].replace("columns: ", ""), *row_1[1:]]
    return header

& specify these headers as the column names of the data that we are loading in

> **Note:** `skip` & `columns` for `DuckExecutionEngine` correspond to `skiprows` & `names` for `pandas.read_csv` as `duckdb` `csv` has different conventions  

In [None]:
def read_text_file(engine: ExecutionEngine, filepath: str) -> DataFrame:
    headers = read_header(filepath)
    if isinstance(engine, NativeExecutionEngine):
        # load_df uses pandas.read_csv
        df = engine.load_df(filepath, infer_schema=True, header=True, skiprows=3, names=headers)
    elif isinstance(engine, DuckExecutionEngine):
        # load_df uses duckdb read_csv_auto
        df = engine.load_df(filepath, infer_schema=True, skip=4, columns=headers)
    elif isinstance(engine, DaskExecutionEngine):
        # load_df uses dask.dataframe.read_csv
        df = engine.load_df(filepath, infer_schema=True, header=True, skiprows=3, names=headers)
    else:
        supported_engines = {NativeExecutionEngine, DuckExecutionEngine, DaskExecutionEngine}   
        raise ValueError(f"Engine {engine} is not supported, must be one of {supported_engines}")
    return df

In [None]:
%%html
<!-- Disable line wrapping in output -->
<style>
div.output_area pre {
    white-space: pre;
}
</style>

In [None]:
file = "/tmp/fugue_example_nonstd.csv"
with open(file, "w") as f:
    f.write(content)

with FugueWorkflow() as dag:
    df = dag.create(read_text_file, params={"filepath": file})
    df.show()

with FugueWorkflow(engine="duckdb") as dag:
    df = dag.create(read_text_file, params={"filepath": file})
    df.show()

with FugueWorkflow(engine="dask") as dag:
    df = dag.create(read_text_file, params={"filepath": file})
    df.show()

os.unlink(file)

PandasDataFrame
TIMESTAMP:str |RECORD:long|WS_80m_90deg_Avg:double|WS_80m_90deg_Std:double|WS_80m_90deg_3sGust_Max:double|WS_80m_90deg_Max:double|WS_80m_270deg_Avg:double|WS_80m_270deg_Std:double|WS_80m_270deg_3sGust_Max:double|WS_80m_270deg_Max:double|WS_65m_90deg_Avg:double|WS_65m_90deg_Std:double|WS_65m_90deg_3sGust_Max:double|WS_65m_90deg_Max:double|WS_65m_270deg_Avg:double|WS_65m_270deg_Std:double|WS_65m_270deg_3sGust_Max:double|WS_65m_270deg_Max:double|WS_50m_90deg_Avg:double|WS_50m_90deg_Std:double|WS_50m_90deg_3sGust_Max:double|WS_50m_90deg_Max:double|WS_50m_270deg_Avg:double|WS_50m_270deg_Std:double|WS_50m_270deg_3sGust_Max:double|WS_50m_270deg_Max:double|WS_30m_90deg_Avg:double|WS_30m_90deg_Std:double|WS_30m_90deg_3sGust_Max:double|WS_30m_90deg_Max:double|WS_30m_270deg_Avg:double|WS_30m_270deg_Std:double|WS_30m_270deg_3sGust_Max:double|WS_30m_270deg_Max:double|Dir_78m_90deg_avg:double|Dir_78m_90deg_std:double|Dir_63m_90deg_avg:double|Dir_63m_90deg_std:double|Dir_28m_90deg_avg

DaskDataFrame
TIMESTAMP:str |RECORD:long|WS_80m_90deg_Avg:double|WS_80m_90deg_Std:double|WS_80m_90deg_3sGust_Max:double|WS_80m_90deg_Max:double|WS_80m_270deg_Avg:double|WS_80m_270deg_Std:double|WS_80m_270deg_3sGust_Max:double|WS_80m_270deg_Max:double|WS_65m_90deg_Avg:double|WS_65m_90deg_Std:double|WS_65m_90deg_3sGust_Max:double|WS_65m_90deg_Max:double|WS_65m_270deg_Avg:double|WS_65m_270deg_Std:double|WS_65m_270deg_3sGust_Max:double|WS_65m_270deg_Max:double|WS_50m_90deg_Avg:double|WS_50m_90deg_Std:double|WS_50m_90deg_3sGust_Max:double|WS_50m_90deg_Max:double|WS_50m_270deg_Avg:double|WS_50m_270deg_Std:double|WS_50m_270deg_3sGust_Max:double|WS_50m_270deg_Max:double|WS_30m_90deg_Avg:double|WS_30m_90deg_Std:double|WS_30m_90deg_3sGust_Max:double|WS_30m_90deg_Max:double|WS_30m_270deg_Avg:double|WS_30m_270deg_Std:double|WS_30m_270deg_3sGust_Max:double|WS_30m_270deg_Max:double|Dir_78m_90deg_avg:double|Dir_78m_90deg_std:double|Dir_63m_90deg_avg:double|Dir_63m_90deg_std:double|Dir_28m_90deg_avg:d