## DataFrameTools

### Set environmental variables

In order to properly load modules within this notebook from outside the repository folder, set the script **PATH** below,  e.g. ```C:/DataFrameTools```:

In [None]:
PATH = "/media/data/scripts/chn@git/chn-tools/tools/DataFrameTools" # <-- optional if running from native path

In [None]:
import importlib.util, os

if not os.path.isdir(PATH):
    PATH = os.getcwd()
PATH = os.path.realpath(PATH)

spec = importlib.util.spec_from_file_location("__init__", PATH+'/__init__.py')
init = importlib.util.module_from_spec(spec)
spec.loader.exec_module(init)

%matplotlib inline
%load_ext autoreload
%autoreload 2

### Import functions

In [None]:
from concat import concat_files
from dflib import df_load
from filters import df_filter_text
from filters import df_filter_interval
from filters import df_filter_timestamp
from normalize import normalize_df
from write import df_write
from xls2csv import convert_file

### Load data from file

Accepts pure text files and Excel `XLS` and `XLSX` extension formats.

In [None]:
file_name = ""

df = df_load(file_name); df

#### Alternative: convert file → `.csv/.xls(x)`

Converts existing files to and from Excel format, separately by each sheet it contains.

In [None]:
file_name = ""

convert_file(file_name)

#### Alternative: load multiple files

In [None]:
folder_name = ""

df = concat_files(folder_name, extension="", output_file="", ignore_header=False); df

#### Alternative: load multiple data frames

In [None]:
list_of_data_frames = []

df = df_concat(list_of_data_frames); df

### Normalize data frame

Normalize values in each column field by `MinMax`, `Standard` scale or `ECDF` (empirical distribution function).

In [None]:
method  = "minmax"
columns = []

new_df = normalize_df(df, method=method, columns=columns); new_df

### Filter data from file

Select parameters to filter file by text, numeric interval, timestamp interval or to cut columns.

#### Select rows that match a text filter rule

In [None]:
keywords = "word|Word"
column   = "column"

In [None]:
new_df = df_filter_text(df, text=keywords, column=column); new_df

#### Select numeric interval to filter values

In [None]:
min_value = 0
max_value = -1
column    = "column"

new_df = df_filter_interval(df, min_value=min_value, max_value=max_value, column=column); new_df

#### Select date interval to filter timestamps

In [None]:
min_date = "1970-01-01"
max_date = "2038-01-18 03:14:07"
column   = "timestamp"

new_df = df_filter_timestamp(df, min_date=min_date, max_date=max_date, column=column); new_df

### Write new data frame to file

Accepts both pure text `CSV` (Comma Separated Values) and `XLS/XLSX` (Excel) extension formats.

In [None]:
output_file = "data.csv" # "data.xls"

df_write(new_df, output_file)

#### Compress output →  `output.zip`

In [None]:
!zip output.zip/*{csv,xls,xlsx}

### [Download output files](output.zip)

___

### References

* Pandas documentation: https://pandas.pydata.org/