# Helper Notebook

This notebook is responsible for declaring and setting up the variables to be passed to the other diagrams and charts for data visualization.  Most importantly, **this notebook should run first before the other notebooks**.

Follow these instructions: 
1. Adjust the constants in the code cell below with the appropriate values.
2. Click on `Cell` and `Run All` to run the notebook.
3. Verify that no errors are created in [this section](#Verification) of the notebook.
-----

In [None]:
INPUT = 'example.csv' # The .csv file generated by the post processor
IS_INDEX = True # Whether we should have an index in the pandas DataFrame
INDEX = 'id' # Which column to use for the pandas DataFrame index
PYLIST = ['citations', 'citation name', 'anchor text',
          'referring record id', 'tags'] # Which columns should be stored as Python lists
DELIM = '|' # The delimiter for columns that store list-like values

---

## Helper Functions

Below this point of the helper notebook are helper functions to check and read the data from the *.csv* file.  Modify with caution.

---

In [None]:
from __future__ import annotations
from errno import ENOENT
from os import strerror
from os.path import exists
import pandas as pd

In [None]:
def check_input() -> None:
    '''
    Check that the input .csv file is in the scope.
    Raise an error if the file is not found.
    '''
    if not exists(INPUT):
        raise FileNotFoundError(ENOENT, strerror(ENOENT), INPUT)

In [None]:
def create() -> pd.DataFrame:
    '''
    Return the postprocessed data file (.csv) as a pandas dataframe.
    '''
    with open(INPUT, encoding='utf-8', newline='') as file:
        df = pd.read_csv(INPUT)
    if IS_INDEX:
        df.set_index(INDEX, inplace=True)
    return df

In [None]:
def convert_values(df: pd.DataFrame, columns: List[str], var_type: int) -> None:
    '''
    Convert the column's values specified by columns in the dataframe df to type var_type.
    
    var_type
    --------
    - 0: convert values to Python list types
    
    Parameters
    --------
    df: pandas DataFrame
    variables: list of column names
    var_type: type to convert values to
    '''
    valid_types = [0];
    if var_type not in valid_types:
        raise TypeError('Invalid var_type')
    if var_type == 0:
        convert_list_variables(df, columns)

In [None]:
def convert_list_variables(df: pd.DataFrame, list_col_names: List[str]) -> None:
    '''
    Convert the column values that are lists but stored as strings
    into Python list types, based on the assumption that the format of a list looks like:
    "96adf8g9200534sf91134465, 13203fs572f502d42957dsf313"
    
    Parameters
    --------
    df: pandas dataframe
    list_col_names: list
                    The list of the column names
    '''
    for index, row in df.iterrows():
        for col_name in list_col_names:
            if pd.notna(row[col_name]):
                df.at[index, col_name] = row[col_name].split(DELIM)

---

## Verification

If the below cell runs without errors, then data visualization should be ready to go.

In [None]:
check_input()
dataframe = create()
convert_values(dataframe, PYLIST, 0)

---

## Exportation

If it has been verified that the pandas dataframe is created and configured successfully, then the dataframe can be exported for visualization.  By running all cells, the code below will store the pandas dataframe for use in other notebooks.

In [None]:
%store dataframe
del dataframe # DELETE dataframe from this notebook