# Gsheet to md documentation

## About

Information about each variable in the Behaverse specification is described in a google sheet (so it can be collaboratively edited and commmented).


This notebook extract the data from those spreadsheets and writes a documentation file in markdown format which can then be rendered in the behaverse data model website (to look like [this](https://raw.githubusercontent.com/behaverse/data-model/main/_spec/cognitive-tests/1-trial.md)). 


The rationale for doing this work is to make it easier to validate and maintain the documentation while allowing for collaborative work on the specification. 






## TODO

 - update the google spreadsheet
 - write validation tests (e.g., are there missing values?)
 - refactor code in this notebook to make it run as a GitHub job

In [1]:
import pandas as pd

## Downloading the spreadsheet

We created and populated a google spreadsheet and generated a [public link](https://docs.google.com/spreadsheets/d/1LWTXsg2T4NPo0xbhD4pulIgkEJk5Oey7FusItltlCk8/edit?usp=sharing) with viewer rights to that spreadsheet. This makes it easy to download that spreadsheet for further processing.




In [2]:
## Get access to the google spreadsheet
sheet_id = "1LWTXsg2T4NPo0xbhD4pulIgkEJk5Oey7FusItltlCk8"
main_sheet = "Tables"
url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={main_sheet}"
tables = pd.read_csv(url)
tables.head()


Unnamed: 0,grand_parent,parent,table_level,nav_order,table_name,table_description,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28
0,Cognitive tests,L1 data,L1,1,Trial,,,,,,...,,,,,,,,,,
1,Cognitive tests,L1 data,L1,2,Stimulus,Table describing each of the stimuli that were...,,,,,...,,,,,,,,,,
2,Cognitive tests,L1 data,L1,3,StimulusComponent,Stimuli can comprise multiple components. This...,,,,,...,,,,,,,,,,
3,Cognitive tests,L1 data,L1,4,Click,Table describing each click that was recorded ...,,,,,...,,,,,,,,,,
4,Cognitive tests,L1 data,L1,5,Option,Table describing each option that a subject co...,,,,,...,,,,,,,,,,


In [3]:
# helper function to load a specific sheet in the gsheet
def load_table(table_name):
    url_table = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={table_name}"
    return pd.read_csv(url_table)

    

## Test functions (TODO)

These functions test whether the google spreadsheet is valid (e.g., are the empty fields? inconsitent data types? repeated variable names?)

List of Tests:
 - empty fields
 - enum types present when needed
 - repeated variables names
 - has codebook changed since last time (keep track of versions, save tsv copies with version numbers)

## Printing functions

There functions convert data from the pandas data frame into formatted text to be written in a markdown documentation file.

In [4]:
def print_header(df_row):
    """
    Generates the yaml header text for the documentation md file.
    """
    
    
    # extract data from df
    table_name = df_row.iloc[0]['table_name']
    table_description = df_row.iloc[0]['table_description']
    nav_order = df_row.iloc[0]['nav_order']
    table_level = df_row.iloc[0]['table_level']
    parent = df_row.iloc[0]['parent']
    grand_parent = df_row.iloc[0]['grand_parent']
   
    
    
    # format string
    header = f"""
---
layout: page
title: {table_name}
permalink: spec/{grand_parent.lower().replace(" ", "-")}/{table_level}/{table_name.lower()}
nav_order: {nav_order}
parent: {parent}
grand_parent: {grand_parent}
is_table: true
---

# <i class="fa fa-table"></i>{table_name} Table
{{: .no_toc }}

{table_description}


# Table of contents
{{: .no_toc .text-delta }}
- TOC
{{:toc}}


"""

    
    return header


In [5]:
def print_row(row):
    """Generates a chunk of text describing a specific row of the codebook.

    Args:
        row (pd.DataFrame): A row of the codebook dataframe.
    """

    name = f'{row["name"]}\t[{row["data_type"]}]\n'
    description = f': {row["description"]}\n'
    
    
    if not pd.isna(row['index_scope']):
        index_scope = f': {row["index_scope"]}\n'
    else:
        index_scope = ''
    
    
    # for enum variables need to print out possible values
    # we also have section headers (e.g., "context") which should only be printed if they change
    # across row
    #if row['data_type'] == 'enum':
    if not pd.isna(row['enum_values']):    
        enum_text = enum2text(row)
    else:
        enum_text = ''
    
    # There can be up to 3 notes for each codebook entry
    notes = [f'\n> {row["note_" + str(i)]} {{ : .note}}]' for i in range(1,4) if not pd.isna(row["note_" + str(i)])]
    notes = '\n'.join(notes)
    output = name + description + index_scope + enum_text + notes + '\n\n\n'
    
    return(output)
    

In [6]:
def enum2text(row):
    """Converts data describing an enum in the codebook (formatted as json) in a md formatted string. 

    Args:
        row (pd.DataFrame): A row of the codebook dataframe.
    """
    if pd.isna(row['enum_values']):
        enum_text = ''
    else: 
        enum_df = pd.read_json(row['enum_values'], orient = 'index')
        enum_text = enum_df.apply(lambda rr: f':  **{rr["name"]}:** {rr["description"]}', axis = 1)
        enum_text = '\n'.join(enum_text) + '\n'
                                  
    return(enum_text)

In [7]:
def print_table(df):
    old_category = ''
    md_text = ''

    for idx, row in df.iterrows():
    
        # get header title for group of variables
        current_category = row['category']
        if current_category != old_category:
            old_category = current_category
            category_header = "\n\n## " + current_category + "\n\n"
            md_text += category_header

        # convert and concatenate each row    
        md_text += print_row(row)
    
    return md_text

In [8]:
def save_md(md_text, filename):
    # write documentation into md file
    md_file = open(filename + '.md', 'w')
    md_file.write(md_text)
    md_file.close()

## Run the code


Loops over all rows of the codebook to generate the md text.

In [9]:
table_count = len(tables["table_name"])

for current_table_index in range(table_count): # loop over tables

    # create header for this table using info from the "Tables" sheet
    header_txt = print_header(tables.iloc[[current_table_index]])

    # create body for this table
    # -- load sheet for this table
    current_table_name = tables["table_name"][current_table_index]
    print(f"Processing Table: {current_table_name}")
    df = load_table(current_table_name)
    
    # format content of the sheet
    body_txt = print_table(df)

    # write to file
    save_md(md_text=header_txt + body_txt, filename=current_table_name)


Processing Table: Trial
Processing Table: Stimulus
Processing Table: StimulusComponent
Processing Table: Click
Processing Table: Option
Processing Table: OptionComponent
Processing Table: Instrument
