# Gsheet to md documentation

## About

Information about each variable in the Behaverse specification is described in a google sheet (so it can be collaboratively edited and commmented).


This notebook extract the data from those spreadsheets and writes a documentation file in markdown format which can then be rendered in the behaverse data model website (to look like [this](https://raw.githubusercontent.com/behaverse/data-model/main/_spec/cognitive-tests/1-trial.md)). 


The rationale for doing this work is to make it easier to validate and maintain the documentation while allowing for collaborative work on the specification. 






## TODO

 - update the google spreadsheet
 - write validation tests (e.g., are there missing values?)
 - refactor code in this notebook to make it run as a GitHub job

In [1]:
import pandas as pd

## Downloading the spreadsheet

We created and populated a google spreadsheet and generated a [public link](https://docs.google.com/spreadsheets/d/1LWTXsg2T4NPo0xbhD4pulIgkEJk5Oey7FusItltlCk8/edit?usp=sharing) with viewer rights to that spreadsheet. This makes it easy to download that spreadsheet for further processing.




In [2]:
## Get access to the google spreadsheet
sheet_id = "1LWTXsg2T4NPo0xbhD4pulIgkEJk5Oey7FusItltlCk8"
sheet_name = "columns"
url = f"https://docs.google.com/spreadsheets/d/{sheet_id}/gviz/tq?tqx=out:csv&sheet={sheet_name}"
codebook = pd.read_csv(url)
codebook.head()



Unnamed: 0,id,table,category,name,data_type,data_subtype,variable_type,description,index_scope,example,...,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25,Unnamed: 26,Unnamed: 27,Unnamed: 28,Unnamed: 29,Unnamed: 30
0,1,generic,key,id,integer,,,"The unique trial identifier, generated in temp...",,,...,,,,,,,,,,
1,2,generic,context,study_name,string,,,The name of the study the participant particip...,,,...,,,,,,,,,,
2,3,generic,context,subject_id,string,,,The identifier of the entity (typically person...,,,...,,,,,,,,,,
3,4,generic,context,session_index,integer,,index,"When there are multiple sessions, this variabl...",index of session within subject.,,...,,,,,,,,,,
4,5,generic,task,instrument_name,string,,,The name of the instrument used for collecting...,,,...,,,,,,,,,,


## Test functions (TODO)

These functions test whether the google spreadsheet is valid (e.g., are the empty fields? inconsitent data types? repeated variable names?)

List of Tests:
 - empty fields
 - enum types present when needed
 - repeated variables names
 - has codebook changed since last time (keep track of versions, save tsv copies with version numbers)

## Printing functions

There functions convert data from the pandas data frame into formatted text to be written in a markdown documentation file.

In [3]:
def print_header():
    """
    Generates the yaml header text for the documentation md file.
    """
    header = """
---
layout: page
title: Trial
permalink: spec/cognitive-tests/L1/trial
nav_order: 1
parent: L1 data
grand_parent: Cognitive tests
is_table: true
---


# <i class="fa fa-table"></i> Trial
{: .no_toc }

a
# Table of contents
{: .no_toc .text-delta }
- TOC
{:toc}


"""

    return(header)


In [5]:
def print_row(row):
    """Generates a chunk of text describing a specific row of the codebook.

    Args:
        row (pd.DataFrame): A row of the codebook dataframe.
    """

    name = f'{row["name"]}\t[{row["data_type"]}]\n'
    description = f': {row["description"]}\n'
    
    
    if not pd.isna(row['index_scope']):
        index_scope = f': {row["index_scope"]}\n'
    else:
        index_scope = ''
    
    
    # for enum variables need to print out possible values
    # we also have section headers (e.g., "context") which should only be printed if they change
    # across row
    #if row['data_type'] == 'enum':
    if not pd.isna(row['enum_values']):    
        enum_text = enum2text(row)
    else:
        enum_text = ''
    
    # There can be up to 3 notes for each codebook entry
    notes = [f'\n> {row["note_" + str(i)]} {{ : .note}}]' for i in range(1,4) if not pd.isna(row["note_" + str(i)])]
    notes = '\n'.join(notes)
    output = name + description + index_scope + enum_text + notes + '\n\n\n'
    
    return(output)
    

In [4]:
def enum2text(row):
    """Converts data describing an enum in the codebook (formatted as json) in a md formatted string. 

    Args:
        row (pd.DataFrame): A row of the codebook dataframe.
    """
    if pd.isna(row['enum_values']):
        enum_text = ''
    else: 
        enum_df = pd.read_json(row['enum_values'], orient = 'index')
        enum_text = enum_df.apply(lambda rr: f':  **{rr["name"]}:** {rr["description"]}', axis = 1)
        enum_text = '\n'.join(enum_text) + '\n'
                                  
    return(enum_text)

## Run the code


Loops over all rows of the codebook to generate the md text.

In [7]:


old_category = ''
md_text = print_header()

for idx, row in codebook.iterrows():
    
    # get header title for group of variables
    current_category = row['category']
    if current_category != old_category:
        old_category = current_category
        category_header = "\n\n## " + current_category + "\n\n"
        md_text += category_header

    # convert and concatenate each row    
    md_text += print_row(row)



In [8]:
# write documentation into md file
md_file = open('demo.md', 'w')
md_file.write(md_text)
md_file.close()