<a href="https://colab.research.google.com/github/emcdona1/fmnh_scripts/blob/main/Microplant_Mystery_Zooniverse_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Processing Your Zooniverse Project

This Colab script is designed specifically for the [Microplant Mystery Zooniverse Project](https://www.zooniverse.org/projects/nvuitton/unfolding-of-microplant-mysteries/).  It _may_ work for other projects, but I would strongly suggest

### Preparation steps:

  1. Download your Classifications CSV from Zooniverse.
  2. Go through this file and remove any old versions of your workflows.  (e.g. use Excel or Google Sheets)  (see section below for details)
  3.  Upload the new CSV file to your Google Drive.

-----

### Specific preprocessing suggestions for Microplant Mystery:
  1. Open the raw CSV file in Excel.
  2. Select all cells (ctrl+A), then turn on filtering (a funnel-shaped icon).
  3. In the `workflow_name` column, select ONLY "Microplant shapes of Inflated Sacs."  Hit OK.
  4. Delete all rows of that workflow.  (Click the row number on the left side to select the whole row, right-click to delete.)
  5. Turn off filtering (click the icon again).
  6. Determine the start date you'd like to use (e.g. the date the project officially launched).
  7. Looking at the "created_at" column, delete all rows BEFORE your start date.
  8. Save the file as a new CSV.
  9. Upload file to your Google Drive.

-----

### Google Colab steps:

**1. Press the play button on the cell below to start up Google Colab and connect it to your Google Drive.**

   1. You may be prompted to open up a URL (blue link) - click the link and follow the prompts to authorize your Google account.
   2. After you connect, you'll see an Authorization Code (long string of gibberish), copy and paste this text into the Colab prompt box (below).
   3. Click Enter and wait 5-10 seconds while Google Drive connects.

In [30]:
import os
import sys
import io
import shutil
import ast
from pathlib import Path
import pandas as pd
from statistics import mode, StatisticsError
from google.colab import drive

drive.mount('/content/drive')
drive_folder = Path('/content/drive/MyDrive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**2. Upload your Zooniverse classification file:**

When prompted, type the name of the CSV file in your Google Drive -- be careful to type the name exactly!

In [34]:
csv_file = input('Type the name of your CSV file: ')
input_filename = Path(csv_file)
csv_file_path = Path(drive_folder, csv_file)
output_filename = f'{input_filename.stem}-processed'
if not os.path.exists(csv_file_path):
    print(f'WARNING!  The file "{csv_file}" is not in your Google Drive!  Did you type the name correctly?  Did it finish uploading?')

Type the name of your CSV file: unfolding-of-microplant-mysteries-classifications (1).csv


'unfolding-of-microplant-mysteries-classifications (1)-processed'

**3. Apply some basic formatting to the CSV file, extract the metadata, and identify the workflows.**

In [35]:
zooniverse_classifications = pd.read_csv(csv_file_path)
def clean_formatting_and_parse(data: str):
    data = data.replace('null', 'None')
    data = data.replace('true', 'True')
    data = data.replace('false', 'False')
    return ast.literal_eval(data)

zooniverse_classifications.loc[:, 'annotations'] = zooniverse_classifications['annotations'].apply(clean_formatting_and_parse)
zooniverse_classifications.loc[:, 'metadata'] = zooniverse_classifications['metadata'].apply(clean_formatting_and_parse)
zooniverse_classifications.loc[:, 'subject_data'] = zooniverse_classifications['subject_data'].apply(clean_formatting_and_parse)

zooniverse_classifications['metadata-user_agent'] = zooniverse_classifications['metadata'].apply(lambda d: d['user_agent'])
zooniverse_classifications['image_filename'] = zooniverse_classifications['subject_data'].apply(lambda d: d[list(d.keys())[0]]['Filename'])

workflows = list(set(zooniverse_classifications.loc[:, 'workflow_name']))
print('Zooniverse Workflows found:')
for workflow in workflows:
  print(f'-- {workflow}')

Zooniverse Workflows found:
-- Determining the Reproductive Structure of a Liverwort
-- Stem and Branching Patterns 


**4. Parse the tasks (from annotations column) into separate columns.**

In [41]:
results_by_workflow = list()
for workflow_name in workflows:
    results = zooniverse_classifications[zooniverse_classifications['workflow_name'] == workflow_name].copy()
    max_count = 0
    for idx, row in results.iterrows():
        tasks = row['annotations']
        for count, task in enumerate(tasks):
            name = f'Task{count+1}'
            results.loc[idx, f'{name}-id'] = task['task']
            results.loc[idx, f'{name}-label'] = task['task_label']
            results.loc[idx, f'{name}-value'] = str(task['value'])
        if count > max_count: max_count = count
    results_by_workflow.append(results)
    print(f'{workflow_name.strip()} expanded into {max_count+1} task(s).')

Determining the Reproductive Structure of a Liverwort expanded into 2 task(s).
Stem and Branching Patterns expanded into 1 task(s).


**5. Extract the (x, y) coordinates.**

In [37]:
def extract_coordinate_info(workflow, idx, pt_info, colname):
    workflow.loc[idx, f'{colname}-x'] = pt_info['x']
    workflow.loc[idx, f'{colname}-y'] = pt_info['y']
    workflow.loc[idx, f'{colname}-width'] = pt_info['width']
    workflow.loc[idx, f'{colname}-height'] = pt_info['height']
    workflow.loc[idx, f'{colname}-value_label'] = pt_info['tool_label']

for workflow_name, workflow in zip(workflows, results_by_workflow):
    value_columns = [val for val in list(workflow.columns) if 'value' in val]
    for idx, row in workflow.iterrows():
        for col in value_columns:
            try:
                points = ast.literal_eval(row[col])
                if type(points) == list:
                    for pt_idx, point in enumerate(points):
                        colname = col.split('-')[0] + '-Pt' + str(pt_idx + 1)
                        extract_coordinate_info(workflow, idx, points[0], colname)
                workflow.loc[idx, col] = None
            except: pass
    found_points = list(set([a.split('-')[0]+'-'+a.split('-')[1] for a in list(workflow.columns) if 'Pt' in a]))
    print(f'Workflow: {workflow_name}')
    if len(found_points) > 0:
        print(f'-- Point info found:', found_points)
    else:
        print('-- No point data found.')
    print()

  self.obj[key] = infer_fill_value(value)


Workflow: Determining the Reproductive Structure of a Liverwort
-- Point info found: ['Task2-Pt1', 'Task2-Pt33', 'Task2-Pt39', 'Task2-Pt45', 'Task2-Pt5', 'Task2-Pt26', 'Task2-Pt11', 'Task2-Pt7', 'Task2-Pt28', 'Task2-Pt23', 'Task2-Pt16', 'Task2-Pt30', 'Task2-Pt20', 'Task2-Pt25', 'Task2-Pt27', 'Task2-Pt40', 'Task2-Pt43', 'Task2-Pt31', 'Task2-Pt29', 'Task2-Pt42', 'Task2-Pt24', 'Task2-Pt17', 'Task2-Pt48', 'Task2-Pt55', 'Task2-Pt53', 'Task2-Pt49', 'Task2-Pt2', 'Task2-Pt13', 'Task2-Pt21', 'Task2-Pt32', 'Task2-Pt12', 'Task2-Pt22', 'Task2-Pt50', 'Task2-Pt14', 'Task2-Pt54', 'Task2-Pt59', 'Task2-Pt9', 'Task2-Pt4', 'Task2-Pt35', 'Task2-Pt57', 'Task2-Pt60', 'Task2-Pt62', 'Task2-Pt37', 'Task2-Pt44', 'Task2-Pt10', 'Task2-Pt46', 'Task2-Pt56', 'Task2-Pt36', 'Task2-Pt8', 'Task2-Pt38', 'Task2-Pt34', 'Task2-Pt47', 'Task2-Pt58', 'Task2-Pt6', 'Task2-Pt19', 'Task2-Pt41', 'Task2-Pt51', 'Task2-Pt52', 'Task2-Pt3', 'Task2-Pt61', 'Task2-Pt15', 'Task2-Pt18']

Workflow: Stem and Branching Patterns 
-- No point dat

In [38]:
input_filename.parent

PosixPath('.')

**6. Save the files to your Google Drive in a new folder.**

In [40]:
results_folder = Path(drive_folder, input_filename.parent, f'{input_filename.stem}_processed')
if not os.path.exists(results_folder):
  os.makedirs(results_folder)
for (workflow_name, workflow) in zip(workflows, results_by_workflow):
  file_location = Path(results_folder, f'{workflow_name.strip()}.csv')
  workflow.to_csv(file_location, index=False, encoding='UTF-8')

# save_location = shutil.make_archive(output_filename, 'zip', results_folder)
print(f'Saved to {results_folder} in your Google Drive.')

Saved to /content/drive/MyDrive/unfolding-of-microplant-mysteries-classifications (1)_processed in your Google Drive.
