# Load_data/metadata generator
Written by Fernanda Fossa @fefossa

Python 3.9.12

**Description**: This notebook generates a CSV file to be used by CellProfiler module LoadData, using a list of the images to be analyzed. At the end, it will also generate a **AWS command** to upload the CSV files and images to AWS S3 bucket using AWS cli, based on the user inputs. These files together with CellProfiler pipelines will be processed within AWS using Distributed-CellProfiler. See more at https://github.com/DistributedScience/Distributed-CellProfiler

**Inputs**: it requires the following inputs:

- Path to the input folder (where images are located) = Make sure to have only the images that will be analyzed in this folder.

- project_name = this is the name of the folder in AWS where images and CSV will be uploaded.

- batch_id = this is the name of each subproject inside project_name folder. You can have different subprojects within one project. 

- Channels dictionary = Dictionary with Channel as a key and the name you want to give as a Value. Example: 'DAPI':'OrigDNA'. DAPI is the channel (written in the image name), and OrigDNA is the name we want to give to this image in CellProfiler. We have three options already available to choose from (Cell Painting, Live Cell Painting, ToxPath panels); if yours is different, please provide a new dictionary. 

**Outputs**: 

1. **load_data.csv** that we use with Illumination Correction pipelines. 

    To extract these informations, we are using a regex adapted to files from Cytation 5 microscope (B10_02_1_10_GFP_001.tif), where the location of the Well, Site and Channel is known. If you have images from different microscope, with a different pattern, you'd need to **change the regex**. 

    - FileName_CHANNEL = CHANNEL as the value you provided in the dictionary. It will extract the names of the images from the input folder. 

    - PathName_CHANNEL = containing a specific path to AWS (which will be used later in the virtual machines). To change that you would need to modify the images_dir variable.

    - Metadata_Well and Metadata_Site = both are extracted from the image filename. 

    - Metadata_Plate = usually the name of the plate is the name of the FOLDER where the images are located with. Notice that we also have a regex for the plate name. We replace any spaces with "_" because AWS does not handle spaces well.

2. **load_data_with_illum.csv** that we use for analysis pipelines after Illum Correction was performed. 

    The columns are the same as above, with two additional columns per channel:

    - FileName_IllumCHANNEL = the name of the Illum Correction file (the name pattern is "PlateName_IllumCHANNEL.npy")

    - PathName_IllumCHANNEL = the path in AWS where these Illum files will be located. Again this is a pattern for who's using AWS with Distributed-CellProfiler.

3. **AWS commands**: based on the project_name, batch_id, and plate, we generate two commands that you can paste in the Command prompt (after installing and setup https://aws.amazon.com/cli/) that will upload the images and CSV files into AWS S3 bucket using aws s3 sync command. 

    





## Import libraries

In [3]:
import pandas as pd
from os import walk
import shutil
import os
import numpy as np
import re
import easygui as eg
import ipywidgets as widgets
from IPython.display import clear_output
import pyperclip

# Inputs

In [4]:
input_folder = eg.enterbox('Paste path to input folder and press OK', 'Paste Path')
# input_folder = input(r"Insert your local path to the folder images here:")

In [5]:
project_name = eg.enterbox('Write project_name here (folder name)', 'Project_Name')

In [6]:
batch_id = eg.enterbox('Write batch_id here (subproject name, second folder)', 'batch_id')

## Dictionary with channel as a key and new name as a value

### Create your own dictionary

- Enter first the name of the Channel, and then the new name. When you finish, just write **done** and press ok.

In [74]:
ch_dic = {}

while True:
    channel = eg.enterbox('Enter CHANNEL name (e.g. DAPI). Enter done to finish input ')
    if channel == 'done':
        break
    else:
        name = eg.enterbox('Enter the NEW NAME (e.g. OrigDNA): ')
        ch_dic[channel] = name

print(ch_dic)

{'DAPI': 'OrigDNA', 'GFP': 'Acridine'}


### OR Run one of the cells below to use our pre-made dictionaries

- We have dictionaries for Cell Painting, Live Cell Painting, and ToxPath image panels names. 

- Run **ONLY ONE OF THE CELLS BELOW**

#### For CELL PAINTING

In [75]:
ch_dic = {'DAPI':'OrigDNA', 'GFP':'OrigER', 'PropidiumIodide':'OrigAGP', 'CY5':'OrigMito'}
print(ch_dic)

{'DAPI': 'OrigDNA', 'GFP': 'OrigER', 'PropidiumIodide': 'OrigAGP', 'CY5': 'OrigMito'}


#### For LIVE CELL PAINTING

In [7]:
ch_dic = {'GFP':'AOGFP', 'PropidiumIodide':'AOPI'}
print(ch_dic)

{'GFP': 'AOGFP', 'PropidiumIodide': 'AOPI'}


#### For TOXPATH

In [80]:
ch_dic = {'DAPI':'OrigDNA', 'GFP':'OrigLipids', 'TexasRed':'OrigH2ax', 'CY5':'OrigNfkb'}
print(ch_dic)

{'DAPI': 'OrigDNA', 'GFP': 'OrigLipids', 'TexasRed': 'OrigH2ax', 'CY5': 'OrigNfkb'}


## Make sure the inputs are correct

In [8]:
print('Input folder:', input_folder)
print('Project name:', project_name)
print('Batch id:', batch_id)
print('Channels dictionary:', ch_dic)

Input folder: G:\My Drive\Training\20230130_TrainingNanoCell\images\211015_065907_Plate 1
Project name: Training_NanoCell
Batch id: 2021_10_08_AgNPViability
Channels dictionary: {'GFP': 'AOGFP', 'PropidiumIodide': 'AOPI'}


## Load data generator

- Run next cell to generate the load data using regex, filenames and folder names. Both files will be saved in the input folder inside a load_data_csv folder.

- IMPORTANT: this will run and generate load_data.csv and load_data_with_illum.csv because we have True and False in the illum_list. **load_data.csv only = False**
and **load_data_with_illum.csv only = True**.

### Choose: Use this CSV locally or on AWS

- Important to determine if this output CSV will have the path to a AWS machine or to be used locally (meaning the input path would be also the images_dir variable)

In [25]:
aws = False #change to False if you're using this CSV LOCALLY 

# add Width of the image

# add a variation for representative_cells that removes the path

In [26]:
illum_list = [True, False]

for illum_bool in illum_list:
    df = pd.DataFrame()
    #plate name
    regex_plate = r".*[\\/](?P<Assay>.*)[\\/](?P<Plate>.*)$"
    plate_search = re.search(regex_plate, input_folder)
    platefind = plate_search.group('Plate')
    plate = platefind.replace(" ", "_")
    if aws:
        images_dir = "/home/ubuntu/bucket/projects/" + project_name + "/" + batch_id + "/images/" + plate + "/images"
        illum_dir = "/home/ubuntu/bucket/projects/" + project_name + "/" + batch_id + "/illum/" + plate
    else:
        images_dir = input_folder
        illum_dir = input_folder + r"/illum"
    #filesname and channel
    files = []
    for (dirpath, dirnames, filenames) in walk(input_folder):
        files.extend(sorted(filenames))
        break
    #find channels
    channels = []
    regex = r"^(?P<Well>.*)_.*_.*_(?P<Site>.*)_(?P<Channel>.*)_001.tif"
    for f in files:
        matches = re.search(regex, f)
        if matches:
            channels.append(matches.group('Channel'))
    channels = np.array(channels)
    ch_unique = np.unique(channels)
    #create columns with files and pathnames
    temp_list = []
    illum_list = []
    for ch in ch_unique:
        temp_list = []
        for file in files:
            if ch in file and 'tif' in file:
                temp_list.append(file)
        if ' ' in ch:
            ch = ch.replace(' ', '')
        for key,value in ch_dic.items():
            if key in ch:
                df["FileName_"+value] = temp_list
                # print(temp_list)
                df["PathName_"+value] = images_dir
                if illum_bool:
                    illum_temp = plate + "_Illum" + value + ".npy"
                    df["FileName_Illum"+value] = illum_temp
                    df["PathName_Illum"+value] = illum_dir
    #get wells and sites names
    wells = []
    sites = []
    for files in df.iloc[:, 0]:
        matches = re.search(regex, files)
        wells.append(matches.group('Well'))
        sites.append(matches.group('Site'))
    df['Metadata_Well'] = wells
    df['Metadata_Site'] = sites
    df['Metadata_Plate'] = plate
    #save df
    directory = "load_data_csv"
    path = os.path.join(input_folder, directory)
    os.makedirs(path, exist_ok=True)
    if illum_bool:
        df.to_csv(path + r'\load_data_with_illum.csv', index=False)
    else:
        df.to_csv(path + r'\load_data.csv', index=False)

## AWS commands

- Cells below will generate an output that can be copied and paste into Command prompt (after installing AWS cli). 

### Click below to copy the command and upload your CSVs into the cloud in the specified path

In [104]:
#create the command
load_csv = "s3://imaging-platform-ssf/projects/" + project_name + "/workspace/load_data_csv/"+ batch_id + "/" + plate + "/"
load_csv_input = input_folder + "\load_data_csv"
load_csv_output = 'aws s3 sync "' + load_csv_input + '" "' + load_csv + '" --exclude "*" --include="*.csv"'
#button to copy the command
button = widgets.Button(description='Copy')
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
      with out:
        # what happens when we press the button
        clear_output()
        print(load_csv_output)
        pyperclip.copy(load_csv_output)
        spam = pyperclip.paste()
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
widgets.VBox([button,out])

VBox(children=(Button(description='Copy', style=ButtonStyle()), Output()))

### Click below to copy the command and upload your IMAGES into the cloud in the specified path

In [106]:
#create the command
images = "s3://imaging-platform-ssf/projects/"+ project_name + "/" + batch_id + "/images/" + plate + "/images/"
images_output = 'aws s3 sync "' + input_folder + '" "' + images + '" --exclude "*" --include="*.tif"'
#button to copy the command
button = widgets.Button(description='Copy')
out = widgets.Output()
def on_button_clicked(_):
      # "linking function with output"
      with out:
        # what happens when we press the button
        clear_output()
        print(images_output)
        pyperclip.copy(images_output)
        spam = pyperclip.paste()
# linking button and function together using a button's method
button.on_click(on_button_clicked)
# displaying button and its output together
widgets.VBox([button,out])

VBox(children=(Button(description='Copy', style=ButtonStyle()), Output()))

# Troubleshooting

## Check for disparity in channels numbers

- If you encounter some error when generating the CSVs, it could be related to a disparity in the number of images (for some reason, you have one of the Wells with one image more or less in one of the channels. It could be some error on the microscope when saving images, etc.). 

- Run the code below to print the number of images in each channel. If the channel has more images than other, you'll see a difference in number of images.

- Change where indicated in the code for the name of disparity channel (e.g., DAPI) and it will print the name of the wells and the number of images. There, you will find which well has one picture more or less than the others. 

- From that you can investigate what's wrong.

In [12]:
wells = []
df = pd.DataFrame()
#plate name
regex_plate = r".*[\\/](?P<Assay>.*)[\\/](?P<Plate>.*)$"
plate_search = re.search(regex_plate, input_folder)
platefind = plate_search.group('Plate')
plate = platefind.replace(" ", "_")
print(plate)
images_dir = "/home/ubuntu/bucket/projects/" + project_name + "/" + batch_id + "/images/" + plate + "/images"
illum_dir = "/home/ubuntu/bucket/projects/" + project_name + "/" + batch_id + "/illum/" + plate
#filesname and channel
files = []
for (dirpath, dirnames, filenames) in walk(input_folder):
    files.extend(filenames)
    break
#find channels
channels = []
regex = r"^(?P<Well>.*)_.*_.*_(?P<Site>.*)_(?P<Channel>.*)_001.tif"
for f in files:
    matches = re.search(regex, f)
    if matches:
        channels.append(matches.group('Channel'))
channels = np.array(channels)
ch_unique = np.unique(channels)
#create cols with files and pathnames
temp_list = []
illum_list = []
for ch in ch_unique:
    temp_list = []
    for file in files:
        if ch in file and 'tif' in file:
            temp_list.append(file)
    temp = []
    for i in temp_list:
        temp.append(i)
    print(len(temp), ch)
    # CHANGE HERE
    if ch == 'DAPI': #CHANGE HERE FOR THE CHANNEL WITH THE DISPARITY NUMBER
        for files in temp:
            matches = re.search(regex, files)
            wells.append(matches.group('Well'))
my_dict = {i:wells.count(i) for i in wells}
print(my_dict)

211015_065907_Plate_1
160 GFP
160 Propidium Iodide
{}
