# Processing Nanodrop Data, No Python Knowledge Needed

This notebook will walk you through using the nanodrop module to simplify your data. Empty columns will be deleted and the data will be converted to a [tidy format](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf). In this process, the notebook will attempt to extract information from the sample names provided at the nanodrop (visible in the "Sample ID" column of the output). Once the data is reformatted, it can be saved as a new file, and basic plots can be constructed from it.

* [Setup](#setup)
* [Using this notebook](#using_the_notebook)
* [Wrangling your data](#wrangling_your_data)
* [Saving your data](#saving_your_data)
* [Plotting your data](#plotting_your_data)
* [Saving your plot](#saving_your_plot)
* [Troubleshooting](#troubleshooting)

# Setup <a class="anchor" id="setup"></a>

## Necessary installations
In order to use this notebook, you'll need these installed on your computer:
- [Python 3](https://www.python.org/downloads/)
- Jupyter
 - option 1: [install Anaconda](https://www.anaconda.com/products/individual) and launch JupyterLab from Anaconda Navigator
 - option 2: on your command line, run `pip install jupyter`
- numpy and pandas: on your command line, run `pip install numpy pandas`
- wrangling: on your command line, run `pip install git+https://github.com/ebentley17/Deniz_lab_code/`

## Optional installations
These are only necessary for the plotting functions of the notebook.
- bokeh: on your command line, run `pip install bokeh`
- selenium (only necessary to save plots): on your command line, run `pip install selenium`


# Using this notebook <a class="anchor" id="using_the_notebook"></a>
Jupyter notebooks are composed of cells, which are either text (like this one) or code. This notebook is designed to be usable without any knowledge of python. The notebook is editable; feel free to make notes to yourself, experiment with adjusting the code, or add in new cells to try something new. You can always re-download the unedited notebook.

You can navigate between cells with the up and down arrows or by clicking. Run one cell at a time by selecting the cell and pressing `ctrl+enter`, or run all cells by selecting Run -> Run All Cells. You will be prompted to enter specific information after certain cells. A few tips:
- You can rerun a cell to change your input. 
- Do not rerun a cell while it is still requesting input. Type "quit" to stop program execution instead. ([Forgot? Troubleshooting this issue](#nothing_happens))
- Rerunning a cell will not cause another cell to update unless that cell is also rerun. Feel free to change your input in earlier cells and rerun later cells to see how they change. 
- You can rerun cells in any order, but skipping a cell without running it at all may cause problems later.

#### Run this cell first:

In [None]:
import re
import warnings
import numpy as np
import pandas as pd

from wrangling import nanodrop
from wrangling.tutorials import handle_input

print("Required imports successful.")

try:
    import bokeh.plotting
    import bokeh.io
    
    from wrangling.bokeh_scatter import scatter, scatter_palette
    
    bokeh.io.output_notebook()
    
except ImportError as e:
    print("Bokeh could not be loaded. Plotting is not available, but other notebook functions are."
          + f"\nError message: {e}")

# Wrangling your data <a class="anchor" id="wrangling_your_data"></a>

## Select your files
Enter the filepath of the file or files you want to import. You can enter the specify a single file or choose multiple files using the `*` catchall character. For example, the filepath `C:/Users/Deniz/data/*.tsv` will include all files ending in `.tsv` in the `Users/Deniz/data` folder (but not in any subfolders). You can use a full filepath (on windows, usually starts with `C:/`; on Macs, usually starts with `/Users/`) or a relative filepath. If your file is in the same folder as this notebook, you can just enter the file name. It is recommended to name your files and folders without spaces.

The notebook will attempt to automatically detect the file type you are using, or ask you for the file type if it fails. Currently, only csv (comma-separated values) and tsv (tab-separated values) files are supported. 

In [None]:
file_list, file_ext = handle_input.interpret(
    "Type the filepath here:",
    handle_input.validate_file_input
)
    
print(f"\nYou have selected these files:")
for file_name in file_list:
    slash_index = file_name.rfind("/")
    if slash_index == -1:
        slash_index = file_name.rfind("\\")
    if slash_index == -1:
        print(file_name)
    else:
        print(file_name[slash_index:])
    
if file_ext in ["tsv", "csv"]:
    print(f"\nAutomatically detected file type: {file_ext}")
else:
    file_ext = handle_input.interpret(
        "\nWhat file type are you using? Choose csv or tsv:",
        handle_input.check_membership,
        list_to_check=["tsv", "csv"]
    )

if "tsv" == file_ext:
    file_reader_kwargs=dict(sep="\t")
elif "csv" == file_ext:
    file_reader_kwargs=dict()

The next cell will display a preview of the first file you selected:

In [None]:
pd.read_csv(file_list[0], **file_reader_kwargs).head(3)

## Describe your data
The sample names you enter at the nanodrop contain important data. For this to work, your names should have a consistent format. For example, I always name my samples in the format: 

\[Peptide\]_\[Peptide Concentration (uM)\]_\[RNA/Peptide Ratio\]

At the nanodrop, I might type: `Peptide1_150_0.5`

I would fill out this section as follows:

    How many pieces of data are in your sample names? 3
    What separator is used in your sample names? _
    Name of data in position 1: Peptide
    Name of data in position 2: Peptide Concentration (uM)
    Name of data in position 3: RNA/Peptide Ratio

Describe your sample names:

In [None]:
args, kwargs = handle_input.request_parsekey_specifications()
MyKey = nanodrop.ParseKey(*args, **kwargs)

print(f"""\nTo confirm, your sample names take the form: [{f']{MyKey.separator}['.join(MyKey.column_names)}]
If this is incorrect, please run the cell again.""")

## Decide how to handle unusual cases
Your data may include blank/buffer samples, or samples that are incorrectly named by the convention you defined above. Should these samples be dropped from the dataset?

If you keep these samples, they will appear in your dataframe, but no information will be extracted from their names. 

In [None]:
drop_buffers = handle_input.interpret(
    "Should samples labeled as buffer or blank be dropped?",
    handle_input.yes_no_to_bool,
)

drop_incorrectly_named_samples = handle_input.interpret(
    "Should incorrectly named samples be dropped?",
    handle_input.yes_no_to_bool,
)

## Reformat the data
This may take several seconds, especially if you have a lot of files. A preview of the output will display.

In [None]:
data = nanodrop.tidy_data(
    file_list, 
    file_reader_kwargs=file_reader_kwargs,
    ParseKey=MyKey,
    drop_incorrectly_named_samples=drop_incorrectly_named_samples,
    drop_buffers=drop_buffers
)

data

# Saving your data <a class="anchor" id="saving_your_data"></a>

When you are satisfied with the data format shown above, you can save it to a new csv file.

In [None]:
new_file_name = handle_input.interpret("What would you like to name the file?")
if new_file_name[-4:] != ".csv":
    new_file_name += ".csv"

data.to_csv(new_file_name, index=False, float_format="%.3f")

# Plotting your data <a class="anchor" id="plotting_your_data"></a>
Some plug-and-go plots are provided here. 

If a legend is provided, you can click items to make them appear or disappear from the plot.

### Scatter plot

In [None]:
specs = handle_input.request_plot_specifications(data)
scatter_plot = scatter(height=400, **specs)

if "cat" in specs.keys():
    legend_location = handle_input.interpret(
        "Where should the legend be (or leave blank for default)?",
        handle_input.check_membership,
        f"Input must be one of {handle_input.clean_legend_locations}, or leave blank.",
        list_to_check=handle_input.allowed_legend_locations,
    ).replace(" ", "_")

    if legend_location not in ["", "None"]:
        scatter_plot.legend.location = legend_location

bokeh.io.show(scatter_plot)

# Saving your plot <a class="anchor" id="saving_your_plot"></a>

Several filetypes are available to save your plot.
* PNG (standard image type)
* SVG (scalable, good for publications)
* HTML (interactive, opens in a web browser)

In [None]:
file_format = handle_input.interpret(
    "What format would you like to save your plot in?",
    handle_input.check_membership,
    list_to_check=["png", "svg", "html"]
)

file_name = handle_input.interpret("Name your file:")

try:
    if file_name[-len(file_format):] != file_format:
        file_name += f".{file_format}"
except:
    file_name += f".{file_format}"

p = scatter_plot
    
if file_format == "png":
    bokeh.io.export_png(p, filename=file_name)
elif file_format == "svg":
    p.output_backend = 'svg'
    bokeh.io.export_svgs(p, filename=file_name)
    p.output_backend = 'canvas'
elif file_format == "html":
    with warnings.catch_warnings():
        warnings.simplefilter(action='ignore', category=UserWarning)
        bokeh.io.save(p, filename=file_name, title="Bokeh Plot")

# Troubleshooting <a class="anchor" id="troubleshooting"></a>
If you got an error message in a red box, check the last line for a succinct description of the error. Read on for more specific queries.

### I got a NameError
You probably skipped a code cell. Note that if you restart the kernel, all variables are erased - even if the output is still being displayed - and you will need to rerun all the cells.

### The notebook asks me the same question over and over
When you answer a prompt and hit `Enter`, you don't advance to the next cell. If you then hit `Shift+Enter`, you'll run the same cell again. Just click on the next cell or hit the down arrow to move forward.

### Nothing happens when I run a cell <a class="anchor" id="nothing_happens"></a>
Another cell is probably still running. To the left of each expanded code cell, you'll see this: `[ ]:` If there's a `*` in those brackets, it means the cell hasn't finished running. 

A cell won't finish running until you answer all the prompts below it. (You can always type "quit" to stop program execution.) If you rerun a cell before it finishes executing, it will never finish running and you will have to restart the kernel.

If you can't find a cell with unanswered prompts, select Kernel -> Restart Kernel. Note that this will erase the notebook's internal memory (even if the output is still displayed) and you will have to start from the first cell again.

### The reformatted table is empty, or has fewer rows than I expected
You probably described your data incorrectly and told the notebook to **drop** incorrectly named samples. The most common problem is incorrectly entering the separator or number of data chunks. Try re-entering your data description. Otherwise, your naming scheme may be inconsistent.

### The reformatted table has a lot of empty columns
You may described your data incorrectly and told the notebook to **keep** incorrectly named samples, or you may have kept a large number of buffer measurements. If the program is unable to parse your sample names, it cannot fill columns. The most common problem is incorrectly entering the separator or number of data chunks. Try re-entering your data description. Otherwise, your naming scheme may be inconsistent.

## Found a bug that isn't mentioned here?
Send me an email at ebentley@scripps.edu.