### STEP : LEfSe Analysis



LEfSe (Linear discriminant analysis Effect Size) determines the features (organisms, clades, operational taxonomic units, genes, or functions) most likely to explain differences between classes by coupling standard tests for statistical significance with additional tests encoding biological consistency and effect relevance.

- https://huttenhower.sph.harvard.edu/lefse/
- https://github.com/statonlab/BiGG2020_CrackNAg/wiki/qiime2-to-lefse
- https://github.com/biobakery/biobakery/wiki/lefse#2-lefse--conda-docker-vm-

**OBS.: Using docker to run LEfSe**

## Setup and settings

In [1]:
import os
import pandas as pd
from qiime2 import Artifact
from qiime2 import Visualization
from qiime2 import Metadata

from qiime2.plugins.feature_table.methods import filter_samples
from qiime2.plugins.feature_table.methods import relative_frequency
from qiime2.plugins.taxa.methods import collapse

import biom
import re

%matplotlib inline

### Receiving the parameters

The following cell can receive parameters using the [papermill](https://papermill.readthedocs.io/en/latest/) tool.

In [2]:
metadata_file = ''
base_dir = ''
experiment_name = ''
class_col = 'group-id'
replace_files = False

In [3]:
# Parameters
base_dir = "/mnt/nupeb/rede-micro/redemicro-ana-flavia-nutri"
class_col = "group-id"
classifier_file = "/mnt/nupeb/rede-micro/datasets/16S_classifiers_qiime2/silva-138-99-nb-classifier.qza"
experiment_name = "ana-flavia-NCxSTD-NC-trim"
manifest_file = "/mnt/nupeb/rede-micro/redemicro-ana-flavia-nutri/data/raw/manifest/manifest-ana-flavia-NCxSTD-NC.csv"
metadata_file = "/mnt/nupeb/rede-micro/redemicro-ana-flavia-nutri/data/raw/metadata/metadata-ana-flavia-NCxSTD-NC.tsv"
overlap = 12
phred = 20
replace_files = False
threads = 6
top_n = 20
trim = {"forward_primer": "CCTACGGGRSGCAGCAG", "overlap": 8, "reverse_primer": "GGACTACHVGGGTWTCTAAT"}
trunc_f = 0
trunc_r = 0


In [4]:
experiment_folder = os.path.abspath(os.path.join(base_dir, 'experiments', experiment_name))
img_folder = os.path.abspath(os.path.join(experiment_folder, 'imgs'))
lefse_folder = os.path.join(experiment_folder, 'lefse')

In [5]:
# Create LEfSe folder, if it not exists
!mkdir -p {lefse_folder}

### Defining names, paths and flags

In [6]:
# QIIME2 Artifacts folder
qiime_folder = os.path.join(experiment_folder, 'qiime-artifacts')

# Input - DADA2 Artifacts
dada2_tabs_path = os.path.join(qiime_folder, 'dada2-tabs.qza')

# Input - Taxonomy
taxonomy_path = os.path.join(qiime_folder, 'metatax.qza')

## Step execution

### Load input files

This Step import the QIIME2 `FeatureTable[Frequency]` Artifact and the `Metadata` file.

In [7]:
#Load Metadata
metadata_qa = Metadata.load(metadata_file)

#Load FeatureTable[Frequency]
tabs = Artifact.load(dada2_tabs_path)

# Filter FeatureTable[Frequency | RelativeFrequency | PresenceAbsence | Composition] based on Metadata sample ID values
tabs = filter_samples(
    table=tabs,
    metadata=metadata_qa,
).filtered_table

# Load Taxonomy
taxonomy = Artifact.load(taxonomy_path)

# Collapse and calculate relative frequency

## Define functions

In [8]:
def process_biom_file(relative_frequency_tab, metadata_tab, class_id, out_csv):
    # Create DataFrames
    relative_frequency_df = relative_frequency_tab.view(pd.DataFrame).T
    metadata_df = metadata_tab.to_dataframe()
    
    # Process IDs
    idx = relative_frequency_df.index
    new_idx = ['|'.join([y[3:] for y in x.split(';') if len(y)>2]) for x in idx]
    
    # Process headers
    group_header = list(metadata_df[class_id].values)
    sample_header = list(relative_frequency_df.columns)
    headers = pd.MultiIndex.from_arrays([group_header, sample_header], names=['group-id', 'subject_id'])
    
    # Create new DataFrame
    new_relative_frequency_df = relative_frequency_df.copy()
    new_relative_frequency_df.columns = headers
    new_relative_frequency_df.index = new_idx
    new_relative_frequency_df.to_csv(out_csv, sep='\t')

In [9]:
def process_res(res_file):
    df = pd.read_csv(res_file, sep='\t', index_col=0, header=None)
    idx = df.index
    new_idx = [re.sub(r"[a-z]__", "|", x) for x in idx]
    print(1, len(idx), len(new_idx))
    new_idx = [x[1:] for x in new_idx if x.startswith('|')]
    print(2, len(idx), len(new_idx))
#     new_idx = [x[:-1] for x in new_idx if x.endswith('|')]
#     print(3, len(idx), len(new_idx))
    new_idx = [re.sub(r"\|\|", "|", x) for x in new_idx]
    print(4, len(idx), len(new_idx))
    print
    print(idx)
    print(new_idx)
    df.index = new_idx
    
#     print(df)
    df.to_csv(res_file, index=False, na_rep='-', header=None)

In [10]:
def process_lefse(raw_csv, tax_lvl, _format='pdf'):
    # Get file name without extension
    base_name = os.path.splitext(os.path.basename(raw_csv))[0]
    in_file = os.path.join(lefse_folder, f'{base_name}.in')
    res_file = os.path.join(lefse_folder, f'{base_name}.res')
    lefse_figs = os.path.join(lefse_folder, 'lefse_plots')
    !mkdir -p {lefse_figs}
    fig_path = os.path.join(lefse_figs, f'{base_name}_metabar.{_format}')
    clad_path = os.path.join(lefse_figs, f'{base_name}_cladogram.{_format}')
    
    
    
    
    # Prepare file to LEfSe
    !docker run --rm --workdir /data -v /:/data biobakery/lefse format_input.py {raw_csv[1:]} {in_file[1:]} -c 1 -u 2 -o 1000000
#     !docker run --rm --workdir /data -v /:/data biobakery/lefse format_input.py {raw_csv[1:]} {in_file[1:]} -c 1 -u 2

    # Execute LEfSe
    !docker run --rm --workdir /data -v /:/data biobakery/lefse run_lefse.py  {in_file[1:]} {res_file[1:]}
        
    # Plot figure
    !docker run --rm --workdir /data -v /:/data biobakery/lefse plot_res.py {res_file[1:]} {fig_path[1:]} --format {_format} --max_feature_len 256
    
    # Plot cladogram
    !docker run --rm --workdir /data -v /:/data biobakery/lefse plot_cladogram.py {res_file[1:]} {clad_path[1:]} --format {_format} --colored_labels 1

In [11]:
def process_tax_level(tax_lvl, tax_tab, abs_tab, metadata_tab, class_id):
    
    # Collapse the table to the tax_lvl level
    collapsed_table = collapse(
        table=tabs,
        taxonomy=taxonomy,
        level=tax_lvl
    ).collapsed_table
    
    # Calculate the relative frequency
    relative_frequency_tab = relative_frequency(
        table = collapsed_table,
    ).relative_frequency_table
    
    # Persist qza file
    relative_frequency_path = os.path.join(qiime_folder, f'collapsed_{tax_lvl}_relative_frequency_table.qza')
    relative_frequency_tab.save(filepath=relative_frequency_path)
    
    # Create a new table with metaheader
    out_csv = os.path.join(lefse_folder, f'collapsed_{tax_lvl}_relative_frequency_table_with_metaheader.tsv')
    process_biom_file(relative_frequency_tab, metadata_tab, class_id, out_csv)
    process_lefse(out_csv, tax_lvl)    

## Perform LEfSe analysis

In [12]:
for tax_lvl in range(1,8):
    print(f'Processing level: {tax_lvl}')
    process_tax_level(tax_lvl, taxonomy, tabs, metadata_qa, class_col)

Processing level: 1


Number of significantly discriminative features: 0 ( 0 ) before internal wilcoxon
No features with significant differences between the two classes
Number of discriminative features with abs LDA score > 2.0 : 0




No differentially abundant features found in mnt/nupeb/rede-micro/redemicro-ana-flavia-nutri/experiments/ana-flavia-NCxSTD-NC-trim/lefse/collapsed_1_relative_frequency_table_with_metaheader.res




Processing level: 2


Number of significantly discriminative features: 5 ( 5 ) before internal wilcoxon
Number of discriminative features with abs LDA score > 2.0 : 5






Processing level: 3


Number of significantly discriminative features: 8 ( 8 ) before internal wilcoxon
Number of discriminative features with abs LDA score > 2.0 : 8






clade_sep parameter too large, lowered to 0.266967773438


Processing level: 4


Number of significantly discriminative features: 14 ( 14 ) before internal wilcoxon
Number of discriminative features with abs LDA score > 2.0 : 14






Processing level: 5


Number of significantly discriminative features: 25 ( 25 ) before internal wilcoxon
Number of discriminative features with abs LDA score > 2.0 : 25






clade_sep parameter too large, lowered to 0.266967773438


Processing level: 6


Number of significantly discriminative features: 44 ( 44 ) before internal wilcoxon
Number of discriminative features with abs LDA score > 2.0 : 44






clade_sep parameter too large, lowered to 0.266967773438


Processing level: 7


Number of significantly discriminative features: 64 ( 64 ) before internal wilcoxon
Number of discriminative features with abs LDA score > 2.0 : 64






clade_sep parameter too large, lowered to 0.200225830078
