## Upload functions and necessary modules

In [3]:
from bb2022_functions import *
%matplotlib inline
from Bio.SeqIO.FastaIO import SimpleFastaParser
from Bio import SeqIO
pd.options.mode.chained_assignment = None  # default='warn'

## Import and format metadata

In [4]:
md = pd.read_csv("metadata_merged.csv")
merged = pd.read_csv("metadata_niskin.csv")
all_md = pd.read_csv("allmetadata.csv")

### Visualize metadata

In [None]:
maxvals = plot_nutrients(all_md, 'euphotic')

## Add microbial communities

In [5]:
#generate a dataframe from all specified amplicon
df, comm = consolidate_tables('16S') #frac='pooled') #16S, chloroplast, or 18S
merged = merge_metadata(df, all_md)
separated, contaminants = pick_metadata(comm, merged)
newseparated = make_defract(all_md, separated)

Community is 02-PROKs
Found all 16S tables.
Successfully saved all tables.
Set up metadata ...
Saved merged_asvs_metadata.tsv
Appended all taxonomies to taxos
Saved separated by metadata dataframe.
Community is 02-PROKs
Removed cyanobacteria and chloroplast from 02-PROKs


In [6]:
#apply changes to taxonomy according to NCBI identified ASVs
newdf = apply_replacement(newseparated, "feature_id", "Genus") 
# or replace Genus with PRSpecies if dealing with phytoref

Values were updated


To generate MIMARKS file for NCBI sequence submission; the output is a .csv file for the samples and their metadata for submission (i.e sampleid, size fraction, date)

In [None]:
make_MIMARKS(newseparated)

In [None]:
#plot of dna concentrations per sample
dnacon(newseparated, depth='all')

In [None]:
#rarefaction curves per community
rarefy_curve(comm, newseparated)

In [None]:
#plot sampling depth histogram to pick library depth for rarefying
plt.hist(newdf['Total'], bins=10, edgecolor='black')

# Add lines for quantiles
quantiles = [0.1, 0.2, 0.5]
colors = ['red', 'blue', 'green']
labels = ['10th Percentile', '20th Percentile', '50th Percentile (Median)']

for q, color, label in zip(quantiles, colors, labels):
    plt.axvline(x=newdf['Total'].quantile(q), color=color, linestyle='--', label=label)
plt.legend()

# Save and show the plot
plt.savefig('sampling_depths.png', dpi=300, bbox_inches='tight')
plt.show()

# Suggest a rarefaction depth
suggested_depth = int(newdf['Total'].quantile(0.5))
print(f"Suggested rarefaction depth: {suggested_depth}")

We can also sort the samples by their library size:

In [None]:
newseparated[['Total','sampleid']].sort_values('Total').drop_duplicates()

#### Explore the taxonomy in the samples and compare

In [None]:
#Produce interactive taxonomic barplots with plotly
phyld, top10d = taxbarplot(comm, newseparated, 'Genus', 5, 10, 'size_code')

In [None]:
#Visualize the static barplots with seaborn, and each size fraction separately
taxonomic_barplots(comm, newdf, [5,60], 'Genus', 20)

In [None]:
#Generate the heatmap for the top genus from each sample
heatmap_top1(comm, newseparated, 'Genus')

The above plot uses taxonomy, but we can generate the same plot but by comparing whether 80% of the features in each samples are also found in the whole (unfractionated samples). This was quantified by dividing the number of shared features by the toal number of features. If a square has a red color (closer to 1), it's very similar to the unfractionated sample, and the bluer the square, the more different it is from the unfractionated.

In [None]:
grab_80perc(comm, newseparated, 0.8, 'feature_id')

We can plot alpha diversity measurements, whether as 'shannon_diversity' or 'nASVs' which is the richness quantified by the total number of ASVs

In [None]:
#run the visualisations for alpha diversity and run pairwise t-tests between size fractions for richness values
anova, results = boxplot_depth(newdf, comm, 'all', 'shannon_diversity', 'Shannon Diversity Index')
#results gives the corrected p-values for pairwise comparisons

Compare the slopes of linear regressions of the richness change over time. Each value represents how much a size fraction (column) differs in comparison to the average slope (averaged between all size fractions) for each depth (rows).

In [None]:
tohm, z_sc_df = get_slopes(comm, separated)
#a zscore of 1= 1 std away from the mean,
#positive values=higher than mean, neg= smaller than mean

### Beta diversity and ANCOM analysis

Optionally we can run ANCOM with removed low abundance features with a given threshold

In [None]:
#only if we want to run ANCOM pairwise
news2 = newseparated[newseparated.size_code != 'L']
news2 = news2[news2.size_code != 'SL']

In [None]:
depths = [1,5,10,30,60]
for depth in depths:
    pca, pca_features, sfdclr, dm = pcaplot(newseparated, depth, comm, 'size_code', 'DFr', 'week')
    DAresults, DARejected_SC_taxonomy, prcentile = run_ancom(comm, newseparated, sfdclr, depth, 'size_code', threshold=0)

    #save outputs
    DAresults.to_csv('outputs/ANCOM/chloroplast/none/'+comm+'_D'+str(depth)+'_WSLSL.csv')
    DARejected_SC_taxonomy.to_csv('outputs/ANCOM/chloroplast/none/'+comm+'_D'+str(depth)+'_Trueonly_WSLSL.csv')

    notify()

Depending on ancom results, we can investigate single features temporal dynamics

In [None]:
f_id = '5a94578dd1d7cdd039a52f1c7079f874'
newdf.loc[newdf['feature_id'] == f_id, 'Taxon'].tolist()[0]

Visualize the time series of a single feature in each size fraction over the 16 weeks

In [None]:
timeseries_fid(comm, newseparated, f_id, 'Oligotrichia', 10)

In [None]:
feature_id_summary = count_feature_id_presence_with_depth_and_W('outputs', comm)
top_asvs_summary = filter_top_asvs(feature_id_summary, method="top_W_sum", n=50)
plot_asv_heatmap(comm, feature_id_summary, file_filter="WSLSL")

### Export files to R analyses

To create a phyloseq object, you need an ASV table, taxonomy file and metadata