<a href="https://colab.research.google.com/github/dcolinmorgan/grph/blob/main/generic_metagenomic_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: Quick metagenomic analysis using GPU UMAP analysis & visualization

Using GPU-accelerated UMAP analysis & visualization, metagenomic samples can be compared faster and much more easily explored.

*   Task: Analyze metagenomic samples for similarity
*   Data: 10 samples
*   [data](https://figshare.scilifelab.se/articles/dataset/Metagenomic_dataset_from_Swedish_urban_lakes/22270225?file=39602290)
*   [paper](https://pubmed.ncbi.nlm.nih.gov/15560821/)

**Insight/ Result:**

over XXx faster for entire
Offers more insight when static plot would otherwise fail
(See also: CPU baseline)

# Setup

(install cuda packages first)

In [None]:
!pip install graphistry[ai]
!pip install dirty_cat umap-learn
import umap, dirty_cat
!pip install Biopython

# import /configure

get a free api-key at https://www.graphistry.com/


In [None]:
import pandas as pd
import graphistry

graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='dcolinmorgan', password='fXjJnkE3Gik6BWy') ## key id, secret key

# graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='dcolinmorgan', password='***') ## key id, secret key
graphistry.__version__

# Data Download & Description

In [None]:
# !wget https://figshare.scilifelab.se/ndownloader/files/39602290
!unzip -qq /content/39602290
# !wget https://figshare.scilifelab.se/ndownloader/files/39602299

In [None]:
!head /content/All_MAGs/Sample_101_S75_bin_1.fa

>Sample_101_S75-bin_1-k141_1338904_length_14014_cov_309.3572
AATCACGCGTACGCCCGCACCTTGAACCGCTTTGCCGCTGCCCCCACATCATCCTCACGAAAGGTACCTT
TTCATGGAAAAAATTATCAAATCCGATGCGGAATGGCGGGCCGTATTGGACCCCGTTCAATATCATGTCC
TACGGGAGTCCGGCACTGAACGCGCCTTTGCCGGCGCGCTGACCGATGAAAAGCGCGAAGGCGAATTTCG
CTGCGCCGGCTGTGAGACTGCCCTGTTTGCTTCGGACACGAAATTTGACAGCGGTTCGGGTTGGCCAAGC
TTTACCGCGCCCGCAGACAATGATGCTGTTGAAGAGCACCGCGATACATCGCACGGCATGGTCCGCATTG
AAGTGCGCTGTGCCGCATGTGAGGGGCATTTGGGCCATGTCTTCCCCGATGGGCCTGGACCGACTGGCCT
GCGTTACTGCATCAACAGCGCCGCGCTTGCATTCGATCCTGAATAACAAGGCGCTTGTCGGCGGTTACGG
GACTGGGTAACACTCGGGCCATGGCACGCGCGCGCAAGATTTCGAAAGAACGTGGCCCAATGGCAACATG
GATACTCCGCATGGTCAAAGCGGGCGTCATCGCGGCGTTGCTGGGCGTCATGGTTCTTGGCATTTTTGTC


# Read in 10 fasta to compare

In [None]:
from Bio import SeqIO
import glob,os
import pandas as pd
B=pd.DataFrame()
for i in glob.glob('/content/All_MAGs/*.fa')[0:9]:
    # j=os.path.basename(i)
    fasta_sequences = SeqIO.parse(open(i),'fasta')
    identifiers = []
    sequences = []
    for fasta in fasta_sequences:
        name, sequence = fasta.id, str(fasta.seq)
        identifiers.append(name)
        sequences.append(sequence)

    A=pd.DataFrame([identifiers,sequences]).T
    A.columns=['ID','seq']
    A.dropna(inplace=True)
    B=B.append(A)
    # A['ID']#=A.ID.str.split('-')[0:1]
# B['ID']=B['ID'].str.split('-').str[0]+'_'+B['ID'].str.split('-').str[1]#.cat()
B['ID']=B.ID.str.split('_length').str[0]
B.index=B.ID

In [None]:
B

Unnamed: 0_level_0,ID,seq
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
Sample_110_S81-bin_116-k141_613475,Sample_110_S81-bin_116-k141_613475,GCGTAGTCTGACAGGTTCGTGCCAAATGAATCCCTTTTTCGAACTG...
Sample_110_S81-bin_116-k141_1652364,Sample_110_S81-bin_116-k141_1652364,GGTTAGTAGTCCCCAAAAAATTGCGCTGAGCAATGCAGCAAAGATT...
Sample_110_S81-bin_116-k141_719215,Sample_110_S81-bin_116-k141_719215,AGGCCTAACCAAATCAGTGCTAAAGTATTCATGGCGTGTCTCCCGA...
Sample_110_S81-bin_116-k141_2258219,Sample_110_S81-bin_116-k141_2258219,ATTTTCGCGAATTGTTGTGCGGTTTCTACGCTTATTTCATGGGTAT...
Sample_110_S81-bin_116-k141_1022213,Sample_110_S81-bin_116-k141_1022213,ATGAGTTTTAGGATTTAATGCTACTGACTATCAGCCCCAGCCATCC...
...,...,...
Sample_105_S79-bin_9-k141_291995,Sample_105_S79-bin_9-k141_291995,TTCAATTCATCCATCACCACTTAAGTTTCCTGGATTAGATCAGACA...
Sample_105_S79-bin_9-k141_2043216,Sample_105_S79-bin_9-k141_2043216,CATGCCAACATCGCAATCGCAGCTTCTGGAATCATCGGCATATAAA...
Sample_105_S79-bin_9-k141_640011,Sample_105_S79-bin_9-k141_640011,AACTTTGATAATCAACGTAGACAATTCCAAAGCGTTTTGCATACCC...
Sample_105_S79-bin_9-k141_906856,Sample_105_S79-bin_9-k141_906856,TTAAAGCCTGAACAGTTGCAACTAAAAGGTATCCACTCAGAATCTG...


## install [HUMAnN 3](https://huttenhower.sph.harvard.edu/humann), a method for efficiently and accurately profiling the abundance of microbial metabolic pathways and other molecular functions from metagenomic or metatranscriptomic sequencing data.

In [None]:
# !pip install humann --no-binary :all:
!pip install metaphlan

In [None]:
### !humann_databases --download utility_mapping full /path/to/databases --update-config yes

# !humann_test

# !wget https://github.com/biobakery/humann/raw/master/examples/demo.fastq.gz
!humann -i demo.fastq.gz -o sample_results

In [26]:
!humann -i /content/All_MAGs/Sample_101_S75_bin_1.fa -o test_out

Output files will be written to: /content/test_out

Running metaphlan ........

CRITICAL ERROR: Error executing: /usr/local/bin/metaphlan /content/All_MAGs/Sample_101_S75_bin_1.fa -t rel_ab -o /content/test_out/Sample_101_S75_bin_1_humann_temp/Sample_101_S75_bin_1_metaphlan_bugs_list.tsv --input_type fasta --bowtie2out /content/test_out/Sample_101_S75_bin_1_humann_temp/Sample_101_S75_bin_1_metaphlan_bowtie2.txt

Error message returned from metaphlan :
(ERR): bowtie2-align died with signal 9 (KILL) 
Traceback (most recent call last):
  File "/usr/local/bin/read_fastx.py", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/metaphlan/utils/read_fastx.py", line 168, in main
    f_nreads, f_avg_read_length = read_and_write_raw(f, opened=False, min_len=min_len, prefix_id=prefix_id)
  File "/usr/local/lib/python3.10/dist-packages/metaphlan/utils/read_fastx.py", line 130, in read_and_write_raw
    nreads, avg_read_length = read_and_write_raw_int(inf, min_l

In [None]:

# B.drop(columns=['ID'],inplace=True)
g = graphistry.nodes(B)
g2=g.umap(feature_engine='cu_cat',engine='cuml')

emb2=g2._node_embedding
g222=graphistry.nodes(emb2.reset_index(),'ID').edges(g2._edges,'_src_implicit','_dst_implicit').bind(point_x="x",point_y="y").settings(url_params={"play":0})
g222.plot()

# Compare clustering distances to family/genus labels (gold standards)

In [None]:
meta=pd.read_excel('/content/39602299')
A=meta.pplacer_taxonomy.str.split(';', expand=True)
A.index=meta.Bin_name
A

In [None]:
stopppp

# bio-ml dataset


1.   [3 subjects x 10 time points](
https://www.ebi.ac.uk/ena/browser/view/PRJNA544527)

2.  [metadata](
https://static-content.springer.com/esm/art%3A10.1038%2Fs41591-019-0559-3/MediaObjects/41591_2019_559_MOESM3_ESM.xlsx)



#try #2

[pull data](https://trace.ncbi.nlm.nih.gov/Traces/index.html?view=run_browser&acc=SRR6747711&display=data-access)
from [papers](https://www.sciencedirect.com/science/article/pii/S0160412019321774#ec-research-data) [and 2](https://pubs.acs.org/doi/10.1021/acs.est.8b03446)

In [None]:
# !wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR6747711/SRR6747711
# !wget AWS	s3://sra-pub-src-5/SRR6747711/161002_I137_FCH7YT3BBXX_L1_wHAXPI035554-18_1.fq.gz

import locale
def getpreferredencoding(do_setlocale = True):
    return "UTF-8"
locale.getpreferredencoding = getpreferredencoding

# with open(path, 'rb') as f:
#   contents = f.read()

from Bio import SeqIO
import gzip

reads = []
count = 0
    # for record in SeqIO.parse(handle, "fasta"):
    #     print(record.id)
# with gzip.open("39602290", "rt") as handle:
for rec in SeqIO.parse('39602290', "fastq"):
    if count <500: # only read in the first 500 reads to avoid running out of memory
        reads.append(rec.seq)
    count = count+1

# take a look at some of the reads
reads[0:20]

#try #1

[pull metagenomic data](https://www.ncbi.nlm.nih.gov/nuccore/2496718099)


In [None]:
# !get https://sra-download.ncbi.nlm.nih.gov/traces/wgs01/wgs_aux/KH/UX/KHUX01/KHUX01.1.fsa_nt.gz
# !gunzip KHUX01.1.fsa_nt.gz



In [None]:
from Bio import SeqIO

fasta_sequences = SeqIO.parse(open('KHUX01.1.fsa_nt'),'fasta')
    name = []
    sequences = []
    for fasta in fasta_sequences:
        name, sequence = fasta.id, str(fasta.seq)
        identifiers.append(name)
        sequences.append(sequence)

In [None]:
A=pd.DataFrame([identifiers,sequences]).T
A.columns=['ID','seq']
A.dropna(inplace=True)

In [None]:
# !pip install -U --force git+https://github.com/graphistry/pygraphistry.git@cudf
!pip install graphistry[ai] --quiet

In [None]:
import graphistry

graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='dcolinmorgan', password='fXjJnkE3Gik6BWy') ## key id, secret key

# graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='dcolinmorgan', password='***') ## key id, secret key
graphistry.__version__

In [None]:
g = graphistry.nodes(A)
g.umap(engine='umap_learn').plot()