<a href="https://colab.research.google.com/github/dcolinmorgan/grph/blob/main/accelearting_metagenomic_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accelerating metagenomic analysis with [Graphistry](graphistry.com)

Using GPU-accelerated UMAP + DBScan analysis & visualization, metagenomic samples can be compared faster and much more easily explored.

*   Task: Analyze metagenomic samples for similarity
*   Data: 10 samples
*   [data](https://figshare.scilifelab.se/articles/dataset/Metagenomic_dataset_from_Swedish_urban_lakes/22270225?file=39602290)
*   [paper](https://pubmed.ncbi.nlm.nih.gov/15560821/)

**Insight/ Result:**

43s to umap and dbscan --
over XXX faster for entire
Offers more insight when static plot would otherwise fail

(See also: [CPU baseline](https://github.com/dcolinmorgan/grph/blob/main/accelerating_chemical_mappings.ipynb))

# Setup

In [None]:
!pip install --extra-index-url=https://pypi.nvidia.com cuml-cu11 cudf-cu11 cugraph-cu11 pylibraft_cu11 raft_dask_cu11 dask_cudf_cu11 pylibcugraph_cu11 pylibraft_cu11
import cuml,cudf
print(cuml.__version__)

!pip install -U --force git+https://github.com/graphistry/pygraphistry.git@cudf
!pip install -U git+https://github.com/graphistry/cu-cat.git@DT3
# !pip install dirty_cat

!pip install Biopython

!nvidia-smi


# import /configure

get a free api-key at https://www.graphistry.com/


In [2]:
import pandas as pd
import graphistry
from time import time


graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='dcolinmorgan', password='fXjJnkE3Gik6BWy') ## key id, secret key

# graphistry.register(api=3,protocol="https", server="hub.graphistry.com", username='dcolinmorgan', password='***') ## key id, secret key
graphistry.__version__

'0.28.7+463.gfb96400'

# bio-ml dataset


1.   [3 subjects x 10 time points](
https://www.ebi.ac.uk/ena/browser/view/PRJNA544527)

2.  [metadata](
https://static-content.springer.com/esm/art%3A10.1038%2Fs41591-019-0559-3/MediaObjects/41591_2019_559_MOESM3_ESM.xlsx)

3.   !wget https://raw.githubusercontent.com/dcolinmorgan/grph/main/ftp_PRJNA544527.txt


In [None]:
!wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR922/006/SRR9224006/SRR9224006_1.fastq.gz
!wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR922/006/SRR9224006/SRR9224006_2.fastq.gz

In [None]:
! gunzip SRR9224006_1.fastq.gz
! gunzip SRR9224006_2.fastq.gz

In [None]:
!head /content/SRR9224006_1.fastq

@SRR9224006.1 7001174F:HVTFNBCXX161011:HVTFNBCXX:2:2206:18894:58151/1
AAAAAAAACAAAATAATGGAAACAAAAAACATCTACTTCATCAGCGGCATTGATACAGATGCCGGAAAAAGCTATTGCACCGCCTGGTATGCCCGTGAGCT
+
DDDDDIIIIIIGIIIIIIIIIIIIHIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIDHGIIIIGGHIIHHHIIIIIIIHHIIIIIIHIIIIIII
@SRR9224006.2 7001174F:HVTFNBCXX161011:HVTFNBCXX:1:1108:17623:49640/1
AAAAAAAACAACCCAATGCGATTCTGATCGCAATCTACATAAGTTACTACTGGTTATCTTCCCTGAAG
+
DDBDDIGIIHIIIGHGHFHIGHIIIEHIHIIIIIIIIIIIIEHHHGHIIIIIIIIIIIIIIEHHHHHH
@SRR9224006.3 7001174F:HVTFNBCXX161011:HVTFNBCXX:1:1204:5345:82516/1
AAAAAAAACAAGAGCTTTATTAAACACGTCTTGATCTTTTTTACACCTGCCGGAAATTCCATCGT


In [None]:
from Bio import SeqIO
import glob,os
import pandas as pd
B=pd.DataFrame()
for i in glob.glob('/content/*.fastq'):
    # j=os.path.basename(i)
    fasta_sequences = SeqIO.parse(open(i),'fastq')
    identifiers = []
    sequences = []
    for fasta in fasta_sequences:
        name, sequence = fasta.id, str(fasta.seq)
        identifiers.append(name)
        sequences.append(sequence)

    A=pd.DataFrame([identifiers,sequences]).T
    A.columns=['ID','seq']
    A.dropna(inplace=True)
    B=B.append(A)
    # A['ID']#=A.ID.str.split('-')[0:1]
# B['ID']=B['ID'].str.split('-').str[0]+'_'+B['ID'].str.split('-').str[1]#.cat()
B['ID']=B.ID.str.split('_length').str[0]
B.index=B.ID

# install [HUMAnN 3](https://huttenhower.sph.harvard.edu/humann), a method for efficiently and accurately profiling the abundance of microbial metabolic pathways and other molecular functions from metagenomic or metatranscriptomic sequencing data.

In [None]:
# !pip install humann --no-binary :all:
!pip install metaphlan

In [None]:
### !humann_databases --download utility_mapping full /path/to/databases --update-config yes

# !humann_test

# !wget https://github.com/biobakery/humann/raw/master/examples/demo.fastq.gz
# !humann -i demo.fastq.gz -o sample_results

### takes very long for running all samples
 (1day+ run on cluster)

In [None]:
mkdir assemble epi_sam_out mpa4_out
# !humann -i /content/All_MAGs/Sample_101_S75_bin_1.fa -o test_out
%%bash
seq=$(ls /content/*.fastq | cut -d / -f2| cut -d _ -f1)

for i in $(eval "echo "$seq" | cut -d _ -f1")

do
metaphlan /content/${i}.fa --nproc 40 --input_type fasta -o /content/assemble/${i}/h4_out.txt -t rel_ab_w_read_stats
done

In [None]:
# from sqlalchemy.util.compat import dataclass_fields
!wget https://github.com/dcolinmorgan/grph/raw/main/PRJNA544527_mpa4out.txt
data=pd.read_csv('/content/PRJNA544527_mpa4out.txt',sep='\t',skiprows=1,index_col=0)
data.index=data.reset_index().clade_name.str.split('|',expand=True)[6]
data=data.reset_index().dropna(axis=0)
data.index=data2[6]
data=data.drop(columns=6)

!wget https://raw.githubusercontent.com/dcolinmorgan/grph/main/PRJNA544527-meta_inf.txt
meta=pd.read_csv('/content/PRJNA544527-meta_inf.txt',sep='\t',header=None)

mm=pd.merge(data.T,meta[[3,5]],left_index=True,right_on=3)

mm['id']=mm[5].str.split('-').str[0]
mm['time']=mm[5].str.split('_').str[0].str.split('-').str[1]

!wget https://static-content.springer.com/esm/art%3A10.1038%2Fs41591-019-0559-3/MediaObjects/41591_2019_559_MOESM3_ESM.xlsx
metaa=pd.read_excel('/content/41591_2019_559_MOESM3_ESM.xlsx',sheet_name='SupTable2',skiprows=3)
metaa=metaa[['Donor','Age','Sex','BMI']]

Full_table=pd.merge(mm,metaa,left_on='id',right_on='Donor')
Full_table=Full_table.drop(columns=[3,	5,	'id'])
Full_table.time=pd.to_datetime(Full_table.time,unit='s')
data2=Full_table.melt(id_vars=['time','Donor','Age','Sex','BMI'])
data2.to_csv('PRJNA544527_mpa4_annot_table.txt',sep='\t')

# umap and dbscan

idea for metagenomic analysis based on [Quantifying Shared and Unique Gene Content across 17 Microbial Ecosystems
](https://journals.asm.org/doi/full/10.1128/msystems.00118-23)

(analyze all samples run on cluster)

In [6]:
# !wget https://raw.githubusercontent.com/dcolinmorgan/grph/main/PRJNA544527_mpa4_annot_table.txt

data2=pd.read_csv('PRJNA544527_mpa4_annot_table.txt',sep='\t',index_col=0)

In [7]:
data2

Unnamed: 0,time,Donor,Age,Sex,BMI,variable,value
0,1970-01-01 00:03:51,am,28,Male,23.1,s__Phocaeicola_vulgatus,28.36239
1,1970-01-01 00:00:18,am,28,Male,23.1,s__Phocaeicola_vulgatus,27.19811
2,1970-01-01 00:01:57,am,28,Male,23.1,s__Phocaeicola_vulgatus,25.97035
3,1970-01-01 00:01:26,am,28,Male,23.1,s__Phocaeicola_vulgatus,37.47952
4,1970-01-01 00:03:14,am,28,Male,23.1,s__Phocaeicola_vulgatus,31.55610
...,...,...,...,...,...,...,...
263011,1970-01-01 00:00:02,bp,28,Male,19.7,s__Pseudomonas_rhodesiae,0.00000
263012,1970-01-01 00:00:01,bo,36,Male,22.2,s__Pseudomonas_rhodesiae,0.00000
263013,1970-01-01 00:00:01,ci,25,Male,23.7,s__Pseudomonas_rhodesiae,0.00000
263014,1970-01-01 00:00:01,cj,35,Male,24.4,s__Pseudomonas_rhodesiae,0.00000


In [9]:
g = graphistry.nodes(data2.drop(columns='time'))

t=time()
g2=g.umap(dbscan=True,feature_engine='cu_cat',engine='cuml')
print("\n"+str(time()-t))

g2.plot()



Using GPU: cu_cat





42.69053101539612
