This notebook looks at the single copy marker genes that were annotated on multiple dominating set pieces, where at least one of which was increased in abundance and decreased in abundance in CD. It selects a few sequences to dig into more using `spacegraphcats extract_reads`

In [1]:
setwd("..")

In [2]:
library(dplyr)
library(readr)
library(purrr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [14]:
# look at an overview of all of the single copy marker gene sequences that had overlapping annotaitons across the species
marker_ko <- Sys.glob("outputs/sgc_pangenome_catlases_corncob_annotation_analysis/*overlapping_marker_kos.tsv") %>%
  map_dfr(read_tsv, show_col_types = F) %>%
  mutate(filename = gsub("_clustered_annotated_seqs.fa", "", filename))

marker_ko_summary <- marker_ko  %>%
  select(filename, Preferred_name) %>%
  group_by(filename, Preferred_name) %>%
  tally() %>%
  arrange(desc(Preferred_name))

marker_ko_summary

filename,Preferred_name,n
<chr>,<chr>,<int>
s__Ruminococcus_B-gnavus,rpsQ,2
s__Enterocloster-clostridioformis_A,rpsI,2
s__Ruminococcus_B-gnavus,rpsI,2
s__Enterocloster-clostridioformis_A,rpsH,4
s__Enterocloster-clostridioformis_A,rpsC,4
s__Enterocloster-sp005845215,rpsC,2
s__Enterocloster-sp005845215,rpsB,3
s__Ruminococcus_B-gnavus,rpmC,2
s__Enterocloster-clostridioformis,rplT,2
s__Ruminococcus_B-gnavus,rplQ,2


It looks like all species have a least one `rpl*` gene that was annotated among both increased and decreased abundance dominating set pieces

In [9]:
marker_ko_summary %>%
  filter(grepl("^rpl", Preferred_name))

filename,Preferred_name,n
<chr>,<chr>,<int>
s__Enterocloster-clostridioformis,rplT,2
s__Ruminococcus_B-gnavus,rplQ,2
s__Enterocloster-clostridioformis_A,rplO,2
s__Enterocloster-sp005845215,rplO,2
s__Enterocloster-clostridioformis_A,rplF,4
s__Enterocloster-clostridioformis,rplD,3
s__Enterocloster-clostridioformis_A,rplD,3
s__Enterocloster-bolteae,rplC,2
s__Enterocloster-clostridioformis_A,rplC,2
s__Enterocloster-bolteae,rplB,5


select the `rpl*` genes that were:
+ both decreased and increased in abundance in multiple species (preferred, not required)
+ were only annotated in two diff abund dom set pieces

In [15]:
marker_ko_summary %>%
  filter(grepl("^rpl", Preferred_name)) %>%
  filter(n == 2)

marker_kos_keep <- c("rplT", "rplQ", "rplO", "rplC")

filename,Preferred_name,n
<chr>,<chr>,<int>
s__Enterocloster-clostridioformis,rplT,2
s__Ruminococcus_B-gnavus,rplQ,2
s__Enterocloster-clostridioformis_A,rplO,2
s__Enterocloster-sp005845215,rplO,2
s__Enterocloster-bolteae,rplC,2
s__Enterocloster-clostridioformis_A,rplC,2


In [20]:
# look at record name that was a multifasta match; will become query for extract_reads
marker_ko %>%
  filter(Preferred_name %in% marker_kos_keep) %>%
  select(estimate, bonferroni, filename, record_name, record_id, Preferred_name, dom_id, size)

estimate,bonferroni,filename,record_name,record_id,Preferred_name,dom_id,size
<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>
1.161457,0.01701443,s__Enterocloster-clostridioformis_A,GCF_000424325.1_14560 50S ribosomal protein L15,GCF_000424325.1_14560,rplO,124843,462
-0.6893618,0.0007332007,s__Enterocloster-clostridioformis_A,GCF_000424325.1_14560 50S ribosomal protein L15,GCF_000424325.1_14560,rplO,55634,2868
1.2326538,0.0002176084,s__Enterocloster-clostridioformis_A,GCF_000424325.1_14655 50S ribosomal protein L3,GCF_000424325.1_14655,rplC,143455,435
-0.5479851,0.001651133,s__Enterocloster-clostridioformis_A,GCF_000424325.1_14655 50S ribosomal protein L3,GCF_000424325.1_14655,rplC,297329,2399
1.1106049,0.0002557957,s__Enterocloster-bolteae,GCF_000371665.1_11465 50S ribosomal protein L3,GCF_000371665.1_11465,rplC,533124,621
-0.5181141,0.00877331,s__Enterocloster-bolteae,GCF_000371665.1_11465 50S ribosomal protein L3,GCF_000371665.1_11465,rplC,304454,3199
0.8746773,0.005752351,s__Enterocloster-sp005845215,GCF_900753375.1_04580 50S ribosomal protein L15,GCF_900753375.1_04580,rplO,11558,1858
-0.9882787,0.000620934,s__Enterocloster-sp005845215,GCF_900753375.1_04580 50S ribosomal protein L15,GCF_900753375.1_04580,rplO,12786,1008
0.8485088,0.01125269,s__Ruminococcus_B-gnavus,GCF_013299885.1_02845 50S ribosomal protein L17,GCF_013299885.1_02845,rplQ,250497,1197
-0.5461895,0.0008195545,s__Ruminococcus_B-gnavus,GCF_013299885.1_02845 50S ribosomal protein L17,GCF_013299885.1_02845,rplQ,126054,4332
