In this notebook we are going to find the reads in DeMicheli (mouse) 2020 D0 fastq file that map to Smim41 (6030408B16Rik), because we want to see if those reads map to any other genomic region. It is highly likely that it is the gene, considering that the results are replicated throughout the days, and the gene also appears in De Micheli 2020 paper dataset. 

To make sure that gene is relevant, we are going to devise some strategy.
* Get the fastq file of the reads from day 0.
* Grep some selected regions to those fastq files and get the reads that map to the selected regions
* Use blast to see if the selected regions map to Smim41

Lastly, we are going to create a fastq file with only those reads, and blast it with low selectivity to see if it maps to any other genomic region.

In [2]:
import scanpy as sc
import scanpy.external as sce
import pandas as pd
import numpy as np
import os
import triku as tk
import matplotlib.pyplot as plt
import matplotlib as mpl
from tqdm.notebook import tqdm
import ray
import subprocess

In [3]:
seed = 10

In [None]:
os.getcwd()

In [5]:
data_dir = os.getcwd() + '/data/'

In [6]:
# Palettes for UMAP gene expression

magma = [plt.get_cmap('magma')(i) for i in np.linspace(0,1, 80)]
magma[0] = (0.88, 0.88, 0.88, 1)
magma = mpl.colors.LinearSegmentedColormap.from_list("", magma[:65])

In [None]:
!cd {data_dir}/demicheli_mouse && aria2c -x 16 https://sra-download.ncbi.nlm.nih.gov/traces/sra47/SRR/010615/SRR10870296

In [None]:
!fastq-dump {data_dir}/demicheli_mouse/SRR10870296 --gzip --split-files

MID SEQUENCES

In [29]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator TTTCCAGCCCTTGACCCTTGGGATTCTTG > grep_TTTCCAGCCCTTGACCCTTGGGATTCTTG_seq1_mid.fasta

In [30]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator TGGGATTCTTGGTTTCCTTCCTCATCTCC  > grep_TGGGATTCTTGGTTTCCTTCCTCATCTCC_seq2_mid.fasta

In [31]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CGCTTCCCACGGCTTTGCCATTAAATAGG  > grep_CGCTTCCCACGGCTTTGCCATTAAATAGG_seq3_mid.fasta

In [32]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator TCTGTAATGGGAGCCAATGCCCTCTTCTG  > grep_TCTGTAATGGGAGCCAATGCCCTCTTCTG_seq4_mid.fasta

In [33]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GGTAGTTACGAACACTGACTGTTCTTCCA  > grep_GGTAGTTACGAACACTGACTGTTCTTCCA_seq5_mid.fasta

In [34]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GTTCTTCCAGAGGTTCTGAGTTTGATTCC  > grep_GTTCTTCCAGAGGTTCTGAGTTTGATTCC_seq6_mid.fasta

END SEQUENCES

In [35]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CACTGTTTCCCCAAGCCTGGCTCTGTTAA  > grep_CACTGTTTCCCCAAGCCTGGCTCTGTTAA_seq1_end.fasta

In [36]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator TGTTAATTATTGTTCTATTGCGATAAAGC > grep_TGTTAATTATTGTTCTATTGCGATAAAGC_seq2_end.fasta

In [37]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator AGTGGGTAATGTGTTTGCCCATCACTATA > grep_AGTGGGTAATGTGTTTGCCCATCACTATA_seq3_end.fasta

In [38]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator ACTATATAAGGTTTTGTATACTATAATTA > grep_ACTATATAAGGTTTTGTATACTATAATTA_seq4_end.fasta

In [39]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GTAAACCTTGCCCTCATCTTTGAAATAGA > grep_GTAAACCTTGCCCTCATCTTTGAAATAGA_seq5_end.fasta

In [40]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator TGAAATAGAAGTGACACCATCAGTGTGAG > grep_TGAAATAGAAGTGACACCATCAGTGTGAG_seq6_end.fasta

BEGINNING SEQUENCES

In [41]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CAAGGGTAGTGTGCACATCTGGGCAGCTG > grep_CAAGGGTAGTGTGCACATCTGGGCAGCTG_seq1_beg.fasta

In [42]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CACATCTGGGCAGCTGGTGGGAGCATGAA > grep_CACATCTGGGCAGCTGGTGGGAGCATGAA_seq2_beg.fasta

In [43]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GCAGCCAAGGCTGCCTGGCTGAGCTGCTG > grep_GCAGCCAAGGCTGCCTGGCTGAGCTGCTG_seq3_beg.fasta

In [44]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GCTGAGCTGCTGCAACCAGTCCGGGCTGC > grep_GCTGAGCTGCTGCAACCAGTCCGGGCTGC_seq4_beg.fasta

In [45]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CAGAGGGGCCACGCATGGTGCAGGCAGTC > grep_CAGAGGGGCCACGCATGGTGCAGGCAGTC_seq5_beg.fasta

In [46]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GGTGCAGGCAGTCGTGCTGGGCGTCCTGT > grep_GGTGCAGGCAGTCGTGCTGGGCGTCCTGT_seq6_beg.fasta

In [28]:
!find . -type f -name 'grep_*.fasta' -exec sed -i 's/@SRR/>SRR/g' {} \; 

After running sed and 

SPECIFIC OF Smim41-202: We have found **no reads** associated to smim41-202 isoform, so it is likely not the specific isoform. However, the unique bases are so few that we cannot be sure of it.

In [None]:
GAGGCTGAGCCTGGTGCCTGTGGAGGGGATGACGACTCCTAG

In [47]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GAGGCTGAGCCTGGTGCCTGTGGAGGG > grep_GAGGCTGAGCCTGGTGCCTGTGGAGGG_smim202_1.fasta

In [48]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GCCTGGTGCCTGTGGAGGGGATGACGA > grep_GCCTGGTGCCTGTGGAGGGGATGACGA_smim202_2.fasta

In [49]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GCCTGTGGAGGGGATGACGACTCCTAG > grep_GCCTGTGGAGGGGATGACGACTCCTAG_smim202_3.fasta

SPECIFIC OF Smim41-201: sequences CTCGCCTGCCGCCCACCCGCACCCTA and CCCGCACCCTATTTGTGCTTGTGGTG were matched, so it can be this isoform.

In [None]:
AGGCTGAGCCTGGTGCCTGTGGAGGGGATGACGACTCCTAGGTGTCAGCTGCTCTTGAGATTCAGCCTGTTCTGTGCCACGCCATCCAGAGTTCATTTCTGTGAGGCACAGCGGAAGGTCCAACCCACAAGCTCTTATTTCCCGCACAGGTGCTGGACATCTCCCCTCGCCTGCCGCCCACCCGCACCCTATTTGTGCTTGTGGTGA

In [50]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GGCTGAGCCTGGTGCCTGTGGAGGGG > grep_GGCTGAGCCTGGTGCCTGTGGAGGGG_smim201_1.fasta

In [51]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GACGACTCCTAGGTGTCAGCTGCTCT > grep_GACGACTCCTAGGTGTCAGCTGCTCT_smim201_2.fasta

In [52]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator GAGATTCAGCCTGTTCTGTGCCACGC > grep_GAGATTCAGCCTGTTCTGTGCCACGC_smim201_3.fasta

In [53]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CAGCGGAAGGTCCAACCCACAAGCTC > grep_CAGCGGAAGGTCCAACCCACAAGCTC_smim201_4.fasta

In [54]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CCACAAGCTCTTATTTCCCGCACAGG > grep_CCACAAGCTCTTATTTCCCGCACAGG_smim201_5.fasta

In [55]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CAGGTGCTGGACATCTCCCCTCGCCT > grep_CAGGTGCTGGACATCTCCCCTCGCCT_smim201_6.fasta

In [56]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CTCGCCTGCCGCCCACCCGCACCCTA > grep_CTCGCCTGCCGCCCACCCGCACCCTA_smim201_7.fasta

In [58]:
!cd {data_dir}/demicheli_mouse && zcat SRR10870296_3.fastq.gz | grep -B 1 --no-group-separator CTGTGGAGGGGATGACGACTCCTAGG > grep_CTGTGGAGGGGATGACGACTCCTAGG_smim201_9.fasta

# Conclusion

We find that there are reads mapped to specific regions of the protein. Also, BLASTn of the matching sequences mapped to Smim41 protein integrally, indicating the existence of the protein, and its expression on this dataset.

Some things to remark:
* No reads were found mapping to "beggining" or "end" sequences. This is likely because "beggining" sequences belong to 5' and this kind of single-cell protocol is 3' enriched. "end" sequences might not appear because maybe those sequences belong to an isoform that is not expressed.

* Of the isoforms 201 and 202, only reads mapping to 201 were found. However, due to the short sequence of isoform 202 (the sequence is longer, but this is the only non-overlapping part), it is likely that either reads did not map probabilistically, or they mapped with some mismatching, and they were not captured with grep.


The real way to do the analysis would be to use STAR or any other aligner to map all the reads, and select the ones mapping to Smim41. Then BLAST those reads and see to which part of the protein they mapped to. This is a more thorough analysis, and since we were only interested in confirming the specificity of the reads, it is enough.
