# <font color=#c51b8a>Mine-n-Match (MnM) Mini Tutorial:</font>
### Make sure to load all the neccessary imports everytime you open this notebook!

In [1]:
# Import functions for mining NCBI
from mine_ncbi_functions import ncbi_fetch_species, ncbi_mine_seq_data 
# Import json so we can load any existing 
import json
# Email for when we query NCBI
email = "sethfrazer@ucsb.edu"  # Replace with your email


## <font color=#c994c7>Step 1: Create a Taxonomy Database</font>

In [2]:
# Example usage:
taxa = "Mammalia"
rank = "class"
limit = 500
report_dir = 'taxonomy_data'
out_file = "mammalia_taxonomy"

# If verbose = True then you will see alot of information printed when you run this function
# Also, if you hover your mouse over any of MY functions then you should get a pop-up with more information on how to use them
species_data = ncbi_fetch_species(email, report_dir=report_dir, out=out_file, taxa=taxa, rank=rank, limit=limit, verbose=True)

# Print the first few entries of the dictionary as an example:
count = 0
for species, lineage in species_data.items():
    print(f"{species}: {lineage}")
    count += 1
    if count >= 5:
        break

Searching for TaxID for Mammalia...
Found TaxID: 40674
Finding direct children of Mammalia...
Found 28 direct children.
  - Child Taxon: Litopterna (TaxID: 1563124)
  - Child Taxon: Notoungulata (TaxID: 1563120)
  - Child Taxon: Cingulata (TaxID: 948951)
  - Child Taxon: Pilosa (TaxID: 948950)
  - Child Taxon: Artiodactyla (TaxID: 91561)
  - Child Taxon: Peramelemorphia (TaxID: 38611)
  - Child Taxon: Notoryctemorphia (TaxID: 38610)
  - Child Taxon: Diprotodontia (TaxID: 38609)
  - Child Taxon: Dasyuromorphia (TaxID: 38608)
  - Child Taxon: Microbiotheria (TaxID: 38607)
  - Child Taxon: Paucituberculata (TaxID: 38606)
  - Child Taxon: Didelphimorphia (TaxID: 38605)
  - Child Taxon: Carnivora (TaxID: 33554)
  - Child Taxon: Dermoptera (TaxID: 30656)
  - Child Taxon: Macroscelidea (TaxID: 28734)
  - Child Taxon: Rodentia (TaxID: 9989)
  - Child Taxon: Lagomorpha (TaxID: 9975)
  - Child Taxon: Pholidota (TaxID: 9971)
  - Child Taxon: Tubulidentata (TaxID: 9815)
  - Child Taxon: Hyracoidea

In [2]:
# If you want to load an existing taxonomy file, this is how you do it
species_data_file = './taxonomy_data/mammalia_taxonomy.json'
with open(species_data_file, 'r') as f:
    species_data = json.load(f)

## <font color=#c994c7>Step 2: Query NCBI For a Protein Group/Family of Interest</font>

In [3]:
# Convert all the species we found in our NCBI taxonomy search to a list.
# The species names are the KEYS in the dictionary we created during the search.
species_list = list(species_data.keys())

In [4]:
query = f"(opsin[Title] OR rhodopsin[Title] OR OPN[Title] OR rh1[Title] OR rh2[Title] OR Rh1[Title] OR Rh2[Title]) NOT partial[Title] NOT voucher[All Fields] NOT kinase[All Fields] NOT kinase-like[All Fields] NOT similar[Title] NOT homolog[Title] NOT opsin-like[Title]"

# Remember, if you hover your mouse over any of MY functions then you should get a pop-up with more information on how to use them
ncbi_query_df, query_report_dir = ncbi_mine_seq_data(email=email, job_label='ncbi_mammalia_opsins', out='ncbi_mammalia_opsins', species_list=species_list[0:50], taxa_dictionary=species_data, query=query)

Creating Job Directory

Saving Species Query List to Text

Starting Queries to NCBI for DNA/Protein Sequences



  0% (0 of 50) |                         | Elapsed Time: 0:00:00 ETA:  --:--:--
  2% (1 of 50) |                         | Elapsed Time: 0:00:01 ETA:   0:01:02
  4% (2 of 50) |#                        | Elapsed Time: 0:00:02 ETA:   0:01:02
  6% (3 of 50) |#                        | Elapsed Time: 0:00:03 ETA:   0:01:02
  8% (4 of 50) |##                       | Elapsed Time: 0:00:07 ETA:   0:01:24
 10% (5 of 50) |##                       | Elapsed Time: 0:00:10 ETA:   0:01:32
 12% (6 of 50) |###                      | Elapsed Time: 0:00:18 ETA:   0:02:16
 14% (7 of 50) |###                      | Elapsed Time: 0:00:20 ETA:   0:02:04
 16% (8 of 50) |####                     | Elapsed Time: 0:00:21 ETA:   0:01:53
 18% (9 of 50) |####                     | Elapsed Time: 0:00:23 ETA:   0:01:45
 20% (10 of 50) |####                    | Elapsed Time: 0:00:24 ETA:   0:01:38
 22% (11 of 50) |#####                   | Elapsed Time: 0:00:26 ETA:   0:01:32
 24% (12 of 50) |#####                  

NCBI Queries Complete!
Now Extracting and Formatting Results For DataFrame...

DataFrame Formatted and Saved to CSV file for future use :)

FASTA File Saved...

Saving txt file with names of species that retrieved no results...



## <font color=#c994c7>Step 3: Extract Candidate Genes From a Trasncriptome of Interest Using Queried Data</font>

In [6]:
import subprocess  # For running external programs like tblastx
import datetime    # For timestamping output filenames
import os          # For interacting with the filesystem

from blast_matching_functions import format_blast_db, run_blast_query

In [9]:
DB_folder = query_report_dir
QUERY_folder = "./transcriptomes/ostracod_seqData/aa"
results_folder = './blast_results'
job_name = 'opsin_trans_mining'

if not os.path.isdir(results_folder):
    os.makedirs(results_folder, exist_ok=True)

# blast parameters
evalue = "1e-5"
outfmt = "6"
blasttyp = 'blastp'

print("Checking DB folder:", DB_folder)
print("Checking QUERY folder:", QUERY_folder)

for db in os.listdir(DB_folder):
    if db.endswith('.fasta'):
        print("Found DB file:", db)
        db_path = os.path.join(DB_folder, db)
        # Format DB and get base path
        format_blast_db(db_path)
        db_base = os.path.splitext(db_path)[0]
    
        for query in os.listdir(QUERY_folder):
            if query.lower().endswith(('.fasta', '.fa', '.faa', '.pep')):
                print("Found QUERY file:", query)

                query_path = os.path.join(QUERY_folder, query)


                # Output file
                query_name = os.path.splitext(query)[0]
                time_stamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
                output_file = os.path.join(results_folder, f"{job_name}_{blasttyp}_{query_name}_{time_stamp}.txt")

                run_blast_query(blasttyp, query_path, db_base, output_file, evalue, outfmt)

Checking DB folder: mnm_data/mnm_on_ncbi_mammalia_opsins_2025-05-20_17-00-42
Checking QUERY folder: ./transcriptomes/ostracod_seqData/aa
Found DB file: mined_ncbi_mammalia_opsins_seqs.fasta
Formatting BLAST DB for: mnm_data/mnm_on_ncbi_mammalia_opsins_2025-05-20_17-00-42\mined_ncbi_mammalia_opsins_seqs.fasta
Database formatted successfully.
Found QUERY file: Actinoseta_chelisparsa.fasta.transdecoder.pep
Running blastp...
blastp search completed successfully!
Results saved in: ./blast_results\opsin_trans_mining_blastp_Actinoseta_chelisparsa.fasta.transdecoder_2025-05-20_17-12-45.txt

=== Preview of Results ===

Found QUERY file: Actinoseta_jonesi.fasta.transdecoder.pep
Running blastp...
blastp search completed successfully!
Results saved in: ./blast_results\opsin_trans_mining_blastp_Actinoseta_jonesi.fasta.transdecoder_2025-05-20_17-12-46.txt

=== Preview of Results ===
TRINITY_DN105_c0_g1_i1.p1	XM_004483219.2	30.233	215	145	3	7	220	152	362	2.13e-31	109
TRINITY_DN105_c0_g1_i1.p1	XM_0044

KeyboardInterrupt: 