# Family 3 saccharide BGCs - Visualising Neighbouring Genes (Operon)

## Overview:
- Importing encapsulin csv files.
- Identifying and selecting the relevant Family 3 saccharide BGC information.
- Formating Family 3 saccharide BGC dataframes:
    - ...

## Importing encapsulin csv files:

**Import the important files of:**
- **enc = dataframe with all putative encapsulins discovered.**
- **enc_operon = datatframe with all putative encapsulins discovered with their surrounding genes and associated Pfam annotations.**

In [1]:
import pandas as pd

#define file paths
enc_path = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\csv\complete_enc_discovery_and_operons\encapsulin_families.csv'
enc_operons_path = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\csv\complete_enc_discovery_and_operons\operon_df_filtered.csv'

#LEARNING POINT: 'r' prefix is required before the string for python to treat the backslashes literal characters instead of espace characters.

#load encapsulin_families.csv file
enc = pd.read_csv(enc_path)
enc_operons = pd.read_csv(enc_operons_path)

## Identifying and selecting family 3 saccharide BGC information:

**We are only interested in family 3 saccharide BGC encapsulins, hence we selected these encapsulins from the complete putative encapsulin dataframe.**

In [2]:
import pandas as pd
import numpy as np

#select family 3 saccharide BGCs from encapsulin families
print("Selecting DataFrame Rows...")
sacc_BGCs = enc.iloc[173:195]
sacc_BGCs.reset_index(drop=True, inplace=True)

#display the number of saccharide BGC enc. 
encapsulin_column = 'Encapsulin MGYP'
sacc_BGC_enc_count = len(sacc_BGCs[encapsulin_column])
print(f"Number of saccharide BGC encapsulins: {sacc_BGC_enc_count}")

#checkpoint print of dataframe
sacc_BGCs_columns_display = ['Encapsulin MGYP', 'Cargo Description']
print(sacc_BGCs[sacc_BGCs_columns_display])

Selecting DataFrame Rows...
Number of saccharide BGC encapsulins: 22
     Encapsulin MGYP                             Cargo Description
0   MGYP001216717877                                Saccharide BGC
1   MGYP001412933479                                Saccharide BGC
2   MGYP003109322860                                Saccharide BGC
3   MGYP003110882604                                Saccharide BGC
4   MGYP003113059926                                Saccharide BGC
5   MGYP003131024615                                Saccharide BGC
6   MGYP003131986813                                Saccharide BGC
7   MGYP003134444350                                Saccharide BGC
8   MGYP003331015935                                Saccharide BGC
9   MGYP003332394819                                Saccharide BGC
10  MGYP003341041167                                Saccharide BGC
11  MGYP003626144734                                Saccharide BGC
12  MGYP003626701920                                Sacchari

**From a complete encapsulin genome neighbourhood (-10 to +10) dataframe, we have only selected our interest of family 3 saccharide BGC encapsulins**

In [3]:
import pandas as pd

#define saccharide BGC enc. variable
list_of_sacc_BGCs = sacc_BGCs['Encapsulin MGYP']

#searching and selecting family 3 sacc. BCG operons
sacc_BGCs_operon_gn_pfam = enc_operons[enc_operons['Encapsulin MGYP'].isin(list_of_sacc_BGCs)]
sacc_BGCs_operon_gn_pfam.reset_index(drop=True, inplace=True)

#validation of dataframe formation
print("First Few Row:")
print(sacc_BGCs_operon_gn_pfam.head(3))

First Few Row:
   -10   -9   -8   -7   -6                -5                -4  \
0  NaN  NaN  NaN  NaN  NaN  MGYP001273916520  MGYP001327798979   
1  NaN  NaN  NaN  NaN  NaN               NaN               NaN   
2  NaN  NaN  NaN  NaN  NaN               NaN               NaN   

                 -3                -2                -1  ...      Pfam 1  \
0  MGYP001230219023  MGYP001223357618  MGYP001284084291  ...          []   
1               NaN  MGYP003122684177  MGYP001430673539  ...  ['CL0072']   
2               NaN               NaN  MGYP003121124808  ...          []   

       Pfam 2      Pfam 3 Pfam 4 Pfam 5      Pfam 6                Pfam 7  \
0  ['CL0123']          []     []     []  ['CL0159']                    []   
1          []          []     []     []          []                    []   
2          []  ['CL0159']     []     []          []  ['CL0057', 'CL0057']   

        Pfam 8       Pfam 9      Pfam 10  
0           []  ['PF04321']  ['PF00535']  
1           []      

## Producing new DataFrames for each putative family 3 saccharide BGC encapsulin:

**Reformating the genome neighbourhood DataFrame into individual encapsulin dataframe with MYGP accession and Pfam info**

In [None]:
import pandas as pd

# Assuming you have your original DataFrame sacc_BGCs_operon_gn_pfam

#create an empty dictionary to store DataFrames
r_sacc_BGC_operon_gn_pfam = {}

#iterate through each index
for i in range(28):
    # extract data for the current index
    data = {
        'Genome Neighbourhood Location': (-10, -9, -8, -7, -6, -5, -4, -3, -2, -1, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
        'MGYP accessions': sacc_BGCs_operon_gn_pfam.iloc[i, 0:20].values,
        'Pfam': sacc_BGCs_operon_gn_pfam.iloc[i, 21:].values
    }

    #create a DataFrame for the current index
    r_sacc_BGC_operon_gn_pfam[i] = pd.DataFrame(data).copy()  #use copy() to produce another copy

    #add the encapsulin_MGYP values to i=10 (operon position = 0) to the new DataFrames 
    enc_MGYP = sacc_BGCs_operon_gn_pfam.iloc[i, 20]
    r_sacc_BGC_operon_gn_pfam[i].at[10, 'Genome Neighbourhood Location'] = '0 (Encapsulin_MGYP)'
    r_sacc_BGC_operon_gn_pfam[i]['MGYP accessions'].iloc[10] = enc_MGYP
    r_sacc_BGC_operon_gn_pfam[i].at[10, 'Pfam'] = '[MISANNOTATED_Pfam_PHAGE]'

    #save the DataFrames as CSV files for each genome neighbourhood
    folder_path = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\csv\sacc_BGC_enc_operons'
    file_names = f'sacc_BGC_operon_{i}.csv'
    saving_path = folder_path + '\\' + file_names  # '\\' is used to concatenate the two variables with a backslash as a separator

    r_sacc_BGC_operon_gn_pfam[i].to_csv(saving_path, index=False)

    print(f"DataFrame for Index {i} saved as dataframe_index_{i}.csv")

#sorry messy code need to clean it up ://


**Issues Pfam accessions gene names can be found in Naails saccharide_BGCs.txt file and also some gene names from file produce Pfam accession that can be found in the DataFrame.**

**Address later, looking protein function annotation packages to enrich dataset.**

## Protein Function annotations for genome neighbourhoods:

**Reformating the saccharide_BGC_genes.fasta file into individual fasta files for each encapsulin.**

In [None]:
import os
from Bio import SeqIO
import pandas as pd

#define input and output fasta file paths
input_fasta_file = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\family_3_saccharide_BGCs_(fasta)\saccharide_BGC_genes.fasta'
output_folder = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\family_3_saccharide_BGCs_(fasta)\sacc_BGC_enc_operons'

#define input folder for csv file paths
input_folder = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\csv\sacc_BGC_enc_operons'

#list 'saccharide BGC encapsulins' dataframe with file paths
sacc_BGC_csv_files = [f for f in os.listdir(input_folder) if f.endswith('.csv')]

#iterate through each file
for csv_file in sacc_BGC_csv_files:  # Corrected variable name
    #form the full paths to the each file
    csv_path = os.path.join(input_folder, csv_file)

    #read the csv files into a DataFrame
    df = pd.read_csv(csv_path)

    #create a dictionary to store MGYP accessions and protein sequences
    sequences_dict = {}

    #iterate through the DataFrame
    for index, row in df.iterrows():
        mgyp_accession = row['MGYP accessions']
        sequences_dict[mgyp_accession] = None

    #iterate through the sequences in the input FASTA file
    for record in SeqIO.parse(input_fasta_file, 'fasta'):
        #check if the sequence header starts with '>'
        if record.id.startswith('>MGYP'):
            mgyp_accession = record.id[1:] 
            if mgyp_accession in sequences_dict:
                sequences_dict[mgyp_accession] = str(record.seq)  #convert the sequence to a string

    #create a new FASTA file with the desired sequences
    output_fasta_file = os.path.join(output_folder, f'output_{csv_file.replace(".csv", ".fasta")}')
    with open(output_fasta_file, 'w') as output_handle:
        for mgyp_accession, sequence in sequences_dict.items():
            if sequence is not None:
                output_handle.write(f'>{mgyp_accession}\n{sequence}\n')

    print(f'New FASTA file created: {output_fasta_file}')


In [6]:
import os
from Bio import SeqIO
import pandas as pd

#define input and output fasta file paths
input_fasta_file = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\family_3_saccharide_BGCs_(fasta)\saccharide_BGC_genes.fasta'
output_folder = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\family_3_saccharide_BGCs_(fasta)\sacc_BGC_enc_operons'

#define input folder for csv file paths
input_folder = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\csv\sacc_BGC_enc_operons'

#list 'saccharide BGC encapsulins' dataframe with file paths
sacc_BGC_csv_files = [f for f in os.listdir(input_folder) if f.endswith('.csv')]

#iterate through each file
for csv_file in sacc_BGC_csv_files:
    #form the full paths to the each file
    csv_path = os.path.join(input_folder, csv_file)

    #read the csv files into a DataFrame
    df1 = pd.read_csv(csv_path)



**Protein function prediction packages using seq:**

1. EggNOG (Cantalapiedra et al., 2021): This operation assigns functional annotations to protein sequences using EggNOG.

2. InterProScan (Jones et al., 2014) : This operation searches protein sequences against multiple databases and predicts functional domains and sites.

3. DeepFRI-seq (Gligorijević et al., 2021): This operation predicts protein function using deep learning models trained on protein sequence.
