# Family 3 saccharide BGCs - Visualising Neighbouring Genes (Operon)

## Overview:
- Importing encapsulin csv files.
- Identifying and selecting the relevant Family 3 saccharide BGC information.
- Formating Family 3 saccharide BGC dataframes:
    - Inserting Family 3 encapsulin info into the +10 to -10 genome neighbourhood.
- ALTERNATE:
    - Use ESM atlas to fetch corresponding MGYP accession number sequences.
    - BEFORE: Script to check accessibility of sequences.

- Obtain corresponding gene description and location from MGYP
    - Useful websites:
    - https://esmatlas.com/resources?action=atlas_lookup
    - Use ESM atlas to fetch corresponding MGYP accession number sequences.
- Add information to the Dataframe
- Using Data_Features_Viewer to visualise the annotated sequences

## Importing encapsulin csv files:

In [3]:
import pandas as pd

#define file paths
enc_path = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\csv\encapsulin_families.csv'
enc_operons_path = r'C:\Users\Cameron\OneDrive - University College London\PhD\Year 1\ENCAPSULIN BIOINFORMATICS AND METAGENOMICS\encapsulin_bioinformatics_repo\csv\operon_df_filtered.csv'

#LEARNING POINT: 'r' prefix is required before the string for python to treat the backslashes literal characters instead of espace characters.

#load encapsulin_families.csv file
enc = pd.read_csv(enc_path)
enc_operons = pd.read_csv(enc_operons_path)


## Identifying and selecting family 3 saccharide BGC information:

In [9]:
import pandas as pd
import numpy as np

#select family 3 saccharide BGCs from encapsulin families
print("Selecting DataFrame Rows...")
sacc_BGCs = enc.iloc[173:195]
sacc_BGCs.reset_index(drop=True, inplace=True)

#display the number of saccharide BGC enc. 
number_of_enc = len(sacc_BGCs[0:21])
print("Number of Saccharide BGC Encapsulins:", number_of_enc)


sacc_BGCs_columns_display = ['Encapsulin MGYP', 'Cargo Description']
print(sacc_BGCs[sacc_BGCs_columns_display])


Selecting DataFrame Rows...
Number of Saccharide BGC Encapsulins: 21
     Encapsulin MGYP                             Cargo Description
0   MGYP001216717877                                Saccharide BGC
1   MGYP001412933479                                Saccharide BGC
2   MGYP003109322860                                Saccharide BGC
3   MGYP003110882604                                Saccharide BGC
4   MGYP003113059926                                Saccharide BGC
5   MGYP003131024615                                Saccharide BGC
6   MGYP003131986813                                Saccharide BGC
7   MGYP003134444350                                Saccharide BGC
8   MGYP003331015935                                Saccharide BGC
9   MGYP003332394819                                Saccharide BGC
10  MGYP003341041167                                Saccharide BGC
11  MGYP003626144734                                Saccharide BGC
12  MGYP003626701920                                Sacchari

In [5]:
import pandas as pd
import numpy as np

#display the enc_operon columns for verification
print(enc_operons.columns)

#reforming operon dataframe - selecting '+10 to -10' and 'Encapsulin MGYP' columns
#gn = genome neighbourhood (surrounding genes)
enc_operon_gn = enc_operons[['-10', '-9', '-8', '-7', '-6', '-5', '-4', '-3', '-2', \
     '-1', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'Encapsulin MGYP']]

#define saccharide BGC enc. variable
list_of_sacc_BGCs = sacc_BGCs['Encapsulin MGYP']

#searching and selecting family 3 sacc. BCG operons
sacc_BGCs_operon_gn = enc_operon_gn[enc_operon_gn['Encapsulin MGYP'].isin(list_of_sacc_BGCs)]
sacc_BGCs_operon_gn.reset_index(drop=True, inplace=True)


#validation of dataframe formation
print("First Few Row:")
print(sacc_BGCs_operon_gn.head(3))

Index(['-10', '-9', '-8', '-7', '-6', '-5', '-4', '-3', '-2', '-1', '1', '2',
       '3', '4', '5', '6', '7', '8', '9', '10', 'Encapsulin MGYP', 'Pfam -10',
       'Pfam -9', 'Pfam -8', 'Pfam -7', 'Pfam -6', 'Pfam -5', 'Pfam -4',
       'Pfam -3', 'Pfam -2', 'Pfam -1', 'Pfam 1', 'Pfam 2', 'Pfam 3', 'Pfam 4',
       'Pfam 5', 'Pfam 6', 'Pfam 7', 'Pfam 8', 'Pfam 9', 'Pfam 10'],
      dtype='object')
First Few Row:
   -10   -9   -8   -7   -6                -5                -4  \
0  NaN  NaN  NaN  NaN  NaN  MGYP001273916520  MGYP001327798979   
1  NaN  NaN  NaN  NaN  NaN               NaN               NaN   
2  NaN  NaN  NaN  NaN  NaN               NaN               NaN   

                 -3                -2                -1  ...  \
0  MGYP001230219023  MGYP001223357618  MGYP001284084291  ...   
1               NaN  MGYP003122684177  MGYP001430673539  ...   
2               NaN               NaN  MGYP003121124808  ...   

                  2                 3                 4       

## Formatting family 3 saccharide BGC dataframes:

In [6]:
import pandas as pd
import numpy as np

#assuming sacc_BGCs_operon_gn is your DataFrame
#define the column to move and the new position
enc_column_to_move = 'Encapsulin MGYP'
new_position = 10

#get the list of columns
enc_column = list(sacc_BGCs_operon_gn.columns)

#remove the column from its current position
enc_column.remove(enc_column_to_move)

#insert the column at the new position
enc_column.insert(new_position, enc_column_to_move)

#create the DataFrame with the new column position
print("Forming DataFrame...")
sacc_BGCs_operon_gn = sacc_BGCs_operon_gn[enc_column]

#validation of dataframe formation
print("First Few Row:")
print(sacc_BGCs_operon_gn.head(3))


Forming DataFrame...
First Few Row:
   -10   -9   -8   -7   -6                -5                -4  \
0  NaN  NaN  NaN  NaN  NaN  MGYP001273916520  MGYP001327798979   
1  NaN  NaN  NaN  NaN  NaN               NaN               NaN   
2  NaN  NaN  NaN  NaN  NaN               NaN               NaN   

                 -3                -2                -1  ...  \
0  MGYP001230219023  MGYP001223357618  MGYP001284084291  ...   
1               NaN  MGYP003122684177  MGYP001430673539  ...   
2               NaN               NaN  MGYP003121124808  ...   

                  1                 2                 3                 4  \
0  MGYP001325026281  MGYP001258357109  MGYP001460135749  MGYP001461491847   
1  MGYP003112855075  MGYP003122684323  MGYP003109280648  MGYP003108657726   
2  MGYP001166445487  MGYP001425121637  MGYP003121124990  MGYP003121125013   

                  5                 6                 7                 8  \
0  MGYP001225842149  MGYP001217703468  MGYP001164885470 

## Checking available MGYP accession sequences on ESM Atlas:

In [7]:
import pandas as pd
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
import sys

#suppress only the InsecureRequestWarning from requests.packages.urllib3 needed for SSL verification
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

def check_sequence_availability(accession_series):
    available_sequences = []
    unavailable_sequences = []

    print(f"Total Accession Numbers: {len(accession_series)}")

    for accession_number in accession_series:
        if pd.notna(accession_number):  #check if the value is not NaN
            api_url = f'https://api.esmatlas.com/fetchSequence/{accession_number}'
            response = requests.get(api_url, verify=False)  #disable SSL verification

            print(f"Accession: {accession_number}, Status Code: {response.status_code}")

            if response.status_code == 200:
                available_sequences.append(accession_number)
            else:
                unavailable_sequences.append(accession_number)

    return available_sequences, unavailable_sequences

# Example usage:
MGYP_accession_numbers = sacc_BGCs_operon_gn.stack()

#open a text file for the outputted information
with open('available_MGYP_accession_numbers.txt', 'w') as output_file:
    available_sequences, unavailable_sequences = check_sequence_availability(MGYP_accession_numbers)

    print(f"Available sequences: {len(available_sequences)}")
    print(f"Unavailable sequences: {len(unavailable_sequences)}")
    print(f"List of available sequences: {available_sequences}")
    print(f"List of unavailable sequences: {unavailable_sequences}")


Total Accession Numbers: 474
Accession: MGYP001273916520, Status Code: 404
Accession: MGYP001327798979, Status Code: 200
Accession: MGYP001230219023, Status Code: 404
Accession: MGYP001223357618, Status Code: 404
Accession: MGYP001284084291, Status Code: 200
Accession: MGYP001178754852, Status Code: 404
Accession: MGYP001325026281, Status Code: 200
Accession: MGYP001258357109, Status Code: 404
Accession: MGYP001460135749, Status Code: 200
Accession: MGYP001461491847, Status Code: 200
Accession: MGYP001225842149, Status Code: 200
Accession: MGYP001217703468, Status Code: 404
Accession: MGYP001164885470, Status Code: 200
Accession: MGYP001281378533, Status Code: 404
Accession: MGYP001353276347, Status Code: 404
Accession: MGYP001316708597, Status Code: 404
Accession: MGYP003122684177, Status Code: 404
Accession: MGYP001430673539, Status Code: 404
Accession: MGYP001216717877, Status Code: 404
Accession: MGYP003112855075, Status Code: 404
Accession: MGYP003122684323, Status Code: 404
Acces