# Data preparation and processing notebook

Dataset descriptions and references: 

1) Yamagishi R, Kaneko H. Data from comprehensive analysis of nuclear localization signals. Data Brief. 2015 Dec 12;6:200-3. doi: 10.1016/j.dib.2015.11.064. PMID: 26862559; PMCID: PMC4707185.

    UniProt IDs were extracted from a spreadsheet containing documented NLS sequences and the proteins in which they are located inside the supplementary materials of the paper cited above. These UniProt IDs were entered in the ID-mapping query on the UniProt website, which then generated a FASTA file containing all the full sequences of the proteins. We then used the SeqIO method inside the Biopython package to extract the sequences from the FASTA file and add them to a CSV. This CSV containing the full sequences was then merged with Pandas such that the proteins, organized by UniProt ID, contained both the NLS sequences and the full sequence.

2)  https://services.healthtech.dtu.dk/services/DeepLoc-2.0/#




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from data_prep import *

In [None]:
df = pd.read_csv("../csv_files/SortingSignalsSwissprot.csv")
df["Types"].unique()

In [None]:
# Drop rows with specific values in the 'Types' column
values_to_drop = ['NLS_NES', 'TM_MT', np.nan, 'SP_TM_PTS', 'SP_PTS', 'SP_GPI', 'SP_TM']
df = df[~df['Types'].isin(values_to_drop)]

In [None]:
df = df.drop(columns="Kingdom")

In [None]:
print(df["Types"].unique())

type_counts = df['Types'].value_counts()
print(type_counts)

In [None]:
#renaming columns before combining two dataframes 

df2 = pd.read_csv("../csv_files/finalized_complete_NLS_sequence_table.csv")

df2 = df2.rename(columns={'Sequence_full': "Sequence"})

df2["Types"] = "NLS"

df2 = df2.rename(columns={"UniProt ID": "ACC"})

df2['AnnotEncoded'] = df2.apply(generate_annotation, axis=1)

df2 = df2.drop(columns=['Name', 'Begin','End','Length', 'Evidence', 'Sequence_nls', 'ECO code'])

In [None]:
#Combining the two data frames
stacked_df = pd.concat([df, df2], ignore_index=True)

In [None]:
# Function to remove rows containing "B", "U", or "X" in the sequence
cleaned_stacked_df = remove_bux(stacked_df)

In [None]:
#Adding the length column which shows lengths of the full protein sequences
cleaned_stacked_df["Length"] = cleaned_stacked_df["Sequence"].apply(lambda x: len(x))

In [None]:
# removing duplicates 
df_cleaned = cleaned_stacked_df.drop_duplicates(keep=False)

In [None]:
#cleaned_modified_df2.to_csv("cleaned_modified_df2.csv",index=False)

cleaned_modified_df2.to_csv("finalized_df_cleaned.csv",index=False)