# Data Wrangle:

In this notebook I intend to inspect a kaggle dataset detailing amino acid sequences of two proteins and whether or not those respective proteins interact with one another.

The link to the dataset is attached here: https://www.kaggle.com/datasets/spandansureja/ppi-dataset?resource=download



I will see what the data looks like once uploaded via pandas, and from there I will decide how the exploratory data analysis will go.

The preliminary goal is to see whether or not the presence of aromatic residues (phenylalanine, tyrosine, tryptophan, histidine) or number of aromatic residues is a good predictor of protein to protein interactions.


The model would input the qualities of the protein 1, qualities of protein 2, and from there have a 1 classification if there is a protein to protein interaction or a 0 if there is no interaction (logistic regression most likely is the best ml algorithm for this question).


1.) Open up the protein to protein yes interaction csv

2.) Open up the protein to protein no interaction csv

3.) Merge the two csv's into one csv

## Imports


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## 1.) Open up the csv containing positive protein to protein interactions

In [3]:
yes_df = pd.read_csv('positive_protein_sequences.csv')

In [6]:
yes_df.head()

Unnamed: 0,protein_sequences_1,protein_sequences_2
0,MESSKKMDSPGALQTNPPLKLHTDRSAGTPVFVPEQGGYKEKFVKT...,MARPHPWWLCVLGTLVGLSATPAPKSCPERHYWAQGKLCCQMCEPG...
1,MVMSSYMVNSKYVDPKFPPCEEYLQGGYLGEQGADYYGGGAQGADF...,MAENVVEPGPPSAKRPKLSSPALSASASDGTDFGSLFDLEHDLPDE...
2,MNRHLWKSQLCEMVQPSGGPAADQDVLGEESPLGKPAMLHLPSEQG...,MEGGRRARVVIESKRNFFLGAFPTPFPAEHVELGRLGDSETAMVPG...
3,MAPPSTREPRVLSATSATKSDGEMVLPGFPDADSFVKFALGSVVAV...,MLFYSFFKSLVGKDVVVELKNDLSICGTLHSVDQYLNIKLTDISVT...
4,MQSGPRPPLPAPGLALALTLTMLARLASAASFFGENHLEVPVATAL...,MQTIKCVVVGDGAVGKTCLLISYTTNKFPSEYVPTVFDNYAVTVMI...


In [5]:
yes_df.describe()

Unnamed: 0,protein_sequences_1,protein_sequences_2
count,36630,36630
unique,7590,6995
top,MVDREQLVQKARLAEQAERYDDMAAAMKNVTELNEPLSNEERNLLS...,MEAIAKYDFKATADDELSFKRGDILKVLNEECDQNWYKAELNGKDG...
freq,172,170


In [9]:
yes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36630 entries, 0 to 36629
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   protein_sequences_1  36630 non-null  object
 1   protein_sequences_2  36630 non-null  object
dtypes: object(2)
memory usage: 572.5+ KB


In [10]:
yes_df.dtypes

protein_sequences_1    object
protein_sequences_2    object
dtype: object

## add a third column indicating that all of these pairings were positive in a protein to protein interaction

In [11]:
ones = []
for x in range(36630):
    ones.append(1)

In [12]:
yes_df['protein_interaction'] = ones

In [16]:
yes_df.head()

Unnamed: 0,protein_sequences_1,protein_sequences_2,protein_interaction
0,MESSKKMDSPGALQTNPPLKLHTDRSAGTPVFVPEQGGYKEKFVKT...,MARPHPWWLCVLGTLVGLSATPAPKSCPERHYWAQGKLCCQMCEPG...,1
1,MVMSSYMVNSKYVDPKFPPCEEYLQGGYLGEQGADYYGGGAQGADF...,MAENVVEPGPPSAKRPKLSSPALSASASDGTDFGSLFDLEHDLPDE...,1
2,MNRHLWKSQLCEMVQPSGGPAADQDVLGEESPLGKPAMLHLPSEQG...,MEGGRRARVVIESKRNFFLGAFPTPFPAEHVELGRLGDSETAMVPG...,1
3,MAPPSTREPRVLSATSATKSDGEMVLPGFPDADSFVKFALGSVVAV...,MLFYSFFKSLVGKDVVVELKNDLSICGTLHSVDQYLNIKLTDISVT...,1
4,MQSGPRPPLPAPGLALALTLTMLARLASAASFFGENHLEVPVATAL...,MQTIKCVVVGDGAVGKTCLLISYTTNKFPSEYVPTVFDNYAVTVMI...,1


# 2.) Open up the negative protein to protein interactions

In [17]:
no_df = pd.read_csv('negative_protein_sequences.csv')

In [18]:
no_df.describe()

Unnamed: 0,protein_sequences_1,protein_sequences_2
count,36480,36480
unique,1355,987
top,MMLQHPGQVSASEVSASAIVPCLSPPGSLVFEDFANLTPFVKEELR...,MSVSGLKAELKFLASIFDKNHERFRIVSWKLDELHCQFLVPQQGSP...
freq,50,82


In [20]:
no_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36480 entries, 0 to 36479
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   protein_sequences_1  36480 non-null  object
 1   protein_sequences_2  36480 non-null  object
dtypes: object(2)
memory usage: 570.1+ KB


In [21]:
no_df.dtypes

protein_sequences_1    object
protein_sequences_2    object
dtype: object

In [25]:
nos = []
for x in range(36480):
    nos.append(0)

In [26]:
no_df['protein_interaction'] = nos

In [27]:
no_df.head()

Unnamed: 0,protein_sequences_1,protein_sequences_2,protein_interaction
0,MSVEMDSSSFIQFDVPEYSSTVLSQLNELRLQGKLCDIIVHIQGQP...,MGDTFIRHIALLGFEKRFVPSQHYVYMFLVKWQDLSEKVVYRRFTE...,0
1,MPITRMRMRPWLEMQINSNQIPGLIWINKEEMIFQIPWKHAAKHGW...,MTMPVNGAHKDADLWSSHDKMLAQPLKDSDVEVYNIIKKESNRQRV...,0
2,MLCVRGARLKRELDATATVLANRQDESEQSRKRLIEQSREFKKNTP...,MRLTLLCCTWREERMGEEGSELPVCASCGQRIYDGQYLQALNADWH...,0
3,MDALESLLDEVALEGLDGLCLPALWSRLETRVPPFPLPLEPCTQEF...,MERLQKQPLTSPGSVSPSRDSSVPGSPSSIVAKMDNQVLGYKDLAA...,0
4,MALSRGLPRELAEAVAGGRVLVVGAGGIGCELLKNLVLTGFSHIDL...,MVVMNSLRVILQASPGKLLWRKFQIPRFMPARPCSLYTCTYKTRNR...,0


# 3.) Merge the two dataframes together

In [29]:
full_df = pd.concat([yes_df,no_df], axis = 0)

In [30]:
full_df.head()

Unnamed: 0,protein_sequences_1,protein_sequences_2,protein_interaction
0,MESSKKMDSPGALQTNPPLKLHTDRSAGTPVFVPEQGGYKEKFVKT...,MARPHPWWLCVLGTLVGLSATPAPKSCPERHYWAQGKLCCQMCEPG...,1
1,MVMSSYMVNSKYVDPKFPPCEEYLQGGYLGEQGADYYGGGAQGADF...,MAENVVEPGPPSAKRPKLSSPALSASASDGTDFGSLFDLEHDLPDE...,1
2,MNRHLWKSQLCEMVQPSGGPAADQDVLGEESPLGKPAMLHLPSEQG...,MEGGRRARVVIESKRNFFLGAFPTPFPAEHVELGRLGDSETAMVPG...,1
3,MAPPSTREPRVLSATSATKSDGEMVLPGFPDADSFVKFALGSVVAV...,MLFYSFFKSLVGKDVVVELKNDLSICGTLHSVDQYLNIKLTDISVT...,1
4,MQSGPRPPLPAPGLALALTLTMLARLASAASFFGENHLEVPVATAL...,MQTIKCVVVGDGAVGKTCLLISYTTNKFPSEYVPTVFDNYAVTVMI...,1


In [32]:
full_df.describe()

Unnamed: 0,protein_interaction
count,73110.0
mean,0.501026
std,0.500002
min,0.0
25%,0.0
50%,1.0
75%,1.0
max,1.0


In [33]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 73110 entries, 0 to 36479
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   protein_sequences_1  73110 non-null  object
 1   protein_sequences_2  73110 non-null  object
 2   protein_interaction  73110 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 2.2+ MB


In [34]:
full_df.to_csv('complete.csv')

# Summary:

Looking at the two csvs:

There seems to be a nice balance in amount of data between the two classifications: around 36,000 observations for both.

The protein sequences are strings.

Was able to store the protein interaction information in the form of 0's and 1's (zero being no interaction and one being a interaction)

