# Protein dataset

This dataset has been obtained from the protein data bank (https://www.rcsb.org/) where you can search for and collate data on a selection or all structures within the database. The first dataset contains selected information from all structures within the database (as of 1st july 2023) and primarily summarises the data quality and some basic information about each structure. More data can be added. Gaining insight into this can allow us to gain an idea of how data has changed.

Setting up the basic analysis environment

In [None]:
#import variuos models we might need
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import statsmodels.api as sm
import json
np.set_printoptions(precision=5, suppress=True)  # suppress scientific floatation 
sns.set(color_codes=True)
%matplotlib inline
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
import os

# Read in the first dataset.

In [None]:
# Read the CSV file into a DataFrame
file_path = './datacollection_keep_data.csv'  # Replace with the actual path to your CSV file
data_df = pd.read_csv(file_path)

In [None]:
# What types of data do we have
data_df.dtypes

# What do these things mean?
See supplementary Glossary of terms found in the proteins data set file.

In [None]:
data_df.head()

In [None]:
data_df

So there is 208831 individual structures and 23 columns of information for each.
A simple starting point might be to see how things have changed with time.

How has releases to the pdb changed with time?

Try to plot the total Entry IDs by year.

In [None]:
yearly_counts = data_df.groupby("Release Year")["Entry ID"].count()

# Creating the histogram
plt.figure(figsize=(10, 6))
sns.barplot(x = yearly_counts.index, y = yearly_counts.values, color = 'blue')
plt.xlabel("Release Year")
plt.ylabel("Total Entry ID Counts")
plt.title("Total Entry ID Counts per Release Year")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Conclusion - there has been a year on year increase of pdb releases. The value is less for the current year as it is not finished

Perhaps the plot could be made to look nicer.

In [None]:
# Creating the histogram with a smooth trendline using Seaborn
plt.figure(figsize=(10, 6))
sns.regplot(x=yearly_counts.index, y=yearly_counts.values, lowess=True, line_kws={'color': 'red'})
plt.xlabel("Release Year")
plt.ylabel("Total Entry ID Counts")
plt.title("Total Entry ID Counts per Release Year")
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

# Can this be broken down in more detail?
Lets have a look at the number of released structures per year per experimental method.


In [None]:
#What are the counts for each experimental method?
data_df["Experimental Method"].value_counts()

X-ray diffractionis by far the most. 
How many structures from each experimental method are released each year.

In [None]:
# try a single plot
# Create a single histogram with different colors for each experimental method
sns.displot(data=data_df, x="Release Year", hue="Experimental Method", kde=True)

Does appear to show changes in Experimental method over time. Can it be made clearer
Lets try a different way of grouping and plotting

In [None]:
#Use groupby to create a new dataframe containing each method per year and counts for it.
method_counts = data_df.groupby(["Release Year", "Experimental Method"]).size().reset_index(name="Count")

In [None]:
method_counts.head()

In [None]:
#Create a line plot for the experimental methods with year.
plt.figure(figsize=(10, 6))
sns.lineplot(data=method_counts, x="Release Year", y="Count", hue="Experimental Method", marker="o")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.title("Count of Experimental Methods per Release Year")
plt.xticks(rotation=45)
plt.legend(title="Experimental Method")

plt.tight_layout()
plt.show()

Only 3 methods show much variation.

From earlier the top 3 Experimental methods are; 

X-RAY DIFFRACTION                                            177588, 
ELECTRON MICROSCOPY                                           16426, 
SOLUTION NMR                                                  13913

Lets plot the top 3 methods in terms of number of releases for each year.


In [None]:
# Get the top 3 methods for each year using the earlier methods_counts dataframe
top_methods = method_counts.groupby("Release Year").apply(lambda x: x.nlargest(3, "Count")).reset_index(drop=True)

In [None]:
top_methods.head()

In [None]:
#redo the plot with the new top_methods dataframe
# Create a single combined line plot for the top 4 methods
plt.figure(figsize=(10, 6))
sns.lineplot(data=top_methods, x="Release Year", y="Count", hue="Experimental Method", marker="o")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.title("Top 3 Experimental Methods per Release Year")
plt.xticks(rotation=45)
plt.legend(title="Experimental Method")
plt.tight_layout()
plt.show()

That is much clearer. For tidieness remove the data for 2023 owing to it being incomplete

In [None]:
# Exclude data for the year 2023
top_methods_no2023 = top_methods[top_methods["Release Year"] != 2023]

In [None]:
# Create a single combined line plot for the top 4 methods
plt.figure(figsize=(10, 6))
sns.lineplot(data=top_methods_no2023, x="Release Year", y="Count", hue="Experimental Method", marker="o")
plt.xlabel("Release Year")
plt.ylabel("Count")
plt.title("Top 3 Experimental Methods per Release Year (Excluding 2023)")
plt.xticks(rotation=45)
plt.legend(title="Experimental Method")
plt.tight_layout()
plt.show()

Much clearer.
Conclusion the major methods for solving macormolecular structures has changed with time. 
X-ray diffraction has shown a steady increase since the 1990s but since about 2015 appears to be levelling off.
Solution NMR had a steady increase in the 1990s and early 2000s after which it has shown a slow decline.
Since about 2015 has shown a rapid increase in structures and is still growing.

Can you explain the spike in structures from X-ray diffraction in 2020? 

# What else could you find out using this data?

Some things to consider

Do the major methods of structure solution vary in terms of structures solved?

Given improvements to technology how has the structures solved or data quality changed?

What factors affect the number of, or presence of structural features, such as waters?

Do larger structures have poorer quality?


# Protein dataset part 2
The second dataset contains data obtained from the pdbsum database (http://www.ebi.ac.uk/thornton-srv/databases/pdbsum/)
This database contains all the structures within the protein data bank but with addition analysis on each structure. This includes information for protein-protein interactions, protein-ligand interactions and in this case data for structures containing ions.

The data supplied contains data obtained for each structure containing an ion and is a summary of the interactions each ion makes. An example for the source data can be found here http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl?pdb=3f5m&ligtype=01&ligno=01&metal=TRUE for the structure https://www.rcsb.org/structure/3F5M. 

It was extracted using the script Metals_working_keep.ipynb

The data was then combined into a single .csv file

# Read in the second dataset.

In [None]:
# Read the CSV file into a DataFrame
file_path = './combined_metals_keep_data.csv'  # Replace with the actual path to your CSV file
ions_df = pd.read_csv(file_path)

In [None]:
# What types of data do we have
ions_df.dtypes

# What do these things mean?
See supplementary Glossary of terms found in the proteins data set file.

In [None]:
ions_df

94843 different entries with 33 columns. You can see from PDB code 6p4d some structures contain more than one entry (more than one ion). You can also see that 6p4d contains NaN, why do you think this might be. How do you deal with it?

You can compare the original data for 3f5m  http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl?pdb=3f5m&ligtype=01&ligno=01&metal=TRUE with the extracted data.

In [None]:
# Assuming df is your DataFrame
row_3f5m = ions_df[ions_df['PDB code'] == '3f5m']

row_3f5m

This structure contains 2 ions the second contains MG. If you wish to compare to the original data it can be found here http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/pdbsum/GetLigInt.pl?pdb=3f5m&ligtype=02&ligno=01&metal=TRUE

# What could you consider using this data?
Some things you could consider

Is the presence of ions is dependant on the resolution of the data.

Do the standard deviations in the second data set vary with resolution

Do the length of ion-protein interaction vary with the size of the ion.

Is there variation in the amino acid environments around different ions.

Cations are positively charged, they will interact with negatively chared amino acids.

Anions are negatively charged, they will interact with positively charged amino acids.
