# **Welcome to our project report! ✨🧪**
## 🚀 *Overview*
This notebook presents an analysis of our pKa prediction package. 

## 🤯 *Acquiring Dataset*
In a first step, we will acquire the [pKa dataset](https://github.com/cbio3lab/pKa/blob/main/Data/test_acids_bases_descfinal_nozwitterions.csv) from cbio3lab's repository, initially extracted from the Harvard [dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6A67L9).

Next, we will perform an exploratory analysis of the collected dataset.

1) Let's download the data directly into your working directory:

In [None]:
!wget https://github.com/anastasiafloris/pKaPredict/blob/main/data/pkadatasetRAWDATA.csv

2) Let's open the file and verify its existence as well as display a preview of the latter:

In [None]:
import pandas as pd
from pathlib import Path

# Define the current working directory
current_directory = Path.cwd()
print("Current Directory:", current_directory.resolve())

# Specify the path to the dataset file
file_path = current_directory / "pkadatasetRAWDATA.csv"

# Verify the file's existence and read its contents if available
if file_path.exists():
    print("The dataset file exists. Reading the file contents...\n")
    
    # Open and display the contents (optional, for verification)
    with file_path.open("r") as file:
        content = file.read()
        print(content[:500])  # Print only the first 500 characters for preview
    
    # Load the dataset using pandas
    try:
        data_pka = pd.read_csv(file_path, delimiter=",")  # Adjust delimiter if necessary
        print("\nDataset successfully loaded. Preview:")
        print(data_pka.head())  # Display first few rows
    except Exception as e:
        print(f"Error loading dataset: {e}")

else:
    print("Error: The specified file does not exist.")


## 🧹 *Cleaning Dataset*

✅ Prints initial dataset shape <br> 
✅ Counts and removes missing values (NaN)  <br>
✅ Prints final dataset shape after cleaning  <br>
✅ Generates a histogram to visualize pKa value distribution  <br>

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Remove missing values and check dataset shape
def clean_and_visualize_pka(data_pka):
    """Cleans dataset by removing NaN values and visualizes pKa distribution."""
    
    # Check initial shape
    initial_shape = data_pka.shape
    print(f"Initial dataset shape: {initial_shape}")

    # Check for missing values
    missing_values = data_pka.isnull().sum().sum()
    print(f"Total missing values: {missing_values}")

    # Drop NaN values
    data_pka.dropna(inplace=True)

    # Check final shape after cleaning
    final_shape = data_pka.shape
    print(f"Dataset shape after NaN removal: {final_shape}")

    # Generate histogram for pKa distribution
    print("\nGenerating histogram for pKa distribution...\n")
    sns.set_theme()
    plt.figure(figsize=(8, 5))
    sns.histplot(data=data_pka, x="pka", binwidth=0.5, kde=True)
    plt.xlabel("pKa")
    plt.ylabel("Frequency")
    plt.title("Distribution of pKa Values")
    plt.show()

    return data_pka  # Return cleaned dataset for further processing if needed