# Progamming & Computing Assignment 2

### Question 1

#### Program

In [35]:
def EUTOS():
    """
    OUTER function calculates the EUTOS score to determine CML risk groups (high or low risk).

    Returns:
    str: string indicating risk group ('High risk' or 'Low risk') and EUTOS score.
    
    """
    basophil_percentage = int(input("Enter basophil percentage (0-25%): "))
    spleen_size = int(input("Enter spleen size (0-40 cm): "))

    def calculate_patient_score(basophil_percentage, spleen_size):
        """
        INNER function multiplies basophil % by 7 and spleen cm by 4 and returns the sum.

        Parameters:
        basophil_percentage (int): 0-25%
        spleen_size (int): 0-40cm 

        Returns:
        int: EUTOS score.
        
        """
        patient_score = (7 * basophil_percentage) + (4 * spleen_size)
        return patient_score

    # check if input values are within ranges
    if not (0 <= basophil_percentage <= 25):
        return "Basophil percentage value must be between 0 and 25."
    if not (0 <= spleen_size <= 40):
        return "Spleen size value must be between 0 and 40."

    # calculate EUTOS with inner function
    patient_score = calculate_patient_score(basophil_percentage, spleen_size)

    # allocate high or low risk via EUTOS threshold
    if patient_score > 87:
        return f"High risk, EUTOS score: {patient_score}"
    else:
        return f"Low risk, EUTOS score: {patient_score}"


#### Example use

In [34]:
# call function
result = EUTOS()

# print result
print(result)


Enter basophil percentage (0-25%):  1
Enter spleen size (0-40 cm):  1


Low risk, EUTOS score: 11


_Explanation_ : Outer function takes user input for the basophil percentage and spleen size. It then checks if the input values are within the specified ranges of 0-25% and 0-40cm, and if not, an error message is returned.
Inner function calculates the EUTOS score using user inputs. Outer function checks EUTOS score and classifies risk group.

### Question 2

#### a)

In [76]:
def encrypt(text):
    """
    ENCRYPT FUNCTION:encrypt user input text. Sub plain alphabet with cipher.
    
    Parameters:
    text (str): text to be encrypted
    
    Returns:
    str: encrypted text
    
    Raises:
    ValueError: special characters -> ValueError
    
    """

    # define plain and cipher alphabet 
    plain_alphabet = "abcdefghijklmnopqrstuvwxyz"
    cipher_alphabet = "phqgiumeaylnofdxjkrcvstzwb"
    encrypted_text = ""

    # check for special charcaters then loop through each character and sub with cipher alphabet
    for char in text.lower():
        # special characters exemption 
        if char not in plain_alphabet:
            raise ValueError("Can't encrypt words having special characters")
            # use index to look up corresponding character in cipher alphabet -> add encrypted character to encrypted string.
        encrypted_text += cipher_alphabet[plain_alphabet.index(char)]

    # return encrypted text
    return encrypted_text


def decrypt(text):
    """
    DECRYPT FUNCTION: decrypt user input text. Sub cipher alphabet with plain. 
    
    Parameters:
    text (str): The text to be decrypted.
    
    Returns:
    str: decrypted text.
    
    Raises:
    ValueError: special characters -> ValueError
    
    """

    # define plain and cipher alphabet
    plain_alphabet = "abcdefghijklmnopqrstuvwxyz"
    cipher_alphabet = "phqgiumeaylnofdxjkrcvstzwb"
    decrypted_text = ""

    # check for special characters then loop through each character and sub with plain alphabet
    for char in text.lower():
        if char not in cipher_alphabet:
            raise ValueError("Can't decrypt words having special characters")
        decrypted_text += plain_alphabet[cipher_alphabet.index(char)]

    # return decrypted text
    return decrypted_text


def main():
    """
    MAIN: user prompt to input text -> select encrypt (e) or decrypt (d) -> call relevant function -> print result
    
    """

    # user prompt to input text then choose encrypt or decrypt
    text = input("Enter the text to encrypt or decrypt: ")
    choice = input("Enter 'e' to encrypt or 'd' to decrypt: ")

    try:
        # call relevant function and print result
            # call encrypt if e, store result, print
            if choice.lower() == 'e':
                result = encrypt(text)
                print("Encrypted text:", result)
            # call decrypt if d, store result, print
            elif choice.lower() == 'd':
                result = decrypt(text)
                print("Decrypted text:", result)
            else:
                # if neither chosen print error
                print("Invalid choice. Enter 'e' or 'd'.")
    except ValueError as e:
        # if special characters then print error message
        print(e)


if __name__ == '__main__':
    main()
# function only called if script run as main

Enter the text to encrypt or decrypt:  hello
Enter 'e' to encrypt or 'd' to decrypt:  e


Encrypted text: einnd


#### b)

_Firstly, the code may need to be adapated to suit the application wherein patient data is fed into the function as opposed to user input driven. To do this the main() function is modified to read patient records from a file or database and encrypted with the encrypt function, which is written to a new file/database._ 

Example

----------------------------------------------------------------

        import csv

        def encrypt_record(record):
            """
            ENCRYPT FUNCTION: Encrypt a patient record. Sub plain alphabet with cipher.

            Parameters:
            record (list): The patient record to be encrypted.

            Returns:
            list: The encrypted patient record.
            """
            plain_alphabet = "abcdefghijklmnopqrstuvwxyz"
            cipher_alphabet = "phqgiumeaylnofdxjkrcvstzwb"
            encrypted_record = []

            for field in record:
                encrypted_field = ""
                for char in field.lower():
                    if char not in plain_alphabet:
                        raise ValueError("Can't encrypt words having special characters")
                    encrypted_field += cipher_alphabet[plain_alphabet.index(char)]
                encrypted_record.append(encrypted_field)

            return encrypted_record


        def encrypt_patient_records(input_file, output_file):
            """
            ENCRYPT FUNCTION: Encrypt all patient records in a CSV file.

            Parameters:
            input_file (str): The name of the input CSV file.
            output_file (str): The name of the output CSV file.

            Raises:
            ValueError: Special characters found in input CSV file.
            """
            with open(input_file, "r") as f_in, open(output_file, "w", newline="") as f_out:
                reader = csv.reader(f_in)
                writer = csv.writer(f_out)

                for record in reader:
                    try:
                        encrypted_record = encrypt_record(record)
                        writer.writerow(encrypted_record)
                    except ValueError as e:
                        print(f"Skipping record: {e}")


        if __name__ == "__main__":
            encrypt_patient_records("patients.csv", "patients_encrypted.csv")
            
--------------------------------------------------------

_To run this script on QUB's Kelvin HPC, resources must be requested  appropriate to the size of the dataset and expected script runtime; these can be requested via Slurm. In effort to make the script more efficient and ideally run faster, parallel processing can be used to  divide the program into fragments that can execute simultaneously, across multiple cores and processors. To do this the code may need amended to add **import multiprocessing** and create a pool of worker processes._ 

_A job script will be submitted to the scheduler, asking for a number of cores & nodes to use, maximum walltime, certain amount of memory and potentially an instruction to receive an email on job completion._

Example

-----------------------------------------------------

!/bin/bash
#SBATCH --job-name=encrypt_patients
#SBATCH --output==/mnt/scratch2/users/swmitch/encrypt_patients.out
#SBATCH --time=24:25:00
#SBATCH --nodes=5
#SBATCH --ntasks=5
#SBATCH --cpus-per-task=5
#SBATCH --mem=1G
#SBATCH --mail-type=begin
#SBATCH --mail-type=end: sends an email when the job ends.

module load apps/python3/3.8.5/gcc-4.8.5
encrypt_patients.py

-----------------------------------------------------
where:
**#!/bin/bash** tells slurm to run the script using the BASH command-line interpreter/shell

**--job-name=encrypt_patients** provides the name of the job, which will show in squeue output

**--output=encrypt_patients.out** name of the output log file

**--time=00:05:00** sets the time limit for the job (eg 24 hours 25 minutes)

**--nodes=5** specifies number of nodes, if blank will allocate enough to satisfy resource 

**--ntasks=5** specifies specifies the number of tasks to be run in parallel (eg 5).

**--cpus-per-task=5**  specificies number of CPUs per task (eg access to 5 CPUs)

**--mem=1G** sets memory limit (1GH) 

**--mail-type=begin** sends  email when job starts

**SBATCH --mail-type=end** sends email when job ends

**module load** loads the newest version of python

**encrypt_patients.py** execute script


### Question 3

#### a)

In [126]:
import pandas as pd

# load microarray and RNA-Seq datasets
microarray_df = pd.read_csv('Q3-microarray.csv')
rnaseq_df = pd.read_csv('Q3-rna-seq.csv')

# filter for overexpressed genes
    # microarray = logFC ≥ 1, adj.P.Val ≤ 0.05
microarray_overexpressed = microarray_df[(microarray_df['logFC'] >= 1) & (microarray_df['adj.P.Val'] <= 0.05)]['RefSeq_IDs']
    # RNA-seq = logFC ≥ 1, q_value ≤ 0.05 
rnaseq_overexpressed = rnaseq_df[(rnaseq_df['logFC'] >= 1) & (rnaseq_df['q_value'] <= 0.05)]['RefSeq_IDs']

# print results
print("Overexpressed genes in microarray dataset:\n", microarray_overexpressed)
print("\nOverexpressed genes in RNA-Seq dataset:\n", rnaseq_overexpressed)

Overexpressed genes in microarray dataset:
 3424       NM_198440
3437       NM_139280
3439       NM_006804
3442       NM_052945
3447       NM_004448
3448    NM_001005862
3451       NM_182626
3458       NM_032339
3460       NM_033419
3465    NM_001030002
3467       NM_003673
3470       NM_003108
3471       NM_005310
3472       NM_004448
3473       NM_005310
Name: RefSeq_IDs, dtype: object

Overexpressed genes in RNA-Seq dataset:
 0           NM_130786
1        NM_001198818
4           NM_017436
23          NM_005763
27          NM_005502
             ...     
17755    NM_001102657
17760    NM_001136509
17768    NM_001099220
17812    NM_001023560
17814    NM_001112734
Name: RefSeq_IDs, Length: 1559, dtype: object


_Explaination_: Pandas library imported to load and manipulate datasets. Microarray and RNA-Seq datasets read with read_csv function and stored in data frames.
Dataframes filtered for overexpressed genes using provided conditions.RefSeq IDs stored for overexpressed genes as new variables before being printed. 

#### b)

In [141]:
import pandas as pd

# load microarray and RNA-Seq datasets
microarray_df = pd.read_csv('Q3-microarray.csv')
rnaseq_df = pd.read_csv('Q3-rna-seq.csv')

# extract RefSeq_IDs
microarray_ID = set(microarray_df['RefSeq_IDs'])
rnaseq_ID = set(rnaseq_df['RefSeq_IDs'])

# find common RefSeq_IDs
common_refseq = list(microarray_ID.intersection(rnaseq_ID))

# new df with common RefSeq_IDs
common_df = pd.DataFrame({'RefSeq_IDs': common_refseq})

# merge datasets via common RefSeq_IDs
merged_df = pd.merge(microarray_df, rnaseq_df, on='RefSeq_IDs')

# remove rows with RefSeq_IDs only appearing in one dataset
final_df = merged_df[merged_df['RefSeq_IDs'].isin(common_refseq)]

# export final dataset as csv
final_df.to_csv('combined_dataset.csv', index=False)

_Explanation:_ Pandas library imported for data manipulation. Datasets then loaded into separate dataframes. RefSeq_IDs extracted from each dataset via the 'Refseq_IDs' column of the dataframes. Intersection fucntion used to find common common RefSeq_IDs which is converted to a list to assist creation of a new dataframe. The new dataset is used as a key for merging the two orginal datasets into a new merged df containing all rows from both datasets where RefSeq_IDs match. Any rows where RefSeq_IDs appears only once are removed (by checking list of common RefSeq_IDs). The final result is exported as a CSV file.

#### c)

In [134]:
import pandas as pd

# load microarray and RNA-Seq datasets
microarray_df = pd.read_csv('Q3-microarray.csv')
rnaseq_df = pd.read_csv('Q3-rna-seq.csv')

# filter for downregulated genes
# microarray = logFC ≤ -1, adj.P.Val ≤ 0.05 
microarray_downregulated = set(microarray_df[(microarray_df['logFC'] <= -1) & (microarray_df['adj.P.Val'] <= 0.05)]['RefSeq_IDs'])
# RNA-seq = logFC ≤ -1, q_value ≤ 0.05
rnaseq_downregulated = set(rnaseq_df[(rnaseq_df['logFC'] <= -1) & (rnaseq_df['q_value'] <= 0.05)]['RefSeq_IDs'])

# find common downregulated genes
common_downregulated_genes = microarray_downregulated.intersection(rnaseq_downregulated)

# print results
print("Downregulated genes in microarray dataset:\n", microarray_downregulated)
print("\nDownregulated genes in RNA-Seq dataset:\n", rnaseq_downregulated)
print("\nCommon downregulated genes across both datasets:\n", common_downregulated_genes)



Downregulated genes in microarray dataset:
 {'NM_152999', 'NM_000139', 'NM_031911', 'NM_032918', 'NM_001025108', 'NM_000125', 'NM_012467'}

Downregulated genes in RNA-Seq dataset:
 {'NM_001144937', 'NM_001264573', 'NM_021156', 'NM_017769', 'NM_203394', 'NM_001144757', 'NM_033280', 'NM_016500', 'NM_198066', 'NM_031280', 'NM_178448', 'NM_001080430', 'NM_020379', 'NM_013278', 'NM_016932', 'NM_000946', 'NM_001134851', 'NM_030796', 'NM_032737', 'NM_003686', 'NM_001253725', 'NM_016523', 'NM_002490', 'NM_016048', 'NM_001254', 'NM_001995', 'NM_001012993', 'NM_001195415', 'NM_001101357', 'NM_001145001', 'NM_020190', 'NM_001554', 'NM_003642', 'NM_198465', 'NM_002308', 'NM_001258315', 'NM_000189', 'NM_000236', 'NM_001131055', 'NM_003090', 'NM_000360', 'NM_001005413', 'NM_020647', 'NM_001267580', 'NM_007365', 'NM_006855', 'NM_001029891', 'NM_003521', 'NM_001130910', 'NM_014875', 'NM_001177519', 'NM_014971', 'NM_014817', 'NM_004701', 'NM_015873', 'NM_152446', 'NM_001014797', 'NM_001114133', 'NM_004

_Explanation_: Filter datasets for downregulated genes using the specified logFC and p/q-value thresholds. Convert the resulting RefSeq_IDs columns to sets to make them easily comparable. Find the intersection of the two sets and store for print.

#### d)

In [149]:
import pandas as pd

# Load the microarray and RNA-Seq datasets
microarray_df = pd.read_csv('Q3-microarray.csv')
rnaseq_df = pd.read_csv('Q3-rna-seq.csv')

# Merge the datasets by Refseq_IDs
merged_df = pd.merge(microarray_df, rnaseq_df, on='RefSeq_IDs')

# Get user input for gene/probe ID
user_input = input('Enter a Gene_ID or ProbeSet_ID: ')

# Filter the merged dataset by user input
filtered_df = merged_df[(merged_df['Gene_ID'] == user_input) | (merged_df['ProbeSet_ID'] == user_input)]

# Check if any records were returned
if len(filtered_df) == 0:
    print("There are no records for this Gene or Probe")
else:
    # Print all information related to the gene/probe
    print(filtered_df)

Enter a Gene_ID or ProbeSet_ID:  ILMN_1678535


    ProbeSet_ID  adj.P.Val  P.Value   logFC_x RefSeq_IDs Gene_ID  \
0  ILMN_1678535   0.027303  0.00002 -2.167818  NM_000125    ESR1   

   Parental_FPKM  Persister_FPKM       q_value   logFC_y  
0        12.3511         18.2107  8.820000e-10  0.555816  


_Explanation_ Loads and merges datasets via Refseq_IDs as before. Then allows user input of either Gene_ID or ProbeSet_ID. Merged dataset filtered by the user input. If no records are returned an error message is printed. Otherwise, all information related to the gene/probe is printed. Code searches for all rows in the merged dataset that match the user input in either the 'Gene_ID' or 'ProbeSet_ID' column. 

#### e)

In [2]:
import pandas as pd

# load microarray and RNA-Seq datasets
microarray_df = pd.read_csv('Q3-microarray.csv')
rnaseq_df = pd.read_csv('Q3-rna-seq.csv')



########## part a #############

# filter for overexpressed genes
    # microarray = logFC ≥ 1, adj.P.Val ≤ 0.05
microarray_overexpressed = microarray_df[(microarray_df['logFC'] >= 1) & (microarray_df['adj.P.Val'] <= 0.05)]['RefSeq_IDs']
    # RNA-seq = logFC ≥ 1, q_value ≤ 0.05 
rnaseq_overexpressed = rnaseq_df[(rnaseq_df['logFC'] >= 1) & (rnaseq_df['q_value'] <= 0.05)]['RefSeq_IDs']

# export
user_input = input("\nDo you want to export the overexpressed genes results to a file? (Y/N)\n")
if user_input.lower() == 'y':
    file_format = input("\nChoose file format (TXT/CSV):\n")
    if file_format.lower() == 'txt':
        microarray_overexpressed.to_csv("microarray_overexpressed.txt", index=False, header=None, sep='\t')
        rnaseq_overexpressed.to_csv("rnaseq_overexpressed.txt", index=False, header=None, sep='\t')
    elif file_format.lower() == 'csv':
        microarray_overexpressed.to_csv("microarray_overexpressed.csv", index=False, header=None)
        rnaseq_overexpressed.to_csv("rnaseq_overexpressed.csv", index=False, header=None)
    else:
        print("\nInvalid file format. Please choose either TXT or CSV.")
else:
    print("\nOverexpressed genes results not exported.")

########## part b #############

# extract RefSeq_IDs
microarray_ID = set(microarray_df['RefSeq_IDs'])
rnaseq_ID = set(rnaseq_df['RefSeq_IDs'])

# find common RefSeq_IDs
common_refseq = list(microarray_ID.intersection(rnaseq_ID))

# new df with common RefSeq_IDs
common_df = pd.DataFrame({'RefSeq_IDs': common_refseq})

# merge datasets via common RefSeq_IDs
merged_df = pd.merge(microarray_df, rnaseq_df, on='RefSeq_IDs')

# remove rows with RefSeq_IDs only appearing in one dataset
final_df = merged_df[merged_df['RefSeq_IDs'].isin(common_refseq)]

# export
user_input = input("\nDo you want to export the common RefSeq_IDs dataset as a file? (Y/N)\n")
if user_input.lower() == 'y':
    file_format = input("\nChoose file format (TXT/CSV):\n")
    if file_format.lower() == 'txt':
        final_df.to_csv("common_IDs_dataset.txt", index=False, header=True, sep='\t')
    elif file_format.lower() == 'csv':
        final_df.to_csv("common_IDs_dataset.csv", index=False, header=True)
    else:
        print("\nInvalid file format. Please choose either TXT or CSV.")
else:
    print("\nCommon RefSeq_IDs dataset not exported.")

########## part c #############

# filter for downregulated genes
# microarray = logFC ≤ -1, adj.P.Val ≤ 0.05 
microarray_downregulated = set(microarray_df[(microarray_df['logFC'] <= -1) & (microarray_df['adj.P.Val'] <= 0.05)]['RefSeq_IDs'])
# RNA-seq = logFC ≤ -1, q_value ≤ 0.05
rnaseq_downregulated = set(rnaseq_df[(rnaseq_df['logFC'] <= -1) & (rnaseq_df['q_value'] <= 0.05)]['RefSeq_IDs'])

# find common downregulated genes
common_downregulated_genes = microarray_downregulated.intersection(rnaseq_downregulated)

# export results as TXT or CSV file
user_input = input("\nDo you want to export the downregulated genes results as a file? (Y/N)\n")
if user_input.lower() == 'y':
    file_format = input("\nChoose file format (TXT/CSV):\n")
    if file_format.lower() == 'txt':
        # create new df downreg genes
        microarray_downregulated_df = pd.DataFrame({'RefSeq_IDs': list(microarray_downregulated)})
        rnaseq_downregulated_df = pd.DataFrame({'RefSeq_IDs': list(rnaseq_downregulated)})
        
        # export TXT 
        microarray_downregulated_df.to_csv("microarray_downregulated.txt", index=False, header=True, sep='\t')
        rnaseq_downregulated_df.to_csv("rnaseq_downregulated.txt", index=False, header=True, sep='\t')
        with open("common_downregulated.txt", "w") as f:
            f.write('\n'.join(common_downregulated_genes))
    elif file_format.lower() == 'csv':
        # create new df downreg genes
        microarray_downregulated_df = pd.DataFrame({'RefSeq_IDs': list(microarray_downregulated)})
        rnaseq_downregulated_df = pd.DataFrame({'RefSeq_IDs': list(rnaseq_downregulated)})
        common_downregulated_df = pd.DataFrame({'RefSeq_IDs': list(common_downregulated_genes)})
        
        # export CSV 
        microarray_downregulated_df.to_csv("microarray_downregulated.csv", index=False, header=True)
        rnaseq_downregulated_df.to_csv("rnaseq_downregulated.csv", index=False, header=True)
        common_downregulated_df.to_csv("common_downregulated.csv", index=False, header=True)
    else:
        print("\nInvalid file format. Please choose either TXT or CSV.")
else:
    print("\nDownregulated genes results not exported.")


########## part d #############

# Merge the datasets by Refseq_IDs
merged_df = pd.merge(microarray_df, rnaseq_df, on='RefSeq_IDs')

# Get user input for gene/probe ID
user_input = input('Enter a Gene_ID or ProbeSet_ID: ')

# Filter the merged dataset by user input
filtered_df = merged_df[(merged_df['Gene_ID'] == user_input) | (merged_df['ProbeSet_ID'] == user_input)]

# Check if any records were returned
if len(filtered_df) == 0:
    print("There are no records for this Gene or Probe")
else:
    # user input for export
    export_input = input('Do you want to export the gene/probe filtered dataset? (Y/N): ')

    if export_input.upper() == 'Y':
        # user input for file format
        file_format_input = input('Enter the file format for export (CSV/TXT): ')

        # export
        if file_format_input.upper() == 'CSV':
            filtered_df.to_csv(user_input + '_filtered.csv', index=False)
        elif file_format_input.upper() == 'TXT':
            filtered_df.to_csv(user_input + '_filtered.txt', index=False, sep='\t')
        else:
            print('Invalid file format.')
    elif export_input.upper() == 'N':
        print('Filtered dataset was not exported.')
    else:
        print('Invalid input. Filtered dataset was not exported.')


Do you want to export the overexpressed genes results to a file? (Y/N)
 n



Overexpressed genes results not exported.



Do you want to export the common RefSeq_IDs dataset as a file? (Y/N)
 n



Common RefSeq_IDs dataset not exported.



Do you want to export the downregulated genes results as a file? (Y/N)
 n



Downregulated genes results not exported.


Enter a Gene_ID or ProbeSet_ID:  ESR1
Do you want to export the gene/probe filtered dataset? (Y/N):  n


Filtered dataset was not exported.


_Explanation_ Update adds an option at each output stage for the user to first decide if they want to export the data and then if they do, choose to export the relevant dataset as a CSV or TXT file. Feedback code shows if exported or not. Adds functionality and is more user friendly. 