# Verification of the train dataset

Imports and data loading

In [1]:
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [17]:
def load_jsonl_df(jsonl_file, lab=False):
    user_contents = []
    assistant_contents = []
    labels = []

    with open(jsonl_file, 'r') as file:
        for line in file:
            data = json.loads(line)
            message = data['messages']
            user_contents.append(message[0]['content'])
            assistant_contents.append(message[1]['content'])
            if lab == True:
                labels.append(data['target'])


    if lab == True:
        df = pd.DataFrame({
            'user_content': user_contents,
            'assistant_content': assistant_contents,
            'target' : labels
            })
    else:
        df = pd.DataFrame({
            'user_content': user_contents,
            'assistant_content': assistant_contents
            })
    return df

In [71]:
data_lab = load_jsonl_df('dataset/train_dataset_lab.jsonl', lab=True)
data_lab.shape

(500, 3)

In [15]:
data_nc = load_jsonl_df('dataset/train_dataset_nc.jsonl')
data_nc.shape

(110, 2)

In [51]:
dataset = data_lab.drop("target", axis=1)

In [52]:
dataset.shape

(200, 2)

In [54]:
dataset = pd.concat([dataset, data_nc])
dataset.shape

(310, 2)

In [55]:
dataset.head()

Unnamed: 0,user_content,assistant_content
0,"List of genes : <<< ADH1C,UGT2B15,UGT1A4,CYP2D...",The analysis of the provided gene and compound...
1,"List of genes : <<< ALDH3B1,GOT1,AOC3,GOT2,PAH...",The metabolic pathway identified from the prov...
2,"List of genes : <<< GLUL,CAT,HYI,AGXT,ACAT2,HA...",The metabolic pathway identified in the experi...
3,"List of genes : <<< CYP2C8,PLA2G6,PLA2G4D,PLA2...",The metabolic pathway identified in this exper...
4,"List of genes : <<< AK9,DGUOK,ENTPD8,NT5C3B,AD...",The metabolic pathway identified in this exper...


## Pathway in assistant response is the right one

In [46]:
def verify_response(row):
    if row.iloc[2].lower() in row.iloc[1].lower():
        return True
    else:
        return False

In [47]:
data_lab['Correct'] = data_lab.apply(verify_response, axis=1)

In [48]:
data_lab["Correct"].value_counts()

Correct
True     179
False     21
Name: count, dtype: int64

In [44]:
def print_cells(row):
    print(row.iloc[2])
    print(row.iloc[1] + '\n\n')

In [45]:
data_lab.loc[data_lab["Correct"] == False].apply(print_cells, axis=1)

Glycolysis / Gluconeogenesis
The metabolic pathway identified based on the altered genes and compounds is Glycolysis/Gluconeogenesis. This conclusion is supported by the presence of multiple genes directly involved in this pathway, such as PGM1, PFKL, PGAM4, ENO2, and HK2. The compounds D-Glucose, D-Fructose 6-phosphate, beta-D-Glucose 6-phosphate, D-Glyceraldehyde 3-phosphate, 3-Phospho-D-glycerate, 2-Phospho-D-glycerate, and Phosphoenolpyruvate are key intermediates in this pathway. Additionally, the presence of Acetyl-CoA and Acetate suggests involvement of pyruvate metabolism, which is closely linked to Glycolysis/Gluconeogenesis. The altered genes DLAT, LDHAL6A, ALDH3B2, ALDH3A1, and FBP2, while not directly in the pathway, are involved in related metabolic processes, further supporting the identification of Glycolysis/Gluconeogenesis as the affected pathway.


Neomycin, kanamycin and gentamicin biosynthesis
The metabolic pathway identified in this experiment is the "Neomycin, kan

10     None
13     None
21     None
28     None
41     None
42     None
46     None
50     None
76     None
82     None
88     None
105    None
120    None
133    None
143    None
145    None
150    None
153    None
154    None
166    None
198    None
dtype: object

After checking the 21 incorrect lines, it seems that differences occured because of addition/suppression of special characters in pathway names.
Besides this, all assistant responses were correct.

In [50]:
data_lab = data_lab.drop("Correct", axis=1)
data_lab.shape

(200, 3)

## Representation of each pathway

Iter1 : The representation of some pathways seems a little imbalanced, especially for pathway that are quite specific (Neomycin, kanamycin and gentamicin biosynthesis, Caffeine metabolism). Some pathways still have only one representation, highlighting the need for further generation. New generation was performed excluding these 2 pathways

Iter2 : The representation is still quite imbalanced. Running new generation with only the pathways with less than 4 examples. Excluding pathway which to my knowledge are not very specific or quite rare. Resulting (keeper) list :

In [None]:
iter3gen = ['Phenylalanine metabolism', 'Sulfur metabolism',
       'Nicotinate and nicotinamide metabolism', 'Linoleic acid metabolism',
       'Purine metabolism', 'Glycerophospholipid metabolism',
       'Propanoate metabolism',
       'Ether lipid metabolism', 'Steroid biosynthesis',
       'Pentose and glucuronate interconversions', 'Fatty acid elongation',
       'Pantothenate and CoA biosynthesis', 'Fatty acid metabolism',
       'D-Amino acid metabolism', 'Folate biosynthesis',
       'Terpenoid backbone biosynthesis',
       'Amino sugar and nucleotide sugar metabolism',
       'Mucin type O-glycan biosynthesis', 'Sphingolipid metabolism',
       'Lipoic acid metabolism', 'Pyrimidine metabolism',
       'alpha-Linolenic acid metabolism', 'Nitrogen metabolism',
       'One carbon pool by folate',
       'Valine, leucine and isoleucine degradation']

In [75]:
pd.set_option('display.max_rows', 99)
data_lab['target'].value_counts()

target
Fatty acid metabolism                                                      13
Arachidonic acid metabolism                                                11
Ether lipid metabolism                                                      9
Biosynthesis of unsaturated fatty acids                                     9
Purine metabolism                                                           9
Alanine, aspartate and glutamate metabolism                                 8
Vitamin B6 metabolism                                                       8
Starch and sucrose metabolism                                               8
Fructose and mannose metabolism                                             8
Drug metabolism - cytochrome P450                                           8
Sulfur metabolism                                                           8
Nicotinate and nicotinamide metabolism                                      8
Phosphonate and phosphinate metabolism                   

In [73]:
vc[vc < 4].index

Index(['Nitrogen metabolism', 'Drug metabolism - other enzymes',
       'Glycosphingolipid biosynthesis - lacto and neolacto series',
       'Folate biosynthesis', 'Butanoate metabolism'],
      dtype='object', name='target')