In [1]:
import torch
import numpy as np
import json
import os
import re
import pandas as pd
import shutil

from datasets import Dataset, load_from_disk, concatenate_datasets, DatasetDict

seed = 42
torch.manual_seed(seed)
np.random.seed(seed)

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
df = pd.read_csv("/users/zlyu12/Desktop/c2s-RL/Create_Dataset/temp/local(8)_cleaned_DEGs.csv")
df

Unnamed: 0,Cell_Type,Num_DE_Genes,Percentage_Cells,Mean_LogFC,Max_LogFC,Min_Pval,Top_DEG_1,Top_DEG_2,Top_DEG_3,Top_DEG_4,...,Top_DEG_11,Top_DEG_12,Top_DEG_13,Top_DEG_14,Top_DEG_15,Top_DEG_16,Top_DEG_17,Top_DEG_18,Top_DEG_19,Top_DEG_20
0,natural T-regulatory cell,26,21.74,-1.062835,1.314502,3.581882e-219,LTB,CCL20,CD40LG,TNFRSF25,...,TYROBP,CCL3L1,CCL3,IFNG,CTSW,GZMK,GNLY,PRF1,GZMH,KLRD1
1,natural killer cell,691,4.05,1.330674,6.552441,1.481299e-233,GZMA,TYROBP,NKG7,FCER1G,...,CD7,GSTP1,GZMB,KLRD1,CTSW,CST7,HOPX,CD63,CYBA,ARPC2
2,"activated CD8-positive, alpha-beta T cell",6,32.61,-1.605185,-1.249155,6.646944e-15,SELL,FCER1G,CCR7,CCL3,...,,,,,,,,,,
3,gamma-delta T cell,7,16.5,-0.278854,1.2694,6.208144e-150,HSPA1A,HSPA6,DNAJA4,CCR7,...,,,,,,,,,,
4,memory T cell,64,15.16,1.079776,5.538998,0.0,CCL4,CCL4L2,KLF6,NR4A2,...,IFNG,RGCC,CCL3L1,SYTL3,FOS,CCL5,REL,CST7,ADGRE5,GZMK
5,CD4-positive helper T cell,131,7.89,-1.050419,2.781162,2.997044e-291,RPS8,RPL12,RPS11,RPL22,...,SELL,TSHZ2,MCUB,PASK,SATB1,SESN3,RIPOR2,LEF1,AP3M2,FCMR
6,native cell,215,2.06,-0.907267,3.530351,2.6265809999999997e-142,RPS8,RPL32,RPS18,RPS12,...,RPL5,RPL12,RPL13A,RPL34,RPS14,RPS3A,RPL21,RPS4X,RPS13,RPL35A


helper functions and constants

In [15]:
def dataset_for_prompt(df):
    s = ""
    cols = [c for c in df.columns if not c.startswith('Top_DEG_')]
    degs = [c for c in df.columns if c.startswith('Top_DEG_')]
    for index, row in df.iterrows():
        for col in cols:
            s += f"{col}: {row[col]}, "
        s += "Differentially Expressed Genes: "
        for deg in degs:
            if pd.notna(row[deg]):
                s += f"{row[deg]}, "
        s += "\n"
        
    return s

prompt_prefix = """This is a scientific manuscript(attached), and an analysis of the single-cell RNA sequencing dataset it is associated with. The analysis is about differentially expressed genes in representative cells from each type. 
I'm a researcher training LLMs to unserstand and be able to analyze and reason with gene expression data formatted as cell sentences(ranked gene names by expression). You need to create example context-question-answer-reasoning pairs for Large Language Models to learn analyzing and reasoning gene expression datasets when given pieces of data converted into cell sentences. I want questions that researchers(users) who got a raw scRNA-seq dataset as a series of cell sentences would want to ask about it, maybe in terms of differentially expressed genes, cell type, tissue origen, disease, or other relevant biological information. You should pretend to be Biologists who only have the original gene expression data without any prior knowledge, haven't done any analysis, or have any understanding of the data, or even doesn't know the cell type. Immitate the tone of such a researcher as user when asking the question and providing the context. The questions must be answerable by the model when looking at the originalcell sentences alone without any analysis given and should be open-ended.  
First read the given manuscript, consider the biological context of this study, the logic progression and conclusion made in the manuscript. Then look at the given gene expression analysis, think about what question the researchers can observed, analyzed or asked about the original gene expression. When answering the quesiton, discuss genes that are most relevent. 
The manuscript and analysis are only for your reference to create the questions and answers, don't directly use them, quote any part, or even just mention it directly. Do not include any gene name in the context or question, but the answer should have them. Avoid questions about experiment design or procedures, or about general facts. Vary the questions as much as possible to cover a diverse range of topics.
Besides the questions and answers, each pair should include the following. The Context is a brief general background of the study. Do not summarize or describe the cell sentence or gene expression because I want my model to get trained to make those observations. Chain of Thoughts is very detailed reasoning and analysis used to train my LLM to reason and think like human researchers. This shoule be in no less than 50 words and can go as much as hundreds of words, add a numbering before each step (like 1. reasoning. 2, reasoning ...). Refer to a cell sentences as gene expression of ... cell (replace ... with the cell type). The Answer should be no less than 20 words. The Keywords is the most essential parts from the answer, like important gene names, or biological information. These keywords will be used during training to validate my model's response and must be in the Answer as well.
Give 30 question answer pairs as one list, don't give any other word. Strictly format like this: 
<|Context|>the context<|Question|>the content of the question<|Chain of Thoughts|>the intermediate reasonings<|Answer|>the content of the answer<|Keyword|>the answer keywords
<|Context|>the context<|Question|>the content of the question<|Chain of Thoughts|>the intermediate reasonings<|Answer|>the content of the answer<|Keyword|>the answer keywords
...
"""

loads all datasets and meta data \
assumes summary datasets are in a directory as csv files, assuming the format processed by Harry \
all files I'm looking at can be found in my C2S-RL dev github repo branch

In [16]:
# TODO: change to your own file path
datasets_directory = "/users/zlyu12/Desktop/c2s-RL/Dec19_dataset" # all summary datasets
meta_data_path = "/users/zlyu12/Desktop/c2s-RL/Create_Dataset/meta_data.json" # dataset name, filename, url etc.
hf_dataset_output_path = "/users/zlyu12/Desktop/c2s-RL/Create_Dataset/temp_hf_dataset_new" # temporary output directory

datasets_files_paths = [os.path.join(datasets_directory, f) for f in os.listdir(datasets_directory) if f.endswith('.csv')]
dataset_numbers = np.sort([int(re.search(r'\((\d+)\)', f).group(1)) for f in datasets_files_paths])

meta_data = json.load(open(meta_data_path))
dataset_numbers_iterator = iter(dataset_numbers[5:])

try:
    hf_dataset = load_from_disk(hf_dataset_output_path)
except:
    hf_dataset = Dataset.from_dict({})

print(hf_dataset)

Dataset({
    features: [],
    num_rows: 0
})


iterates through all datasets in your directory \
this prepares the prompt \
copy paste, add the publication text at the end and run our favorite LLM \
run only once for a row, we rely on some variables defined here in later cells

In [19]:
dataset_index = str(next(dataset_numbers_iterator))
dataset_name = [k for k,v in meta_data.items() if f'({dataset_index})' in v.get('filename', '')]
if len(dataset_name) != 1:
    print("dataset index: ", dataset_index)
    print(dataset_name)
    print(f"dataset \"{dataset_name}\" not in meta_data!")
else:
    dataset_name = dataset_name[0]
    cur_url = meta_data[dataset_name]['url']
    dataset_file_path = [path for path in datasets_files_paths if f"({dataset_index})" in path][0]
    dataset_df = pd.read_csv("/users/zlyu12/Desktop/c2s-RL/Create_Dataset/temp/local(8)_cleaned_DEGs.csv")
    dataset_in_prompt = dataset_for_prompt(dataset_df)
    print("Dataset Index: ", dataset_index)
    print("\nDataset Name: ", dataset_name)
    print("\nPublication URL: ", cur_url)
    print("\nPrompt:\n", prompt_prefix + "Dataset: \n" + dataset_in_prompt + "Manuscript:\n")


Dataset Index:  8

Dataset Name:  UMAP of T-Cells cells

Publication URL:  https://cellxgene.cziscience.com/collections/a18474f4-ff1e-4864-af69-270b956cee5b

Prompt:
 This is a scientific manuscript(attached), and an analysis of the single-cell RNA sequencing dataset it is associated with. The analysis is about differentially expressed genes in representative cells from each type. 
I'm a researcher training LLMs to unserstand and be able to analyze and reason with gene expression data formatted as cell sentences(ranked gene names by expression). You need to create example context-question-answer-reasoning pairs for Large Language Models to learn analyzing and reasoning gene expression datasets when given pieces of data converted into cell sentences. I want questions that researchers(users) who got a raw scRNA-seq dataset as a series of cell sentences would want to ask about it, maybe in terms of differentially expressed genes, cell type, tissue origen, disease, or other relevant biolog

Just copy paste the output into the next cell, shouldn't need to do additional processing

In [20]:
output = """
<|Context|>In this study, we explore the immune regulatory landscape within a complex tissue, aiming to understand cell-specific signaling pathways.<|Question|>Based on the raw gene expression data from a regulatory immune cell, what potential signaling roles can be inferred?<|Chain of Thoughts|>Examining the gene expression of natural T-regulatory cell reveals several markers that are typically associated with immune modulation. By considering the differential expression metrics and the balance between upregulated and downregulated genes, one can deduce that these cells might be involved in cytokine signaling, cell–cell communication, and immune suppression. The interplay among these markers hints at coordinated regulatory pathways that maintain tissue homeostasis.<|Answer|>The expression data in the natural T-regulatory cell indicates involvement of LTB, CCL20, CD40LG, and TNFRSF25, suggesting these genes contribute to cytokine signaling and immune regulation.<|Keyword|>LTB, CCL20, CD40LG, TNFRSF25 <|Context|>This investigation focuses on cytotoxic immune responses, examining cells known for their ability to kill target cells.<|Question|>What insights regarding cytotoxic potential and effector functions can be derived from the gene expression of a cytotoxic immune cell?<|Chain of Thoughts|>Analyzing the gene expression of a natural killer cell shows a wide range of differentially expressed genes. High levels of cytotoxic markers and key signaling molecules are evident. Such a pattern is typically linked to the cell’s capacity for degranulation and target cell lysis. The diversity and strength of these gene signals suggest that this cell type is well-equipped to carry out rapid and effective immune responses.<|Answer|>The gene expression of natural killer cell reveals high expression of GZMA, TYROBP, NKG7, and GZMB, indicating robust cytotoxicity and strong effector functions essential for target cell elimination.<|Keyword|>GZMA, TYROBP, NKG7, GZMB <|Context|>This study examines effector T cell responses in a complex tissue environment with a focus on activated immune cells.<|Question|>What potential effector functions might be suggested by the gene expression patterns of an activated T cell?<|Chain of Thoughts|>Considering the gene expression of an activated CD8-positive, alpha-beta T cell, one notes a pattern where key adhesion molecules and signaling factors are differentially expressed. Although the overall expression is downregulated, the presence of certain markers indicates roles in cell adhesion, migration, and immune activation. This profile may reflect a transient state where the cell is either preparing for migration or entering a phase of controlled activation.<|Answer|>The gene expression of activated CD8-positive, alpha-beta T cell highlights SELL, FCER1G, CCR7, and CCL3, implying roles in cell adhesion, migration, and the orchestration of immune responses.<|Keyword|>SELL, FCER1G, CCR7, CCL3 <|Context|>In a study of unconventional T cell populations, researchers analyze cells that deviate from classical T cell profiles.<|Question|>What could the gene expression data from an unconventional T cell suggest about its stress response and immunological function?<|Chain of Thoughts|>The gene expression of a gamma-delta T cell includes markers typically associated with heat shock and stress response, as well as some immune regulatory signals. This blend indicates that such cells might be primed to react to stressors while also participating in immune surveillance. Their moderate differential expression pattern suggests a balance between maintaining cellular integrity under stress and executing immune functions.<|Answer|>The gene expression of gamma-delta T cell, featuring HSPA1A, HSPA6, DNAJA4, and CCR7, indicates a role in managing cellular stress and modulating immune responses effectively.<|Keyword|>HSPA1A, HSPA6, DNAJA4, CCR7 <|Context|>Investigating long-term immune memory, the study focuses on cells that store immunological recall information.<|Question|>What aspects of immune memory and regulation might be reflected in the gene expression of a memory cell?<|Chain of Thoughts|>The gene expression of a memory T cell exhibits a pattern with significant upregulation of effector and regulatory genes. This suggests that memory cells retain a heightened state of readiness for reactivation upon antigen re-encounter. The interplay between cytokine production and transcriptional regulation within these cells underscores their ability to sustain long-term immunity and quickly mobilize defense mechanisms.<|Answer|>The gene expression of memory T cell, with notable markers such as CCL4, FOS, IFNG, and GZMB, suggests a robust capacity for recall responses and effective regulation of secondary immune challenges.<|Keyword|>CCL4, FOS, IFNG, GZMB <|Context|>In analyzing adaptive immune responses, the study examines cells that assist in coordinating various immune functions.<|Question|>What immune coordination roles can be hypothesized from the gene expression profile of a helper T cell?<|Chain of Thoughts|>Reviewing the gene expression of a CD4-positive helper T cell reveals a mix of ribosomal components and key signaling factors. This combination indicates high metabolic activity along with the capacity to support and direct other immune cells. The presence of several markers involved in cell communication implies that these cells are essential for orchestrating complex immune interactions, from antigen presentation to the stimulation of cytotoxic responses.<|Answer|>The gene expression of CD4-positive helper T cell shows involvement of RPS8, RPL12, CCR7, and LTB, hinting at roles in antigen presentation and coordinating responses among various immune cell types.<|Keyword|>RPS8, RPL12, CCR7, LTB <|Context|>Native cells serve as a control in this study, providing baseline gene expression data for healthy cellular functions.<|Question|>What can be inferred about the general cellular activity from the gene expression profile of a native cell?<|Chain of Thoughts|>The native cell gene expression profile is dominated by housekeeping genes, particularly those coding for ribosomal proteins. This pattern reflects the essential functions required for protein synthesis and cellular maintenance. Such a stable expression profile serves as an important baseline against which the more dynamic expression patterns in activated or specialized cells can be compared, highlighting deviations associated with specific functions or disease states.<|Answer|>The gene expression of native cell, characterized by several ribosomal proteins like RPS8, RPL32, and RPS18, suggests a strong baseline activity for protein synthesis and general metabolic processes.<|Keyword|>RPS8, RPL32, RPS18 <|Context|>The research investigates immune regulatory cells within a complex tissue microenvironment to understand their signaling dynamics.<|Question|>How might the extent of differential gene expression in a regulatory cell inform its potential impact on immune modulation?<|Chain of Thoughts|>In the natural T-regulatory cell, moderate fold changes in key genes are observed alongside highly significant p-values, indicating that even subtle shifts in expression can have pronounced regulatory effects. The magnitude of these changes provides clues about the cell's role in maintaining immune balance and modulating responses. The interplay of these expression levels may highlight a fine-tuned regulatory network essential for immune homeostasis.<|Answer|>The natural T-regulatory cell's expression, featuring genes like IFNG, CD4, and GZMH with moderate differential fold changes, suggests it plays a balanced role in dampening immune responses while facilitating regulatory signaling.<|Keyword|>IFNG, CD4, GZMH <|Context|>This investigation explores the diversity of cytotoxic cells, emphasizing their varied gene expression profiles.<|Question|>What does the large number of differentially expressed genes in a cytotoxic cell imply about its functional versatility?<|Chain of Thoughts|>Natural killer cells exhibit an exceptionally high number of differentially expressed genes, pointing to a complex network of activation and effector pathways. This extensive expression profile suggests that these cells are equipped to handle multiple functional roles, from direct target cell lysis to modulating the activity of other immune cells. The breadth of gene expression reflects their adaptability in varied immune contexts, underlining their critical role in defense mechanisms.<|Answer|>The extensive differential expression in natural killer cell, with markers such as PRF1, GZMB, and NKG7, implies a wide range of cytotoxic and regulatory functions that contribute to its versatile immune role.<|Keyword|>PRF1, GZMB, NKG7 <|Context|>The study focuses on effector T cells, which are crucial for immediate immune responses in tissues.<|Question|>What could the observed downregulated expression in an activated T cell reveal about its activation status and potential exhaustion?<|Chain of Thoughts|>In activated CD8-positive, alpha-beta T cells, the negative mean log fold change suggests a controlled downregulation of certain effector molecules. This could be indicative of a feedback mechanism to prevent overactivation or an early sign of cellular exhaustion. Such regulatory adjustments are common in highly active immune cells as they balance between effective pathogen clearance and self-tolerance, ensuring immune responses do not become detrimental.<|Answer|>The downregulated profile in activated CD8-positive, alpha-beta T cell, including reduced expression of markers like CCR7 and TYROBP, might reflect a transitional state toward cellular exhaustion or a finely tuned regulatory mechanism during activation.<|Keyword|>CCR7, TYROBP <|Context|>Unconventional T cells are examined for their unique expression patterns, which may reveal adaptive responses to stress.<|Question|>How does the gene expression pattern in an unconventional T cell suggest its capability to adapt to diverse stress signals?<|Chain of Thoughts|>The gamma-delta T cell displays a combination of stress-related proteins and moderate activation markers. The presence of heat shock proteins and chaperones indicates that these cells are well-equipped to manage environmental stress. This adaptive gene expression allows them to swiftly respond to cellular damage or pathogen-induced stress, highlighting their role as first responders in the immune system.<|Answer|>The gamma-delta T cell's expression, including markers like LTB and HSPA6, combined with stress response genes, indicates an adaptive capacity to respond to diverse stress signals and modulate immune activity accordingly.<|Keyword|>LTB, HSPA6 <|Context|>Long-lived immune cells are critical for rapid recall responses, and their gene expression patterns are key to understanding memory functions.<|Question|>How can the upregulation of certain effector genes in a memory cell contribute to its rapid response during secondary infections?<|Chain of Thoughts|>Memory T cells are primed for quick activation upon antigen re-encounter. Their gene expression profile, which includes several upregulated effector genes, suggests a state of readiness that enables rapid cytokine production and cell-mediated responses. This preparation is essential for an effective and timely immune response, ensuring that memory cells can quickly mobilize and protect the host during re-infection.<|Answer|>The memory T cell's elevated expression of genes such as IFNG, GZMB, and CCL3 suggests these molecules are crucial for mounting a swift and effective response during secondary infections, enhancing overall immune recall.<|Keyword|>IFNG, GZMB, CCL3 <|Context|>Helper T cells play a vital role in orchestrating immune responses through diverse molecular signals.<|Question|>What does the diversity of gene expression in helper T cells reveal about their multifaceted roles in immunity?<|Chain of Thoughts|>CD4-positive helper T cells exhibit a broad spectrum of expressed genes, ranging from ribosomal proteins to chemokine receptors. This diversity underlines their ability to support various immune processes, including cell proliferation, antigen presentation, and intercellular communication. The multiplicity of markers reflects their central role in coordinating both innate and adaptive immune responses, highlighting their importance in maintaining immune system balance.<|Answer|>The diversity in helper T cell expression, featuring genes like RPS8, CCR7, and LTB, indicates their multifaceted role in supporting cellular communication, proliferation, and immune regulation across different immune compartments.<|Keyword|>RPS8, CCR7, LTB <|Context|>Native cells are used as a reference to gauge changes in gene expression in more specialized or diseased cells.<|Question|>How does the gene expression profile of a native cell serve as a reference for detecting alterations in other cell populations?<|Chain of Thoughts|>Native cells primarily express housekeeping and ribosomal genes at stable levels. This consistency provides a benchmark for comparing the more dynamic gene expression patterns observed in activated or diseased cells. Any significant deviations from this baseline can be attributed to specific cellular responses or pathological conditions, making native cell profiles essential for identifying meaningful changes in gene expression across different samples.<|Answer|>The native cell's expression of consistent housekeeping genes like RPS8 and RPS6 serves as a reliable baseline, enabling the detection of significant deviations in gene expression in other cell populations under various conditions.<|Keyword|>RPS8, RPS6 <|Context|>The research investigates immune regulatory cells within complex tissues to understand their impact on intercellular communication.<|Question|>What insights into cell communication can be derived from the gene expression profile of a regulatory cell?<|Chain of Thoughts|>Natural T-regulatory cells express several key genes that are known to mediate intercellular interactions. By analyzing their gene expression, one can identify signals that potentially influence the behavior of surrounding cells. These molecular cues are critical for establishing an environment where immune responses are finely regulated, preventing overactivation while ensuring effective defense mechanisms are maintained.<|Answer|>The expression of genes like CD40LG and CCL3 in natural T-regulatory cells points to mechanisms that mediate intercellular communication, influencing the activity and coordination of neighboring immune cells.<|Keyword|>CD40LG, CCL3 <|Context|>The cytotoxic cells in this study are examined for the signaling pathways that govern their activation and function.<|Question|>How might the signaling pathways indicated by the gene expression of cytotoxic cells contribute to their effector functions?<|Chain of Thoughts|>In natural killer cells, the presence of specific signaling molecules is essential for activating cytotoxic responses. The expression of these genes helps trigger degranulation and the release of cytotoxic factors, which are crucial for eliminating infected or malignant cells. By mapping these pathways, one can better understand how these cells are activated and how they interact with other components of the immune system to execute their functions.<|Answer|>The expression of genes such as TYROBP and CD247 in natural killer cells suggests that these signaling pathways are pivotal in activating cytotoxic responses and orchestrating effective target cell lysis.<|Keyword|>TYROBP, CD247 <|Context|>Activated T cells are critical for reaching sites of infection, and their migratory behavior is reflected in their gene expression profiles.<|Question|>What does the gene expression profile of an activated T cell reveal about its migratory capacity?<|Chain of Thoughts|>In activated CD8-positive T cells, molecules involved in cell adhesion and chemotaxis play an essential role in migration. The presence of these markers suggests that these cells are equipped to travel from circulation to sites of inflammation or infection. This migratory potential is vital for the timely delivery of effector functions to the affected tissue, underscoring the importance of these genes in immune defense.<|Answer|>The elevated expression of SELL and CCR7 in activated CD8-positive T cells indicates a high migratory potential, facilitating their efficient movement toward inflamed or infected tissues where they can exert their effector functions.<|Keyword|>SELL, CCR7 <|Context|>Unconventional T cells are explored for their ability to withstand and respond to cellular stress.<|Question|>How does the gene expression in an unconventional T cell reflect its ability to adapt to cellular stress?<|Chain of Thoughts|>Gamma-delta T cells express a unique set of genes that includes several heat shock proteins and chaperones. These stress-related genes enable the cell to maintain functionality under adverse conditions. This adaptive expression pattern suggests that such cells can effectively counteract the effects of cellular stress, thereby contributing to their role in early immune responses and tissue protection.<|Answer|>The presence of HSPA1A and DNAJA4 in the gene expression of gamma-delta T cells underscores their capacity to adapt to cellular stress, ensuring they remain functional even in challenging microenvironments.<|Keyword|>HSPA1A, DNAJA4 <|Context|>Memory cells are crucial for long-term immunity, and their gene expression is key to understanding their rapid recall capabilities.<|Question|>What elements in the gene expression of a memory cell might contribute to its longevity and sustained responsiveness?<|Chain of Thoughts|>Memory T cells exhibit a gene expression profile that includes key transcription factors and cytokines known for supporting cell survival. The upregulation of these genes suggests that memory cells are not only primed for a quick response upon re-exposure to antigens but also possess intrinsic mechanisms to sustain their viability over extended periods. These molecular features are critical for the durability of the immune memory.<|Answer|>The expression of genes such as NR4A2 and IFNG in memory T cells suggests that these cells have built-in mechanisms for long-term survival and rapid reactivation, which are essential for sustained immune protection.<|Keyword|>NR4A2, IFNG <|Context|>Helper T cells are central to orchestrating the immune response, and their gene expression reflects their diverse functional roles.<|Question|>Based on its gene expression profile, how might a helper T cell collaborate with other immune cells?<|Chain of Thoughts|>The gene expression of a CD4-positive helper T cell includes several chemokine receptors and signaling molecules that facilitate communication with other immune cells. These markers are indicative of the cell’s role in recruiting and activating various immune cell types, thereby coordinating a comprehensive immune response. Such collaboration is crucial for effective pathogen clearance and overall immune system regulation.<|Answer|>The helper T cell's expression of CCR7, LTB, and SELL indicates its ability to interact with and direct other immune cells, thereby effectively orchestrating a coordinated immune response across different cellular compartments.<|Keyword|>CCR7, LTB, SELL <|Context|>Native cells provide a baseline for healthy tissue function and are used to identify changes in diseased or activated states.<|Question|>What can the gene expression of a native cell tell us about its metabolic and biosynthetic activities?<|Chain of Thoughts|>Native cells typically express a consistent set of housekeeping genes, including those coding for ribosomal proteins. This stable expression profile reflects the cell's ongoing metabolic processes and its commitment to protein synthesis. By comparing these levels to those in more activated or diseased cells, researchers can identify significant deviations that may signal pathological changes or adaptive responses.<|Answer|>The native cell's expression profile, highlighted by genes such as RPS18 and RPL11, reflects vigorous metabolic activity and a high capacity for protein synthesis, serving as a robust baseline for healthy cellular function.<|Keyword|>RPS18, RPL11 <|Context|>This study examines the relative abundance of different immune cell types within a tissue to understand their functional impact.<|Question|>How might the relative proportion of a regulatory immune cell in the dataset correlate with its functional impact?<|Chain of Thoughts|>A natural T-regulatory cell that comprises a significant fraction of the cellular population suggests an important role in maintaining immune balance. Its gene expression profile, which includes several key regulatory markers, reinforces the idea that its abundance is linked to its capability to modulate immune responses effectively. The higher percentage can indicate a substantial influence on the local immune microenvironment.<|Answer|>The substantial presence of natural T-regulatory cells, along with the expression of markers such as CD4, IFNG, and GZMH, underscores their pivotal role in modulating and maintaining balanced immune responses within the tissue.<|Keyword|>CD4, IFNG, GZMH <|Context|>Within the dataset, cytotoxic cells are present at a relatively low frequency, prompting questions about their specialized functions.<|Question|>What does the lower percentage of cytotoxic cells suggest about their specialization, as inferred from their gene expression profile?<|Chain of Thoughts|>Natural killer cells, despite being a minor fraction of the overall cell population, display an intense expression of key cytotoxic markers. This suggests that even in small numbers, they are highly specialized and potent. Their focused gene expression pattern likely equips them with the tools necessary for rapid target cell recognition and elimination, emphasizing quality over quantity in immune defense.<|Answer|>The natural killer cell's profile, marked by high expression of genes like GZMB and PRF1 despite a low overall frequency, highlights their specialized and potent cytotoxic functions, essential for targeted immune responses.<|Keyword|>GZMB, PRF1 <|Context|>Effector T cells are notably abundant in the dataset, reflecting an active immune response within the tissue.<|Question|>How might the high proportion of activated T cells, as indicated by their gene expression, reflect an ongoing immune response?<|Chain of Thoughts|>The elevated percentage of activated CD8-positive T cells suggests a heightened state of immune alertness. Their gene expression profile, which includes markers involved in cell adhesion and activation, supports the notion of a dynamic and responsive immune environment. This abundance likely correlates with an active process of antigen recognition and elimination, contributing to a robust immune defense mechanism.<|Answer|>The high proportion of activated CD8-positive, alpha-beta T cells, with differential expression of SELL and FCER1G, reflects a vigorous and ongoing immune response, likely associated with active antigen recognition and elimination.<|Keyword|>SELL, FCER1G <|Context|>Unconventional T cells, though present at moderate levels, provide insights into the diversity of immune responses within a tissue.<|Question|>What could the moderate abundance of unconventional T cells imply about their role, based on their gene expression profile?<|Chain of Thoughts|>Gamma-delta T cells, representing about 16.5% of the population, exhibit a unique gene expression signature that includes both stress-response and regulatory markers. This suggests they are strategically positioned to respond early to pathogenic challenges while also modulating local immune responses. Their moderate abundance indicates that while they are not the dominant cell type, they fulfill specialized functions that are critical under certain conditions.<|Answer|>The gamma-delta T cell's expression, including markers like LTB and HSPA6, combined with its moderate abundance, implies a specialized role in initiating early stress responses and fine-tuning immune modulation within the tissue.<|Keyword|>LTB, HSPA6 <|Context|>Long-lived immune cells are essential for ensuring rapid responses upon re-exposure to antigens.<|Question|>How does the gene expression of a memory cell support its capacity for rapid immune recall?<|Chain of Thoughts|>Memory T cells are characterized by a gene expression pattern that primes them for swift reactivation. The upregulation of specific effector molecules and cytokines in these cells enables a faster and more robust response upon antigen re-encounter. This preparedness is a hallmark of effective immune memory, allowing the body to combat recurring infections with greater efficiency than naive cells.<|Answer|>The memory T cell expresses key genes such as IFNG, GZMB, and CCL3, which underpin its readiness to mount a rapid and robust response upon re-encounter with previously encountered antigens.<|Keyword|>IFNG, GZMB, CCL3 <|Context|>Helper T cells exhibit a wide array of gene expression signals, reflecting their diverse roles in immune coordination.<|Question|>What does the diversity of gene expression in helper T cells reveal about their multifaceted roles in immunity?<|Chain of Thoughts|>CD4-positive helper T cells display a complex expression profile that includes both ribosomal and signaling molecules. This diversity allows them to support various functions, from facilitating antigen presentation to driving cytokine-mediated communication. Such a multifaceted gene expression pattern highlights their central role in linking innate and adaptive immunity, ensuring a harmonized and effective immune response.<|Answer|>The diversity in helper T cell expression, featuring genes like RPS8, CCR7, and LTB, underscores their multifaceted role in orchestrating immune responses, ranging from effective antigen presentation to robust intercellular signaling.<|Keyword|>RPS8, CCR7, LTB <|Context|>Native cells are critical for establishing a baseline of cellular function, providing context for deviations observed in diseased states.<|Question|>How does the gene expression profile of a native cell serve as a reference for detecting alterations in other cell populations?<|Chain of Thoughts|>Native cell gene expression is dominated by stable housekeeping genes, primarily involved in fundamental processes like protein synthesis. This consistency provides a control reference that can be used to pinpoint significant deviations in more specialized or diseased cells. By comparing these profiles, researchers can identify which genes are differentially regulated, thus elucidating potential mechanisms underlying disease or activation states.<|Answer|>The native cell's consistent expression of housekeeping genes such as RPS8 and RPS6 serves as a reliable reference point, allowing for the clear detection of significant gene expression changes in activated or diseased cells.<|Keyword|>RPS8, RPS6 <|Context|>Investigating regulatory cells offers insight into the molecular mechanisms governing intercellular communication within the immune network.<|Question|>How does the gene expression profile of a regulatory cell inform its potential for immune suppression and intercellular signaling?<|Chain of Thoughts|>In natural T-regulatory cells, the coordinated expression of several immune-modulatory genes provides clues about how these cells communicate with their neighbors. The presence of molecules known to participate in inhibitory signaling and cytokine regulation suggests that these cells are equipped to suppress overactive immune responses. This regulatory network is essential for maintaining immune tolerance and preventing autoimmunity, highlighting the complex interplay between various signaling pathways.<|Answer|>The natural T-regulatory cell's expression of genes like CD40LG and CCL3 indicates that it plays a crucial role in mediating intercellular communication, thereby exerting significant immune suppressive functions to maintain homeostasis.<|Keyword|>CD40LG, CCL3 <|Context|>Cytotoxic cells are known for their role in direct cell killing and require robust signaling mechanisms for activation.<|Question|>How might the signaling pathways indicated by the gene expression of cytotoxic cells facilitate their interactions with other immune components?<|Chain of Thoughts|>Natural killer cells, through their gene expression, exhibit a network of signaling molecules that not only trigger cytotoxic responses but also enable cross-talk with other immune cells. These interactions are critical for fine-tuning immune responses and ensuring that cytotoxic actions are appropriately targeted. The integration of adhesion molecules and activation receptors in their expression profile underscores the complexity and efficiency of these pathways in coordinating immune defenses.<|Answer|>The natural killer cell's expression of TYROBP and CD247, along with other markers, suggests that these signaling pathways are essential for both triggering cytotoxic responses and facilitating effective interactions with other immune cells, thus enhancing overall immune coordination.<|Keyword|>TYROBP, CD247 <|Context|>Activated T cells are pivotal for immune defense and often require effective migration to reach sites of infection.<|Question|>What does the gene expression profile of an activated T cell reveal about its capacity for migration and tissue homing?<|Chain of Thoughts|>In activated CD8-positive T cells, the expression of adhesion molecules and chemokine receptors plays a key role in directing the cells to inflamed or infected tissues. The coordinated regulation of these genes not only ensures proper migration but also supports the cells' ability to home to specific tissue sites. This migratory capacity is crucial for delivering effector functions precisely where they are needed, thereby optimizing immune responses.<|Answer|>The activated CD8-positive T cell shows high expression of SELL and CCR7, which highlights its strong migratory capacity and the ability to home to targeted tissues where immune responses are required.<|Keyword|>SELL, CCR7 <|Context|>Unconventional T cells offer unique insights into stress adaptation and immune function due to their distinct gene expression profiles.<|Question|>How does the gene expression in an unconventional T cell reflect its ability to manage cellular stress and maintain functionality?<|Chain of Thoughts|>Gamma-delta T cells express a series of stress response genes alongside immunologically relevant markers. This combination suggests that these cells are particularly adept at coping with adverse conditions while still performing essential immune functions. Their gene expression pattern indicates that they can rapidly adjust to stress, thereby ensuring that their effector capabilities are preserved even under challenging circumstances.<|Answer|>The presence of HSPA1A and DNAJA4 in gamma-delta T cells indicates robust stress management, enabling these cells to maintain functionality and support early immune responses under adverse conditions.<|Keyword|>HSPA1A, DNAJA4 <|Context|>Memory cells are integral to long-term immune protection, with specific gene expression patterns supporting their durability.<|Question|>What elements in the gene expression of a memory cell contribute to its longevity and sustained responsiveness?<|Chain of Thoughts|>Memory T cells are characterized by the expression of genes that promote cell survival and rapid reactivation. The upregulation of specific transcription factors and cytokines indicates that these cells are equipped with mechanisms to resist apoptosis and remain vigilant for re-infection. These molecular traits are essential for maintaining long-term immunological memory and ensuring a swift response upon antigen re-exposure.<|Answer|>The expression of NR4A2 and IFNG in memory T cells suggests that these cells possess intrinsic survival mechanisms and are primed for rapid reactivation, thereby ensuring sustained immune protection over time.<|Keyword|>NR4A2, IFNG <|Context|>Helper T cells are central to orchestrating immune responses, interacting with various cell types through complex molecular signals.<|Question|>Based on its gene expression profile, how might a helper T cell collaborate with other immune cells to coordinate a response?<|Chain of Thoughts|>The gene expression of a CD4-positive helper T cell includes several key markers that facilitate intercellular communication. These markers enable the cell to interact with B cells, cytotoxic T cells, and other immune components, effectively coordinating the overall immune response. The presence of chemokine receptors and signaling molecules underscores the cell's role in guiding and activating other immune cells, ensuring a harmonized response to pathogenic challenges.<|Answer|>The helper T cell's expression of CCR7, LTB, and SELL demonstrates its capability to interact with and direct other immune cells, thereby playing a critical role in coordinating a comprehensive immune response.<|Keyword|>CCR7, LTB, SELL <|Context|>Native cells serve as a baseline for evaluating the metabolic and biosynthetic activities in healthy tissue.<|Question|>What can the gene expression of a native cell tell us about its metabolic and biosynthetic activities?<|Chain of Thoughts|>Native cell gene expression is dominated by housekeeping genes, especially those coding for ribosomal proteins. These genes are essential for ongoing protein synthesis and metabolic maintenance, providing a snapshot of the cell’s baseline activities. Such a stable expression profile is invaluable for comparing with more dynamic cells, as it highlights the shifts that occur during activation, stress, or disease, thereby revealing the underlying cellular processes.<|Answer|>The native cell's robust expression of ribosomal proteins like RPS18 and RPL11 reflects high metabolic and biosynthetic activity, serving as a critical benchmark for healthy cellular function and protein synthesis.<|Keyword|>RPS18, RPL11 <|Context|>Regulatory cells are key to suppressing overactive immune responses, and their gene expression patterns reveal potential mechanisms of immune inhibition.<|Question|>How does the gene expression profile of a regulatory cell inform its potential for immune suppression?<|Chain of Thoughts|>Natural T-regulatory cells display an expression profile enriched with genes that are known to modulate immune activity and suppress excessive responses. By examining these markers, one can infer that these cells are actively engaged in mechanisms that prevent autoimmunity and maintain homeostasis. The coordinated expression of inhibitory signals and cytokines plays a crucial role in damping down hyperactive immune processes, thus ensuring a balanced immune environment.<|Answer|>The expression of CCL20, IFNG, and CD28 in natural T-regulatory cells suggests that these cells deploy multiple mechanisms to suppress immune overactivity, thereby contributing to the maintenance of immune homeostasis.<|Keyword|>CCL20, IFNG, CD28 <|Context|>Cytotoxic cells not only kill target cells but also interact with other immune cells to fine-tune responses.<|Question|>What might the gene expression of a cytotoxic cell suggest about its interactions with other immune components?<|Chain of Thoughts|>Natural killer cells express a range of adhesion and signaling molecules that facilitate communication with other immune cells. This cross-talk is essential for modulating immune responses and ensuring that cytotoxic activity is effectively integrated with the broader immune network. The expression of these markers can indicate roles in recruiting, activating, or even regulating the function of other cells, which is crucial for a coordinated immune response.<|Answer|>The natural killer cell expresses genes such as FCER1G, CD7, and GZMA, which suggest that it actively interacts with other immune cells, helping to modulate and enhance overall immune surveillance and response coordination.<|Keyword|>FCER1G, CD7, GZMA


"""

run the following cell to convert the output into a hf dataset

In [21]:
qa_dict_list = []
for QA_pair in output.split("<|Context|>"):
    if not QA_pair:
        continue
    try:
        context = QA_pair.split("<|Question|>")[0].strip()
        question = QA_pair.split("<|Question|>")[1].split("<|Chain of Thoughts|>")[0].strip()
        chain_of_thoughts = QA_pair.split("<|Chain of Thoughts|>")[1].split("<|Answer|>")[0].strip()
        answer = QA_pair.split("<|Answer|>")[1].split("<|Keyword|>")[0].strip()
        label = QA_pair.split("<|Keyword|>")[1].strip()
    except:
        print("Error in the following QA pair:")
        print(QA_pair)
        continue

    entry = {
        "Context": context,
        "Summary_Dataset": dataset_in_prompt,
        "Question": question, 
        "Chain of Thoughts": chain_of_thoughts,
        "Answer": answer,
        "Keyword": label,
        "full_QA_pair": "<|Question|>"+QA_pair.strip(), 
        "Dataset_Name": dataset_name,
        "Publication_URL": cur_url,
        "Dataset_Index": dataset_index,
    }
    qa_dict_list.append(entry)

qa_dict = {key: [d[key] for d in qa_dict_list] for key in qa_dict_list[0].keys()} # list of dicts to dict of lists

new_hf_dataset = Dataset.from_dict(qa_dict)
print("Example:")
new_hf_dataset[0]

Error in the following QA pair:


Example:


{'Context': 'In this study, we explore the immune regulatory landscape within a complex tissue, aiming to understand cell-specific signaling pathways.',
 'Summary_Dataset': 'Cell_Type: natural T-regulatory cell, Num_DE_Genes: 26, Percentage_Cells: 21.74, Mean_LogFC: -1.0628347, Max_LogFC: 1.3145021, Min_Pval: 3.581881992289846e-219, Differentially Expressed Genes: LTB, CCL20, CD40LG, TNFRSF25, PHACTR2, CD28, CD4, CRTAM, FCER1G, TRDC, TYROBP, CCL3L1, CCL3, IFNG, CTSW, GZMK, GNLY, PRF1, GZMH, KLRD1, \nCell_Type: natural killer cell, Num_DE_Genes: 691, Percentage_Cells: 4.05, Mean_LogFC: 1.3306739, Max_LogFC: 6.552441, Min_Pval: 1.481298890905054e-233, Differentially Expressed Genes: GZMA, TYROBP, NKG7, FCER1G, TRDC, CD247, GNLY, PFN1, PRF1, ACTB, CD7, GSTP1, GZMB, KLRD1, CTSW, CST7, HOPX, CD63, CYBA, ARPC2, \nCell_Type: activated CD8-positive, alpha-beta T cell, Num_DE_Genes: 6, Percentage_Cells: 32.61, Mean_LogFC: -1.6051849, Max_LogFC: -1.2491546, Min_Pval: 6.646944401804179e-15, Diffe

run the following cell every time to save updates \
sometimes save_to_disk will fail because it doesn't automatically overwrites, just delete the old file adn run it again

In [22]:
hf_dataset = concatenate_datasets([hf_dataset, new_hf_dataset])
hf_dataset


Dataset({
    features: ['Context', 'Summary_Dataset', 'Question', 'Chain of Thoughts', 'Answer', 'Keyword', 'full_QA_pair', 'Dataset_Name', 'Publication_URL', 'Dataset_Index'],
    num_rows: 37
})

In [23]:
hf_dataset.save_to_disk("temp/dataset8_3-12-25_hf_dataset")
print("current hf_dataset: ")
hf_dataset

Saving the dataset (1/1 shards): 100%|██████████| 37/37 [00:00<00:00, 1508.87 examples/s]

current hf_dataset: 





Dataset({
    features: ['Context', 'Summary_Dataset', 'Question', 'Chain of Thoughts', 'Answer', 'Keyword', 'full_QA_pair', 'Dataset_Name', 'Publication_URL', 'Dataset_Index'],
    num_rows: 37
})