In [1]:
import os
import sys

In [2]:
os.getcwd()

'/Users/chao/github/mychat'

In [6]:
sys.path.append(os.path.join(os.getcwd(), 'donotsync'))
import authentications

In [7]:
os.environ["OPENAI_API_KEY"] = authentications.APIKEY

In [24]:
from langchain_community.document_loaders.text import TextLoader
loader = TextLoader('data/CHAO_DAI_cv.txt')

In [39]:
from langchain_community.document_loaders.directory import DirectoryLoader
loaderdir = DirectoryLoader('data', "*.txt")

In [29]:
docs = loader.load()
len(docs)

1

In [41]:
docs = loaderdir.load_and_split()

[nltk_data] Downloading package punkt to /Users/chao/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/chao/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [42]:
docs

[Document(page_content='Pervasive effects of sQTLs on gene expression levels The prevalence of unproductive splicing, along with the observation that unproductive splicing anti-correlates with gene expression levels genomewide, predicts that genetic effects on RNA splicing would often impact RNA expression levels. To test this prediction, we used quantitative trait loci (QTL) mapping to identify genetic variants associated with expression (eQTLs) and splicing (splice junction abundance, sQTLs) in naRNA-seq, 4sU-seq and steady-state RNA-seq data. To better distinguish splicing-mediated expression effects from transcriptional effects, we mapped histone QTLs (hQTLs), reflective of variants impacting promoter and enhancer activity (H3K27ac, H3K4me1, and H3K4me3) and transcription across gene bodies (H3K36me3). In total, we identified 57,981 QTLs for 620,020 tested molecular traits. Consistent with previous work, we find a large fraction of eQTLs are explained by transcriptional regulation,

In [43]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

In [44]:
len(all_splits)

13

In [49]:
print(all_splits[0].page_content)

Pervasive effects of sQTLs on gene expression levels The prevalence of unproductive splicing, along with the observation that unproductive splicing anti-correlates with gene expression levels genomewide, predicts that genetic effects on RNA splicing would often impact RNA expression levels. To test this prediction, we used quantitative trait loci (QTL) mapping to identify genetic variants associated with expression (eQTLs) and splicing (splice junction abundance, sQTLs) in naRNA-seq, 4sU-seq and steady-state RNA-seq data. To better distinguish splicing-mediated expression effects from transcriptional effects, we mapped histone QTLs (hQTLs), reflective of variants impacting promoter and enhancer activity (H3K27ac, H3K4me1, and H3K4me3) and transcription across gene bodies (H3K36me3). In total, we identified 57,981 QTLs for 620,020 tested molecular traits. Consistent with previous work, we find a large fraction of eQTLs are explained by transcriptional regulation, as indicated by the


In [50]:
all_splits[10].metadata

{'source': 'data/CHAO_DAI_cv.txt', 'start_index': 3572}

In [51]:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

vectorstore = Chroma.from_documents(documents=all_splits, 
                                    embedding=OpenAIEmbeddings())

In [52]:
retriever = vectorstore.as_retriever(search_type="similarity", 
                                     search_kwargs={"k": 6})

In [53]:
retrieved_docs = retriever.invoke(
    "what was my job in 2014?"
    )

In [54]:
len(retrieved_docs)

6

In [55]:
print(retrieved_docs[0].page_content)

The Weather Company Manager, Revenue and Operations Jan 2014 – Jun 2016, New York, NY

Led a team of analysts to design, develop, and maintain a suite of business insights dashboards and KPIs. Developed sophisticated web traffic and revenue forecast models that greatly minimized unmatched demands and provided valuable insights for product development team to

balance user experience and monetization.

Led industry studies to evaluate industry landscape and researched emerging technological and consumer behavior trends. Evaluated major

advertising technology vendors and proposed potential partners.

IBM Corporation Senior Consultant Apr 2012 – Jan 2014, New York, NY

Consulted with Fortune 500 clients in retail, CPG, consumer electronics, and financial services. Analyzed client’s technology, data infrastructure, and analytics capabilities. Managed project scope, timeline, resource, and communicated efficiently among project

stakeholders.


In [56]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

In [59]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

In [60]:
example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:")]

In [61]:
print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


In [62]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [66]:
for chunk in rag_chain.stream("what is post-transcriptional regulation?"):
    print(chunk, end="", flush=True)

Post-transcriptional regulation refers to the processes that occur after transcription, such as alternative splicing and alternative polyadenylation, which can affect gene expression levels. It is estimated that at least 24% of eQTLs function post-transcriptionally, with alternative splicing being a major contributor to inter-individual variation in gene expression levels. Alternative polyadenylation, on the other hand, plays a comparatively minor role in post-transcriptional regulation.

In [73]:
from langchain_core.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Be precise and concise with your answers, but do not omit any important information.

{context}

Question: {question}

Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)

print(rag_chain.invoke("summarise the documents in bullet points, everything about biology"))

- Project 2: Detected gene fusion using Nanopore long read sequencing in breast cancer cell line MDA-MB-231.
- Project 3: Developed RNA-seq protocols using in-house Tn5 transposase for human breast cancer MDA-MB-231 cells and human embryonic kidney 293 (HEK-293) cells.
- Abstract & Conference: Presented research on essential transcription factors for human cortical neuron differentiation at Cell Symposia in San Francisco.
- Research Experience: Collaborated on research projects on regulatory elements in human cancer and neuron differentiations, including investigating chromatin accessibility and gene expression.
- Education: Holds a Master of Science in Biology from New York University.
