**Identifying and Ingesting Your Data**

In this exercise:  
We'll continue to work with our document.  
Step 1: Extract our data into paragraphs.  
Step 2: Split the paragraphs into chunks. 

Step 1: Extract our data into paragraphs.  

Remember that the LLM will tokenize our content.  Should we break it into pages, paragraph, sentences, or something else?   (Could even convert to HTML and then parse.)  
[Tokenizer Example](../static/images/tokenizer.png)  
[Python Docx API](https://python-docx.readthedocs.io/en/latest/)


In [2]:
from docx import Document as DocxDocument

file_path = "../static/input_files/Jupyter_Notebook_Info.docx"
doc = DocxDocument(file_path)
paragraph_chunks = []

# Extract text content from paraAgraphs
for para in doc.paragraphs:
    if para.text.strip():
        paragraph_chunks.append(para.text)
        # print(f"PARAGRAPH:{para.text}")



Step 2: Split the paragraphs into chunks.  
Discuss: What's the best way?  

[Langchain RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

# We join all the text into a single string before splitting
text_chunks = text_splitter.split_text("\n".join(paragraph_chunks))

for chunk in text_chunks:
    print(f"CHUNK:{chunk}")


<u style="color: yellow;">_________________________________________________________________________________________</u>  
  
**Process and Transform Data**  

In this exercise:  
Step 1: Discuss Text Cleaning and Preprocessing Considerations.  
Step 2: Summarize our text chunks.  
Step 3: Vectorize each summarized text chunk.   


Step 1: Text Cleaning and Preprocessing Considerations:  
* Handle Whitespace  
* Lowercase  
* Remove Punctuation (sometimes, but be careful)  
* Remove Special Characters  
* Handle HTML/XML Tags if needed  
* Handle Accents and Diacritics  
* Expand Contractions  
* Stop Word Removal (e.g., "the," "a," "is")  
* Stemming / Lemmatization (Usually Lemmatization is preferred)  
* Consider Metadata  
* Consider Summarization  

Step 2: Summarization  
[Langchain ChatOpenAI](https://python.langchain.com/docs/integrations/chat/openai/)  
[Langchain ChatPromptTemplate](https://sj-langchain.readthedocs.io/en/latest/prompts/langchain.prompts.chat.ChatPromptTemplate.html)  
[Langchain HumanMessage](https://sj-langchain.readthedocs.io/en/latest/schema/langchain.schema.messages.HumanMessage.html)  


In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import HumanMessage

# Initialize the ChatGPT model
llm = ChatOpenAI(model_name="gpt-4", temperature=0)

# Create the text prompt (note the {element} variable)
text_prompt_text = """
        You are an assistant tasked with summarizing text for semantic retrieval.
        These summaries will be embedded and used to retrieve the raw text elements.
        Give a detailed summary of the text below that is well optimized for retrieval.
        Also, provide a one-line description of what the text is about.
        Do not add additional words like Summary: etc.
        Text chunk:
        {element}
"""

text_prompt = ChatPromptTemplate.from_template(text_prompt_text)

# Create the Summary
summaries = []

for text in text_chunks:
    prompt_text = text_prompt.format(element=text) # pass text chunk into prompt
    response = llm.invoke([HumanMessage(content=prompt_text)])
    summaries.append(response.content)

for summary in summaries:
    print(summary)

Step 3: Vectorize each summarized chunk.  

[Langchain OpenAIEmbeddings](https://python.langchain.com/docs/integrations/text_embedding/openai/)

In [5]:
from langchain_openai import OpenAIEmbeddings

def get_embedding(text):
    openai_embedding = OpenAIEmbeddings(model="text-embedding-3-small")
    content = openai_embedding.embed_query(text)
    float_content = [float(x) for x in content] # needed for pgvector
    return float_content

for summary in summaries:
    embedded_summary = get_embedding(summary)
    print(embedded_summary[:5])


[-0.0757230594754219, -0.01474311575293541, 0.03324216231703758, -0.01488342322409153, -0.00787342805415392]
[-0.04038451611995697, 0.03060675784945488, 0.05201853811740875, -0.02758493460714817, -0.0067289541475474834]
[-0.01937182806432247, 0.047913625836372375, 0.05982820689678192, -0.022956840693950653, -0.0385521724820137]
[-0.011679466813802719, 0.040910590440034866, 0.06260524690151215, 0.008179757744073868, -0.020018570125102997]
[-0.006696468219161034, 0.045008622109889984, 0.056652382016181946, 0.013334195129573345, 0.01661716215312481]
[-0.005106356460601091, 0.029977384954690933, 0.021410761401057243, -0.026615651324391365, 0.005822173785418272]
[0.005611640866845846, -0.0045518409460783005, 0.0349310077726841, -0.027173271402716637, 0.0020202435553073883]
