### Testing some potential pipelines

In [6]:
from langchain_community.document_loaders import PyPDFLoader, TextLoader
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings
import numpy as np

## Load a PDF content

In [7]:
# loader = ("lettre_1.txt")
# with open("lettre_1.txt", "r") as file:
#     loader = file.read()
#     print(loader)

loader = TextLoader("lettre_1.txt")
pages = loader.load_and_split()

In [8]:
pages

[Document(page_content="Profession : sans\nTaille: 158\nG : 3 P : 2 \nSérologie :\n/RAI:\tRub +; Toxo -(09/16); CMV: +; \nAg HBs nég/Ac nég/ anti-core nég (04/16);  HbC-(04/16); Syph -(04/16); HIV -(04/16); Chlamydia -;\nAgglu irr. - (08/16);  \n\nMénarche: 12\nRègles: +- réguliers  \nAntécédents obstétricaux : 21/03/07: 39w6C/S pour SFA F2810g Alyena AM\n2010: accht eutocique \nAntécédents médicaux : \ngastrite\nRisques : 1 Césarienne dans antécédents\nFibrinogène: 2.86 ( 12/08/16) \n\nGrossesses antérieures : \n\nGrossesse n° 1 \tsuivie par: Dr. Alain Claudot \taccouchée par: Dr. A. Loos  \nTerme prévu: 22/03/2007 \tFin de grossesse le 21/03/2007 à 39 s. 6 j. \nRésumé de la grossesse:\nrhogam 10/11 sur métro, refait le 5/1, Anémie sidéroprive, traitée par supplément martial  \nAccouchement: Présentation: céphalique \nMotif d'entrée: Rupture prématurée des membranes avec entrée en travail spontané dans les 24 heures  \nTravail: optimalisé en salle de travail par perfusion de syntocino

### Unify content for a better chunking process



In [9]:
unify_content = ""
for page in pages:
    unify_content += "\n" + page.page_content

In [10]:
unify_content

"\nProfession : sans\nTaille: 158\nG : 3 P : 2 \nSérologie :\n/RAI:\tRub +; Toxo -(09/16); CMV: +; \nAg HBs nég/Ac nég/ anti-core nég (04/16);  HbC-(04/16); Syph -(04/16); HIV -(04/16); Chlamydia -;\nAgglu irr. - (08/16);  \n\nMénarche: 12\nRègles: +- réguliers  \nAntécédents obstétricaux : 21/03/07: 39w6C/S pour SFA F2810g Alyena AM\n2010: accht eutocique \nAntécédents médicaux : \ngastrite\nRisques : 1 Césarienne dans antécédents\nFibrinogène: 2.86 ( 12/08/16) \n\nGrossesses antérieures : \n\nGrossesse n° 1 \tsuivie par: Dr. Alain Claudot \taccouchée par: Dr. A. Loos  \nTerme prévu: 22/03/2007 \tFin de grossesse le 21/03/2007 à 39 s. 6 j. \nRésumé de la grossesse:\nrhogam 10/11 sur métro, refait le 5/1, Anémie sidéroprive, traitée par supplément martial  \nAccouchement: Présentation: céphalique \nMotif d'entrée: Rupture prématurée des membranes avec entrée en travail spontané dans les 24 heures  \nTravail: optimalisé en salle de travail par perfusion de syntocinon® rupture spontanée 

### Apply three main chunking methods

In [12]:
import services.content_preparator as chunker
from langchain_openai.embeddings import OpenAIEmbeddings

ModuleNotFoundError: No module named 'services.content_preparator'

In [None]:
recursive_splitter = chunker.recursive_splitter(chunk_size=512,overlap=20)

## For this method we need to define an embedder
embedder = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")
semantic_splitter = chunker.semantic_chunker(embedder)
semantic_splitter_openai = chunker.semantic_chunker(OpenAIEmbeddings())

  from .autonotebook import tqdm as notebook_tqdm


### Apply both methods to the text

In [None]:
recursive_splited_chunks = recursive_splitter.create_documents([unify_content])

semantic_splited_chunks = semantic_splitter.create_documents([unify_content])

semantic_splitter_openai_chunks = semantic_splitter.create_documents([unify_content])

In [None]:
print(recursive_splited_chunks[0].page_content)

Spanish cuisine  (Spanish : Cocina española ) consists of the traditions and practices of 
Spanish cooking. It features considerable regional diversity, with important differences 
between the traditions of each of Spain's  regional cuisines . 
Olive oil  (of which Spain is the world's largest producer) is extensively used in Spanish 
cuisine .[1][2] It forms the base of many vegetable sauces (known in Spanish 
as sofritos ).[3] Herbs most commonly used


In [None]:
print(semantic_splited_chunks[0].page_content)


Spanish cuisine  (Spanish : Cocina española ) consists of the traditions and practices of 
Spanish cooking. It features considerable regional diversity, with important differences 
between the traditions of each of Spain's  regional cuisines . Olive oil  (of which Spain is the world's largest producer) is extensively used in Spanish 
cuisine .[1][2] It forms the base of many vegetable sauces (known in Spanish 
as sofritos ).[3] Herbs most commonly used 
include  parsley , oregano , rosemary  and thyme .[4] The use of  garlic  has been noted as 
common in Spanish cooking .[5] The most used meats in Spanish cuisine 
include  chicken , pork, lamb  and veal.[6] Fish and seafood  are also consumed on a regular 
basis .[6] Tapas  and pinchos  are snacks and appetizers commonly served in bars and cafes. Tennis  is a racket sport  that is played either individually against a single opponent (singles ) 
or between two teams of two players each (doubles ).


In [None]:
print(semantic_splitter_openai_chunks[1].page_content)

Each player uses a  tennis racket  that 
is strung with cord to strike a hollow rubber  ball covered with felt over or around a net and 
into the opponent's  court . The object of the game is to manoeuvre the ball in such a way 
that the opponent is not able to play a valid return. The player who is unable to return the 
ball validly will not gain a point, while the opposite player will .[1][2] 
Tennis is an  Olympic  sport and is played at all levels of society and at all ages. The sport 
can be played by anyone who can hold a racket, including  wheelchair users . The original 
forms of tennis developed in  France  during the late  Middle Ages .[3] The modern form of 
tennis originated in  Birmingham , England, in the late 19th century as  lawn tennis .[4] It had 
close connections both to various field (lawn) games such as  croquet  and bowls  as well as 
to the older racket sport today called  real tennis .[5] 
The rules of modern tennis have changed little since the 1890s. Two exce

### Other options

### Custom topic guided chunking

In [None]:
import re

# Splitting the essay on '.', '?', and '!'
single_sentences_list = re.split(r'(?<=[.?!])\s+', unify_content)
print (f"{len(single_sentences_list)} senteneces were found")

16 senteneces were found


In [None]:
sentences = [{'sentence': x, 'index' : i} for i, x in enumerate(single_sentences_list)]
sentences[:3]

[{'sentence': '\nSpanish cuisine  (Spanish : Cocina española ) consists of the traditions and practices of \nSpanish cooking.',
  'index': 0},
 {'sentence': "It features considerable regional diversity, with important differences \nbetween the traditions of each of Spain's  regional cuisines .",
  'index': 1},
 {'sentence': "Olive oil  (of which Spain is the world's largest producer) is extensively used in Spanish \ncuisine .[1][2] It forms the base of many vegetable sauces (known in Spanish \nas sofritos ).[3] Herbs most commonly used \ninclude  parsley , oregano , rosemary  and thyme .[4] The use of  garlic  has been noted as \ncommon in Spanish cooking .[5] The most used meats in Spanish cuisine \ninclude  chicken , pork, lamb  and veal.[6] Fish and seafood  are also consumed on a regular \nbasis .[6] Tapas  and pinchos  are snacks and appetizers commonly served in bars and cafes.",
  'index': 2}]

In [None]:
def combine_sentences(sentences, buffer_size=1):
    # Go through each sentence dict
    for i in range(len(sentences)):

        # Create a string that will hold the sentences which are joined
        combined_sentence = ''

        # Add sentences before the current one, based on the buffer size.
        for j in range(i - buffer_size, i):
            # Check if the index j is not negative (to avoid index out of range like on the first one)
            if j >= 0:
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += sentences[j]['sentence'] + ' '

        # Add the current sentence
        combined_sentence += sentences[i]['sentence']

        # Add sentences after the current one, based on the buffer size
        for j in range(i + 1, i + 1 + buffer_size):
            # Check if the index j is within the range of the sentences list
            if j < len(sentences):
                # Add the sentence at index j to the combined_sentence string
                combined_sentence += ' ' + sentences[j]['sentence']

        # Then add the whole thing to your dict
        # Store the combined sentence in the current sentence dict
        sentences[i]['combined_sentence'] = combined_sentence

    return sentences

sentences = combine_sentences(sentences)

In [None]:
sentences

[{'sentence': '\nSpanish cuisine  (Spanish : Cocina española ) consists of the traditions and practices of \nSpanish cooking.',
  'index': 0,
  'combined_sentence': "\nSpanish cuisine  (Spanish : Cocina española ) consists of the traditions and practices of \nSpanish cooking. It features considerable regional diversity, with important differences \nbetween the traditions of each of Spain's  regional cuisines ."},
 {'sentence': "It features considerable regional diversity, with important differences \nbetween the traditions of each of Spain's  regional cuisines .",
  'index': 1,
  'combined_sentence': "\nSpanish cuisine  (Spanish : Cocina española ) consists of the traditions and practices of \nSpanish cooking. It features considerable regional diversity, with important differences \nbetween the traditions of each of Spain's  regional cuisines . Olive oil  (of which Spain is the world's largest producer) is extensively used in Spanish \ncuisine .[1][2] It forms the base of many vegetabl

In [None]:
sentences_debugging = []

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embedder = OpenAIEmbeddings()

In [None]:
embeddings = embedder.embed_documents([x['sentence'] for x in sentences])


for i, sentence in enumerate(sentences):
    sentence['combined_sentence_embedding'] = embeddings[i]

## Create topics embeddings

In [None]:
topics= ['Text regarding racket sports','Text regarding food',  'Text regarding technology']
topics_list = [{'sentence': x, 'index' : i} for i, x in enumerate(topics)]

In [None]:
topics_list

[{'sentence': 'Text regarding racket sports', 'index': 0},
 {'sentence': 'Text regarding food', 'index': 1},
 {'sentence': 'Text regarding technology', 'index': 2}]

In [None]:
topic_embeddings = embedder.embed_documents([x['sentence'] for x in topics_list])
print(topic_embeddings)

[[-0.021430230644980026, 0.0013532153651583977, 0.006850761176059402, -0.0029069072359310675, -0.0034080980343054418, 0.026462878251986444, 5.1658308006100035e-05, -0.013832868642836038, -0.03141256922858737, -0.03821494214131895, 0.02217683286804853, 0.006826565414580916, -0.004960061719554882, 0.008973044710426988, 0.016549668020630907, 0.005205472212926399, 0.03735773045682805, -0.01509794375234187, 0.009325607130021172, -0.007866968722655295, 0.006729783765650893, 0.018844776626579778, -0.002991591236952501, -0.019259556053316775, -0.023587080683780046, -0.007542058668416708, 0.018872429457596682, -0.021886487455597153, 0.005233124112620691, -0.009014523025629731, 0.03437132528984448, 0.005170907571139187, -0.0049013013162891835, 0.0009859634268244203, -0.0162178452242994, 0.0047111941566449446, 0.00017228436604502272, 0.01269222754761583, 0.016936796478996225, -0.000965224502053701, 0.03395654772575271, 0.012961833336804526, 0.0029345591356253597, -0.0160657602416421, -0.006187115

## Write the embeddings in the topic list

In [None]:
for i, topic in enumerate(topics_list):
    topic['combined_topic_embedding'] = topic_embeddings[i]

In [None]:
topics_list

[{'sentence': 'Text regarding racket sports',
  'index': 0,
  'combined_topic_embedding': [-0.021430230644980026,
   0.0013532153651583977,
   0.006850761176059402,
   -0.0029069072359310675,
   -0.0034080980343054418,
   0.026462878251986444,
   5.1658308006100035e-05,
   -0.013832868642836038,
   -0.03141256922858737,
   -0.03821494214131895,
   0.02217683286804853,
   0.006826565414580916,
   -0.004960061719554882,
   0.008973044710426988,
   0.016549668020630907,
   0.005205472212926399,
   0.03735773045682805,
   -0.01509794375234187,
   0.009325607130021172,
   -0.007866968722655295,
   0.006729783765650893,
   0.018844776626579778,
   -0.002991591236952501,
   -0.019259556053316775,
   -0.023587080683780046,
   -0.007542058668416708,
   0.018872429457596682,
   -0.021886487455597153,
   0.005233124112620691,
   -0.009014523025629731,
   0.03437132528984448,
   0.005170907571139187,
   -0.0049013013162891835,
   0.0009859634268244203,
   -0.0162178452242994,
   0.0047111941566449

In [None]:
# embeddings = oaiembeds.embed_documents([x['combined_sentence'] for x in sentences])


# for i, sentence in enumerate(sentences):
#     sentence['combined_sentence_embedding'] = embeddings[i]



from sklearn.metrics.pairwise import cosine_similarity

def get_most_related_topic(embedder, topics, sentences):
    """
    Determine the topic most related to a given chunk of text based on embeddings.

    Parameters:
        embedder: A function that takes a text input and returns its embedding.
        topics (list of str): List of topics in plain text.
        chunk (str): The chunk of text for which the related topic needs to be determined.

    Returns:
        str: The topic most related to the chunk.
    """

    for i,sentence in enumerate(sentences):
        # Generate embeddings for the chunk
        sentence_embedding = sentence['combined_sentence_embedding']

        # Initialize variables to store maximum similarity score and related topic
        max_similarity = -1
        related_topic = None

        # Calculate similarity scores between chunk embedding and embeddings of topics
        for topic in topics:
            # Generate embedding for the topic
            topic_embedding = topic['combined_topic_embedding']


            # Calculate cosine similarity between chunk embedding and topic embedding
            similarity = cosine_similarity([sentence_embedding], [topic_embedding])[0][0]

            # Update maximum similarity score and related topic if current similarity is higher
            if similarity > max_similarity:
                max_similarity = similarity
                related_topic = topic['sentence']
                sentence['related_topic'] = related_topic

    return sentences

In [None]:
sentences2 = get_most_related_topic(embedder,topics= topics_list,sentences=sentences)

In [None]:
for sentence in sentences2:
    print("Sentence--->", sentence['sentence'])
    print("Topic---->", sentence['related_topic'] )

Sentence---> 
Spanish cuisine  (Spanish : Cocina española ) consists of the traditions and practices of 
Spanish cooking.
Topic----> Text regarding food
Sentence---> It features considerable regional diversity, with important differences 
between the traditions of each of Spain's  regional cuisines .
Topic----> Text regarding food
Sentence---> Olive oil  (of which Spain is the world's largest producer) is extensively used in Spanish 
cuisine .[1][2] It forms the base of many vegetable sauces (known in Spanish 
as sofritos ).[3] Herbs most commonly used 
include  parsley , oregano , rosemary  and thyme .[4] The use of  garlic  has been noted as 
common in Spanish cooking .[5] The most used meats in Spanish cuisine 
include  chicken , pork, lamb  and veal.[6] Fish and seafood  are also consumed on a regular 
basis .[6] Tapas  and pinchos  are snacks and appetizers commonly served in bars and cafes.
Topic----> Text regarding food
Sentence---> Tennis  is a racket sport  that is played eith

### SQL queries testing, Populate a local DB

In [None]:
import sqlite3

# Connect to the SQLite database (this will create it if it doesn't exist)
conn = sqlite3.connect('./inputs/Chinook.db')

# Create a cursor object using the cursor() method
cursor = conn.cursor()

# Create table as per requirement
sql_create_table = """
CREATE TABLE IF NOT EXISTS People (
    person_id INTEGER PRIMARY KEY,
    record_id INTEGER NOT NULL,
    date TEXT NOT NULL
);
"""

cursor.execute(sql_create_table)

# Insert a sample record into the table
sql_insert_record = """
INSERT INTO People (person_id, record_id, date) VALUES (?, ?, ?);
"""

# Sample data to insert
sample_data = (1, 100, '2024-03-09')

cursor.execute(sql_insert_record, sample_data)

# Commit the transaction
conn.commit()

# Close the connection
conn.close()


## Text queries, populate a text db (Chroma)

In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings

In [4]:
loader = TextLoader('./inputs/sample.txt')
documents = loader.load()

### Load the document with the simplest chunking option
text_splitter = CharacterTextSplitter(chunk_size=500,
chunk_overlap=30)

texts= text_splitter.split_documents(documents)

# Calculate embeddings:
embeddings = SentenceTransformerEmbeddings(model_name="all-MiniLM-L6-v2")

# Save texts and embeddings to chroma
db = Chroma.from_documents(texts, embeddings, persist_directory="./inputs")




Created a chunk of size 855, which is longer than the specified 500
Created a chunk of size 830, which is longer than the specified 500
Created a chunk of size 1097, which is longer than the specified 500
Created a chunk of size 539, which is longer than the specified 500
Created a chunk of size 814, which is longer than the specified 500
Created a chunk of size 666, which is longer than the specified 500
