# Retrieval Augmented Generation (RAG) for Legal Questions
##### TODO:
- Preprocess corpus and queries:
    - Tokenize
    - Lowercase
    - Remove punctuation
    - Remove stopwords
    - Stemming
- Implement more similarity measures
- Try other LLMs

In [48]:
from IPython.display import display, Math, Latex
import requests
import json
import spacy 

## NLP Pipeline

## Import Laws and Regulations 

In [49]:
doc1 = open("doc1.txt", "r")
doc2 = open("doc2.txt", "r")

corpus = [
    doc1.read(),
    doc2.read()
]

print(corpus)

['\t\nAny processing of personal data should be lawful and fair. \nIt should be transparent to natural persons that personal data concerning them are collected, used, consulted or otherwise processed \nand to what extent the personal data are or will be processed. The principle of transparency requires that any information and \ncommunication relating to the processing of those personal data be easily accessible and easy to understand, and that clear and \nplain language be used. That principle concerns, in particular, information to the data subjects on the identity of the controller \nand the purposes of the processing and further information to ensure fair and transparent processing in respect of the natural \npersons concerned and their right to obtain confirmation and communication of personal data concerning them which are being processed. \nNatural persons should be made aware of risks, rules, safeguards and rights in relation to the processing of personal data and how \nto exer

## Jaccard Similarity

In [50]:
display(Math(r'\text{JS}(A, B) = \frac{\vert A \cap B \vert}{\vert A \cup B \vert} = \frac{\vert A \cap B \vert}{\vert A \vert + \vert B \vert - \vert A \cap B \vert}'))

<IPython.core.display.Math object>

In [51]:
def jaccard_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))
    return len(intersection)/len(union)

## Sørensen-Dice Similarity 

In [52]:
display(Math(r'\text{SD}(A, B) = \frac{2 \times \vert A \cap B \vert}{\vert A \vert + \vert B \vert}'))

<IPython.core.display.Math object>

In [53]:
def sorense_dice_similarity(query, document):
    query = query.lower().split(" ")
    document = document.lower().split(" ")
    intersection = set(query).intersection(set(document))
    return 2*len(intersection)/(len(query) + len(document))

##### Testing Similarity Measures

In [54]:
def get_most_likely_document(query, corpus):
    similarities = []
    for doc in corpus:
        similarity = jaccard_similarity(query, doc)
        #similarity = sorense_dice_similarity(query, doc)
        print(similarity)
        similarities.append(similarity)
    return corpus[similarities.index(max(similarities))]

In [55]:
get_most_likely_document("Can i record my children and post the video on a social?", corpus)

0.025974025974025976
0.03636363636363636


"An operator is required to obtain verifiable parental consent before any collection, use, or disclosure of personal information \nfrom children, including consent to any material change in the collection, use, or disclosure practices to which the parent has \npreviously consented. An operator must give the parent the option to consent to the collection and use of the child's personal \ninformation without consenting to disclosure of his or her personal information to third parties."

In [56]:
get_most_likely_document("Can I store personal data for a forbidden purposes?", corpus)

0.026490066225165563
0.018867924528301886


'\t\nAny processing of personal data should be lawful and fair. \nIt should be transparent to natural persons that personal data concerning them are collected, used, consulted or otherwise processed \nand to what extent the personal data are or will be processed. The principle of transparency requires that any information and \ncommunication relating to the processing of those personal data be easily accessible and easy to understand, and that clear and \nplain language be used. That principle concerns, in particular, information to the data subjects on the identity of the controller \nand the purposes of the processing and further information to ensure fair and transparent processing in respect of the natural \npersons concerned and their right to obtain confirmation and communication of personal data concerning them which are being processed. \nNatural persons should be made aware of risks, rules, safeguards and rights in relation to the processing of personal data and how \nto exerc

## Adding LLM (LLAMA2)

In [57]:
def get_llama_response(query, corpus):
    doc = get_most_likely_document(query, corpus) # compute similarity to detect the most usefull document
    
    prompt = """
                You are a bot that answer simple legal questions. 
                You answer in very short sentence and do not include extra information.
                This is the info you have to use to answer: {relevant_document}
                The user input is: {user_input}
                Compile an answer to the user based on the info and the user input.
                Don't add more info other than the answer.
             """
    
    url = 'http://localhost:11434/api/generate'
    
    data = {
        "model": "llama2",
        "prompt": prompt.format(user_input=query, relevant_document=doc)
    }
    
    headers = {'Content-Type': 'application/json'}
    response = requests.post(url, data=json.dumps(data), headers=headers, stream=True)
        
    full_response = []
    try:
        for line in response.iter_lines():
            if line:
                decoded_line = json.loads(line.decode('utf-8')) 
                full_response.append(decoded_line['response'])
    finally:
        response.close()
        
    return ''.join(full_response)

In [58]:
get_llama_response("Can I record children and post the video on a social?", corpus)

0.025974025974025976
0.03636363636363636


'No, you cannot record your children and post the video on a social media platform without obtaining verifiable parental consent first. The operator is required to obtain consent from the parent before collecting, using, or disclosing personal information from children, including for the purpose of posting videos online.'

In [59]:
get_llama_response("Can I store personal data for a forbidden purposes", corpus)

0.03333333333333333
0.018867924528301886


'No, you cannot store personal data for forbidden purposes. The processing of personal data should be lawful and fair, and should be transparent to natural persons. The principle of transparency requires that any information and communication relating to the processing of personal data be easily accessible and easy to understand, and that clear and plain language be used. Additionally, personal data should be processed only if the purpose of the processing could not reasonably be fulfilled by other means, and time limits should be established by the controller for erasure or for a periodic review. Every reasonable step should be taken to ensure that personal data which are inaccurate are rectified or deleted.'

In [60]:
get_llama_response("Can I store personal data for a lecit purposes", corpus)

0.03333333333333333
0.018867924528301886


'Yes, you can store personal data for legitimate purposes.'