## Instalations and imports

In [1]:
!pip install -q -U langchain
!pip install -q -U sentence-transformers
!pip install -q -U chromadb
!pip install -q -U bitsandbytes
!pip install -q -U accelerate 

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf 23.8.0 requires cubinlinker, which is not installed.
cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
cudf 23.8.0 requires ptxcompiler, which is not installed.
cuml 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
dask-cudf 23.8.0 requires cupy-cuda11x>=12.0.0, which is not installed.
keras-cv 0.8.2 requires keras-core, which is not installed.
keras-nlp 0.9.3 requires keras-core, which is not installed.
tensorflow-decision-forests 1.8.1 requires wurlitzer, which is not installed.
apache-beam 2.46.0 requires dill<0.3.2,>=0.3.1.1, but you have dill 0.3.8 which is incompatible.
apache-beam 2.46.0 requires numpy<1.25.0,>=1.14.3, but you have numpy 1.26.4 which is incompatible.
apache-beam 2.46.0 requires pyarrow<10.0.0,>=3.0.0, but you have pyarrow 15.0.2 which is incompatible.

In [2]:
from datasets import load_dataset
import pandas as pd

from langchain.document_loaders import TextLoader, DataFrameLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma

import torch

from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

## Loading and preprocessing the dataset

In [3]:
# Load the data from a local cvs file or from a kaggle dataset

#df = load_dataset("csv", data_files='medium.csv') 
df = pd.read_csv('/kaggle/input/1300-towards-datascience-medium-articles-dataset/medium.csv')

In [4]:
data = []
for index in range(len(df)):
    topic = df['Title'][index]
    text = df['Text'][index]
    for i in range(0, len(text), 400):
        if i + 400 < len(text):
            data.append([topic, text[i:i+400]])
        else:
            data.append([topic, text[i:]])
        

In [5]:
articles = DataFrameLoader(df, page_content_column = "Title") #saving the articles as documents
document = articles.load()

In [6]:
splitter = RecursiveCharacterTextSplitter(chunk_size = 300, chunk_overlap = 10) # splitting the documents to fit the 
docOutput = splitter.split_documents(document)

## Creating the database for similarity search
Creating a Chroma database to use for similarity search with the prompts given. Retrieving relevant articles to the topic for further processes. 

In [7]:
model = HuggingFaceEmbeddings(model_name="avsolatorio/GIST-small-Embedding-v0")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.0k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/719 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [8]:
chroma_db = Chroma.from_documents(documents=docOutput, embedding=model, persist_directory="chroma_db")
retriever = chroma_db.as_retriever()

In [73]:
query = 'What is ggplot?'

In [74]:
docs = chroma_db.similarity_search(query, k = 3)
print(f"Query: {query}")
print(f"Retrieved documents: {len(docs)}")
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    print("Source: ", doc_details['page_content'])
    print("Text: ", doc_details['metadata']['Text'][:100], "\n")

Query: What is ggplot?
Retrieved documents: 3
Source:  How to Use ggplot2 in Python
Text:  Introduction

Thanks to its strict implementation of the grammar of graphics, ggplot2 provides an ex 

Source:  Data visualization with Python Using Seaborn and Plotly_ GDP per Capita & Life Expectency Dataset
Text:  Data visualization with Python Using Seaborn and Plotly_ GDP per Capita & Life Expectency Dataset

P 

Source:  Quick Code to Spruce Up Your Histograms & Scatterplots
Text:  Simple graphs are essential visual tools for data analysis. If you are starting to learn how to make 



## Generating the answer based on the articles found
Setting up the Mistral 7B model, creating a prompt of the query and the context found in the articles and feeding it into  the model. 

In [75]:
text_found = ""  # Relevant text in the similar articles
for doc in docs:
    doc_details = doc.to_json()['kwargs']
    text_found += f"\n{doc_details['metadata']['Text']}"

In [76]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

In [77]:
model_name='/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1'

In [78]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
textmodel = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
    )

In [80]:
text_generation_pipeline = pipeline(  #creating the pipeline
    "text-generation",
    model=textmodel,
    tokenizer = tokenizer,
    torch_dtype =  torch.float16,
)

In [81]:
prompt = f"Answer the question in under 150 words based on the context given: {text_found}. The question: {query}"

In [82]:
result = text_generation_pipeline(  # answer generation from the pipeline
    prompt,
    do_sample=True,
    max_new_tokens=150, 
    temperature=0.7, 
    top_k=50, 
    top_p=0.95,
    num_return_sequences=1,
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [83]:
answer = f"""Answer to the query: {query} \n
{result[0]["generated_text"][len(prompt):]}... \n
This answer was created based on articles: \n
""" + "\n ".join([f"{doc.to_json()['kwargs']['page_content']}" for doc in docs])

print(answer)


Answer to the query: What is ggplot? 



Ggplot is a data visualization package for the R programming language. It provides a set of tools for creating high-quality, informative graphs and charts. It was created by Hadley Wickham in 2006.

It has a lot of similarities to the Python seaborn library.

The ggplot package is built on top of the ggplot2 package, which is a powerful and flexible tool for creating high-quality, informative graphs and charts.

It is used to create a wide range of visualizations, including line charts, scatterplots, bar charts, and more.

It is also used to create interactive visualizations, which allow users to... 

This answer was created based on articles: 

How to Use ggplot2 in Python
 Data visualization with Python Using Seaborn and Plotly_ GDP per Capita & Life Expectency Dataset
 Quick Code to Spruce Up Your Histograms & Scatterplots
