## Raw data
The data was taken on Kaggle [https://www.kaggle.com/datasets/stackoverflow/pythonquestions/], they were collected from Stack Overflow, using the Python hashtag for the period from August 2008 to October 2016. The data is stored in three tables Questions, Answers and Tags. The text is raw, heterogeneous and unstructured. Not all questions have answers. Most of the questions have multiple answers, which have a score from negative to strongly positive values.

## Data preparation
A dataset with the best response scores has been prepared, which is then divided into training and validation datasets.

## Initialization of the embedding model
The <b>*sentence-transformers/all-MiniLM-L6-v2*</b> model is used to create vector representations of text data, that is, preparing data for indexing and subsequent vector search.

## Integration with Pinecone
Pinecone, a vector search service, is used to create a dataspeak-qa index, which allows you to quickly find the most relevant answers to your questions. This is a key component for building a system of questions and answers.

## Downloading and configuring the language model
The language model <b>*meta-llama/Llama-2-13b-chat-hf*</b> has been selected for text generation.

## Initialization of the tokenizer and stop criteria
The tokenizer is initialized to convert text into tokens, and criteria are also set for stopping text generation to prevent infinite generation.

## Integration of the embedding model with Pinecone
A Pinecone object is created that integrates the embedding model with the Pinecone service, providing the ability to perform a vector search on indexed data.

## Vector search and user interaction
The query vector is generated by the embedding model and used to find the most relevant answers in the Pinecone index.

## Conclusion
This project demonstrates an approach to creating a question and answer system using natural language processing and vector search technologies. It combines several components: text preprocessing and vectorization, indexing and vector search, as well as text generation based on the found data.
Using a variety of external services and libraries requires careful testing and possible optimization to increase efficiency and reliability.

In [None]:
# Installing the necessary libraries
!pip install ipykernel
!pip install datasets
!pip install git-lfs
!pip install torch==2.1.0
!pip install --upgrade torchaudio torchdata torchtext torchvision
!pip install accelerate
!pip install evaluate
!pip install tensorflow
!pip install sentence-transformers
!pip install transformers==4.31.0
!pip install sentence-transformers==2.2.2
!pip install pinecone-client # ==2.2.2
!pip install datasets==2.14.0
!pip install accelerate==0.21.0
!pip install einops==0.6.1
!pip install langchain # ==0.0.240
!pip install xformers==0.0.20
!pip install bitsandbytes==0.41.0


Collecting jedi>=0.16 (from ipython>=5.0.0->ipykernel)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi
Successfully installed jedi-0.19.1
Collecting datasets
  Downloading datasets-2.17.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.6/536.6 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow>=12.0.0 (from datasets)
  Downloading pyarrow-15.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (38.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)

In [None]:
#!pip install langchain==0.0.240

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import transformers
import torch
import tensorflow as tf
from torch import cuda, bfloat16
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
import os
import pinecone
from langchain.vectorstores import pinecone
from langchain.chains import RetrievalQA
from transformers import StoppingCriteria, StoppingCriteriaList
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
import time
import sys
import json
from google.colab import drive

In [None]:
drive.mount('/content/drive')
filepath = '/content/drive/My Drive/NLP/data/dataspeak'

Mounted at /content/drive


In [None]:
df_best_score = pd.read_csv(filepath, encoding='iso-8859-1')
print(df_best_score.sample(5))


        id_question  owner_user_id_question     creation_date_question  \
46066       4681737                  152439  2011-01-13 15:26:29+00:00   
209889     18150230                 2014962  2013-08-09 15:23:48+00:00   
547167     36995764                 5705019  2016-05-03 04:44:46+00:00   
47722       4834894                  594669  2011-01-29 03:07:54+00:00   
43077       4402685                  235709  2010-12-09 20:12:47+00:00   

        score_question                                              title  \
46066               15  How to calculate the area of a polygon on the ...   
209889               3   Profiling Python code that uses multiprocessing?   
547167               3            Detecting Overlapping Circles in OpenCV   
47722                1   Python: Get the class, which a method belongs to   
43077                0   Switch between FCGI and CGI with Python and Flup   

                                            body_question  id_answer  \
46066   Let's say yo

In [None]:
#df_best_score.info()

# Data Preparation

In [None]:
#  The data set is divided into training and validation sets.
small_train_dataset, small_eval_dataset = train_test_split(df_best_score, test_size=0.02, train_size=0.05, random_state=54321)

small_train_dataset = small_train_dataset.reset_index(drop=True)
small_eval_dataset = small_eval_dataset.reset_index(drop=True)


In [None]:
train_title = small_train_dataset['title']
train_body_questions = small_train_dataset['body_question']
train_body_answers = small_train_dataset['body_answer']


# Initializing the model to create embeddings

In [None]:

embed_model_id = 'sentence-transformers/all-MiniLM-L6-v2'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

embed_model = HuggingFaceEmbeddings(
    model_name=embed_model_id,
    model_kwargs={'device': device},
    encode_kwargs={'device': device, 'batch_size': 32}
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

# Integration with Pinecone

In [None]:
# Initializing Pinecone with explicitly set values
PINECONE_API = "********-****-****-****-************"
PINECONE_ENVIRON = "Chat"


In [None]:
from pinecone import Pinecone

pc = Pinecone(api_key=PINECONE_API)

test=len(embeddings[0])

index_name = 'dataspeak-qa'
print(test)


384


In [None]:
while not pc.describe_index(index_name).status['ready']:
    time.sleep(1)
index = pc.Index(index_name)
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.30613,
 'namespaces': {'': {'vector_count': 30613}},
 'total_vector_count': 30613}

In [None]:
# Creating and extracting indexes
index_info = index.describe_index_stats()

if index_info['total_vector_count'] == 0:
    batch_size = 32
    max_metadata_size = 39000

    for i in range(0, len(small_train_dataset), batch_size):
        i_end = min(len(small_train_dataset), i + batch_size)
        batch = small_train_dataset.iloc[i:i_end]
        ids = [f"{x['id_question']}-{x['id_answer']}" for i, x in batch.iterrows()]
        texts = [(x['body_answer']) for i, x in batch.iterrows()]
        embeds = embed_model.embed_documents(texts)
        metadata = [
            {'text': x['body_answer']}
            for i, x in batch.iterrows()
        ]

        metadata_json = json.dumps(metadata, ensure_ascii=False)
        metadata_size = sys.getsizeof(metadata_json)

        if metadata_size > max_metadata_size:

            truncated_metadata = metadata[:20000] # Truncate text

            truncated_metadata_json = json.dumps(truncated_metadata, ensure_ascii=False)
            truncated_metadata_size = sys.getsizeof(truncated_metadata_json)

            index.upsert(vectors=zip(ids, embeds, truncated_metadata))
        else:
            index.upsert(vectors=zip(ids, embeds, metadata))
else:
  print('Vectors already exist. Please use an existing index or start over.')

In [None]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.30613,
 'namespaces': {'': {'vector_count': 30613}},
 'total_vector_count': 30613}

# Downloading and finetuning the language model

In [None]:
model_id = 'meta-llama/Llama-2-13b-chat-hf'

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

In [None]:
bnb_config = transformers.BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type='nf4',
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=bfloat16
)

In [None]:
hf_auth = 'hf_***************************'
model_config = transformers.AutoConfig.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

config.json:   0%|          | 0.00/587 [00:00<?, ?B/s]

In [None]:
model = transformers.AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    config=model_config,
    quantization_config=bnb_config,
    device_map='auto',
    use_auth_token=hf_auth
)



model.safetensors.index.json:   0%|          | 0.00/33.4k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/9.90G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/6.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
model.eval()

print(f"The model is uploaded to  {device}")

# Initialization of the tokenizer and stop criteria

In [None]:
tokenizer = transformers.AutoTokenizer.from_pretrained(
    model_id,
    use_auth_token=hf_auth
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]



tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [None]:
stop_list = ['\nContext:', '\n```\n', '\nAnswer:']

stop_token_ids = [tokenizer(x)['input_ids'] for x in stop_list]
stop_token_ids = [torch.LongTensor(x).to(device) for x in stop_token_ids]
stop_token_ids

[tensor([    1, 29871,    13,  2677, 29901], device='cuda:0'),
 tensor([    1, 29871,    13, 28956,    13], device='cuda:0'),
 tensor([    1, 29871,    13, 22550, 29901], device='cuda:0')]

In [None]:
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        for stop_ids in stop_token_ids:
            if torch.eq(input_ids[0][-len(stop_ids):], stop_ids).all():
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

# Embedding integration with Pinecone

In [None]:
generate_text = transformers.pipeline(
    model=model, tokenizer=tokenizer,
    return_full_text=True,
    task='text-generation',
    temperature=0.0,
    max_new_tokens=512,  # the maximum number of tokens to generate in the output
    repetition_penalty=1.1  # without this, the output begins to repeat itself
)

In [None]:
llm = HuggingFacePipeline(pipeline=generate_text)

In [None]:
text_field = 'text'

from langchain.vectorstores import pinecone as pinecone_store

vectorstore = pinecone_store.Pinecone(index, embed_model.embed_query, text_field)


  warn_deprecated(


In [None]:
index = pc.Index(index_name)
print(index)

<pinecone.data.index.Index object at 0x7ef968f0cdc0>


# Vector search and user interaction

In [None]:
query = 'tell me about python'
# Converting a text query to a vector using embeddings
query_vector = embed_model.embed_query(query)
# Performing a similarity search using the resulting query vector
search_result = index.query(vector=query_vector, top_k=5)
print(search_result)

{'matches': [{'id': '2324208-2324217', 'score': 0.702793181, 'values': []},
             {'id': '5214399-5214786', 'score': 0.657592773, 'values': []},
             {'id': '2385380-2385560', 'score': 0.657296062, 'values': []},
             {'id': '32421612-32421764', 'score': 0.634758532, 'values': []},
             {'id': '2823983-2824002', 'score': 0.631272554, 'values': []}],
 'namespace': '',
 'usage': {'read_units': 5}}


In [None]:
question_answer = RetrievalQA.from_chain_type(
    llm=llm, chain_type='stuff',
    retriever=vectorstore.as_retriever(k=5)
)

In [None]:
chat_history = []
def chatting(input):
  query = input
  result = question_answer({'query': query, 'chat_history': chat_history})
  chat_history.append(result['result'])
  print(result['result'])

  return result['result']


In [None]:
chatting('What is scipy?')

  warn_deprecated(


 Scipy is a scientific computing library for Python.


' Scipy is a scientific computing library for Python.'

In [None]:
chatting('What is scikit-learn and how is it helpful?')

 Scikit-learn is a machine learning library for Python that provides tools for classification, regression, clustering, and other tasks. It includes a wide range of algorithms and utilities for data preprocessing, feature selection, and model evaluation. By using scikit-learn, developers can quickly and easily implement machine learning models in their applications without having to write everything from scratch.


' Scikit-learn is a machine learning library for Python that provides tools for classification, regression, clustering, and other tasks. It includes a wide range of algorithms and utilities for data preprocessing, feature selection, and model evaluation. By using scikit-learn, developers can quickly and easily implement machine learning models in their applications without having to write everything from scratch.'

In [None]:
chatting('matplotlib color tables')



You can use matplotlib's color tables to create a sequence of colors that gradually transition from one color to another. Here's an example of how to do this using the `gist_rainbow` colormap:
```
import matplotlib.pyplot as plt

# Define the color sequence
c = plt.get_cmap('gist_rainbow')
colors = [c(i) for i in range(10)]

# Plot the data using the defined colors
plt.scatter(x, y, c=colors)
```
In this example, we first import the `matplotlib.pyplot` module and define the color sequence using the `gist_rainbow` colormap. We then use the `c` function to create a sequence of colors that gradually transition from one color to another. Finally, we plot the data using the defined colors.

You can also use other colormaps available in matplotlib, such as `viridis`, `plasma`, `inferno`, etc. to create different types of color sequences.

You can also use the `icolor` function to create a sequence of colors based on a specific palette. For example:
```
import matplotlib.pyplot as plt

# De

"\n\nYou can use matplotlib's color tables to create a sequence of colors that gradually transition from one color to another. Here's an example of how to do this using the `gist_rainbow` colormap:\n```\nimport matplotlib.pyplot as plt\n\n# Define the color sequence\nc = plt.get_cmap('gist_rainbow')\ncolors = [c(i) for i in range(10)]\n\n# Plot the data using the defined colors\nplt.scatter(x, y, c=colors)\n```\nIn this example, we first import the `matplotlib.pyplot` module and define the color sequence using the `gist_rainbow` colormap. We then use the `c` function to create a sequence of colors that gradually transition from one color to another. Finally, we plot the data using the defined colors.\n\nYou can also use other colormaps available in matplotlib, such as `viridis`, `plasma`, `inferno`, etc. to create different types of color sequences.\n\nYou can also use the `icolor` function to create a sequence of colors based on a specific palette. For example:\n```\nimport matplotlib