<a href="https://www.kaggle.com/code/cheranratnam/final-fine-tuned-few-shot?scriptVersionId=164628138" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Competition Overview

**The goal of this competition is to create notebooks that demonstrate how to use the Gemma LLM to accomplish one or more of the following developer-oriented tasks**:
- Answer common questions about the Kaggle platform. 

I have focused on this task:
**- Explain or teach basic data science concepts.**


- Summarize Kaggle Solution write ups.
- Explain or teach concepts from Kaggle Solution write ups.
- Answer common questions about the Python programming language.


Submissions to this competition take the form of Kaggle Notebooks. Your notebook should demonstrate how to use the Gemma model to complete the task that you have selected. 

**How can Gemma be used to assist Kaggle developers? Show us your ideas today!**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

/kaggle/input/data-assistants-with-gemma/submission_categories.txt
/kaggle/input/data-assistants-with-gemma/submission_instructions.txt
/kaggle/input/gemma/transformers/2b/2/model.safetensors.index.json
/kaggle/input/gemma/transformers/2b/2/gemma-2b.gguf
/kaggle/input/gemma/transformers/2b/2/config.json
/kaggle/input/gemma/transformers/2b/2/model-00001-of-00002.safetensors
/kaggle/input/gemma/transformers/2b/2/model-00002-of-00002.safetensors
/kaggle/input/gemma/transformers/2b/2/tokenizer.json
/kaggle/input/gemma/transformers/2b/2/tokenizer_config.json
/kaggle/input/gemma/transformers/2b/2/special_tokens_map.json
/kaggle/input/gemma/transformers/2b/2/.gitattributes
/kaggle/input/gemma/transformers/2b/2/tokenizer.model
/kaggle/input/gemma/transformers/2b/2/generation_config.json
/kaggle/input/gemma/transformers/2b-it/1/model.safetensors.index.json
/kaggle/input/gemma/transformers/2b-it/1/gemma-2b-it.gguf
/kaggle/input/gemma/transformers/2b-it/1/config.json
/kaggle/input/gemma/transform

In [2]:
#Use PIP to install or upgrade Hugging Face transformers library from GitHub
!pip install -U git+https://github.com/huggingface/transformers  accelerate bitsandbytes 

#installing other packages as needed

!pip install transformers
!pip install bitsandbytes accelerate
!pip install qdrant_client
!pip install sentence_transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-kxnnt72h
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-kxnnt72h
  Resolved https://github.com/huggingface/transformers to commit bd5b9863060c31f60d66b6aec88b9743d3dcd8f4
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting bitsandbytes
  Downloading bitsandbytes-0.42.0-py3-none-any.whl.metadata (9.9 kB)
Downloading bitsandbytes-0.42.0-py3-none-any.whl (105.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hBuilding wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25ldone
[?25h  Created wheel for transformers: f

In [3]:
#These imports allow using AutoTokenizer to tokenize text, load pre-trained models for causal modeling
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

from qdrant_client import models, QdrantClient
from sentence_transformers import SentenceTransformer
from fuzzywuzzy import fuzz

**Quick Note on the imports above**

Hugging Face Transformers (transformers.AutoTokenizer and transformers.AutoModelForCausalLM):

Description: These components facilitate the use of pre-trained language models for natural language processing tasks. AutoTokenizer loads tokenizers, while AutoModelForCausalLM loads models for tasks like text generation by predicting the next word in a sequence.
BitsAndBytesConfig:

Description: A configuration class in the Hugging Face Transformers library, BitsAndBytesConfig is specifically tailored for configuring models in tasks related to bits and bytes processing.
PyTorch (torch module):

Description: PyTorch's torch module is a fundamental part of the PyTorch library, providing tools for building and training deep neural networks. It is widely used in various machine learning applications, including natural language processing and computer vision.
Qdrant (qdrant_client.models and QdrantClient):

Description: Components of the Qdrant library, qdrant_client.models includes structures for Qdrant operations, and QdrantClient is a client for efficient interaction with Qdrant. Qdrant is designed for storing and retrieving vector embeddings, commonly used in similarity search and recommendation systems.
SentenceTransformer:

Description: SentenceTransformer is a Python library designed for creating sentence embeddings using pre-trained transformer models. These embeddings are valuable for tasks such as measuring semantic similarity between sentences and clustering.
Fuzzywuzzy (fuzzywuzzy.fuzz):

Description: The fuzz.ratio function from the fuzzywuzzy library is used for fuzzy string matching. It calculates the similarity ratio between two strings, making it useful for tasks where approximate string matching or similarity scores are needed, such as finding similar text.


In [4]:
#Loading the model in 4bit - This is so we can use our processing power more efficiently
quantization_config = BitsAndBytesConfig(load_in_8bit=False) 

In [5]:
#Define model and data type - We will use this model throughout our approach
model_id = "/kaggle/input/gemma/transformers/2b-it/1" 
dtype = torch.float16

In [6]:
#Set up a language model and tokenizer using a pre-trained model with automatic device mapping, data type, and configurations
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=dtype,
    quantization_config = quantization_config
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [7]:
# Test the base model - Note the output the base model generates to compare after we make some fine tuning
input_text = "What is data science?"
input_ids = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**input_ids, max_new_tokens=250)
print(tokenizer.decode(outputs[0]))

2024-02-28 03:29:05.873827: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-28 03:29:05.873957: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-28 03:29:06.039251: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


<bos>What is data science?

Data science is the rapidly growing field that combines the power of data analysis, statistical modeling, and computer science to extract meaningful insights from complex and ever-growing datasets. It involves a wide range of tasks, from data wrangling and cleaning to data analysis, modeling, and visualization. Data scientists use their skills to uncover hidden patterns, identify trends, and make informed decisions based on data insights.

**Key components of data science include:**

* **Data wrangling and cleaning:** Gathering, transforming, and organizing data from various sources.
* **Data analysis and modeling:** Using statistical methods and machine learning algorithms to identify patterns and relationships.
* **Data visualization:** Creating clear and insightful visualizations to communicate data insights.
* **Data wrangling and cleaning:** Gathering, transforming, and organizing data from various sources.
* **Statistical modeling:** Using statistical 

# Now I will load a custom data set I created to improve results of the base model 

In [8]:
df = pd.read_csv('/kaggle/input/data-science-concepts-data-set-public-version/final_ds_data.csv')
df.head()

Unnamed: 0,text,label
0,Data science is an interdisciplinary academic ...,what is data science?
1,Top 3 data science concepts are: 1. Data Types...,top 3 data science concepts
2,\nThe term “Data Science” was created in the e...,history of data science
3,"Statistics, Visualization, Deep Learning, Mach...",basic data science concepts
4,\nStatistics is the most critical unit of Data...,what is statistics?


In [9]:
#just want to process and adjust the text - if needed (I had this issue in a previous version - always good to process)
df['text'] = df['text'].apply(lambda x: x.replace("\n", ""))

print(df)


                                                 text  \
0   Data science is an interdisciplinary academic ...   
1   Top 3 data science concepts are: 1. Data Types...   
2   The term “Data Science” was created in the ear...   
3   Statistics, Visualization, Deep Learning, Mach...   
4   Statistics is the most critical unit of Data S...   
5   Visualization technique helps you access huge ...   
6   Machine Learning explores the building and stu...   
7   Deep Learning method is new machine learning r...   
8   Data Engineering is the process of organizing,...   
9   A convolutional neural network (CNN) is a cate...   
10  The pooling layer of a CNN is a critical compo...   
11  The process starts by sliding a filter designe...   
12  In 1974, Peter Naur authored the Concise Surve...   
13  The term "data science" has been traced back t...   
14  Just as the name implies, data science is a br...   
15  Data wrangling is the process of converting da...   
16  Data Visualization is one o

In [10]:
df = df[df['text'].notna()] # remove any NaN values as it blows up serialization (there are no in this dataset but as a good practice)
data = df.to_dict('records') # creating a dictionary
len(data)

20

# Using pre-trained Sentence Transformer model to encode textual data into vectors 

**Then searches for similar vectors in an in-memory Qdrant collection. The goal is to retrieve documents that are similar to a given query (user prompt).**

**Here, a Sentence Transformer model named 'all-MiniLM-L6-v2' is loaded. It's a pre-trained model designed for converting sentences into numerical vectors.**

In [11]:
encoder = SentenceTransformer('all-MiniLM-L6-v2') # Model to create embeddings

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

**A QdrantClient is created, representing an in-memory instance of the Qdrant vector database.**

In [12]:
# create the vector database client
qdrant = QdrantClient(":memory:") # Create in-memory Qdrant instance

**This code initializes a Qdrant collection named 'ds_concepts' to store vectorized representations of data. The vectors have a size determined by the Sentence Transformer model, and cosine distance is used for similarity calculations.**

In [13]:
# Create collection of data Science Concepts
qdrant.recreate_collection(
    collection_name="ds_concepts",
    vectors_config=models.VectorParams(
        size=encoder.get_sentence_embedding_dimension(), # Vector size is defined by used model
        distance=models.Distance.COSINE
    )
)

True

**Data from the 'data' variable, which holds the data science concepts information is vectorized using the Sentence Transformer model and uploaded to the Qdrant collection 'ds_concepts'.**

In [14]:
# vectorize!
qdrant.upload_points(
    collection_name="ds_concepts",
    points=[
        models.PointStruct(
            id=idx,
            vector=encoder.encode(doc["text"]).tolist(),
            payload=doc,
        ) for idx, doc in enumerate(data)  # data is the variable holding all the data science concepts
    ]
)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [15]:
user_prompt = "Top 5 Data Science Concepts"

**A search is performed in the 'ds_concepts' collection to find similar vectors to the one generated from the user prompt "Top 5 Data Science Concepts". You can limit as needed, I selected 122 randomly.**

In [16]:
# Search time for data science concepts!

hits = qdrant.search(
    collection_name="ds_concepts",
    query_vector=encoder.encode(user_prompt).tolist(),
    limit=122
)
for hit in hits:
  print(hit.payload, "score:", hit.score)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

{'text': 'Top 5 data science concepts are: data set, data wrangling, data visualization, outlier, and data imputation', 'label': 'what are top 5 data science concepts?'} score: 0.8214098510272281
{'text': 'Statistics, Visualization, Deep Learning, Machine Learning are important Data Science concepts. Data Science Process goes through Discovery, Data Preparation, Model Planning, Model Building, Operationalize, Communicate Results.', 'label': 'basic data science concepts'} score: 0.5867200099924217
{'text': 'Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm

**The search results are printed, including the payload (original document information) and the similarity score.**

In [17]:
# define a variable to hold the search results
search_results = [hit.payload for hit in hits]

In [18]:
for hit in search_results[:5]:
    print(hit)


{'text': 'Top 5 data science concepts are: data set, data wrangling, data visualization, outlier, and data imputation', 'label': 'what are top 5 data science concepts?'}
{'text': 'Statistics, Visualization, Deep Learning, Machine Learning are important Data Science concepts. Data Science Process goes through Discovery, Data Preparation, Model Planning, Model Building, Operationalize, Communicate Results.', 'label': 'basic data science concepts'}
{'text': 'Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a

**Next, We will generate a response from an AI assistant based on user input. The generate_response function takes the user input, search results (retrieved from the Qdrant search), and other model-related parameters to decide whether to provide an answer directly from the retrieved results or generate a new response using a language model.**

# Key code elements to note in the below approach

* Retrieval-based Response:The code uses the fuzz.ratio function from the fuzzywuzzy library to find the best match in the search results based on the similarity between the user input and the 'label' field in each search result. If the similarity is above a certain threshold (here, 90%), it directly selects the best match.
* Generation-based Response:If no high similarity match is found in the search results, the user input is tokenized and used as input to a language model for response generation. The generated text is then used as the response.
* Example Usage in Interactive Loop:This sets up an interactive loop where the user can input queries, and the AI assistant responds based on either the retrieved results or generated content from the language model.

**In summary, this code combines retrieval and generation techniques to provide responses. If a highly similar result is found in the search results, it directly uses that information. Otherwise, it generates a response using a language model, taking into account the user input and the retrieved search results.
**

# Using the dataset I uploaded and the base model to provide more comprehensive answers 

* Q1. What is data? if the fuzzy ratio was at 80% this would bring an answer from my dataset - What is data science
* Q2. What is data science? This answer is directly from my data set. You can see what the base model provided for this vs what this model is providing
* Q3. Give me some data science concepts ... I have several questions about concepts but since I have my ratio at 90% it pulls from the base model
* Q4. what are top data science concepts? This is one of the answers from my dataset :) Please note I am doing this just to make it clear how the model retrieves specific information and these are just for the sake of testing ... 
* Q5. Who is the father of Data science? Again, just for Kicks :) 
* Q6. Who invented data science?


# Now we will go into more data science concepts

* Q7. What is a CNN? It selects from my dataset
* Q8. What are CNNs used for? This answer comes from the base model
* Q9. What is a pooling layer? I added this answer to see how the model goes back and forth :)
* Q10. What is data engineering? This picks from my dataset as well. 

##It is important to note that your fuzzy ratio comes into play when it comes to when the model will choose your answer vs the base models'.

In a business scenario, one could have a lower fuzzy ratio to perhaps retrieve organizational specific information over more generic information retrieved from the base model.





In [19]:
from fuzzywuzzy import fuzz

def generate_response(user_input, search_results, tokenizer, model, max_length=250):
    # Find the best matching 'label_column'
    best_match = max(search_results, key=lambda hit: fuzz.ratio(user_input.lower(), hit['label'].lower()))

    # Check if the best match has a high enough similarity
    if fuzz.ratio(user_input.lower(), best_match['label'].lower()) >= 90:
        answer = f"{best_match['text']} (related concept: {best_match['label']})."
    else:
        # Tokenize the user input for model-generated response
        input_ids = tokenizer.encode(user_input, return_tensors="pt", max_length=max_length, truncation=True)
        input_ids = input_ids.to(model.device)  # Ensure input_ids are on the same device as the model
    
        # Generate response with max_new_tokens
        outputs = model.generate(input_ids, max_length=max_length, max_new_tokens=200, num_return_sequences=1)
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        answer = generated_text

    return answer

# Example usage in an interactive loop
while True:
    user_input = input("You: ")
    
    # Exit the loop if the user types 'exit'
    if user_input.lower() == 'exit':
        print("AI Assistant: Goodbye! Have a great day.")
        break

    # Use the AI assistant to generate a response
    response = generate_response(user_input, search_results, tokenizer, model)
    
    # Print the assistant's response without including search_results
    print("AI Assistant:", response)


You:  What is data?


Both `max_new_tokens` (=200) and `max_length`(=250) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


AI Assistant: What is data?

Data is a collection of facts, figures, and other information that is used to make decisions. It can be numerical, textual, or graphical. Data is often used to track trends, identify patterns, and make predictions.

Here are some examples of data:

* Sales figures
* Customer names
* Employee salaries
* Weather conditions
* Market trends

What is the difference between data and information?

While data and information are related, they are not the same thing. Data is a collection of facts, figures, and other information, while information is a synthesis of data that is organized and meaningful.

Here is a simple analogy:

* Data is like the raw materials that are used to build a house.
* Information is like the finished product that is built from the raw materials.

Data is a fundamental part of any decision-making process. By understanding data, you can make more informed decisions and achieve your goals more effectively.


You:  What is data science?


AI Assistant: Data science is an interdisciplinary academic field that uses statistics, scientific computing, scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from potentially noisy, structured, or unstructured data.Data science also integrates domain knowledge from the underlying application domain (e.g., natural sciences, information technology, and medicine). Data science is multifaceted and can be described as a science, a research paradigm, a research method, a discipline, a workflow, and a profession.Data science is a "concept to unify statistics, data analysis, informatics, and their related methods" to "understand and analyze actual phenomena" with data. It uses techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, information science, and domain knowledge.  However, data science is different from computer science and information science. Turing Award winner Jim Gray 

You:  Give me some data science concepts


Both `max_new_tokens` (=200) and `max_length`(=250) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


AI Assistant: Give me some data science concepts and techniques that I can use to analyze and visualize data in a healthcare setting?

**Data Science Concepts and Techniques for Healthcare Data Analysis and Visualization**

**1. Data wrangling and cleaning:**
- Data integration from various sources (e.g., EHRs, medical devices, lab reports)
- Handling missing values and outliers
- Data transformation and normalization

**2. Data exploration and analysis:**
- Descriptive statistics (e.g., mean, median, standard deviation)
- Inferential statistics (e.g., hypothesis testing, regression analysis)
- Visualization techniques (e.g., scatter plots, boxplots, heatmaps)

**3. Data mining and pattern discovery:**
- Identifying patterns and relationships in data
- Clustering, classification, and regression algorithms
- Association rule mining

**4. Data visualization:**
- Creating clear and informative visualizations that communicate insights
- Choosing appropriate visualization techniques (e.g., 

You:  what are top data science concepts?


AI Assistant: Top 5 data science concepts are: data set, data wrangling, data visualization, outlier, and data imputation (related concept: what are top 5 data science concepts?).


You:  Who is the father of Data science?


AI Assistant: The term "data science" has been traced back to 1974, when Peter Naur proposed it as an alternative name to computer science (related concept: Who is the father of Data Science?).


You:  Who invented data science?


AI Assistant: In 1974, Peter Naur authored the Concise Survey of Computer Methods, using the term “Data Science,” repeatedly. Naur presented his own convoluted definition of the new concept: “The usefulness of data and data processes derives from their application in building and handling models of reality.” (related concept: who invented data science?).


You:  What is a CNN?


AI Assistant: A convolutional neural network (CNN) is a category of machine learning model, namely a type of deep learning algorithm well suited to analyzing visual data. CNNs -- sometimes referred to as convnets -- use principles from linear algebra, particularly convolution operations, to extract features and identify patterns within images. Although CNNs are predominantly used to process images, they can also be adapted to work with audio and other signal data. CNNs use a series of layers, each of which detects different features of an input image. Depending on the complexity of its intended purpose, a CNN can contain dozens, hundreds or even thousands of layers, each building on the outputs of previous layers to recognize detailed patterns. (related concept: What is a CNN?).


You:  What are CNNs used for?


Both `max_new_tokens` (=200) and `max_length`(=250) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


AI Assistant: What are CNNs used for?

Convolutional neural networks (CNNs) are a type of artificial neural network (ANN) used for image recognition. They are widely used in various applications, including:

* **Medical diagnosis:** CNNs can be used to analyze medical images, such as X-rays, CT scans, and MRI scans, to detect diseases and abnormalities.
* **Security and surveillance:** CNNs can be used to identify and track objects or individuals in security footage, detect suspicious activities, and monitor security systems.
* **Object recognition:** CNNs can be used to identify and classify objects in images, such as faces, animals, and objects in manufacturing.
* **Natural language processing (NLP):** CNNs can be used to process and understand natural language text and images, such as sentiment analysis, text classification, and image captioning.
* **Advertising and marketing:** CNNs can be used to analyze customer data and preferences to create targeted advertising campaigns.

CNNs

You:  What is a pooling layer?


AI Assistant: The pooling layer of a CNN is a critical component that follows the convolutional layer. Similar to the convolutional layer, the pooling layer's operations involve a sweeping process across the input image, but its function is otherwise different. The pooling layer aims to reduce the dimensionality of the input data while retaining critical information, thus improving the network's overall efficiency. This is typically achieved through downsampling: decreasing the number of data points in the input. For CNNs, this typically means reducing the number of pixels used to represent the image. The most common form of pooling is max pooling, which retains the maximum value within a certain window (i.e., the kernel size) while discarding other values. Another common technique, known as average pooling, takes a similar approach but uses the average value instead of the maximum. (related concept: What is a pooling layer?).


You:  What is data engineering? 


AI Assistant: Data Engineering is the process of organizing, managing, and analyzing large amounts of data. It's a key component in the world of data science, but it can be used by anyone who has to deal with big data regularly. Data engineering is about collecting, storing, and processing data. (related concept: what is data engineering?).


You:  exit


AI Assistant: Goodbye! Have a great day.


# Incorporating few-shot prompts to the above model

* Few-shot prompts are used to provide context or guidance to the language model when generating responses. The idea behind few-shot learning is to train a model on a small set of examples (shots) for a particular task or prompt. In your code, the few-shot prompts are combined with the user's input to form a comprehensive prompt for the model.

**Here is how few shot prompting is used in this code:**

* Defining Few-Shot Prompts:

Define a list of few-shot prompts (few_shot_prompts) related to data science concepts. These prompts are examples of the kinds of questions or tasks you want the model to be able to handle.

* Combining Prompts with User Input:

When a user provides input, the code combines the user's input with the few-shot prompts. It creates a new prompt by joining the few-shot prompts and the user's input together.

* Tokenization and Model Input:

The combined prompt is then tokenized using the tokenizer provided by the Hugging Face Transformers library. This tokenized prompt is then converted into input tensors suitable for the model.

* Model Response Generation:

The tokenized prompt is fed into the language model using the "model.generate function". The model generates a response based on the combined input prompt.

* Decoding and Displaying Response:

The generated response is then decoded from token IDs to human-readable text using the tokenizer. This response is then printed as the output of your AI assistant.

By incorporating few-shot prompts, we are essentially guiding the model to understand the context of the user's input better. The prompts act as examples to influence the model's behavior, allowing it to perform specific tasks or answer questions related to data science concepts. The prompts help fine-tune the model's responses based on the specified prompts, making it more tailored to the desired domain of knowledge.




# Let's explore some few shot prompts


* Q1. "Can you explain the concept of regularization in the context of machine learning?"
* Q2. What is visualization in data science? 
* Q3. What is data visualization? This fetches my definition 

I have show an example of how the model would work with few shot prompting and can still pull an answer from my data set in these questions.

In [20]:
from fuzzywuzzy import fuzz

def generate_response(user_input, search_results, tokenizer, model, few_shot_prompts, max_length=550):
    # Find the best matching 'label_column'
    best_match = max(search_results, key=lambda hit: fuzz.ratio(user_input.lower(), hit['label'].lower()))

    # Check if the best match has a high enough similarity
    if fuzz.ratio(user_input.lower(), best_match['label'].lower()) >= 90:
        answer = f"{best_match['text']} (related concept: {best_match['label']})."
    else:
        # Combine few-shot prompts with user input
        prompts = "\n".join([f"{prompt}\n" for prompt in few_shot_prompts])
        full_prompt = f"{prompts}User: {user_input}"

        # Tokenize the combined prompt for model-generated response
        input_ids = tokenizer.encode(full_prompt, return_tensors="pt", max_length=max_length, truncation=True)
        input_ids = input_ids.to(model.device)  # Ensure input_ids are on the same device as the model
    
        # Generate response with max_new_tokens
        outputs = model.generate(input_ids, max_length=max_length, max_new_tokens=200, num_return_sequences=1)
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

        answer = generated_text

    return answer

# Example usage in an interactive loop
while True:
    user_input = input("You: ")
    
    # Exit the loop if the user types 'exit'
    if user_input.lower() == 'exit':
        print("AI Assistant: Goodbye! Have a great day.")
        break

    # Three few-shot prompts related to data science concepts
    few_shot_prompts = [
        "Explain the following data science concept:",
        "Describe the significance of the following term in data science:",
        "Provide an example use case for the following data science technique:"
    ]
    
    response = generate_response(user_input, search_results, tokenizer, model, few_shot_prompts)
    
    # Print the assistant's response without including search_results
    print("AI Assistant:", response)


You:  Can you explain the concept of regularization in the context of machine learning?


Both `max_new_tokens` (=200) and `max_length`(=550) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


AI Assistant: Explain the following data science concept:

Describe the significance of the following term in data science:

Provide an example use case for the following data science technique:
User: Can you explain the concept of regularization in the context of machine learning?

**Regularization** is a technique used in machine learning to reduce overfitting and improve the generalization performance of a model. It achieves this by adding a penalty term to the loss function that is proportional to the magnitude of the model's weights. This encourages the model to find a simpler solution that is less likely to overfit to the training data.

**Significance of regularization:**

* Reduces overfitting: Overfitting occurs when a model becomes too closely fit to the training data and fails to generalize well to new, unseen data. Regularization helps to prevent overfitting by forcing the model to find a more general solution.
* Improves generalization performance: By encouraging the model

You:  What is visualization in data science?


Both `max_new_tokens` (=200) and `max_length`(=550) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


AI Assistant: Explain the following data science concept:

Describe the significance of the following term in data science:

Provide an example use case for the following data science technique:
User: What is visualization in data science?

**Answer:**

**Term:** Visualization

**Significance:** Visualization is a powerful tool in data science that allows users to communicate complex data insights and patterns in a clear and compelling manner. It helps to identify trends, outliers, and relationships between different data variables, making it easier for stakeholders to understand and make informed decisions based on data.

**Example Use Case:**

Imagine a data scientist analyzing sales data for a retail company. By using visualization techniques, the data scientist can create charts and graphs that illustrate the following insights:

* **Sales trends over time:** The data shows that sales have been steadily increasing in recent months.
* **Seasonal variations:** There are significant f

You:  What is data visualization?


AI Assistant: Data Visualization is one of the most important branches of data science. It is one of the main tools used to analyze and study relationships between different variables. Data visualization (e.g., scatter plots, line graphs, bar plots, histograms, qqplots, smooth densities, boxplots, pair plots, heat maps, etc.) can be used for descriptive analytics. Data visualization is also used in machine learning for data preprocessing and analysis, feature selection, model building, model testing, and model evaluation. When preparing a data visualization, keep in mind that data visualization is more of an Art than Science. (related concept: What is data visualization?).


You:  exit


AI Assistant: Goodbye! Have a great day.


# SUMMARY

**This code defines a function, generate_response, for an AI assistant that combines retrieval and generative approaches. It first attempts to find a highly similar match in the provided search results based on user input and a labeled 'concept.' If a high similarity is found, it directly uses the matched information. Otherwise, it constructs a prompt by combining a set of predefined few-shot prompts with the user input and utilizes a language model to generate a response, seamlessly integrating user input with contextual prompts. The interactive loop allows users to engage with the AI assistant, prompting it with queries, and receiving responses that dynamically blend retrieved information and generated content.**

### **Sole Author and contributor to this work: Cheran Ratnam - cheran.jacob@gmail.com | 2149912389 **