In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/config.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model-00002-of-00002.bin
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer_config.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model.bin.index.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/pytorch_model-00001-of-00002.bin
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/special_tokens_map.json
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/.gitattributes
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/tokenizer.model
/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1/generation_config.json
/kaggle/input/news-cats/cnn_news4cats.csv


# **Project: Retrieval-Augmented Generation with Mistral and Vector-Based News Analysis**


## **Project Overview**

This project is an automated system designed to analyze textual data (news articles) and generate analytical reports. The system identifies relevant articles, summarizes their key points, and extracts critical insights, providing valuable assistance to analysts and researchers.

---

## **Data**

- **Data Source:** A dataset containing news headlines and categories.
- **Data Structure:** 
  - `titles` — News headlines.
  - `category` — Category of the news (e.g., politics, economy, sports).
  - `combined` — A combined feature of category and title for enhanced semantic understanding.

---

## **Technologies and Tools**

1. **Programming Language:**
   - Python

2. **Libraries and Frameworks:**
   - **Pandas:** For data manipulation and preprocessing.
   - **LangChain:** For building an analytical pipeline with language models.
   - **Transformers (Hugging Face):** For text generation using models like Mistral.
   - **FAISS:** For efficient vector-based document retrieval.
   - **SentenceTransformers:** For creating dense vector embeddings of text.

3. **Models:**
   - **Mistral (7B):** A powerful large language model for text generation and summarization.
   - **SentenceTransformer (e.g., all-MiniLM-L6-v2):** For generating vector embeddings used in similarity search.

4. **Infrastructure:**
   - **Kaggle Environment:** For model execution and testing.
   - **GPU Acceleration:** Optional, for faster model inference.

---

## **Workflow**

1. **Data Loading and Preprocessing:**
   - Load the dataset of news headlines and categories.
   - Combine categories with titles to form a unified textual feature.

2. **Embedding Creation:**
   - Generate vector embeddings for the combined text using SentenceTransformer.
   - Store these embeddings in a FAISS index for efficient similarity search.

3. **Query Handling:**
   - A user query is processed to find the top relevant news articles using FAISS.
   - Two retrieval strategies are supported:
     - **Similarity Search:** Finds articles most similar to the query.
     - **MMR (Maximal Marginal Relevance):** Balances relevance and diversity.

4. **Text Generation:**
   - Summarize the top articles using the Mistral model.
   - Extract three distinct insights based on the summary.

5. **Report Generation:**
   - Generate a structured report containing:
     - Top relevant news articles.
     - A concise summary of the articles.
     - Key insights or conclusions derived from the data.

---

## **Key Features**

- **Dynamic Query Handling:** The system adapts to user queries to retrieve and analyze the most relevant information.
- **Multi-Agent Structure:** Uses distinct prompts for summarization and insight generation.
- **Fine-Tuned Parameters:** Optimized hyperparameters (e.g., `temperature`, `repetition_penalty`) for coherent and relevant responses.
- **Scalability:** Efficiently handles large datasets using FAISS and vector embeddings.

---

## **Use Cases**

- **Media Monitoring:** Quickly analyze the latest news and extract insights.
- **Policy Analysis:** Summarize political events and derive actionable insights.
- **Market Research:** Analyze economic or financial news for trends and predictions.

---

## **Conclusion**

This project demonstrates a robust pipeline for automated text analysis and report generation. It combines cutting-edge machine learning models and efficient retrieval techniques to deliver actionable insights from large volumes of textual data.


In [2]:
# Installing necessary libraries for model fine-tuning
!pip install -U bitsandbytes  # Install bitsandbytes for 4-bit and 8-bit quantization, optimizing model memory and speed.
!pip install -U langchain-community  # Install community-contributed modules for LangChain, enhancing retrieval and processing.
!pip install langchain faiss-cpu transformers sentence-transformers  # Install LangChain for LLM pipelines, FAISS for vector search, Transformers for model management, and SentenceTransformers for embedding generation.


Collecting bitsandbytes
  Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Downloading bitsandbytes-0.44.1-py3-none-manylinux_2_24_x86_64.whl (122.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.4/122.4 MB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: bitsandbytes
Successfully installed bitsandbytes-0.44.1
Collecting langchain-community
  Downloading langchain_community-0.3.7-py3-none-any.whl.metadata (2.9 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.7 (from langchain-community)
  Downloading langchain-0.3.7-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.17 (from langchain-community)
  Downloading langchain_core-0.3.18-py3-none-any.whl.metadata (6.3 kB)
Collecting langsmith<0.2.0,>=0.1.125 (from langchain-community)
  Downloading langs

In [3]:
import pandas as pd  # For data manipulation and loading datasets.
import faiss  # For efficient similarity search and indexing of vector embeddings.
from langchain.vectorstores import FAISS  # FAISS integration with LangChain for vector-based retrieval.
from sentence_transformers import SentenceTransformer  # To generate dense embeddings for textual data.
from langchain.embeddings import HuggingFaceEmbeddings  # Wrapper for HuggingFace models to create embeddings.

from langchain.llms import HuggingFacePipeline  # Allows integration of Hugging Face models as LLM pipelines.
from langchain.chains import RetrievalQA  # Chain for question-answering using retrieval-augmented generation (RAG).
from langchain.prompts import PromptTemplate  # Utility for creating structured prompts for LLMs.

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, BitsAndBytesConfig  
# AutoTokenizer: Tokenizes text for input to models.
# AutoModelForCausalLM: Loads pre-trained causal language models.
# pipeline: Simplified wrapper to streamline tasks like text generation or question answering.
# BitsAndBytesConfig: Configures model quantization for efficient memory use.

from langchain.schema import Document  # Schema for defining document structure in LangChain.


In [4]:
# Loading the dataset
df = pd.read_csv('/kaggle/input/news-cats/cnn_news4cats.csv')  # Load news titles and categories.

# Combining category and title into a single text feature
df['combined'] = df['category'] + " - " + df['titles']

# Converting the combined texts into a list
combined_texts = df['combined'].tolist()

# Creating LangChain Document objects for each combined text
documents = [Document(page_content=text) for text in combined_texts]

# Initializing HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

# Creating a FAISS index for efficient similarity search
vectorstore = FAISS.from_documents(documents, embedding_model)

# Configuring the retriever to fetch top 5 most similar documents
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={'k': 5})
df.sample()


  embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Unnamed: 0,titles,category,combined
4979,Masters 2023: Koepka sets the pace in Augusta,sports,sports - Masters 2023: Koepka sets the pace in...


In [5]:
# Configuring BitsAndBytes for efficient model usage
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,  # Load model with 4-bit precision for reduced memory usage
    bnb_4bit_quant_type="nf4",  # Use "nf4" quantization type for optimal performance
    bnb_4bit_compute_dtype="float16",  # Set computation precision to 16-bit floating point
    bnb_4bit_use_double_quant=False  # Disable double quantization for simpler processing
)

# Loading the Mistral model's tokenizer
tokenizer = AutoTokenizer.from_pretrained('/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1')

# Loading the Mistral model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    '/kaggle/input/mistral/pytorch/7b-instruct-v0.1-hf/1',
    device_map='auto',  # Automatically map model across available GPUs
    quantization_config=bnb_config  # Apply BitsAndBytes configuration for optimization
)

# Creating a text-generation pipeline with the Mistral model
mistral_pipeline = pipeline(
    "text-generation",  # Specify the task type as text generation
    model=model, 
    tokenizer=tokenizer, 
    max_length=512,  # Set maximum output token length
    repetition_penalty=1.2,  # Penalize repetitive text in outputs
    temperature=0.1,  # Control randomness; lower values produce more deterministic outputs
    top_k=40,  # Limit sampling to top 40 tokens by probability
    top_p=0.9,  # Apply nucleus sampling to include tokens with cumulative probability <= 0.9
    pad_token_id=tokenizer.eos_token_id,  # Use EOS token for padding
    return_full_text=False,  # Only return generated text without input context
    do_sample=True  # Enable sampling to introduce variability in output
)

# Integrating the pipeline with LangChain
llm = HuggingFacePipeline(pipeline=mistral_pipeline)  # Use the pipeline in LangChain for advanced workflows

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  llm = HuggingFacePipeline(pipeline=mistral_pipeline)  # Use the pipeline in LangChain for advanced workflows


In [6]:
def get_relevant_summary_and_insights(query, retriever, llm, top_k=5):
    """
    Universal Agent:
    1. Retrieves `top_k` most relevant news articles based on the query.
    2. Generates a concise summary of the retrieved articles.
    3. Extracts three unique insights without repetition.

    Parameters:
    - query: Analyst's query.
    - retriever: Document retriever object.
    - llm: Large language model for text generation.
    - top_k: Number of top news articles to analyze.
    
    Returns:
    - Top N news articles, a concise summary, and three unique insights.
    """
    
    retriever.search_kwargs["k"] = top_k
    docs = retriever.invoke(query)  # Fetch the most relevant documents based on the query.
    
    if docs:
        top_news = "\n".join([f"- {doc.page_content}" for doc in docs])  # Format the retrieved articles.
        
        # Prompt for generating a concise summary
        summary_prompt = f"""
TASK: Generate a summary based on the articles provided below. Do not include the query or repeat any information unnecessarily. Follow the structure strictly.

### Relevant News Articles:
{top_news}

### Instructions:
- Summarize the key points from these articles.
- Write the summary in one concise paragraph.
- Exclude the question and list of articles from the summary.

### Summary:
"""
        summary = llm.invoke(summary_prompt, temperature=0.0, max_new_tokens=200)  # Generate the summary.

        # Prompt for generating unique insights
        insights_prompt = f"""
TASK: Generate insights based on the summary provided below. Avoid redundancy and focus on unique conclusions.

### Summary:
{summary.strip()}

### Instructions:
- Provide three distinct insights or conclusions.
- Each insight should be brief and focus on different aspects of the summary.
- Exclude any repetitive or unnecessary information.

### Insights:
"""
        insights = llm.invoke(insights_prompt, temperature=0.0, repetition_penalty=1.2, max_new_tokens=100)  # Generate insights.
        
        return f"""
===== GENERATED REPORT =====

### Top {top_k} News:

{top_news}

### Summary:

{summary.strip()}

### Insights:

{insights.strip()}

============================
"""
    else:
        return "No relevant news articles found for your query."


In [7]:
df.sample(6)

Unnamed: 0,titles,category,combined
6666,"'If you feel lost and you don't know why, you'...",health,health - 'If you feel lost and you don't know ...
1220,Putin's critics have been silenced but the ele...,politics,politics - Putin's critics have been silenced ...
2342,Golfer plays tournament with his two dogs – an...,sports,sports - Golfer plays tournament with his two ...
2678,5 things to know for June 10: European electio...,politics,politics - 5 things to know for June 10: Europ...
7043,Josef Newgarden wins thrilling Indy 500,sports,sports - Josef Newgarden wins thrilling Indy 500
5933,Tropical Storm Calvin threatens to bring heavy...,weather,weather - Tropical Storm Calvin threatens to b...


In [8]:
# Create a retriever for document search
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 10})
# Use "mmr" (Maximal Marginal Relevance) search to balance relevance and diversity in the retrieved documents.
# 'k': Number of documents to retrieve.

# Example query from an analyst
query = "What is the latest situation with Biden?"  
# The query will fetch relevant news articles about the situation in France, as example.

# Generate a summary and insights based on the retrieved documents
overview = get_relevant_summary_and_insights(query, retriever, llm, top_k=10)  
# Pass the query, retriever, and language model (llm) to the function for generating a detailed report.

print("Generated Report:", overview)  
# Display the generated report including the top 10 news articles, summary, and insights.


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


Generated Report: 
===== GENERATED REPORT =====

### Top 10 News:

- politics - Biden is trying to salvage his campaign. Democrats say things are only getting worse
- politics - How would Democrats replace Biden?
- politics - Fareed's take: For Democrats, there's opportunity in the Biden crisis
- politics - Biden just made the hardest decision any politician can make
- politics - What happens in the Democratic nomination now that Biden has left the race
- politics - Costas: Biden must step aside, no longer 'a compelling alternative' to Trump
- politics - 'A scramble': How Biden's team is trying to salvage the president's campaign schedule
- politics - Democratic senator: Biden needs to do more
- politics - Why it will be tough for Biden to defeat Trump
- politics - The last time Biden and Trump debated

### Summary:

Joe Biden is facing criticism from both within and outside his party as he tries to salvage his presidential campaign. Some Democrats believe that things are only getting 