# Exploring Twitter Data with Vector Databases and RAG Systems

This tutorial introduces the creation of a vector database and the use of a retrieval-augmented generation (RAG) system to explore Twitter data interactively. 

Please check [LBSocial](www.lbsocial.net)  on how to collect Twitter data. 

## Set up a Database and API Keys

Create a [MongoDB](www.mongodb.com) cluster and store the connection string in a safe place, such as AWS Secrets Manager. 
- key name: `connection_string`
- key value: <`the connection string`>, you need to type the password
- secret name: `mongodb`


You also need to purchase and your [oepnai](https://openai.com/) api key in AWS Secrets Manager:
- key name: `api_key`
- key value: <`your openai api key`>
- secret name: `openai`

## Install Python Libraries

- pymongo: manage the MongoDB database
- openai: create embeddings and resonpses 

In [1]:
pip install pymongo openai -q

Note: you may need to restart the kernel to use updated packages.


## Secrets Manager Function

In [2]:
import boto3
from botocore.exceptions import ClientError
import json

def get_secret(secret_name):
    region_name = "us-east-1"

    # Create a Secrets Manager client
    session = boto3.session.Session()
    client = session.client(
        service_name='secretsmanager',
        region_name=region_name
    )

    try:
        get_secret_value_response = client.get_secret_value(
            SecretId=secret_name
        )
    except ClientError as e:
        raise e

    secret = get_secret_value_response['SecretString']
    
    return json.loads(secret)

## Import Python Libraries and Credentials

In [3]:
import pymongo
from pymongo import MongoClient
import json
import re
import os
from openai import OpenAI
openai_api_key  = get_secret('openai')['api_key']
client = OpenAI(api_key=openai_api_key)


mongodb_connect = get_secret('mongodb')['connection_string']


## Connect to the MongoDB cluster

In [4]:
mongo_client = MongoClient(mongodb_connect)
db = mongo_client.demo # use or create a database named demo
tweet_collection = db.tweet_collection #use or create a collection named tweet_collection

## Utility Funcitons

- the `clean_tweet` function removes URLs in tweets
- the `get_embedding` function use openai to create tweet embeddings
- the `vector_search` function return relevent tweets based on a query

In [5]:
def clean_tweet(text):
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(url_pattern, '', text)

In [6]:
embedding_model= 'text-embedding-3-small'

def get_embedding(text):

    try:
        embedding = client.embeddings.create(input=text, model=embedding_model).data[0].embedding
        return embedding
    except Exception as e:
        print(f"Error in get_embedding: {e}")
        return None

In [7]:
def vector_search(query):

    query_embedding = get_embedding(query)
    if query_embedding is None:
        return "Invalid query or embedding generation failed."
    # Define the vector search pipeline
    pipeline = [
        {
            "$vectorSearch": {
                "index": "tweet_vector",
                "queryVector": query_embedding,
                "path": "tweet.embedding",
                "numCandidates": 1000,  # Number of candidate matches to consider
                "limit": 10  # Return top 10 matches
            }
        },
        {
            "$project": {
                "_id": 0,  # Exclude the _id field
                "tweet.text": 1 # return tweet text

            }
        }
    ]

    results = tweet_collection.aggregate(pipeline)
    return list(results)

## Tweets Embedding 

For more about text embeddings please read [Introducing text and code embeddings](https://openai.com/index/introducing-text-and-code-embeddings/)

In [8]:
from tqdm.auto import tqdm
tweets = tweet_collection.find()

for tweet in tqdm(list(tweets)):
    try:
        tweet_embedding = get_embedding(clean_tweet(tweet['tweet']['text']))
    #     print(tweet_embedding)

        tweet_collection.update_one(
            {'tweet.id':tweet['tweet']['id']},
            {"$set":{'tweet.embedding':tweet_embedding}}
        )
    except:
        print(f"""error in embedding tweet {tweet['tweet']['id']}""")
        pass


  0%|          | 0/187 [00:00<?, ?it/s]

## Create a Vector Index

For more about the MognoDB Vector database, please read [What are Vector Databases?](https://www.mongodb.com/resources/basics/databases/vector-databases)
This code creates a vector index following the [MongoDB official document](https://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/#std-label-avs-types-vector-search).

In [9]:
# Create your index model, then create the search index

from pymongo.operations import SearchIndexModel
import time

search_index_model = SearchIndexModel(
  definition={
  "fields": [
    {
      "type": "vector",
      "path": "tweet.embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    }
  ]
},
  name="tweet_vector",
  type="vectorSearch"

)
result = tweet_collection.create_search_index(model=search_index_model)
print("New search index named " + result + " is building.")
# Wait for initial sync to complete
print("Polling to check if the index is ready. This may take up to a minute.")
predicate=None

if predicate is None:
  predicate = lambda index: index.get("queryable") is True

while True:
  indices = list(tweet_collection.list_search_indexes(result))
  if len(indices) and predicate(indices[0]):

    break
  time.sleep(5)

print(result + " is ready for querying.")

New search index named tweet_vector is building.
Polling to check if the index is ready. This may take up to a minute.
tweet_vector is ready for querying.


In [14]:
user_query = 'combats'

for tweet in vector_search(user_query):
    print(tweet['tweet']['text'])

https://t.co/TSbkkmiGuk
To deploy in any major conflict, the US military relies on civilian infrastructure like ports, railways and airlines — all of which have been targeted by Chinese cyber threats actors, says a new report from @FDD_CCTI's Cyberspace Solarium Commission 2.0
https://t.co/RApGoh7GPZ
🚨 Threat Alert: Cyber Threats to US Military Mobilization Infrastructure  

📅 Date: 2025-03-31  

📍 Location: United States  

📌 Attribution: Volt Typhoon, Flax Typhoon, Salt Typhoon (Chinese state-sponsored groups), including known aliases for Volt Typhoon such as Vanguard
An examination of the security threats in the cyber domain that affect unmanned aircraft systems (UAS), focusing on the three core tenets of information security: confidentiality, integrity, and availability.

https://t.co/nwbWo4htwf

@USArmy  @TRADOC @usacac  @ArmyUniversity
Cybersecurity is like a sports event where you're always on the defensive, protecting your turf from hackers and cyber threats
RT @INTERPOL_HQ: Ho

## Retrieval-Augmented Generation (RAG) 

For more about RAG, please read [Retrieval-Augmented Generation (RAG) with Atlas Vector Search](https://www.mongodb.com/docs/atlas/atlas-vector-search/rag/#std-label-avs-rag).

In [15]:
from openai import OpenAI

delimiter = '###'
chat_model = 'gpt-4o'
temperature = 0

chat_history = [{"role": "system", "content": """you are a chabot answer user questions based on the returned tweets"""}]

def chatbot(prompt):

    chat_history.append({"role": "user", "content": prompt})
    
    tweets = vector_search(prompt)
    chat_history.append({"role": "system", "content": f"here the returned tweets delimitered by {delimiter}{tweets}{delimiter}"})

    response = client.chat.completions.create(
        model=chat_model,  # Use the model you prefer
        messages=chat_history
    )

    reply = response.choices[0].message.content

    chat_history.append({"role": "assistant", "content": reply})
    
    return reply

In [16]:
while True:
    user_input = input("You: ")
    if user_input.lower() in ['exit', 'quit']:
        print("Chatbot: Goodbye!")
        break
    reply = chatbot(user_input)
    print(f"Chatbot: {reply}")

You:  recent threats


Chatbot: Recent threats mentioned in the tweets include:

1. **Cyber Threats in the United States:**
   - The U.S. has extended its National Emergency status due to cyber threats. These malicious cyber activities are primarily originating from individuals outside the United States, with connections to state-sponsored groups.
   - Specific threats have been identified targeting the US critical transportation and military mobilization infrastructures, attributed to Chinese state-sponsored hacking groups like Volt Typhoon, Flax Typhoon, and Salt Typhoon.

2. **Ransomware Trends:**
   - There is a concern about ransomware threats, detailing trends observed in March 2025, with various countries being targeted.

3. **AI-Driven Cyber Threats:**
   - Modern cyber threats are increasingly AI-driven, making them more sophisticated and challenging to defend against.

4. **Omissions and Inclusions in Threat Listings:**
   - There's a mention of confusion regarding the removal of Russia from a pote

You:  strategies to combat cyber threats


Chatbot: To combat cyber threats effectively, consider the following strategies:

1. **System Updates and Antivirus Software:**
   - Regularly update systems and software to protect against vulnerabilities.
   - Use reliable antivirus programs to detect and mitigate malware threats.

2. **Caution with Links and Emails:**
   - Be vigilant about suspicious links and emails, which are common vectors for cyber attacks.

3. **Governmental Measures:**
   - Implement stricter cybersecurity regulations and frameworks at national and organizational levels.

4. **Proactive Cybersecurity Strategy:**
   - Focus on preventing cyber threats with advanced vulnerability assessments and AI-powered threat detection.
   - Utilize virtual security services to maintain a strong defensive posture.

5. **Business Protection Practices:**
   - Regularly back up data securely.
   - Enable two-factor authentication (2FA) and use strong, unique passwords.
   - Avoid using public Wi-Fi for accessing sensitive info

You:  future of cyber threats


Chatbot: The future of cyber threats is expected to be shaped by several evolving factors:

1. **Advancements in Technology:**
   - The advent of quantum computing poses a challenge to current encryption methods, necessitating next-generation security solutions.

2. **Ever-Evolving Nature of Threats:**
   - Cyber threats continue to grow in complexity, targeting sensitive sectors like financial institutions, which must continuously strengthen their security measures.

3. **Rise of Cyberwarfare:**
   - The increase in sophisticated cyber weapons presents significant risks to critical infrastructure, national security, and economic stability. Digital defenses are becoming increasingly important.

4. **Proactive Cybersecurity:**
   - A proactive approach to cybersecurity, utilizing threat intelligence and advanced defense systems, is crucial as attacks become a question of "when," not "if."

5. **AI in Risk Management:**
   - Insurers and businesses are developing AI-powered tools to dete

You:  new methods used in cyber threats


Chatbot: New methods used in cyber threats, as inferred from the tweets, include:

1. **Advanced Attack Techniques:**
   - Cyber threats are evolving rapidly, targeting sensitive sectors like financial institutions by using more sophisticated methods to exploit vulnerabilities.

2. **Real-Time and AI-Driven Attacks:**
   - The use of real-time insights, often powered by AI, enables cyber attackers to swiftly adapt and respond to defensive measures, making threats more dynamic and harder to predict.

3. **Complex Threat Environments:**
   - Cyber threats are becoming multi-faceted, requiring businesses to adopt a Unified Security Posture to gain a comprehensive view of their security maturity and effectively prioritize their defenses.

4. **Sophistication in Ransomware and Data Breaches:**
   - As technology advances, cyber threats such as data breaches and ransomware attacks continue to become more complex, highlighting the need for robust protection strategies.

5. **Threat Intelligen

You:  quit


Chatbot: Goodbye!


## Reference

- *“Introducing Text and Code Embeddings.”* n.d. OpenAI. Accessed October 31, 2024. https://openai.com/index/introducing-text-and-code-embeddings/.
- *“What Are Vector Databases?”* n.d. MongoDB. Accessed October 31, 2024. https://www.mongodb.com/resources/basics/databases/vector-databases.
