# Simple RAG System for SpaceX Projects
## Part (1) Data Acquisition and Preprocessing   
\

## Objective
### Develop a basic Retrieval-Augmented Generation (RAG) system using a diverse dataset to answer questions specifically related to SpaceX.

The primary goal of this project is to build a RAG system that can respond to user queries about SpaceX, their ongoing projects, and their latest space missions by utilizing a combination of different data sources.

\
## Dataset Acquisition and Processing
In this project, various datasets are collected from multiple resources, processed, and stored in my Google Drive for easy access and use. The collected data focuses on providing comprehensive information relevant to SpaceX and its activities.

\
### Data Sources
The dataset for this RAG system is derived from four main types of sources, each contributing unique information to form a complete and informative resource:

- **PDF Files**: These include SpaceX mission guides, general information about SpaceX's rocket fleet, and detailed documentation about the company's history, vision, and mission. The PDFs are preprocessed, split into smaller chunks, and stored as text files.
  
- **Wikipedia Articles**: Articles sourced from Wikipedia provide scientific insights and background on topics related to space exploration, space vehicles, rocket technology, and historical space missions.

- **CC News Articles**: This data source comprises journalistic articles covering SpaceX, NASA, and other prominent space organizations. These articles highlight key achievements, collaborations, and the impact of SpaceX on the space industry.

- **Video Transcripts**: Videos related to space exploration, SpaceX’s ambitious goals, and the intricacies of rocket engines are transcribed. The transcripts offer a detailed, comprehensive view of the company’s work and its future vision for space travel and technology.

\
## Preprocessing and Storage
After collecting data from the above sources, each resource was preprocessed using text splitting, cleaning, and formatting techniques. The processed data was then stored in `.txt` files for seamless retrieval and integration into the RAG system.

\
## Purpose and Use Case
The resulting RAG system is designed to serve as a Q&A service, focusing on providing accurate and contextual information regarding SpaceX's endeavors, current projects, and their broader implications for space exploration. By leveraging this system, users can gain deeper insights into the company's efforts and keep up-to-date with their latest achievements and missions.

In [16]:
# install the required libraries to download datasets from
!pip install datasets gdown ffmpeg-python


Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0


In [17]:
# install wisper model will be needed in video translation
!pip install git+https://github.com/openai/whisper.git


Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-u8ugmqg8
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /tmp/pip-req-build-u8ugmqg8
  Resolved https://github.com/openai/whisper.git to commit 423492dda7806206abe56bdfe427c1096473a020
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting tiktoken (from openai-whisper==20240927)
  Downloading tiktoken-0.7.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting triton>=2.0.0 (from openai-whisper==20240927)
  Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

In [21]:
# import the required functions

from google.colab import drive
from datasets import load_dataset
import whisper
import gdown
import ffmpeg
import os
import re


In [4]:
# mount google drive to load and upload the data
drive.mount('/content/drive')

# change the divercory of working to RagProject folder to make it easier to work with the dirve
os.chdir("/content/drive/MyDrive/RagProject ")

Mounted at /content/drive


## Datasets from wikipedia and cc_new process and download

In [5]:
# the following key words should be mensioned in the title of the dataset to be selected as relevant topic

space_keywords = [
    "SpaceX", "Raptor engine", "Falcon 9", "Falcon Heavy", "Elon Musk",
    "rocket engine", "aerospace", "space exploration", "Mars mission",
    "reusable rocket", "rocket technology", "space mission",
    "Raptor rocket engine", "rocket propulsion", "Raptor vacuum",
    "spacecraft", "launch vehicle", "space launch", "orbital rocket",
    "Falcon rocket", "Merlin engine", "Starship"
]


In [6]:
# import cc_news_dataset with 30% of the total data for the porposes of computational time
cc_news_dataset = load_dataset("cc_news", split="train[:30%]")

# import wikipedia_dataset with 30% of the total data for the porposes of computational time
# this dataset were released on 1st of march 2022 in English language
wiki_dataset = load_dataset("wikipedia", "20220301.en", split="train[:30%]")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.57k [00:00<?, ?B/s]

train-00000-of-00005.parquet:   0%|          | 0.00/211M [00:00<?, ?B/s]

train-00001-of-00005.parquet:   0%|          | 0.00/234M [00:00<?, ?B/s]

train-00002-of-00005.parquet:   0%|          | 0.00/219M [00:00<?, ?B/s]

train-00003-of-00005.parquet:   0%|          | 0.00/245M [00:00<?, ?B/s]

train-00004-of-00005.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/708241 [00:00<?, ? examples/s]

wikipedia.py:   0%|          | 0.00/36.7k [00:00<?, ?B/s]

README.md:   0%|          | 0.00/16.0k [00:00<?, ?B/s]

The repository for wikipedia contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/wikipedia.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


Downloading data:   0%|          | 0/41 [00:00<?, ?files/s]

train-00006-of-00041.parquet:   0%|          | 0.00/366M [00:00<?, ?B/s]

train-00004-of-00041.parquet:   0%|          | 0.00/431M [00:00<?, ?B/s]

train-00003-of-00041.parquet:   0%|          | 0.00/491M [00:00<?, ?B/s]

train-00015-of-00041.parquet:   0%|          | 0.00/236M [00:00<?, ?B/s]

train-00001-of-00041.parquet:   0%|          | 0.00/705M [00:00<?, ?B/s]

train-00013-of-00041.parquet:   0%|          | 0.00/248M [00:00<?, ?B/s]

train-00007-of-00041.parquet:   0%|          | 0.00/326M [00:00<?, ?B/s]

train-00009-of-00041.parquet:   0%|          | 0.00/312M [00:00<?, ?B/s]

train-00000-of-00041.parquet:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

train-00014-of-00041.parquet:   0%|          | 0.00/222M [00:00<?, ?B/s]

train-00012-of-00041.parquet:   0%|          | 0.00/229M [00:00<?, ?B/s]

train-00011-of-00041.parquet:   0%|          | 0.00/247M [00:00<?, ?B/s]

train-00010-of-00041.parquet:   0%|          | 0.00/267M [00:00<?, ?B/s]

train-00002-of-00041.parquet:   0%|          | 0.00/558M [00:00<?, ?B/s]

train-00008-of-00041.parquet:   0%|          | 0.00/329M [00:00<?, ?B/s]

train-00005-of-00041.parquet:   0%|          | 0.00/391M [00:00<?, ?B/s]

train-00016-of-00041.parquet:   0%|          | 0.00/215M [00:00<?, ?B/s]

train-00017-of-00041.parquet:   0%|          | 0.00/229M [00:00<?, ?B/s]

train-00018-of-00041.parquet:   0%|          | 0.00/241M [00:00<?, ?B/s]

train-00019-of-00041.parquet:   0%|          | 0.00/228M [00:00<?, ?B/s]

train-00020-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00021-of-00041.parquet:   0%|          | 0.00/255M [00:00<?, ?B/s]

train-00022-of-00041.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00023-of-00041.parquet:   0%|          | 0.00/226M [00:00<?, ?B/s]

train-00025-of-00041.parquet:   0%|          | 0.00/218M [00:00<?, ?B/s]

train-00024-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00026-of-00041.parquet:   0%|          | 0.00/212M [00:00<?, ?B/s]

train-00027-of-00041.parquet:   0%|          | 0.00/206M [00:00<?, ?B/s]

train-00028-of-00041.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00029-of-00041.parquet:   0%|          | 0.00/219M [00:00<?, ?B/s]

train-00030-of-00041.parquet:   0%|          | 0.00/214M [00:00<?, ?B/s]

train-00031-of-00041.parquet:   0%|          | 0.00/216M [00:00<?, ?B/s]

train-00032-of-00041.parquet:   0%|          | 0.00/200M [00:00<?, ?B/s]

train-00033-of-00041.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00034-of-00041.parquet:   0%|          | 0.00/201M [00:00<?, ?B/s]

train-00035-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00036-of-00041.parquet:   0%|          | 0.00/199M [00:00<?, ?B/s]

train-00037-of-00041.parquet:   0%|          | 0.00/195M [00:00<?, ?B/s]

train-00038-of-00041.parquet:   0%|          | 0.00/203M [00:00<?, ?B/s]

train-00039-of-00041.parquet:   0%|          | 0.00/192M [00:00<?, ?B/s]

train-00040-of-00041.parquet:   0%|          | 0.00/185M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/6458670 [00:00<?, ? examples/s]

Loading dataset shards:   0%|          | 0/20 [00:00<?, ?it/s]

In [7]:
# navigate through the dataset
wiki_dataset

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 1937601
})

In [8]:
# navigate through the dataset
wiki_dataset

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 1937601
})

## Process and filter the required articles from the datasets

In [9]:

# create a regex pattern that matches any of the keywords as whole words
pattern = r'\b(' + '|'.join(re.escape(keyword) for keyword in space_keywords) + r')\b'

# create the filtersted wiki_articles and its titles
filtered_wiki_titles = []
filtered_wiki_articles = []

for i, j in enumerate(wiki_dataset):

  if re.search(pattern, j["title"], re.IGNORECASE):
    filtered_wiki_titles.append(j["title"])
    filtered_wiki_articles.append(j["text"])

# create the filtersted cc_news and its titles
filtered_cc_news_articles=[]
filtered_cc_news_titles=[]

for i, j in enumerate(cc_news_dataset):
  if re.search(pattern, j["title"], re.IGNORECASE):
    filtered_cc_news_titles.append(j["title"])
    filtered_cc_news_articles.append(j["text"])



In [10]:
def clean_text(text):

    # replace multiple periods ('..') with a single space
    text = re.sub(r'\.{2,}', ' ', text)
    # replace newline characters with a space
    text = text.replace('\n', ' ')
    # remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

In [12]:
# save filtered CC News titles and articles to separate text files

with open('DownloadedDataSet/filtered_cc_news_titles.txt', 'w', encoding='utf-8') as file:
    for title in filtered_cc_news_titles:
        file.write(clean_text(title) + '\n\n\n*')
with open('textDataSet/filtered_cc_news_articles.txt', 'w', encoding='utf-8') as file:
    for article in filtered_cc_news_articles:
        file.write(clean_text(article) + '\n\n\n*')



# save filtered wiki articles and titles to separate text files

with open('DownloadedDataSet/filtered_wiki_titles.txt', 'w', encoding='utf-8') as file:
    for title in filtered_wiki_titles:
        file.write(clean_text(title) + '\n\n\n*')
with open('textDataSet/filtered_wiki_articles.txt', 'w', encoding='utf-8') as file:
    for article in filtered_wiki_articles:
        file.write(clean_text(article) + '\n\n\n*')



## Download the processed (wikipedia and cc_news) files

In [15]:

# downloading the previously processed files
gdown.download('https://drive.google.com/uc?id=1WvGZ01SNch4uLYympd3gB1Qm2lW-IhWj', 'wiki_articles.txt', quiet=False)
gdown.download('https://drive.google.com/uc?id=1CoBdWYujQMXVt-C8DRcq6cujKytyPGVh', 'cc_news_articles.txt', quiet=False)


Downloading...
From: https://drive.google.com/uc?id=1WvGZ01SNch4uLYympd3gB1Qm2lW-IhWj
To: /content/drive/MyDrive/RagProject /wiki_articles.txt
100%|██████████| 537k/537k [00:00<00:00, 49.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1CoBdWYujQMXVt-C8DRcq6cujKytyPGVh
To: /content/drive/MyDrive/RagProject /cc_news_articles.txt
100%|██████████| 2.95M/2.95M [00:00<00:00, 112MB/s]


'cc_news_articles.txt'

## Process the Video files

In [20]:
# this function is used to extract audio from the video

def extract_audio(video_file, output_audio_file):
    (
        ffmpeg
        .input(video_file)
        .output(output_audio_file, format='mp3')
        .run(overwrite_output=True)
    )


In [22]:

# load the Whisper model and chose the base model
model = whisper.load_model("base")

# transcribe the audio file to txt
def transcribe_audio(audio_file):
    result = model.transcribe(audio_file)
    return result['text']


100%|███████████████████████████████████████| 139M/139M [00:03<00:00, 41.3MiB/s]
  checkpoint = torch.load(fp, map_location=device)


In [23]:

# this function is used to convert the video folder into txt files in the provided pathes

def process_video_to_text(video_dir, output_dir):
    for i in os.listdir(video_dir):
        if i.endswith(".mp4"):
            # Construct the correct video and output audio paths for each file
            video_path = os.path.join(video_dir, i)
            audio_path = os.path.join(output_dir, i.split(".")[0] + ".mp3")

            print("start_processing_video:", i)

            # Extract audio
            extract_audio(video_path, audio_path)

            # Transcribe the audio
            transcript = transcribe_audio(audio_path)

            # Save the transcription to a text file
            with open(os.path.join(output_dir, i.split(".")[0] + ".txt"), "w") as file:
                file.write(transcript)

            print("finished_video:", i)


In [24]:

# convert the videos to txt transcript
process_video_to_text("VideosDataSet/","soundDataSet/")


start_processing_video: The Real Reason SpaceX Developed The Raptor Engine!.mp4




finished_video: The Real Reason SpaceX Developed The Raptor Engine!.mp4
start_processing_video: The Journey of Elon Musk (Documentary).mp4




finished_video: The Journey of Elon Musk (Documentary).mp4
start_processing_video: Raptor Engine.mp4




finished_video: Raptor Engine.mp4
start_processing_video: SpaceX rocket Engine.mp4




finished_video: SpaceX rocket Engine.mp4
start_processing_video: SpaceX Raptor3.mp4




finished_video: SpaceX Raptor3.mp4
start_processing_video: SpaceX moonBase.mp4




finished_video: SpaceX moonBase.mp4
start_processing_video: How SpaceX and NASA Plan To Build A Mars Colony!.mp4




finished_video: How SpaceX and NASA Plan To Build A Mars Colony!.mp4
