<a href="https://colab.research.google.com/github/amuzetnoM/artifactvirtual/blob/ADE/notebooks/modeltraining/multimodalaitraining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AVA v1.0 APEX
AVA: Advanced Virtual Assistant

This notebook establishes a **robust multimodal AI assistant** capable of understanding and processing text, image, and audio,
- Uses separate quantized models per modality,
- Incorporates [Qwen3](https://huggingface.co/Qwen) as the core LLM interface,
- Supports RAG (retrieval-augmented generation),
- Includes a file and document processing pipeline,
- Robust error handling,
- Ready to expand into generative applications.

In [1]:
!pip install transformers datasets torchaudio torchvision matplotlib sentence-transformers
!pip install pyaudio wave speechrecognition PyMuPDF opencv-python ffmpeg-python
!pip install langchain qwen openai faiss-cpu unstructured
!pip install langchain-community
!pip install tqdm

Collecting argparse<2.0.0,>=1.4.0 (from zetascale->qwen)
  Using cached argparse-1.4.0-py2.py3-none-any.whl.metadata (2.8 kB)
Using cached argparse-1.4.0-py2.py3-none-any.whl (23 kB)
Installing collected packages: argparse
Successfully installed argparse-1.4.0


Collecting langchain-community
  Using cached langchain_community-0.3.23-py3-none-any.whl.metadata (2.5 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Using cached pydantic_settings-2.9.1-py3-none-any.whl.metadata (3.8 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Using cached httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Using cached python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Using cached langchain_community-0.3.23-py3-none-any.whl (2.5 MB)
Using cached httpx_sse-0.4.0-py3-none-any.whl (7.8 kB)
Using cached pydantic_settings-2.9.1-py3-none-any.whl (44 kB)
Using cached python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, httpx-sse, pydantic-settings, langchain-community
Successfully installed httpx-sse-0.4.0 langchain-community-0.3.23 pydantic-settings-2.9.1 python-dotenv-1.1.0


In [4]:
import os
import torch
import torchaudio
import wave
import speech_recognition as sr
import matplotlib.pyplot as plt
from PIL import Image
import torchvision.transforms as T
import fitz  # PyMuPDF
import cv2
import tempfile
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel, AutoProcessor, pipeline
from sentence_transformers import SentenceTransformer
from langchain.document_loaders import PyPDFLoader, UnstructuredFileLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
from tqdm.notebook import tqdm

In [10]:
model_name = "Qwen/Qwen1.5-1.8B-Chat-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto", load_in_4bit=True)

def handle_text(text):
    try:
        if not text or not isinstance(text, str): raise ValueError("Text must be a non-empty string.")
        tokens = tokenizer(text, return_tensors='pt').to(model.device)
        return tokens
    except Exception as e:
        print("Text processing error:", e)
        return None

def handle_image(image_path):
    try:
        if not os.path.exists(image_path): raise FileNotFoundError(image_path)
        image = Image.open(image_path).convert("RGB")
        transform = T.Compose([T.Resize((224, 224)), T.ToTensor()])
        return transform(image).unsqueeze(0)
    except Exception as e:
        print("Image error:", e)
        return None

def handle_audio(audio_path):
    try:
        if not os.path.exists(audio_path): raise FileNotFoundError(audio_path)
        waveform, _ = torchaudio.load(audio_path)
        return waveform
    except Exception as e:
        print("Audio error:", e)
        return None

def audio_to_text(audio_path):
    recognizer = sr.Recognizer()
    try:
        with sr.AudioFile(audio_path) as source:
            audio = recognizer.record(source)
            return recognizer.recognize_google(audio)
    except Exception as e:
        print("Speech Recognition failed:", e)
        return ""

def chat(prompt):
    try:
        inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
        output = model.generate(**inputs, max_new_tokens=100, do_sample=True)
        return tokenizer.decode(output[0], skip_special_tokens=True)
    except Exception as e:
        return f"Chat error: {e}"

def setup_rag(pdf_path):
    loader = PyPDFLoader(pdf_path)
    documents = loader.load()
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
    db = FAISS.from_documents(documents, embeddings)
    retriever = db.as_retriever(search_kwargs={"k": 3})
    rag = RetrievalQA.from_chain_type(llm=HuggingFacePipeline(pipeline="text-generation", model=model, tokenizer=tokenizer), chain_type="stuff", retriever=retriever)
    return rag

# Example text
tokens = handle_text("Hello world")
if tokens:
    text_vector = model(**tokens).last_hidden_state[:, 0, :]
    print("Text vector shape:", text_vector.shape)

OSError: Qwen/Qwen1.5-1.8B-Chat-GPTQ is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`

# AVA v0.1
This notebook establishes a foundation for processing and combining
text, image, and audio data using deep learning models.
It handles each input type separately, allowing for modularity and flexibility.
Outputs can be fused into a single representation for downstream tasks.



**Install Dependencies**
We install the necessary libraries for handling text, image, and audio data,
as well as for visualization and model loading.

In [15]:
!pip install transformers datasets torchaudio torchvision matplotlib
!pip install wave
!apt-get update && apt-get install -y portaudio19-dev
!pip install pyaudio
!pip install speechrecognition
!pip install PyPDF2

Hit:1 http://security.ubuntu.com/ubuntu jammy-security InRelease
Hit:2 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease
Hit:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Hit:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease
Hit:6 http://archive.ubuntu.com/ubuntu jammy-updates InRelease
Hit:7 http://archive.ubuntu.com/ubuntu jammy-backports InRelease
Hit:8 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:9 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Hit:10 https://ppa.launchpadcontent.net/ubuntugis/ppa/ubuntu jammy InRelease
Reading package lists... Done
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading

**Import Libraries**
Here we import the libraries we'll be using throughout the notebook.
hese include tools for text processing, image manipulation, audio handling,
visualization, and model loading.

In [16]:
from transformers import AutoTokenizer, AutoModel
import torch
import torchvision.transforms as T
from PIL import Image
import torchaudio
import matplotlib.pyplot as plt
import pyaudio
import wave
import speech_recognition as sr
import os
import PyPDF2


**Define Input Handlers with Error Handling and Validations**
These functions handle different input types (text, image, audio) and
include error handling and validations to ensure robustness.

In [17]:
# Text-------------------------------------------------------------------------
def handle_text(text):
    """Processes text input using BERT tokenizer."""
    try:
        if not isinstance(text, str) or not text:
            raise ValueError("Invalid text input. Please provide a non-empty string.")
        tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
        tokens = tokenizer(text, return_tensors='pt')
        return tokens
    except ValueError as e:
        print(f"Error processing text: {e}")
        return None

# Image------------------------------------------------------------------------
def handle_image(image_path):
    """Processes image input using torchvision transforms."""
    try:
        if not os.path.exists(image_path):
            raise FileNotFoundError(f"Image file not found: {image_path}")
        image = Image.open(image_path).convert('RGB')
        transform = T.Compose([
            T.Resize((224, 224)),
            T.ToTensor()
        ])
        return transform(image).unsqueeze(0)
    except (FileNotFoundError, OSError) as e:
        print(f"Error processing image: {e}")
        return None

# Audio------------------------------------------------------------------------
def handle_audio(audio_path):
    """Processes audio input using torchaudio."""
    try:
        if not os.path.exists(audio_path):
            raise FileNotFoundError(f"Audio file not found: {audio_path}")
        waveform, sample_rate = torchaudio.load(audio_path)
        return waveform
    except (FileNotFoundError, OSError) as e:
        print(f"Error processing audio: {e}")
        return None

*Test Handler (optional)*

In [None]:
# Example: Text
text_data = handle_text("This is a test.")

# Example: Image
image_tensor = handle_image("/content/image.jpg")

# Example: Audio
audio_waveform = handle_audio("/content/audio.wav")


**Model Forwarrd Pass**
This section loads the BERT model and performs a forward pass
on the text data to obtain text embeddings.

In [None]:
from transformers import AutoModel

model = AutoModel.from_pretrained('bert-base-uncased')  # Load model once

def get_text_embedding(text):
    """Gets text embedding using BERT model."""
    text_data = handle_text(text)
    if text_data is not None:
        outputs = model(**text_data)
        return outputs.last_hidden_state[:, 0, :]  # CLS token
def get_text_embedding(text):
    """Gets text embedding using BERT model."""
    text_data = handle_text(text)
    if text_data is not None:
        outputs = model(**text_data)
        return outputs.last_hidden_state[:, 0, :]  # CLS token


*Recognize audio, image and text*

In [None]:
filename = 'audio.wav'
# Initialize recognizer
r = sr.Recognizer()
with sr.AudioFile(filename) as source:
    # listen for the data (load audio to memory)
    audio_data = r.record(source)
    # recognize (convert from speech to text)
    text = r.recognize_google(audio_data)
    print(text)

Visualize Image or Audio

In [None]:
# Image
plt.imshow(image_tensor.squeeze(0).permute(1, 2, 0))
plt.title("Loaded Image")
plt.axis('off')
plt.show()

# Audio
plt.plot(audio_waveform.t().numpy())
plt.title("Audio Waveform")
plt.show()


**Fusion (optional)**
You can later combine embeddings (text, image, audio) into a shared vector and train a classifier or generative model on top.

In [None]:
# Combined Vector
text = "This is a test." # Replace with your desired text
text_data = handle_text(text)
if text_data is not None:
    outputs = model(**text_data)  # This line was missing
    text_vector = outputs.last_hidden_state[:, 0, :]  # CLS token
    combined = text_vector  # Later concat with image/audio embeddings

# Classifier layer (optional)
# classifier = torch.nn.Linear(combined.size(1), num_classes)
# logits = classifier(combined)


Summary


Each input type is handled separately.

Outputs can be combined into one representation.

From here, build your own loss function, dataset loader, and training loop.

This is a template, not a finished AI. But it’s the bones of one.