# RFT-Lab — Input Handling Layer

A powerful AI system does not start with a model.
It starts with **robust input handling**.

This notebook builds a unified input layer that supports:
- Text input
- PDF documents
- Images (OCR)
- Audio files
- Live microphone speech

All inputs are converted into **clean text** before entering
the Transformer pipeline.

Design principle:
Input handling is completely **decoupled** from
understanding, reasoning, and generation.


## Step 0: Environment Setup

We use lightweight, production-friendly libraries.
No model logic is mixed here.


In [None]:
!pip install pytesseract PyPDF2 SpeechRecognition
!pip install git+https://github.com/openai/whisper.git

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Collecting SpeechRecognition
  Downloading speechrecognition-3.14.4-py3-none-any.whl.metadata (30 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading speechrecognition-3.14.4-py3-none-any.whl (32.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.9/32.9 MB[0m [31m66.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: SpeechRecognition, pytesseract, PyPDF2
Successfully installed PyPDF2-3.0.1 SpeechRecognition-3.14.4 pytesseract-0.3.13
Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /tmp/pip-req-build-gc7138rq
  Running command git c

In [27]:
!apt-get install -y portaudio19-dev
!pip install pyaudio

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  libasound2-dev libjack-dev libjack0 libportaudio2 libportaudiocpp0
Suggested packages:
  libasound2-doc jackd1 portaudio19-doc
The following packages will be REMOVED:
  libjack-jackd2-0
The following NEW packages will be installed:
  libasound2-dev libjack-dev libjack0 libportaudio2 libportaudiocpp0
  portaudio19-dev
0 upgraded, 6 newly installed, 1 to remove and 41 not upgraded.
Need to get 596 kB of archives.
After this operation, 3,178 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libjack0 amd64 1:0.125.0-3build2 [93.3 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/main amd64 libasound2-dev amd64 1.2.6.1-1ubuntu1 [110 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libjack-dev amd64 1:0.125.0-3build2 [206 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/univers

In [29]:
!pip install sounddevice scipy

Collecting sounddevice
  Downloading sounddevice-0.5.3-py3-none-any.whl.metadata (1.6 kB)
Downloading sounddevice-0.5.3-py3-none-any.whl (32 kB)
Installing collected packages: sounddevice
Successfully installed sounddevice-0.5.3


In [2]:
import re
from typing import Dict, Union

from PIL import Image
import pytesseract
import PyPDF2

import whisper
import speech_recognition as sr


## Step 1: Text Cleaning

User inputs are often noisy.
We normalize text early to stabilize downstream reasoning.


In [4]:
def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"\s+", " ", text)
    text = re.sub(r"[^\w\s.,?!]", "", text)
    return text.strip()


## Step 2: Text Input Handling

Plain text input is directly cleaned and normalized.


In [5]:
def handle_text_input(text: str) -> str:
    return clean_text(text)

## Step 3: PDF Input Handling

Users upload resumes, reports, or documents.
We extract text conservatively and defer interpretation.


In [6]:
def extract_text_from_pdf(file_path: str) -> str:
    text = ""
    with open(file_path, "rb") as f:
        reader = PyPDF2.PdfReader(f)
        for page in reader.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + " "
    return clean_text(text)


## Step 4: Image Input Handling

Images may contain scanned documents or screenshots.
OCR is used to extract readable text.


In [7]:
def extract_text_from_image(image_path: str) -> str:
    image = Image.open(image_path).convert("RGB")
    text = pytesseract.image_to_string(image)
    return clean_text(text)


## Step 5: Audio File Input Handling

Audio files are converted to text using Whisper.
Reasoning never happens on raw audio.


In [8]:
whisper_model = whisper.load_model("large", device="cuda")

def extract_text_from_audio(audio_path: str) -> str:
    try:
        result = whisper_model.transcribe(audio_path)
        return clean_text(result["text"])
    except Exception as e:
        return f"Error transcribing {audio_path}: {e}"

100%|██████████████████████████████████████| 2.88G/2.88G [00:22<00:00, 137MiB/s]


## Step 6: Live Microphone Input

The system also supports real-time microphone input.
Users can speak naturally instead of typing.

Speech is converted to text and routed
into the same pipeline.


In [32]:
import sounddevice as sd
from scipy.io.wavfile import write

def record_from_mic() -> str:
    recognizer = sr.Recognizer()

    with sr.Microphone() as source:
        print("Listening... speak now")
        audio = recognizer.listen(source)

    text = recognizer.recognize_google(audio)
    return clean_text(text)

## Step 7: Unified Input Router

Regardless of how input arrives,
it is converted into a single normalized format.

This function acts as a clean contract
between UI and AI core.


In [34]:
def handle_input(
    input_type: str,
    payload: Union[str, bytes]
) -> Dict:

    if input_type == "text":
        content = handle_text_input(payload)

    elif input_type == "pdf":
        content = extract_text_from_pdf(payload)

    elif input_type == "image":
        content = extract_text_from_image(payload)

    elif input_type == "audio":
        content = extract_text_from_audio(payload)

    elif input_type == "mic":
        content = record_from_mic()

    else:
        raise ValueError("Unsupported input type")

    return {
    "raw_content": content,          # actual extracted text
    "length": len(content.split()),  # word count
    "input_type": input_type         # source type
}

## Step 8: Input Validation

Before reasoning begins, input quality is checked.
Warnings are attached instead of blocking execution.


In [21]:
def validate_input(input_dict: Dict) -> Dict:
    content = input_dict.get("raw_content", "")
    length = input_dict.get("length", 0)

    input_dict["is_valid"] = True
    input_dict["warning"] = None

    if not content:
        input_dict["is_valid"] = False
        input_dict["warning"] = "Empty input"

    elif length < 5:
        input_dict["warning"] = "Input too short for deep reasoning"

    elif length > 5000:
        input_dict["warning"] = "Input very long; truncation may occur"

    return input_dict


## Step 9: End-to-End Test

This simulates the exact internal flow
of the deployed application.


In [22]:
user_text = "Analyze this resume and highlight weaknesses."

processed = handle_input("text", user_text)
validated = validate_input(processed)

validated

{'raw_content': 'analyze this resume and highlight weaknesses.',
 'length': 6,
 'input_type': 'text',
 'is_valid': True,

## Microphone Test

Speak a sentence and verify text extraction.


In [36]:
# spoken = handle_input("mic", None)
# validated = validate_input(spoken)
# validated

## **This notebook demonstrates:**
- True multimodal input support
- Clean separation of concerns
- Production-aware validation
- Speech-ready AI system design