<a href="https://colab.research.google.com/github/dushyant3615/AI_Voice_Chatbot/blob/main/Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
pip install torch transformers datasets pandas speechrecognition

Collecting datasets
  Downloading datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting speechrecognition
  Downloading SpeechRecognition-3.14.1-py3-none-any.whl.metadata (31 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  

In [3]:
pip install pydub

Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Installing collected packages: pydub
Successfully installed pydub-0.25.1


In [4]:
import os
import json
import pandas as pd
import speech_recognition as sr
from pydub import AudioSegment
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import Dataset

def load_clinc_data(file_path):
    with open(file_path, "r") as file:
        data = json.load(file)

    # Extract user queries and intents
    queries = [item[0] for item in data]  # User query
    intents = [item[1] for item in data]  # Intent

    return pd.DataFrame({"query": queries, "intent": intents})

# To load the Clinc AI dataset
clinc_data = load_clinc_data("/content/train.json")

def load_mozilla_data(csv_path, audio_folder):
    df = pd.read_csv(csv_path, sep="\t")  # Use tab separator for TSV files

    # To extract sentences and corresponding audio file paths
    sentences = df['sentence'].tolist()
    audio_files = [os.path.join(audio_folder, row['path']) for _, row in df.iterrows()]

    return pd.DataFrame({"sentence": sentences, "audio_path": audio_files})

# To load the Mozilla Common Voice dataset
mozilla_data = load_mozilla_data("/content/Mozilla Common Voice Dataset/validated.tsv", "/content/Mozilla Common Voice Dataset/clips")

def audio_to_text(audio_path):
    recognizer = sr.Recognizer()

    # Check if the file exists
    if not os.path.exists(audio_path):
        print(f"File not found: {audio_path}")
        return ""  # To return empty string if file is missing

    # Convert MP3 to WAV if the file is not already in WAV format
    if audio_path.endswith(".mp3"):
        try:
            audio = AudioSegment.from_mp3(audio_path)
            wav_path = audio_path.replace(".mp3", ".wav")
            audio.export(wav_path, format="wav")
            audio_path = wav_path  # To use the converted WAV file
        except Exception as e:
            print(f"Error converting {audio_path} to WAV: {e}")
            return ""  # To return empty string if conversion fails

    try:
        with sr.AudioFile(audio_path) as source:
            audio = recognizer.record(source)
            return recognizer.recognize_google(audio)  # To convert the speech to text
    except sr.UnknownValueError:
        print(f"Could not transcribe audio: {audio_path}")
        return ""  # If audio cannot be transcribed
    except sr.RequestError:
        print(f"API error for audio: {audio_path}")
        return ""  # If there's an API error
    except Exception as e:
        print(f"Unexpected error for audio: {audio_path}: {e}")
        return ""  # Handle any other errors

# To transcribe audio files to text
mozilla_data["transcribed_text"] = mozilla_data["audio_path"].apply(audio_to_text)

# To combine the datasets
combined_data = pd.concat([
    clinc_data.rename(columns={"query": "text", "intent": "label"}),
    mozilla_data.rename(columns={"transcribed_text": "text"})[["text"]]
], ignore_index=True)

# To add dummy labels for Mozilla data (since it doesn't have intents)
combined_data["label"] = combined_data["label"].fillna("unknown")

# To convert labels to numerical values
label_to_id = {label: idx for idx, label in enumerate(combined_data["label"].unique())}
combined_data["label"] = combined_data["label"].map(label_to_id)

# To load pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# To tokenize the text data
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

# To convert DataFrame to Hugging Face Dataset
dataset = Dataset.from_pandas(combined_data)
tokenized_dataset = dataset.map(tokenize_function, batched=True)

# To load pre-trained BERT model for intent classification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(label_to_id))

# To define training settings
training_args = TrainingArguments(
    output_dir="./results",  # Directory to save results
    per_device_train_batch_size=8,  # Batch size for training
    num_train_epochs=3,  # Number of training epochs
    logging_dir="./logs",  # Directory to save logs
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
) # To initialize Trainer

trainer.train() # To train the model

model.save_pretrained("trained_chatbot_model") # To save the trained model
tokenizer.save_pretrained("trained_chatbot_model")

print("Model Training Complete. Chatbot is Ready!")

File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41383256.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41823983.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41881685.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41799514.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41552032.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41827319.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41526838.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41435787.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41633128.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41586424.mp3
File not found: /content/Mozilla Common Voice Dataset/clips/common_voice_en_41489529.mp3
File not found: /cont

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/251 [00:00<?, ? examples/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mdushyant3615[0m ([33mdushyant3615-own[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss


Model Training Complete. Chatbot is Ready!
