<a href="https://colab.research.google.com/github/andege19/LuxeChatBot/blob/main/LuxeNailSalonBot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

###Luxe Nails Kenya Chatbot
###The chatbot is designed to handle user queries related to a nail salon, providing responses in either English or Swahili based on the input language. The process involves loading and preprocessing a dataset, training a neural network model, and implementing a user interface for interaction. We are to:

<ul>
<li>Train a rule-based chatbot using this dataset.
<li>Use ML/NLP (like TF-IDF + cosine similarity) to make it smarter.
<li>Add Swahili / Sheng phrasing for the local language flavor.
<ul>


###Building Process
###Dataset Loading and Preprocessing
<ul>
<li>Loading: The dataset is loaded from a CSV file, containing intents, patterns, and responses.
<li>Language Mapping: A function (detect_lang) is used to map responses to either English or Swahili based on specific keywords or default settings.
<li>Preprocessing: Text preprocessing includes tokenization, lowercasing, stopword removal, stemming, and lemmatization to prepare the data for training.
<ul>

###Training Data Creation
<ul>
<li>Bag-of-Words Model: The preprocessed patterns are converted into a bag-of-words representation, creating input features for the neural network.
<li>Labels: Intents are encoded as labels for supervised learning.
<ul>

###Neural Network Model
<ul>
<li>Architecture: A simple neural network with one hidden layer is defined.
<li>Training: The model is trained using backpropagation to learn the relationship between input patterns and intents.

###Response Handling
<ul>
<li>Intent Prediction: The trained model predicts the intent of user input.
<li>Language Detection: The input language is detected to select the appropriate response language.
<li>Response Selection: A response is chosen based on the predicted intent and detected language.
<ul>

####User Interface
<ul>
<li>Gradio Interface: A user-friendly interface is created using Gradio, allowing users to interact with the chatbot via text input.
<>ul

###Testing Process
###Unit Testing
Language Detection: Test the detect_input_lang function with various inputs to ensure it correctly identifies English and Swahili.
Preprocessing: Verify that the preprocess function correctly tokenizes, removes stopwords, and applies stemming and lemmatization.
Intent Prediction: Test the neural network's ability to predict the correct intent for a variety of input patterns.

###Integration Testing
End-to-End Testing: Simulate user interactions through the Gradio interface to ensure the entire system works cohesively.
Response Accuracy: Check if the chatbot provides accurate and contextually appropriate responses based on the input language and intent.

###Performance Testing
Response Time: Measure the time it takes for the chatbot to process an input and generate a response to ensure it meets performance expectations.
Scalability: Test the chatbot with a large number of queries to evaluate its performance under load.

###User Acceptance Testing
Scenario Testing: Test the chatbot with various scenarios, including edge cases and ambiguous inputs, to see how it handles unexpected situations.

###Challenges in Testing
Language Ambiguity: Handling inputs that could be interpreted in multiple languages or have ambiguous meanings.
Intent Overlap: Dealing with inputs that could belong to multiple intents, requiring the chatbot to make a nuanced decision.
Data Variability: Ensuring the test data covers a wide range of possible inputs to thoroughly evaluate the chatbot's performance.

###Areas for Improvement in Testing
Automated Testing: Implement automated tests for language detection, preprocessing, and intent prediction to catch regressions early.
Expanded Test Data: Create a comprehensive test dataset that includes diverse examples, edge cases, and mixed-language inputs.
Continuous Integration: Integrate testing into the development pipeline to automatically run tests with each code change.
User Feedback Loop: Establish a feedback loop with users to continuously gather insights and improve the chatbot based on real-world usage.
Error Logging: Implement detailed error logging to capture and analyze failures during testing and live usage.

###Challenges
<ul>
<li>Language Detection: Accurately detecting the input language, especially when inputs are short or contain mixed languages, can be challenging.
<li>Dataset Quality: Ensuring the dataset is comprehensive and accurately labeled is crucial for model performance.
<li>Model Complexity: Balancing model complexity with performance to ensure real-time response capability.
<li>Handling Ambiguity: Dealing with ambiguous user inputs that could belong to multiple intents.
<ul>

###Areas for Improvement
<ul>
<li>Enhanced Language Detection: Implement more sophisticated language detection algorithms to improve accuracy, especially for mixed-language inputs.
<li>Dataset Expansion: Expand the dataset to include more diverse examples and languages if needed.
<li>Model Optimization: Experiment with different neural network architectures and hyperparameters to improve accuracy and response time.
<li>User Feedback Integration: Incorporate user feedback to continuously improve the chatbot's performance and response quality.
<li>Error Handling: Improve error handling for cases where the input language is not clearly identifiable or the intent is ambiguous.
<li>Multilingual Support: Consider extending support to additional languages if the user base expands.
<ul>

Installing Dependencies

In [1]:
# Install required packages
!pip install pandas numpy nltk langdetect gradio

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio
  Downloading gradio-5.24.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<25.0,>=22.0 (from gradio)
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.12-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.5.0-py3-none-any.whl.metadata (3.0 kB)
Collecting gradio-client==1.8.0 (from gradio)
  Downloading gradio_client-1.8.0-py3-none-any.whl.metadata (7.1 kB)
Collecting groovy~=0.1 (from gradio)
  Downloading groovy-0.1.2-py3-none-any.whl.metadata (6.1 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gr

Importing Libraries

In [2]:
# Import dependencies
import pandas as pd #tool for working with structured data like CSV files or Excel spreadsheets.
import numpy as np  # adds support for large, multi-dimensional arrays and mathematical functions.
import nltk # library for working with human language data (text).
import random # built-in module to generate random numbers or make random choices — useful for sampling, shuffling
import gradio as gr
from nltk.stem import LancasterStemmer #A stemmer-chops off word endings to reduce words to their root form (more aggressive).
from langdetect import detect, DetectorFactory  # detect: Function to auto-detect the language of a given text string.
#DetectorFactory: Lets you configure how detection works
from langdetect.lang_detect_exception import LangDetectException #LangDetectException: Used to catch errors when language detection fails (like for empty or gibberish text).

# Download NLTK tokenizer
nltk.download('punkt')

# Set seed for consistent language detection
DetectorFactory.seed = 0
stemmer = LancasterStemmer()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Loading Dataset

In [3]:
def load_dataset(file_path):
    try:
        df = pd.read_csv(file_path)
        data = {"intents": []}  #initializes an empty dictionary data with a key "intents" that has an empty list as its value.
    #This dictionary will hold the final structure of the dataset, where each intent will be organized into different categories.


        for intent_group in df.groupby('Intent'): #This line begins a loop that groups the data in the DataFrame df by the 'Intent' column.
    #The variable intent will hold the name of each intent group (e.g., "greeting", "service inquiry"), and group will hold the subset of rows corresponding to that particular intent.

            intent = intent_group[0]
            group_data = intent_group[1]
            intent_entry = {
                "tag": intent,
                "patterns": [],
                "responses": {"en": [], "sw": []}
            }

            for _, row in group_data.iterrows():#Loops through each row of the grouped data for that intent.
                intent_entry["patterns"].append(row['Pattern'])#Adds the user input pattern from the 'Pattern' column to the patterns list.
                response_text = row['Response'] #Grabs the chatbot's response from the 'Response' column.
                try:
                    lang = detect(response_text)#Uses the langdetect library's detect() function to guess the language of the response.
                    if lang not in ['en', 'sw']:
                        lang = 'en' if any(c in 'abcdefghijklmnopqrstuvwxyz' for c in response_text.lower()) else 'sw' #If the language is not English (en) or Swahili (sw), it does a manual check:
                        #If it finds English-like letters, it assumes it's English.Otherwise, it assumes Swahili.

                except LangDetectException:#If language detection fails or errors, default to English.
                    lang = 'en'
                intent_entry["responses"][lang].append(response_text)#Appends the response text to the appropriate language list ("en" or "sw").

            data["intents"].append(intent_entry)#After processing all rows for an intent, add the intent_entry to the data["intents"] list.
        return data #Return the final data dictionary, structured for chatbot training.

    except Exception as e:
        print(f"Error loading dataset: {e}")
        return None #If anything goes wrong during file reading or processing, print the error and return None.

dataset = load_dataset('nail_salon_chatbot_100questions_dataset.csv')


Text Preprocessing

In [4]:
def preprocess_and_detect_language(sentence): #This defines a function called preprocess_and_detect_language that takes one argument: sentence (a string — user input).
    try:
        lang = detect(sentence)#Uses langdetect.detect() to automatically identify the language of the input sentence.(For example, "Habari yako?" would likely return 'sw' (Swahili), while "Hello, how are you?" would return 'en')

    except LangDetectException:
        lang = 'en'#If language detection fails or throws an error (e.g., input is gibberish or empty), the code defaults to 'en' (English) as a fallback.

    words = nltk.word_tokenize(sentence) #Uses NLTK to split the sentence into individual words (called tokens).Example: "Hi there!" → ['Hi', 'there', '!']

    stemmed_words = [stemmer.stem(word.lower()) for word in words] #Converts all words to lowercase and applies stemming to reduce words to their root form:"running" → "run", "easily" → "eas". It assumes stemmer is defined elsewhere (likely nltk.PorterStemmer() or SnowballStemmer()).
    return stemmed_words, lang #Returns two things: A list of stemmed, lowercased tokens (for model input or analysis)
    #The detected language code ('en' or 'sw', etc.)

Creating Training Data

In [5]:
# Install required packages
!pip install pandas numpy nltk langdetect gradio
nltk.download('punkt_tab')

def create_training_data(data): #Takes data (like the dataset we prepared earlier) and converts it into machine learning-friendly format.
    words = [] #(words → all the words in patterns, labels → all unique intent tags (e.g., greeting, booking, etc.), docs → list of tuples: (pattern words, tag))
    labels = []
    docs = []

    for intent in data['intents']:#(Loops over each intent.Tokenizes each pattern (user phrase) into words.Adds these words to the global words list.Saves (tokenized pattern, tag) into docs.)
        for pattern in intent['patterns']:
            wrds = nltk.word_tokenize(pattern)
            words.extend(wrds)
            docs.append((wrds, intent['tag']))

        if intent['tag'] not in labels: #Adds the tag (intent label) to the labels list if it's not already there.
            labels.append(intent['tag'])

    words = [stemmer.stem(w.lower()) for w in words if w != "?"]#Stems and lowercases all words, and removes duplicates using set().
    #["running", "run", "ran"] might all become "run"words is now a sorted list of all unique root words across all patterns.
    words = sorted(list(set(words)))

    training = []#training will hold "bag of words" vectors for input
    output = [] #output will hold one-hot vectors representing the intent tag

    for doc in docs:
      #(Creates a bag of words vector (binary vector).
      #For every word in the full words list, check if it exists in this doc's pattern.The result: a list of 1s and 0s indicating presence/absence of each word.)
        bag = [1 if stemmer.stem(w.lower()) in [stemmer.stem(word.lower()) for word in doc[0]] else 0 for w in words]

        output_row = [0] * len(labels)#(Creates a one-hot encoded vector for the output intent:If the intent is the second label, output would be: [0, 1, 0, 0, ...])
        output_row[labels.index(doc[1])] = 1

        training.append(bag)#Appends the bag and label vector to the training set.
        output.append(output_row)

    return np.array(training), np.array(output), words, labels #(training as X_train – input features, output as y_train – labels, words – the vocabulary used, labels – the intent classes)

X_train, y_train, all_words, tags = create_training_data(dataset)



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Building Neural Network

In [7]:
# --- Neural Network Training ---
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def train_model(X, y, hidden_neurons=10, alpha=0.1, epochs=10000):
#Trains a basic 2-layer neural network (input → hidden → output) Parameters:
#X: Input data (bag of words)
#y: Output labels (one-hot vectors)
#hidden_neurons: Number of neurons in the hidden layer
#alpha: Learning rate (controls weight update step size)
#epochs: Number of training loops

    np.random.seed(1) #Sets a seed so that weight initialization is consistent across runs.
    input_neurons = X.shape[1] #Number of input and output neurons depends on input feature size and output classes.
    output_neurons = y.shape[1]

    synapse_0 = 2 * np.random.random((input_neurons, hidden_neurons)) - 1 #(synapse_0: Weights between input → hidden layer synapse_1: Weights between hidden → output layer. Random values between -1 and 1)
    synapse_1 = 2 * np.random.random((hidden_neurons, output_neurons)) - 1

    for epoch in range(epochs): #Run the learning steps for a fixed number of epochs.
        layer_0 = X # input features
        layer_1 = sigmoid(np.dot(layer_0, synapse_0)) # hidden layer activations
        layer_2 = sigmoid(np.dot(layer_1, synapse_1)) #output predictions (probabilities of each intent)

        layer_2_error = y - layer_2 #Difference between true labels and predictions.
        layer_2_delta = layer_2_error * (layer_2 * (1 - layer_2)) #Derivative of sigmoid (for gradient) = sigmoid(x) * (1 - sigmoid(x))

        layer_1_error = layer_2_delta.dot(synapse_1.T) #Same backpropagation logic applied to the hidden layer
        layer_1_delta = layer_1_error * (layer_1 * (1 - layer_1))

        synapse_1 += alpha * layer_1.T.dot(layer_2_delta) #(Weights are adjusted using gradient descent:Multiply the error gradient with learning rate and dot product with the previous layer’s output.This makes the network slightly more accurate each time.)
        synapse_0 += alpha * layer_0.T.dot(layer_1_delta)

    return synapse_0, synapse_1 #These trained weights (synapse_0, synapse_1) will be used for making predictions with the chatbot.


Fine Tuning the Model

In [8]:
def fine_tune_model(new_data, synapse_0, synapse_1, words, tags, epochs=1000, alpha=0.1): # new_data - A list of new patterns and their correct intent tags (like extra training examples)
    new_training = []#stores the new training input and output data.
    new_output = []

    for entry in new_data: #Tokenizes and stems each new input phrase (just like in training).
        pattern = entry['pattern']
        tag = entry['tag']
        wrds = nltk.word_tokenize(pattern)
        pattern_words = [stemmer.stem(w.lower()) for w in wrds]

        bag = [1 if w in pattern_words else 0 for w in words]#Builds a bag-of-words vector based on whether each known word is in the input.
        new_training.append(bag)

#Creates a one-hot vector for the tag — only if it exists in the original tag list.If it’s a brand new tag not in the model? Skip it (nice and safe).
        output_row = [0] * len(tags)
        if tag in tags:
            output_row[tags.index(tag)] = 1
        else:
            print(f"[!] Tag '{tag}' not in existing tags. Skipping.")
            continue

        new_output.append(output_row) #Adds the output vector to the training list.

#Converts lists into arrays for training math.
    X_new = np.array(new_training)
    y_new = np.array(new_output)

#The fine tuning loop is the exact same backpropagation logic as the main training function. The key difference is: you're only updating the model with new examples.
    for _ in range(epochs):
        layer_0 = X_new
        layer_1 = sigmoid(np.dot(layer_0, synapse_0))
        layer_2 = sigmoid(np.dot(layer_1, synapse_1))

        layer_2_error = y_new - layer_2
        layer_2_delta = layer_2_error * (layer_2 * (1 - layer_2))

        layer_1_error = layer_2_delta.dot(synapse_1.T)
        layer_1_delta = layer_1_error * (layer_1 * (1 - layer_1))

        synapse_1 += alpha * layer_1.T.dot(layer_2_delta)
        synapse_0 += alpha * layer_0.T.dot(layer_1_delta)

#Returns the fine-tuned weights, ready to use for predictions.
    return synapse_0, synapse_1


Test for Accuracy

In [9]:
#Trains the model using previously defined train_model() function.synapse_0 and synapse_1 now hold the trained weights for inference (prediction).
synapse_0, synapse_1 = train_model(X_train, y_train)

# --- Response Generation ---
#function
def get_response(intent_tag, input_lang):

  #Looks through each intent in the dataset.When it finds a match with intent_tag, it proceeds to find the correct response.
    for intent in dataset['intents']:
        if intent['tag'] == intent_tag:

  #Tries to get the response list in the user’s language (e.g., 'en' or 'sw').If no such responses exist for the language, it will default to English:
            lang_responses = intent['responses'].get(input_lang, [])
          #Randomly pick a response so replies aren’t repetitive:
            if not lang_responses:
                lang_responses = intent['responses']['en']
            return random.choice(lang_responses)

    # Fallback. Used when no matching intent is found.
    fallback = {
        "en": "I'm sorry, I don't have information on that. Please visit our website https://luxenails.co.ke or contact Luxe Nails directly.",
        "sw": "Samahani, sina habari kuhusu hayo. Tafadhali tembelea tovuti yetu https://luxenails.co.ke au wasiliana na Luxe Nails moja kwa moja.",

    }
    return fallback.get(input_lang, fallback["en"])

#Fuction
#Main function to get a chatbot response for a user message.
def chatbot_response(user_input, language="en"):
    try:
      #Tokenizes and stems input. Tries to detect the input language using langdetect.
        processed_input, detected_lang = preprocess_and_detect_language(user_input)

        #Matches the user’s input words against the full vocabulary.
        bag = [1 if word in processed_input else 0 for word in all_words]

      #Just like during training, except here we're only predicting (no backpropagation). Final output layer_2 is a vector of probabilities for each intent.
        layer_0 = np.array(bag)
        layer_1 = sigmoid(np.dot(layer_0, synapse_0))
        layer_2 = sigmoid(np.dot(layer_1, synapse_1))

      #Gets the index of the highest-probability tag. Maps it back to the tag name.
        intent_tag = tags[np.argmax(layer_2)]
        # Fetches an appropriate response
        response = get_response(intent_tag, language)
        return response
    #Prevents the chatbot from crashing if something unexpected happens.
    except Exception as e:
        print(f"Error: {e}")
        return "Sorry, something went wrong."

Chatbot Interface

In [10]:
# --- Gradio Interface ---
# Language Bridge Function
def chatbot_interface(Questions:str, language:str):

  #This dictionary maps the dropdown label ("English", "Swahili") to language codes used by your chatbot ("en", "sw").
    lang_map = {"English": "en", "Swahili": "sw",}

    #Calls response function. It feeds the question and mapped language code to your chatbot_response() function.
    return chatbot_response(Questions, lang_map[language])

iface = gr.Interface(
    fn=chatbot_interface, #The function to be called when user interacts.
    inputs=["text", gr.Dropdown(["English", "Swahili",], value="English")], # For user to type a question.Let’s user choose between English or Swahili.
    outputs="text", #A single text output (the chatbot’s response).
    title="💅 Luxe Nails Chatbot", #What’s displayed on the interface.
    description="Ask me anything about Luxe Nails! Services, prices, appointments, and more."
)

iface.launch()#Launches the web UI locally on Colab


Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://0ed3ba8141c8067430.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


