# Ubuntu Automated Customer Service

Project Overview

This project is focused on creating a conversational AI system designed to automate customer service for Ubuntu users. By leveraging the Ubuntu Dialogue Corpus, the system will be trained to understand customer queries and offer automated solutions. The project encompasses several key phases, including data preprocessing, development and training of a natural language processing (NLP) model, and the integration of this model into a chatbot interface that users can interact with.

Data Source:
- The dataset for this project, the Ubuntu Dialogue Corpus, is available on Kaggle. [Link](https://www.kaggle.com/datasets/rtatman/ubuntu-dialogue-corpus)

Project Goals:
- To preprocess the Ubuntu Dialogue Corpus data for NLP.
- To build and train an NLP model capable of understanding and responding to user queries.
- To integrate the trained NLP model into a chatbot interface for automated customer service.
- To evaluate the effectiveness and accuracy of the conversational AI system in handling real-world user queries.

Steps:

1. **Data Acquisition and Preprocessing**:
- Download the Ubuntu Dialogue Corpus from Kaggle.
- Clean and preprocess the data to format it suitably for NLP tasks. This may include tokenization, removing stop words, and stemming or lemmatization.

2. **Model Development**:
- Select an appropriate NLP model architecture that can process the conversational data effectively. This could involve sequence-to-sequence models, transformers, or other architectures suitable for handling dialogue.
- Implement the model using a machine learning framework such as TensorFlow or PyTorch.

3. **Training**:
- Train the model on the preprocessed Ubuntu Dialogue Corpus, adjusting parameters and structures as necessary to improve performance.
- Use a portion of the data for validation to monitor the model's performance and prevent overfitting.

4. **Chatbot Integration**:
- Develop a chatbot interface that can interact with users in real-time. This interface should be capable of processing user inputs, passing them to the trained NLP model, and displaying the model's responses.
- Ensure the chatbot interface is user-friendly and can handle a variety of query types.

5. **Evaluation and Testing**:
- Test the conversational AI system with a set of predefined queries to assess its response accuracy and relevance.
- Optionally, conduct user testing with real users to gather feedback on the system's performance and identify areas for improvement.

6. **Iteration and Improvement**:
- Based on testing feedback and performance evaluations, make necessary adjustments to the model and chatbot interface.
- Explore advanced NLP techniques and model architectures to enhance the system's understanding and response capabilities.

# Step 1: Data Acquisition and Preprocessing

In [1]:
import pandas as pd
import os
import numpy as np

1.1- load data

In [2]:
def load_data():
    # Load your data
    df2 = pd.read_csv('Ubuntu-dialogue-corpus/dialogueText.csv', nrows=10)
    df3 = pd.read_csv('Ubuntu-dialogue-corpus/dialogueText_301.csv', nrows=10)
    df4 = pd.read_csv('Ubuntu-dialogue-corpus/dialogueText_196.csv', nrows=10)
    df = pd.concat([df2, df3, df4], ignore_index=True)
    df = df.drop(['folder', 'dialogueID', 'date', 'from', 'to'], axis=1)
    return df
df = load_data()
df.head()

Unnamed: 0,text
0,"Hello folks, please help me a bit with the fol..."
1,Did I choose a bad channel? I ask because you ...
2,the second sentence is better english and we...
3,Sock Puppe?t
4,WTF?


1.2 Data Inspection:


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    30 non-null     object
dtypes: object(1)
memory usage: 372.0+ bytes


1.3 Data Preprocessing:


In [4]:
# !pip install nltk
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
#nltk.download()
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
nltk.download('wordnet')
stemmer= PorterStemmer()    
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\CENTER_ELRahama\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\CENTER_ELRahama\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\CENTER_ELRahama\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [5]:
def preprocess_text(text):
    text = re.sub(r'[^a-zA-Z\s]', '', text.lower())
    tokens = word_tokenize(text)
    stemmered = [
        stemmer.stem(token)
        for token in tokens 
        if (not token.isnumeric()) and (len(token) > 2) and (token not in stop_words)
    ]
    return " ".join(stemmered)

df['cleaned_text'] = df['text'].apply(preprocess_text)
df.head()

Unnamed: 0,text,cleaned_text
0,"Hello folks, please help me a bit with the fol...",hello folk pleas help bit follow sentenc order...
1,Did I choose a bad channel? I ask because you ...,choos bad channel ask seem dumb like window user
2,the second sentence is better english and we...,second sentenc better english dumb
3,Sock Puppe?t,sock puppet
4,WTF?,wtf


In [6]:
df.dropna(inplace=True)
df.isna().sum()

text            0
cleaned_text    0
dtype: int64

In [17]:
input_texts = df['cleaned_text'].tolist()[:-1]
target_texts = df['cleaned_text'].tolist()[1:]

Transformers, like many sequence-to-sequence models, need to be trained with input sequences that predict the next word in a sequence. This requires a specific way of preparing the target sequences:

- Input Sequences: The input sequences remain as they are.

- Response Sequences: These sequences are used to create both the decoder input and the decoder output sequences.

    + Decoder Input: The target sequences are shifted right by one position. This means that the model sees the start of a sentence and predicts the next word at each step.

    + Decoder Output: This is the actual target sequence which the model should predict.

Here's how this looks in practice:

For a given target sequence: [START, How, are, you, doing, ?]

Decoder Input: [START, How, are, you, doing]

Decoder Output: [How, are, you, doing, ?]

- 1.4- `Tokenization` using Tokenizer class and fit_on_texts method

    - Convert texts to tokens 

- 1.5- `Sequences` using texts_to_sequences method

    - Convert tokens to Sequences, return sequences

- 1.6- `Padding` using pad_sequences method

    - Convert Sequences to pad sequences
    - Ensuring all sequences have the same length.
    - Return pad_sequences

- 1.7-  `split 'pad_sequences' array` to input_pad_sequences and target_pad_sequences
    - If we build transformer
    - For prepare Decoder Input and Output Sequences

- 1.8- `Split data` using train_test_split method

    - Perform the train-test split on the padded sequences with 

In [57]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

# Convert texts to tokens 
max_vocab_size = 20000
tokenizer = Tokenizer(num_words=max_vocab_size,
                      oov_token='<OOV>') # oov to handling unknown word
tokenizer.fit_on_texts(input_texts + target_texts)

# Convert tokens to Sequences
input_sequences = tokenizer.texts_to_sequences(input_texts)
target_sequences = tokenizer.texts_to_sequences(target_texts)

# Convert sequences to padded sequences
max_sequence_length = 50
input_pad_sequences = pad_sequences(input_sequences, 
                                maxlen=max_sequence_length, 
                                padding='post')
target_pad_sequences = pad_sequences(target_sequences, 
                                maxlen=max_sequence_length, 
                                padding='post')

# Split 'target_pad_sequences' to prepare Decoder Input and Output Sequences
target_pad_sequences_input = target_pad_sequences[:, :-1]
target_pad_sequences_output = target_pad_sequences[:, 1:]

# Split data for training and validation
input_train, input_val, target_train_input, target_val_input, target_train_output, target_val_output = train_test_split(
    input_pad_sequences, 
    target_pad_sequences_input,
    target_pad_sequences_output,
    test_size=0.2,
    random_state=42
)

print(f"Shape of input train: {input_train.shape} and input val: {input_val.shape}")
print(f"Shape of target train input: {target_train_input.shape} and target val input : {target_val_input.shape}")
print(f"Shape of target val input: {target_train_output.shape} and target val output: {target_val_output.shape}")

Shape of input train: (23, 50) and input val: (6, 50)
Shape of target train input: (23, 49) and target val input : (6, 49)
Shape of target val input: (23, 49) and target val output: (6, 49)


# Step 2: Model Development

2.2- Define the Model Architecture and Implementation for automated chatbot customer service:

2.2.1- Encoder: Processes the input text.

- Input: Tokenized and padded input sequences.
- Transformer Encoder Layers: Stack multiple layers of transformer encoder blocks.
- Each encoder block consists of:
    - Multi-head self-attention mechanism.
    - Feedforward neural network.

2.2.2- Decoder: Generates the response text.

- Input: Tokenized and padded target sequences (shifted right).
- Transformer Decoder Layers: Stack multiple layers of transformer decoder blocks.
- Each decoder block consists of:
    - Multi-head self-attention mechanism (for target).
    - Multi-head attention over the encoder outputs.
    - Feedforward neural network.
- Output: Probability distribution over the vocabulary for generating responses.

2.2.3- Attention Mechanism: Improves the performance by allowing the decoder to focus on relevant parts of the input sequence.

- Used in both encoder and decoder to focus on relevant parts of the input sequence and improve context understanding.


Import necessary libraries:


In [49]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

Define model parameters:


In [50]:
vocab_size = len(tokenizer.word_index) + 1  # Add 1 for padding token
embed_dim = 256
num_heads = 8
ff_dim = 512
num_transformer_blocks = 4
dropout_rate = 0.1

Create the transformer block:

In [51]:
def transformer_block(inputs, num_heads, ff_dim, dropout_rate=0.1):
    attention_output = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)(inputs, inputs)
    attention_output = layers.Dropout(dropout_rate)(attention_output)
    out1 = layers.LayerNormalization(epsilon=1e-6)(inputs + attention_output)
    
    ffn_output = layers.Dense(ff_dim, activation="relu")(out1)
    ffn_output = layers.Dense(embed_dim)(ffn_output)
    ffn_output = layers.Dropout(dropout_rate)(ffn_output)
    
    return layers.LayerNormalization(epsilon=1e-6)(out1 + ffn_output)


Build the encoder:


In [52]:
def build_encoder(vocab_size, embed_dim, num_heads, ff_dim, num_transformer_blocks, max_sequence_length, dropout_rate=0.1):
    inputs = layers.Input(shape=(max_sequence_length,))
    embedding_layer = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
    x = embedding_layer(inputs)
    x = layers.Dropout(dropout_rate)(x)
    
    for _ in range(num_transformer_blocks):
        x = transformer_block(x, num_heads, ff_dim, dropout_rate)
    
    return keras.Model(inputs=inputs, outputs=x, name="encoder")

Build the decoder:


In [53]:
def build_decoder(vocab_size, embed_dim, num_heads, ff_dim, num_transformer_blocks, max_sequence_length, dropout_rate=0.1):
    inputs = layers.Input(shape=(max_sequence_length,))
    embedding_layer = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
    x = embedding_layer(inputs)
    x = layers.Dropout(dropout_rate)(x)
    
    for _ in range(num_transformer_blocks):
        x = transformer_block(x, num_heads, ff_dim, dropout_rate)
    
    x = layers.Dense(vocab_size, activation="softmax")(x)
    
    return keras.Model(inputs=inputs, outputs=x, name="decoder")

Build the full transformer model:

In [58]:
def build_transformer(vocab_size, embed_dim, num_heads, ff_dim, num_transformer_blocks, max_sequence_length, dropout_rate=0.1):
    encoder_inputs = layers.Input(shape=(max_sequence_length,), name="encoder_inputs")
    decoder_inputs = layers.Input(shape=(max_sequence_length - 1,), name="decoder_inputs")
    
    encoder = build_encoder(vocab_size, embed_dim, num_heads, ff_dim, num_transformer_blocks, max_sequence_length, dropout_rate)
    decoder = build_decoder(vocab_size, embed_dim, num_heads, ff_dim, num_transformer_blocks, max_sequence_length - 1, dropout_rate)
    
    encoder_outputs = encoder(encoder_inputs)
    decoder_outputs = decoder(decoder_inputs)
    
    outputs = layers.Dense(vocab_size, activation="softmax")(decoder_outputs)
    
    model = keras.Model(inputs=[encoder_inputs, decoder_inputs], outputs=outputs, name="transformer")
    return model

Create and compile the model:

In [59]:
model = build_transformer(vocab_size, embed_dim, num_heads, ff_dim, num_transformer_blocks, max_sequence_length, dropout_rate)

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

model.summary()

Model: "transformer"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 decoder_inputs (InputLayer  [(None, 49)]                 0         []                            
 )                                                                                                
                                                                                                  
 decoder (Functional)        (None, 49, 75)               9508427   ['decoder_inputs[0][0]']      
                                                                                                  
 encoder_inputs (InputLayer  [(None, 50)]                 0         []                            
 )                                                                                                
                                                                                        

# Step 3: Training, validation and testing

Train the model

In [61]:
history = model.fit(
    [input_train, target_train_input],
    target_train_output,
    validation_data=([input_val, target_val_input], target_val_output),
    epochs=3,
    batch_size=64
)

Epoch 1/3
Epoch 2/3
Epoch 3/3
Validation Loss: 4.115200042724609
Validation Accuracy: 0.942176878452301


Evaluate model

In [None]:
val_loss, val_accuracy = model.evaluate([input_val, target_val_input], target_val_output)
print(f"Validation Loss: {val_loss}")
print(f"Validation Accuracy: {val_accuracy}")

Test model

In [65]:
def generate_response(model, input_sequence, tokenizer, max_sequence_length, temperature=1.0):
    # Tokenize and pad the input sequence
    input_seq = tokenizer.texts_to_sequences([input_sequence])
    input_seq = pad_sequences(input_seq, maxlen=max_sequence_length, padding='post')

    # Initialize the target sequence with the start token
    target_seq = np.zeros((1, max_sequence_length - 1))
    target_seq[0, 0] = tokenizer.word_index.get('<start>', 0)  # Assuming '<start>' token exists

    # Generate the response
    decoded_sentence = []
    for i in range(max_sequence_length - 2):  # Changed this to avoid index out of bounds
        # Predict the next token
        output = model.predict([input_seq, target_seq], verbose=0)
        sampled_token_index = sample_token(output[0, i, :], temperature)
        sampled_word = tokenizer.index_word.get(sampled_token_index, '<UNK>')

        # Exit condition: either hit max length or find the end token
        if sampled_word == '<end>' or len(decoded_sentence) >= max_sequence_length - 2:
            break

        decoded_sentence.append(sampled_word)

        # Update the target sequence for the next iteration
        if i < max_sequence_length - 2:  # Ensure we don't go out of bounds
            target_seq[0, i+1] = sampled_token_index

    return ' '.join(decoded_sentence)

def sample_token(probabilities, temperature=1.0):
    # Apply temperature to the probabilities
    probabilities = np.asarray(probabilities).astype('float64')
    probabilities = np.log(probabilities) / temperature
    exp_probabilities = np.exp(probabilities)
    probabilities = exp_probabilities / np.sum(exp_probabilities)
    
    # Sample from the distribution
    return np.random.choice(len(probabilities), p=probabilities)

# Example usage:
input_text = "How can I install Ubuntu?"
response = generate_response(model, input_text, tokenizer, max_sequence_length)
print(f"Input: {input_text}")
print(f"Response: {response}")

Input: How can I install Ubuntu?
Response: jdk ibm get plugin help prgidi pleas load prdigi blackdown tri person prgidi tri idea java folk bad ye help afaik blackdown channel noneu pleas sun yet bit person yet follow better java prodigi onlin doesnt dumb somewher jdk noneu hello think <OOV> bad ibm guy develop second


Save the trained model and tokenizer

In [22]:
import pickle
import os

def save_model_and_tokenizer(model, tokenizer, save_dir):
    model.save(os.path.join(save_dir, 'chatbot_model.h5'))
    with open(os.path.join(save_dir, 'tokenizer.pickle'), 'wb') as handle:
        pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    print(f"Model and tokenizer saved in {save_dir}")

save_dir = '/kaggle/working/saved_models'
os.makedirs(save_dir, exist_ok=True)
save_model_and_tokenizer(model, tokenizer, save_dir)

Model and tokenizer saved in /kaggle/working/saved_models


# Step 4: Chatbot Integration

# Step 5: Evaluation and Testing

# Step 6. Iteration and Improvement