# [Dataset](#Methodology)

In this section we will discuss the clean up of the Ubuntu Dialog Corpus and create Q&A Pairs that we will use to train the dataset

The Ubuntu Dialogue Corpus consists of approximately one million two-person dialogues derived from Ubuntu tech support chat logs. These natural language interactions average 8 turns per conversation and collectively contain over 100 million words. The dataset includes an identifier for each dialogue along with timestamps, sender and recipient information, and the text content of each turn in the conversation - all formatted as text rather than audio. A sample subset of this corpus is available in .csv format across multiple files. Collected by researchers Ryan Lowe et al., this corpus is licensed under Apache License 2.0.

In [9]:
import os
import re
import json
from tqdm import tqdm
import numpy as np
import json
import pandas as pd
from datasets import load_dataset, DatasetInfo
from huggingface_hub import notebook_login
from sklearn.model_selection import train_test_split

## Load the raw dataset

First we processes and combine data from multiple CSV files in the Ubuntu dialogue corpus. Each file is loaded, date is converted to a standardized format, a unique ID is generated by combining folder and dialogueID columns, and original columns are dropped. Finally, all processed data from different files are combined into one large dataframe for easy analysis.

In [2]:
import pandas as pd
import os
from tqdm import tqdm

# Global settings
FOLDER = "./Ubuntu-dialogue-corpus"  # Input folder containing Ubuntu dialogue CSV files
SOURCE = "ubuntu-dialogue"  # Source to use in the parquet for each row

def load_csv(file_path):
    """Loads a CSV file and processes its content.
    
    Args:
        file_path (str): Path to the CSV file.
        
    Returns:
        pd.DataFrame: Processed DataFrame.
    """
    # Load CSV file into a DataFrame
    data = pd.read_csv(file_path)
    
    # Convert 'date' column to datetime format
    data["date"] = pd.to_datetime(data["date"])
    
    # Generate 'id' column by combining 'folder' and 'dialogueID' columns, 
    # and removing '.tsv' extension from 'dialogueID' values
    data["id"] = data.apply(lambda row: f"{row['folder']}_{row['dialogueID'].split('.tsv')[0]}", axis=1)
    
    # Drop the original 'folder' and 'dialogueID' columns
    data.drop(columns=["folder", "dialogueID"], inplace=True)
    
    return data

def aggregate_data(folder_path):
    """Aggregates data from all CSV files in the specified folder.
    
    Args:
        folder_path (str): Path to the folder containing the CSV files.
        
    Returns:
        pd.DataFrame: Aggregated DataFrame.
    """
    # Initialize an empty DataFrame for data aggregation
    aggregated_data = pd.DataFrame()
    
    # Iterate through each file in the specified folder
    for file_name in tqdm(os.listdir(folder_path)):
        # Construct the full path to the current file
        file_path = os.path.join(folder_path, file_name)
        
        # Load and process the current file
        current_data = load_csv(file_path)
        
        # Concatenate the current data with the aggregated data
        aggregated_data = pd.concat([aggregated_data, current_data])
    
    return aggregated_data

# Call the aggregate_data function to process and aggregate all CSV files in the specified folder
aggregated_data = aggregate_data(FOLDER)

100%|██████████| 3/3 [05:42<00:00, 114.00s/it]


In [3]:
# Display the first five rows of the aggregated data to verify the results
aggregated_data.head(20)

Unnamed: 0,date,from,to,text,id
0,2004-11-23 11:49:00+00:00,stuNNed,,any ideas why java plugin takes so long to load?,301_1
1,2004-11-23 11:49:00+00:00,crimsun,stuNNed,java 1.4?,301_1
2,2004-11-23 11:49:00+00:00,stuNNed,crimsun,yes,301_1
3,2004-11-23 11:49:00+00:00,crimsun,stuNNed,java 1.5 loads _much_ faster,301_1
4,2004-11-23 11:50:00+00:00,stuNNed,crimsun,noneus: how can i get 1.5 is there a .deb some...,301_1
5,2004-11-23 11:50:00+00:00,crimsun,stuNNed,not yet.,301_1
6,2004-11-23 11:50:00+00:00,stuNNed,crimsun,noneus: is this blackdown or sun?,301_1
7,2004-11-23 11:50:00+00:00,stuNNed,crimsun,did you install just the jre?,301_1
8,2004-11-23 11:51:00+00:00,crimsun,stuNNed,I use IBM's 1.4.2,301_1
9,2004-11-23 11:51:00+00:00,crimsun,stuNNed,"(jdk, because I do globus development)",301_1


**observation**
- The oringal dataset comprising of 100M examples is loaded into a  dataframe that has the `date`, `from`, to `and` text `and` `id` columns.  One can track the conversation in this dataset through looking at  `id` and `to` and `from`. 
- On first glance, it is difficult to intutively follow the conversations in this this `chatroom`. We will further process this data to create nice question and answer pairs. 

## Create Question and Answer Pairs

Next we processess the  `aggregated_data` dataframe of conversations, refining it into a more understandable format. The `is_valid_question` function checks if a question meets specific criteria, like having at least 12 characters and containing common question words (e.g., 'what', 'who'). The `is_expressive_answer` function examines if an answer is descriptive and relevant, filtering out short or off-topic responses. It also removes responses containing certain words (e.g., 'google' or 'wrong') and those that are simple affirmatives or negatives (e.g., 'yes', 'no'). The `clean_dataframe` function then uses these two functions to process the entire chat data, removing any unneeded conversations and organizing the remaining valid questions and their corresponding expressive answers into a cleaner format.

In [4]:
SOURCE = "ubuntu-dialogue"

def is_valid_question(question):
    """Checks if the question is valid based on certain criteria."""
    if not question or pd.isna(question) or len(question) < 12:
        return False
    question_keywords = re.findall(r'(?i)(?:\?|what|who|where|why|when|how|whose|explain|tell|does|way|can|know|able|best|recommend)', question)
    return bool(question_keywords)

def is_expressive_answer(candidate, all_recipients):
    """Checks if the answer is expressive and on-topic based on certain criteria."""
    if not candidate or pd.isna(candidate) or re.findall(r'(?i)^(yes|yep|yeah|no|nah|nope|sure|yes\s*sir)\W*$', candidate):
        return False
    if len(candidate) < 3 or re.findall(r'(?i)(?:wrong|of.*?topic|else\s*where|ask.+?in|\#\w+|google|you.+?mean)', candidate):
        return False
    if re.findall(r'\b(' + all_recipients + r')\b', candidate):
        return False
    return True

def clean_dataframe(data):
    """Cleans up the DataFrame, removes duplicates, and processes conversations."""
    clean_dict = {col: [] for col in ["question", "answer"]}

    for name, group in tqdm(data.groupby("id")):
        if len(group) not in (3, 4, 5):  # Checking for valid conversation length
            continue

        group.sort_values(by=["date"], ascending=True, inplace=True)
        question = str(group["text"].values[0]).strip()
        question_user = group["from"].values[0]

        if not is_valid_question(question):
            continue

        all_recipients = "|".join([re.escape(item) for item in set(group["to"].tolist() + group["from"].tolist()) if pd.notna(item)])
        answer = None
        answer_user = None

        for _, row in group.iterrows():
            if row["to"] == question_user:
                candidate = str(row["text"]).strip()
                if is_expressive_answer(candidate, all_recipients):
                    answer = candidate
                    answer_user = row["from"]
            elif answer_user is not None and row["to"] == answer_user and row["from"] == question_user:
                if re.findall(r'(?i)(?:thank|thanks|perfect|thx|works|working|great|good|awesome)', str(row["text"])):
                    clean_dict["question"].append(question)
                    clean_dict["answer"].append(answer)
                    break

    return clean_dict

# Usage:
FOLDER = "./Ubuntu-dialogue-corpus"  # Input folder containing Ubuntu dialogue CSV files
data = None
for file in tqdm(os.listdir(FOLDER)):
    data = pd.concat([data, pd.read_csv(os.path.join(FOLDER, file))])

clean_data = clean_dataframe(aggregated_data)

100%|██████████| 3/3 [00:57<00:00, 19.12s/it]
100%|██████████| 1852868/1852868 [03:57<00:00, 7815.52it/s] 


In [5]:
clean = pd.DataFrame(clean_data)
clean.sort_values(by="answer", key=lambda x: x.str.len(), inplace=True, ascending=False)
clean.drop_duplicates(subset=["question"], inplace=True)
clean.sort_index(inplace=True)

In [6]:
print(f"Retrieved {len(clean) / len(aggregated_data['id'].unique()) * 100.:.2f}% of all questions ({len(clean)})")  # 19921

Retrieved 0.93% of all questions (17178)


**observation**
- We have cut down the number of examples from 100M to slightly over 10000 question and answer pairs. We will be using the LoRA technique and the reduced number of examples is going to be sufficient to train an model

## Convert text to lowercase and and remove spaces

In [7]:
# To convert all text in `question` and `answer` into lowercase
clean['question'] = clean['question'].str.lower()
clean['answer'] = clean['answer'].str.lower()

# To remove more than one space from 'question' and 'answer'
clean['question'] = clean['question'].str.replace(r'\s{2,}', ' ', regex=True)
clean['answer'] = clean['answer'].str.replace(r'\s{2,}', ' ', regex=True)

# Now display the first few rows of the cleaned dataframe
clean.head()

Unnamed: 0,question,answer
0,"hi, is there a cli command to roll back any up...",your recourse is to re-install fresh the older...
1,a livecd iso can be burned to a dvd-r and run ...,"i hope so, or the custom dvds i've done are wo..."
2,"hello, is there a way to adjust gamma settings...",for me i have my nvidia settings manager and i...
4,does ubuntu come with a firewall by default?,no iptables rule is loaded by deault on ubuntu
5,can someone tell me howto get rid of google ch...,sudo dpkg -l |grep -i chrom ----> sudo apt-get...


## Split the dataset into train and test

In [10]:
clean = clean.sample(frac = 1)

# Split the data
train, test = train_test_split(clean, test_size=0.30)

print("Training set size:", len(train))
print("Testing set size:", len(test))

Training set size: 12024
Testing set size: 5154


**observation**
- After cleaning the dataset, we were left with 0.93% of all questions, totaling 17,178. We divided these into training and testing data using a 70/30 split. The training set consists of 12,100 questions, while the testing set comprises 5,186 questions.

In [17]:
for index, row in clean.iterrows():
    print("Q >", row["question"])
    print("A >", row["answer"])
    print()
    if index > 20:
        break

Q > i am trying to share folders with nfs through a wireless router, but the shared folder does not show up in the network. would someone help me with this?
A > using nfs over wifi is a bad idea because the protocol does not gracefully handle disconnection. i suggest smb of sshfs instead



**observation**
- We ended up with a clean set of question and answer pairs. Since the downstream task is answer technical questions with special characters, we are limited on how much we can process the data like removing punctcations, ascii characters etc and so leave them in the data

## Create .jsonl file  and upload the cleaned up dataset to huggingface

We will convert the dataset into JSONL also called newline-delimited JSON. JSON Lines is a convenient format for storing structured data that may be processed one record at a time and take the following structure

```
{"question": "What color is the sky?", "answer": "The sky is blue."}
{"question": "Where is the best place to get cloud GPUs?", "answer": "Lambda.com"}
{"question": "Why do Americans love guns so much?", "output": "answer of the Spanish."}
```

In [None]:
# Log into huggingface 
notebook_login()

In [19]:
def create_jsonl(dataframe, filename):
    with open(filename+'.jsonl', 'w') as f:
      for i, row in dataframe.iterrows():
        record = {
          'question': row['question'].strip(),
          'answer': row['answer'].strip(),
        }
        json_record = json.dumps(record)
        f.write(json_record+'\n')
    
create_jsonl(test,"test")
create_jsonl(train,"train")

In [20]:
ubuntu_question_answer_jsonl = load_dataset("json", data_files = {"train": "./train.jsonl","test": "./test.jsonl"})

Downloading and preparing dataset json/default to /home/jupyter/.cache/huggingface/datasets/json/default-66aeeca9882648bc/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /home/jupyter/.cache/huggingface/datasets/json/default-66aeeca9882648bc/0.0.0/fe5dd6ea2639a6df622901539cb550cf8797e5a6b2dd7af1cf934bed8e233e6e. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [21]:
ubuntu_question_answer_jsonl

DatasetDict({
    train: Dataset({
        features: ['question', 'answer'],
        num_rows: 12024
    })
    test: Dataset({
        features: ['question', 'answer'],
        num_rows: 5154
    })
})

In [23]:
ubuntu_question_answer_jsonl.push_to_hub("mugithi/ubuntu_question_answer", private=False)

Pushing split train to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/13 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

Pushing split test to the Hub.


Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/6 [00:00<?, ?ba/s]

Deleting unused files from dataset repository:   0%|          | 0/1 [00:00<?, ?it/s]

**observation**
- We uploaded the data into hugginface so that we can easily download the the preprocessed data from the hugginface.

# Conclusion

In this notebook,we ingested the initial dataset encompassing 100 million examples was ingested into a dataframe, with designated columns for date, from, to, text, and id, facilitating the tracing of dialogues within the dataset via the id, to, and from fields. Upon initial examination, navigating through the conversations in this 'chatroom' proved to be non-intuitive. Subsequent steps were taken to refine this data into well-structured question and answer pairs for ease of analysis. Post-cleaning, a mere 0.93% of the original questions remained, summing up to 17,178 entries. This refined dataset was then partitioned into training and testing subsets following a 70/30 ratio, yielding 12,100 questions for training and 5,186 for testing purposes. Further enhancing the accessibility and usability of the preprocessed data, it was uploaded to Hugging Face, thereby simplifying the retrieval process for future endeavors.

# References

Lowe, R., Pow, N., Serban, I., & Pineau, J. (2016). The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. Retrieved from https://doi.org/10.48550/arXiv.1506.08909

Nagyfi, R. (2023). Open-Assistant/ubuntu_parser.ipynb. Retrieved from https://github.com/sedthh/Open-Assistant/blob/b3a8c2479b12ea69d66487e2852b836083b7e4db/data/datasets/ubuntu_dialogue_qa/ubuntu_parser.ipynb