1. **Mount Google Drive to Access Data**:
   - The function `mount_drive()` mounts Google Drive to make files stored there accessible. This is necessary for accessing datasets stored in Google Drive from Google Colab.

2. **Define Paths for Data Access and Storage**:
   - The function `define_paths()` sets up file paths for input (the original dataset) and output (processed data). It ensures that the directories exist and are ready for reading and writing data related to the project.

3. **Load NLP Tools from NLTK for Text Processing**:
   - The function `load_nltk_resources()` downloads and initializes various NLP tools from NLTK, such as stopwords, lemmatizer, and stemmer, which are necessary for text preprocessing tasks like removing common stopwords, reducing words to their root forms, and stemming.

    1. **`nltk.download('stopwords')`**: Downloads a list of common words (e.g., "and," "the") that are often removed during text preprocessing.
    2. **`nltk.download('wordnet')`**: Downloads the WordNet lexical database, which is used for lemmatizing words to their base or root forms.
    3. **`nltk.download('omw-1.4')`**: Downloads the Open Multilingual Wordnet, which provides additional linguistic data for WordNet.
    4. **`nltk.download('punkt')`**: Downloads a tokenizer that can split text into sentences and words, used for segmenting text during processing.

In [None]:
import os
import csv
import json
import codecs
import re
import unicodedata
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from google.colab import drive
import random
import string
import nltk

# Step 1: Mount Google Drive to access files
def mount_drive():
    # Mount Google Drive to make sure we can access the dataset stored in Google Drive.
    drive.mount('/content/drive')

# Step 2: Define paths for input and output data
def define_paths():
    # Define the paths to the corpus and processed data within Google Drive
    corpus_name = "movie-corpus"
    corpus = os.path.join("/content/drive/My Drive/Colab Notebooks/nlp_pro_babu/data", corpus_name)
    processed_data_dir = os.path.join("/content/drive/My Drive/Colab Notebooks/nlp_pro_babu/data/processed")
    os.makedirs(processed_data_dir, exist_ok=True)  # Create directory if it doesn't exist
    return corpus, processed_data_dir

# Step 3: Load necessary NLP tools from NLTK
def load_nltk_resources():
    # Download necessary resources for text processing
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('omw-1.4')
    nltk.download('punkt')

    # Initialize stop words, lemmatizer, and stemmer
    stop_words = set(stopwords.words('english'))  # Load stop words from NLTK
    lemmatizer = WordNetLemmatizer()  # Initialize lemmatizer to reduce words to their base form
    stemmer = PorterStemmer()  # Initialize Porter stemmer for stemming words
    return stop_words, lemmatizer, stemmer


Extracting Data from Database

In [None]:

# Step 4: Function to print lines from a file for preview
def print_lines(file, n=10):
    # This function allows us to check the contents of a file, useful for debugging and verification
    with open(file, 'rb') as datafile:
        lines = datafile.readlines()
    for line in lines[:n]:
        print(line)

# Step 5: Load lines and conversations from the dataset
def load_lines_and_conversations(file_name):
    # This function extracts individual lines and groups them into conversations to be used for training
    lines = {}
    conversations = {}
    with open(file_name, 'r', encoding='iso-8859-1') as f:
        for line in f:
            line_json = json.loads(line)
            line_obj = {
                "lineID": line_json["id"],
                "characterID": line_json["speaker"],
                "text": line_json["text"]
            }
            lines[line_obj['lineID']] = line_obj
            # Group lines into conversations
            if line_json["conversation_id"] not in conversations:
                conv_obj = {
                    "conversationID": line_json["conversation_id"],
                    "movieID": line_json["meta"]["movie_id"],
                    "lines": [line_obj]
                }
            else:
                conv_obj = conversations[line_json["conversation_id"]]
                conv_obj["lines"].insert(0, line_obj)
            conversations[conv_obj["conversationID"]] = conv_obj
    return lines, conversations

# Step 6: Extract sentence pairs from conversations
def extract_sentence_pairs(conversations):
    # Create question-answer pairs by extracting consecutive lines in each conversation
    qa_pairs = []
    for conversation in conversations.values():
        for i in range(len(conversation["lines"]) - 1):
            input_line = conversation["lines"][i]["text"].strip()
            target_line = conversation["lines"][i+1]["text"].strip()
            if input_line and target_line:
                qa_pairs.append([input_line, target_line])
    return qa_pairs


**Step 7: Additional Data Processing Techniques**

1. **Normalize the String**:
   - Convert the text to lowercase and remove non-letter characters.
   - This helps standardize the text for easier processing by removing inconsistencies like casing and special characters.

2. **Remove Stopwords**:
   - Remove common stopwords like "is" or "the" to reduce noise in the data. - This helps the model focus on the more important parts of the sentence that carry meaningful information.

3. **Lemmatize the Sentence**:
   - Convert each word in the sentence to its base or dictionary form (e.g., "running" becomes "run").
   - Lemmatization helps standardize different forms of words and improves the efficiency of the model by reducing redundancy.

4. **Stem the Sentence**:
   - Apply stemming to reduce words to their root by removing suffixes (e.g., "studies" becomes "studi").
   - This is another form of word normalization but may be a bit more aggressive than lemmatization.

5. **Remove Punctuation**:
   - Remove punctuation marks from the text to further clean the data.
   - This is often necessary to make the model focus on the core content of sentences without being distracted by punctuation symbols.

6. **Tokenize the Sentence**:
   - Split the sentence into individual words (tokens).
   - This is an important step in NLP, allowing further processing on a word-by-word basis.

7. **Augment Data by Shuffling Words in Sentences**:
   - To make the model more robust, shuffle words in the sentences randomly to generate variations.
   -  This technique increases the number of available training pairs, which helps in training a more generalized model.


In [None]:

# Step 7: Additional Data Processing Techniques

# 7.1: Normalize the string
def normalize_string(s):
    # Normalization involves converting to lowercase and removing non-letter characters
    s = s.lower().strip()  # Convert to lowercase and remove leading/trailing spaces
    s = ''.join(
        c for c in unicodedata.normalize('NFD', s)  # Decompose special characters
        if unicodedata.category(c) != 'Mn'  # Remove accent characters
    )
    s = re.sub(r"([.!?])", r" \1", s)  # Add space before punctuation marks
    s = re.sub(r"[^a-zA-Z.!?]+", r" ", s)  # Remove any character that isn't a letter or punctuation
    s = re.sub(r"\s+", r" ", s).strip()  # Replace multiple spaces with a single space
    return s

# 7.2: Remove stopwords from the sentence
def remove_stopwords(sentence, stop_words):
    # This function removes common stopwords (e.g., 'is', 'the') to reduce noise in the data
    words = sentence.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(filtered_words)

# 7.3: Lemmatize the sentence
def lemmatize_sentence(sentence, lemmatizer):
    # Lemmatization reduces words to their base or root form (e.g., 'running' becomes 'run')
    words = sentence.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

# 7.4: Stem the sentence
def stem_sentence(sentence, stemmer):
    # Stemming reduces words to their root form by removing suffixes (e.g., 'running' becomes 'run')
    words = sentence.split()
    stemmed_words = [stemmer.stem(word) for word in words]
    return ' '.join(stemmed_words)

# 7.5: Remove punctuation from the sentence
def remove_punctuation(sentence):
    # This function removes punctuation marks to further clean the text
    return sentence.translate(str.maketrans('', '', string.punctuation))

# 7.6: Tokenize the sentence
def tokenize_sentence(sentence):
    # Tokenization splits the sentence into individual words for further processing
    return nltk.word_tokenize(sentence)

# 7.7: Augment data by shuffling words in sentences
def augment_data(pairs, num_augments=1):
    # This simple augmentation technique introduces variations in word order to make the model more robust
    augmented_pairs = []
    for pair in pairs:
        for _ in range(num_augments):
            input_line_words = pair[0].split()
            target_line_words = pair[1].split()
            random.shuffle(input_line_words)
            random.shuffle(target_line_words)
            augmented_pairs.append([' '.join(input_line_words), ' '.join(target_line_words)])
    return pairs + augmented_pairs

# Step 8: Load, process, and augment data
def load_process_and_augment_data(corpus, stop_words, lemmatizer, stemmer):
    # Load the lines and conversations from the dataset and extract the sentence pairs
    lines, conversations = load_lines_and_conversations(os.path.join(corpus, "utterances.jsonl"))
    qa_pairs = extract_sentence_pairs(conversations)

    # Step 9: Apply data processing to each pair
    # Normalize, remove punctuation, remove stopwords, lemmatize, and stem each sentence in the question-answer pairs
    processed_pairs = []
    for pair in qa_pairs:
        input_sentence, target_sentence = pair
        input_sentence = normalize_string(input_sentence)  # Normalize by converting to lowercase and removing non-letter characters
        target_sentence = normalize_string(target_sentence)
        input_sentence = remove_punctuation(input_sentence)  # Remove punctuation
        target_sentence = remove_punctuation(target_sentence)
        input_sentence = remove_stopwords(input_sentence, stop_words)  # Remove stopwords
        target_sentence = remove_stopwords(target_sentence, stop_words)
        input_sentence = lemmatize_sentence(input_sentence, lemmatizer)  # Lemmatize to reduce words to their base form
        target_sentence = lemmatize_sentence(target_sentence, lemmatizer)
        input_sentence = stem_sentence(input_sentence, stemmer)  # Apply stemming to further reduce words
        target_sentence = stem_sentence(target_sentence, stemmer)
        processed_pairs.append([input_sentence, target_sentence])

    # Step 10: Augment data to increase dataset size
    # Apply data augmentation to increase the number of training pairs and introduce variation
    augmented_pairs = augment_data(processed_pairs, num_augments=1)
    return augmented_pairs

# Step 11: Save the processed data
def save_processed_data(augmented_pairs, processed_data_dir):
    # Save the processed and augmented pairs to a text file for future use
    datafile = os.path.join(processed_data_dir, "formatted_movie_lines.txt")
    delimiter = '\t'
    delimiter = str(codecs.decode(delimiter, "unicode_escape"))

    print("\nWriting processed and augmented file...")
    with open(datafile, 'w', encoding='utf-8') as outputfile:
        writer = csv.writer(outputfile, delimiter=delimiter, lineterminator='\n')
        for pair in augmented_pairs:
            writer.writerow(pair)

    # Step 12: Verify the saved file
    # Print a few lines from the saved processed file to verify the output
    print("Sample lines from processed file:")
    print_lines(datafile)


In [None]:

# Main function to run all steps
def main():
    mount_drive()
    corpus, processed_data_dir = define_paths()
    load_nltk_resources()
    stop_words, lemmatizer, stemmer = load_nltk_resources()
    augmented_pairs = load_process_and_augment_data(corpus, stop_words, lemmatizer, stemmer)
    save_processed_data(augmented_pairs, processed_data_dir)

# Run the main function
if __name__ == "__main__":
    main()


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



Writing processed and augmented file...
Sample lines from processed file:
b'\t\n'
b'okay\thope\n'
b'wow\tlet go\n'
b'kid know sometim becom persona know quit\t\n'
b'\tokay gonna need learn lie\n'
b'figur get good stuff eventu\tgood stuff\n'
b'good stuff\treal\n'
b'real\tlike fear wear pastel\n'
b'listen crap\tcrap\n'
b'crap\tendless blond babbl like bore\n'
