# Lego Text Instructions: Data Exploration
Calvin Laughlin and Alex Wang
CS 224N Milestone

### In this notebook, we will explore our dataset and find common structures that will aid in preprocessing. We are looking for ways we can pass these data to our GPT 3.5 model to finetune.

In [1]:
import os

def load_text_files(directory):
    text_data = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
                text_data.append(file.read())
    return text_data

directory = "uncleaned_text"
text_data = load_text_files(directory)


In [57]:
import re

def preprocess_text(text):
    # Remove special characters and extra spaces
    text = re.sub(r'\s+', ' ', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text.lower()

preprocessed_data = [preprocess_text(text) for text in text_data]

After removing special characters and extra space, we have a list of all instructions, each as a single string.

In [58]:
preprocessed_data[:3]

['76917 speed champions 2 fast 2 furious nissan skyline gtr r34 set adapted by alex charbonneau and tested by natalie charbonneau this buildable lego speed champions nissan skyline gtr r34 replica model 76917 has been inspired by the iconic car from the 2 fast 2 furious movie kids aged 9 car lovers and fans of the popular movie franchise can experience a rewarding build before proudly displaying this car toy or recreating fastpaced street racing scenes a great gift for car lovers  this fast  furious toy is packed with authentic details from the reallife model including iconic livery on the side a wing at the back a grille on the front impressive wheel arches and a nitro fuel canister on the passenger seat there is also a brian oconner minifigure to place in the drivers seat so car fans can enjoy allaction role play fast  furious car toy  lego speed champions 2 fast 2 furious nissan skyline gtr r34 model 76917 for kids car enthusiasts and fans of the 2 fast 2 furious movie 1 minifigure 

In [59]:
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

def tokenize_text(text):
    return word_tokenize(text)

tokenized_data = [tokenize_text(text) for text in preprocessed_data]


We then use a tokenizer to extract tokens from each instruction and keep them in list format.

In [61]:
tokenized_data[:3]

[['76917',
  'speed',
  'champions',
  '2',
  'fast',
  '2',
  'furious',
  'nissan',
  'skyline',
  'gtr',
  'r34',
  'set',
  'adapted',
  'by',
  'alex',
  'charbonneau',
  'and',
  'tested',
  'by',
  'natalie',
  'charbonneau',
  'this',
  'buildable',
  'lego',
  'speed',
  'champions',
  'nissan',
  'skyline',
  'gtr',
  'r34',
  'replica',
  'model',
  '76917',
  'has',
  'been',
  'inspired',
  'by',
  'the',
  'iconic',
  'car',
  'from',
  'the',
  '2',
  'fast',
  '2',
  'furious',
  'movie',
  'kids',
  'aged',
  '9',
  'car',
  'lovers',
  'and',
  'fans',
  'of',
  'the',
  'popular',
  'movie',
  'franchise',
  'can',
  'experience',
  'a',
  'rewarding',
  'build',
  'before',
  'proudly',
  'displaying',
  'this',
  'car',
  'toy',
  'or',
  'recreating',
  'fastpaced',
  'street',
  'racing',
  'scenes',
  'a',
  'great',
  'gift',
  'for',
  'car',
  'lovers',
  'this',
  'fast',
  'furious',
  'toy',
  'is',
  'packed',
  'with',
  'authentic',
  'details',
  'from

Next, we will use bigrams and trigrams to identify common phrases and patterns from the text.

In [62]:
from collections import Counter

def find_common_phrases(tokenized_data, n=2):
    n_grams = []
    for text in tokenized_data:
        n_grams.extend(ngrams(text, n))
    return Counter(n_grams)

# Bigrams (2-word phrases)
common_bigrams = find_common_phrases(tokenized_data, n=2)
# Trigrams (3-word phrases)
common_trigrams = find_common_phrases(tokenized_data, n=3)

In [21]:
def display_common_phrases(common_phrases, top_n=10):
    for phrase, freq in common_phrases.most_common(top_n):
        print(' '.join(phrase), freq)

# Display top 10 common bigrams
print("Top 10 common bigrams:")
display_common_phrases(common_bigrams)

# Display top 10 common trigrams
print("\nTop 10 common trigrams:")
display_common_phrases(common_trigrams)

Top 10 common bigrams:
to the 19425
of the 18190
on the 12780
the left 10375
put a 10011
the right 9916
the front 8552
the back 7226
the previous 5143
at the 5119

Top 10 common trigrams:
to the left 5523
to the right 5506
the previous piece 3509
to the front 3332
to the back 2995
of the previous 2906
put a f 2203
the left of 2152
on top of 2093
top of the 1984


From this, we can see that most common bigrams and trigrams have to do with instructions and relative positions (to the left, to the right, the left of, etc.). 

In [66]:
# Identify common structures and patterns in the instructions
import re

def find_patterns(text):
    patterns = {
        "step": len(re.findall(r'\bstep\s+\d+\b', text, re.IGNORECASE)),
        "bag": len(re.findall(r'\bbag\s+\d+\b', text, re.IGNORECASE)),
        "make a part": len(re.findall(r'\bmake a part\b', text, re.IGNORECASE)),
        "put": len(re.findall(r'\bput\b', text, re.IGNORECASE)),
        "term": len(re.findall(r'\bput\b', text, re.IGNORECASE))
    }
    return patterns

patterns_in_instructions = [find_patterns(instr) for instr in text_data]

for idx, patterns in enumerate(patterns_in_instructions):
    print(f"Patterns in instruction {idx + 1}: {patterns}")


Patterns in instruction 1: {'step': 12, 'bag': 4, 'make a part': 0, 'put': 1, 'term': 1}
Patterns in instruction 2: {'step': 2, 'bag': 26, 'make a part': 9, 'put': 59, 'term': 59}
Patterns in instruction 3: {'step': 2, 'bag': 9, 'make a part': 0, 'put': 6, 'term': 6}
Patterns in instruction 4: {'step': 9, 'bag': 1, 'make a part': 0, 'put': 4, 'term': 4}
Patterns in instruction 5: {'step': 0, 'bag': 0, 'make a part': 1, 'put': 18, 'term': 18}
Patterns in instruction 6: {'step': 53, 'bag': 4, 'make a part': 0, 'put': 5, 'term': 5}
Patterns in instruction 7: {'step': 29, 'bag': 5, 'make a part': 0, 'put': 6, 'term': 6}
Patterns in instruction 8: {'step': 30, 'bag': 6, 'make a part': 0, 'put': 58, 'term': 58}
Patterns in instruction 9: {'step': 3, 'bag': 8, 'make a part': 12, 'put': 226, 'term': 226}
Patterns in instruction 10: {'step': 0, 'bag': 0, 'make a part': 0, 'put': 9, 'term': 9}
Patterns in instruction 11: {'step': 0, 'bag': 0, 'make a part': 1, 'put': 80, 'term': 80}
Patterns in 