## Detailed Report on Data Augmentation Methods

- Controlled Synonym Replacement: This method replaces words in the sentence with their synonyms, ensuring that the replacement words are valid and meaningful. This increases variability while retaining the semantic meaning.

- Back-Translation Paraphrasing: This method uses back-translation to generate paraphrases. The sentence is translated to another language and then back to English. By using multiple languages (French, Spanish, German), we generate diverse paraphrases while retaining the original meaning.

- Template-Based Augmentation: This method uses a set of predefined templates to create variations of the original sentence. By inserting the original sentence into different templates, we generate meaningful variations that enhance the dataset.

- Multiple Augmentation Rounds: The dataset is augmented multiple times to generate a larger dataset. Each round includes synonym replacement, back-translation, and template-based augmentation to ensure comprehensive augmentation.

## Install Dependencies

In [4]:
!pip install textblob nltk sklearn

Collecting textblob
  Downloading textblob-0.18.0.post0-py3-none-any.whl.metadata (4.5 kB)
Downloading textblob-0.18.0.post0-py3-none-any.whl (626 kB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m626.3/626.3 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m MB/s[0m eta [36m0:00:01[0m
[?25hInstalling collected packages: textblob
Successfully installed textblob-0.18.0.post0


## Initial Dataset

In [30]:
# format: prompt (key, pos_or_neg) -> top-5 most relevant uiuds (value) [1 (most relevant) to 5 (5th most relevant)]
# if pos_or_neg = 1, it is a positive sample, if pos_or_neg=0, it is a negative sample 
dataset = {
    ("what is the Alex desk",1): [923, 924, 925, 926, 927],
    ("what is the Alex desk",0): [930, 931, 932, 933, 934],
    ("for the Alex desk, what are the warnings I should know of?",1): [924, 923, 925, 926, 927],
    ("for the Alex desk, what are the warnings I should know of?",0): [930, 931, 932, 933, 934],
    ("for the Alex desk, what parts do I need?",1): [925, 924, 923, 926, 927],
    ("for the Alex desk, what parts do I need?",0): [930, 931, 932, 933, 934],
    ("for the Alex desk, what is the first step?",1): [926, 925, 924, 923, 927],
    ("for the Alex desk, what is the first step?",0): [930, 931, 932, 933, 934],
    ("for the Alex desk, what is the second step?",1): [926, 925, 924, 923, 927],
    ("for the Alex desk, what is the second step?",0): [930, 931, 932, 933, 934],
    ("for the Alex desk, how many nails do I need for step one?",1): [926, 925, 924, 923, 927],
    ("for the Alex desk, how many nails do I need for step one?",0): [935, 931, 932, 933, 934],
    ("for the Alex desk, how many parts do I need for step two?",1): [926, 925, 924, 923, 927], 
    ("for the Alex desk, how many parts do I need for step two?",0): [936, 935, 932, 933, 934],
    ("for the Alex desk, what is the last step?",1): [936, 935, 932, 933, 934],
    ("for the Alex desk, what is the last step?",0): [926, 925, 924, 923, 927],
    ("what is the askholmen shelf",1): [831, 832, 833, 834, 835],
    ("what is the askholmen shelf",0): [838, 839, 840, 841, 842],
    ("for the askholmen shelf, what are the warnings I should know of?",1): [833, 832, 831, 834, 835],
    ("for the askholmen shelf, what are the warnings I should know of?",0): [842, 841, 839, 838, 840],
    ("for the askholmen shelf, what parts do I need?",1): [836, 831, 832, 834, 835],
    ("for the askholmen shelf, what parts do I need?",0): [842, 841, 839, 838, 840],
    ("for the askholmen shelf, what is the first step?",1): [837, 836, 832, 834, 835],
    ("for the askholmen shelf, what is the first step?",0): [842, 841, 839, 838, 840],
    ("for the askholmen shelf, what is the second step?",1):  [837, 836, 832, 834, 835],
    ("for the askholmen shelf, what is the second step?",0):  [842, 841, 839, 838, 840],
    ("for the askholmen shelf, how many nails do I need for step one?",1): [837, 836, 832, 834, 835],
    ("for the askholmen shelf, how many nails do I need for step one?",0): [842, 841, 839, 838, 840],
    ("for the askholmen shelf, how many parts do I need for step two?",1): [837, 836, 832, 834, 835], 
    ("for the askholmen shelf, how many parts do I need for step two?",0): [842, 841, 839, 838, 840], 
    ("for the askholmen shelf, what is the last step?",1):  [841, 840, 838, 837, 840],
     ("for the askholmen shelf, what is the last step?",0):  [831, 832, 833, 834, 835],
    ("what is the lunnarp table", 1): [739, 740, 741, 742, 743],
    ("what is the lunnarp table", 0): [750, 749, 748, 747, 746],
    ("for the lunnarp table, what are the warnings I should know of?", 1): [740, 739, 741, 742, 743],
    ("for the lunnarp table, what are the warnings I should know of?", 0): [750, 749, 748, 747, 746],
    ("for the lunnarp table, what parts do I need?", 1): [741, 742, 739, 740, 743],
    ("for the lunnarp table, what parts do I need?", 0): [750, 749, 748, 747, 746],
    ("for the lunnarp table, what is the first step?", 1): [742, 741, 739, 740, 743],
    ("for the lunnarp table, what is the first step?", 0): [750, 749, 748, 747, 746],
    ("for the lunnarp table, what is the second step?", 1): [743, 742, 741, 740, 739],
    ("for the lunnarp table, what is the second step?", 0): [750, 749, 748, 747, 746],
    ("for the lunnarp table, what tool do I need for step one?", 1): [742, 741, 739, 740, 743],
    ("for the lunnarp table, what tool do I need for step one?", 0): [750, 749, 748, 747, 746],
    ("for the lunnarp table, how many parts do I need for step two?", 1): [743, 742, 741, 740, 739],
    ("for the lunnarp table, how many parts do I need for step two?", 0): [750, 749, 748, 747, 746],
    ("for the lunnarp table, what is the last step?", 1): [748, 747, 746, 745, 744],
    ("for the lunnarp table, what is the last step?", 0): [739, 740, 741, 742, 743],
    ("what is the Flisat desk", 1): [887, 888, 889, 890, 891],
    ("what is the Flisat desk", 0): [906, 905, 904, 903, 902],
    ("for the Flisat desk, what are the warnings I should know of?", 1): [888, 887, 889, 890, 891],
    ("for the Flisat desk, what are the warnings I should know of?", 0): [906, 905, 904, 903, 902],
    ("for the Flisat desk, what parts do I need?", 1): [889, 887, 888, 890, 891],
    ("for the Flisat desk, what parts do I need?", 0): [906, 905, 904, 903, 902],
    ("for the Flisat desk, what is the first step?", 1): [890, 887, 889, 888, 891],
    ("for the Flisat desk, what is the first step?", 0): [906, 905, 904, 903, 902],
    ("for the Flisat desk, what is the second step?", 1): [891, 887, 889, 888, 890],
    ("for the Flisat desk, what is the second step?", 0): [906, 905, 904, 903, 902],
    ("for the Flisat desk, how many nails do I need for step one?", 1): [890, 887, 889, 888, 891],
    ("for the Flisat desk, how many nails do I need for step one?", 0): [906, 905, 904, 903, 902],
    ("for the Flisat desk, how many parts do I need for step two?", 1): [891, 887, 889, 888, 890],  
    ("for the Flisat desk, how many parts do I need for step two?", 0): [906, 905, 904, 903, 902],  
    ("for the Flisat desk, what is the last step?", 1): [906, 905, 904, 903, 902],
    ("for the Flisat desk, what is the last step?", 0): [887, 888, 889, 890, 891],
    ("what is the vittsjo shelf", 1): [871, 876, 875, 874, 873],
    ("what is the vittsjo shelf", 0): [886, 885, 884, 883, 882],
    ("for the vittsjo shelf, what are the warnings I should know of?", 1): [872, 873, 874, 875, 871],
    ("for the vittsjo shelf, what are the warnings I should know of?", 0): [886, 885, 884, 883, 882],
    ("for the vittsjo shelf, what parts do I need?", 1): [876, 871, 875, 874, 873],
    ("for the vittsjo shelf, what parts do I need?", 0): [886, 885, 884, 883, 882],
    ("for the vittsjo shelf, what is the first step?", 1): [877, 876, 871, 875, 878],
    ("for the vittsjo shelf, what is the first step?", 0): [886, 885, 884, 883, 882],
    ("for the vittsjo shelf, what is the second step?", 1): [878, 871, 875, 874, 883],
    ("for the vittsjo shelf, what is the second step?", 0): [886, 885, 884, 883, 882],
    ("for the vittsjo shelf, what pieces do I need for step one?", 1): [877, 876, 871, 875, 878],
    ("for the vittsjo shelf, what pieces do I need for step one?", 0): [886, 885, 884, 883, 882],
    ("for the vittsjo shelf, how many parts do I need for step two?", 1): [878, 871, 875, 874, 873], 
    ("for the vittsjo shelf, how many parts do I need for step two?", 0): [886, 885, 884, 883, 882], 
    ("for the vittsjo shelf, what is the last step?", 1): [885, 884, 883, 882, 881],
    ("for the vittsjo shelf, what is the last step?", 0): [877, 876, 871, 875, 878],
    ("what is the vaniljstang shelf", 1): [859, 860, 861, 863, 864],
    ("what is the vaniljstang shelf", 0): [870, 869, 868, 867, 865],
    ("for the vaniljstang shelf, what are the warnings I should know of?", 1): [860, 859, 861, 862, 863],
    ("for the vaniljstang shelf, what are the warnings I should know of?", 0): [870, 869, 868, 867, 865],
    ("for the vaniljstang shelf, what parts do I need?", 1): [861, 859, 862, 863, 864],
    ("for the vaniljstang shelf, what parts do I need?", 0): [870, 869, 868, 867, 865],
    ("for the vaniljstang shelf, what is the first step?", 1): [862, 859, 861, 863, 864],
    ("for the vaniljstang shelf, what is the first step?", 0): [870, 869, 868, 867, 865],
    ("for the vaniljstang shelf, what is the second step?", 1): [863, 859, 862, 861, 864],
    ("for the vaniljstang shelf, what is the second step?", 0): [870, 869, 868, 867, 865],
    ("for the vaniljstang shelf, what pieces do I need for step one?", 1): [862, 859, 861, 863, 864],
    ("for the vaniljstang shelf, what pieces do I need for step one?", 0): [870, 869, 868, 867, 865],
    ("for the vaniljstang shelf, how many parts do I need for step two?", 1): [863, 859, 862, 861, 864],
    ("for the vaniljstang shelf, how many parts do I need for step two?", 0): [870, 869, 868, 867, 865],
    ("for the vaniljstang shelf, what is the last step?", 1): [870, 869, 868, 867, 865],
    ("for the vaniljstang shelf, what is the last step?", 0): [859, 860, 861, 863, 864],
    ("what is the tornviken furniture", 1): [615, 616, 617, 618, 619],
    ("what is the tornviken furniture", 0): [638, 637, 636, 635, 634],
    ("for the tornviken furniture, what are the warnings I should know of?", 1): [616, 615, 617, 618, 619],
    ("for the tornviken furniture, what are the warnings I should know of?", 0): [638, 637, 636, 635, 634],
    ("for the tornviken furniture, what parts do I need?", 1): [617, 616, 615, 618, 619],
    ("for the tornviken furniture, what parts do I need?", 0): [638, 637, 636, 635, 634],
    ("for the tornviken furniture, what is the first step?", 1): [618, 615, 616, 617, 618],
    ("for the tornviken furniture, what is the first step?", 0): [638, 637, 636, 635, 634],
    ("for the tornviken furniture, what is the second step?", 1): [619, 616, 615, 617, 618],
    ("for the tornviken furniture, what is the second step?", 0): [638, 637, 636, 635, 634],
    ("for the tornviken furniture, what pieces do I need for step one?", 1): [618, 615, 616, 617, 618],
    ("for the tornviken furniture, what pieces do I need for step one?", 0): [638, 637, 636, 635, 634],
    ("for the tornviken furniture, how many parts do I need for step two?", 1): [619, 616, 615, 617, 618],
    ("for the tornviken furniture, how many parts do I need for step two?", 0): [638, 637, 636, 635, 634],
    ("for the tornviken furniture, what is the last step?", 1): [638, 637, 636, 635, 634],
    ("for the tornviken furniture, what is the last step?", 0): [619, 616, 615, 617, 618],
    ("what is the tommaryd table", 1): [723, 724, 725, 726, 727],
    ("what is the tommaryd table", 0): [738, 737, 736, 735, 734],
    ("for the tommaryd table, what are the warnings I should know of?", 1): [724, 723, 725, 726, 727],
    ("for the tommaryd table, what are the warnings I should know of?", 0): [738, 737, 736, 735, 734],
    ("for the tommaryd table, what parts do I need?", 1): [724, 725, 723, 726, 727],
    ("for the tommaryd table, what parts do I need?", 0): [738, 737, 736, 735, 734],
    ("for the tommaryd table, what is the first step?", 1): [725, 724, 723, 726, 727],
    ("for the tommaryd table, what is the first step?", 0): [738, 737, 736, 735, 734],
    ("for the tommaryd table, what is the second step?", 1): [726, 725, 724, 726, 727],
    ("for the tommaryd table, what is the second step?", 0): [738, 737, 736, 735, 734],
    ("for the tommaryd table, how many nails do I need for step one?", 1): [725, 724, 723, 726, 727],
    ("for the tommaryd table, how many nails do I need for step one?", 0): [738, 737, 736, 735, 734],
    ("for the tommaryd table, how many parts do I need for step two?", 1): [726, 725, 724, 726, 727],
    ("for the tommaryd table, how many parts do I need for step two?", 0): [738, 737, 736, 735, 734],
    ("for the tommaryd table, what is the last step?", 1): [735, 734, 733, 732, 731],
    ("for the tommaryd table, what is the last step?", 0): [725, 724, 723, 726, 727],
    ("what is the ronninge chair", 1): [97, 98, 99, 100, 101],
    ("what is the ronninge chair", 0): [108, 107, 106, 105, 104],
    ("for the ronninge chair, what are the warnings I should know of?", 1): [98, 97, 99, 100, 101],
    ("for the ronninge chair, what are the warnings I should know of?", 0): [108, 107, 106, 105, 104],
    ("for the ronninge chair, what parts do I need?", 1): [98, 97, 99, 100, 101],
    ("for the ronninge chair, what parts do I need?", 0): [108, 107, 106, 105, 104],
    ("for the ronninge chair, what is the first step?", 1): [99, 97, 98, 100, 101],
    ("for the ronninge chair, what is the first step?", 0): [108, 107, 106, 105, 104],
    ("for the ronninge chair, what is the second step?", 1): [100, 97, 99, 98, 101],
    ("for the ronninge chair, what is the second step?", 0): [108, 107, 106, 105, 104],
    ("for the ronninge chair, how many nails do I need for step one?", 1): [99, 97, 98, 100, 101],
    ("for the ronninge chair, how many nails do I need for step one?", 0): [108, 107, 106, 105, 104],
    ("for the ronninge chair, how many parts do I need for step two?", 1): [100, 97, 99, 98, 101],
    ("for the ronninge chair, how many parts do I need for step two?", 0): [108, 107, 106, 105, 104],
    ("for the ronninge chair, what is the last step?", 1): [107, 106, 105, 103, 102],
    ("for the ronninge chair, what is the last step?", 0): [97, 98, 99, 100, 101],
    ("what is the nordviken chair", 1): [287, 288, 289, 290, 291],
    ("what is the nordviken chair", 0): [298, 297, 296, 295, 294],
    ("for the nordviken chair, what are the warnings I should know of?", 1): [288, 287, 289, 290, 291],
    ("for the nordviken chair, what are the warnings I should know of?", 0): [298, 297, 296, 295, 294],
    ("for the nordviken chair, what parts do I need?", 1): [288, 287, 289, 290, 291],
    ("for the nordviken chair, what parts do I need?", 0): [298, 297, 296, 295, 294],
    ("for the nordviken chair, what is the first step?", 1): [289, 287, 288, 290, 291],
    ("for the nordviken chair, what is the first step?", 0): [298, 297, 296, 295, 294],
    ("for the nordviken chair, what is the second step?", 1): [289, 287, 288, 290, 291],
    ("for the nordviken chair, what is the second step?", 0): [298, 297, 296, 295, 294],
    ("for the nordviken chair, how many nails do I need for step one?", 1): [289, 287, 288, 290, 291],
    ("for the nordviken chair, how many nails do I need for step one?", 0): [298, 297, 296, 295, 294],
    ("for the nordviken chair, how many parts do I need for step two?", 1): [289, 287, 288, 290, 291],
    ("for the nordviken chair, how many parts do I need for step two?", 0): [298, 297, 296, 295, 294],
    ("for the nordviken chair, what is the last step?", 1): [296, 295, 294, 293, 292],
    ("for the nordviken chair, what is the last step?", 0): [287, 288, 289, 290, 291],
    ("what is the silveran bench", 1): [1, 2, 3, 4, 5],
    ("what is the silveran bench", 0): [8, 9, 10, 11, 12],
    ("for the silveran bench, what are the warnings I should know of?", 1): [2, 3, 1, 4, 5],
    ("for the silveran bench, what are the warnings I should know of?", 0): [12, 9, 8, 10, 11],
    ("for the silveran bench, what parts do I need?", 1): [2, 3, 1, 4, 5],
    ("for the silveran bench, what parts do I need?", 0): [11, 9, 10, 8, 12],
    ("for the silveran bench, what is the first step?", 1): [3, 1, 2, 4, 5],
    ("for the silveran bench, what is the first step?", 0): [12, 10, 8, 9, 11],
    ("for the silveran bench, what is the second step?", 1): [3, 1, 2, 4, 5],
    ("for the silveran bench, what is the second step?", 0): [11, 9, 8, 12, 10],
    ("for the silveran bench, how many nails do I need for step one?", 1): [3, 1, 2, 4, 5],
    ("for the silveran bench, how many nails do I need for step one?", 0): [10, 9, 8, 12, 11],
    ("for the silveran bench, how many parts do I need for step two?", 1): [3, 1, 2, 4, 5],
    ("for the silveran bench, how many parts do I need for step two?", 0): [12, 8, 9, 10, 11],
    ("for the silveran bench, what is the last step?", 1): [11, 10, 8, 9, 7],
    ("for the silveran bench, what is the last step?", 0): [1, 2, 3, 4, 5],
    ("what is the tjusig shelf",1): [65, 66, 67, 68, 69],
    ("what is the tjusig shelf",0): [73, 74, 75, 76, 72],
    ("for the tjusig bench, what are the warnings I should know of?",1): [66, 65, 67, 68, 69],
    ("for the tjusig bench, what are the warnings I should know of?",0): [73, 72, 75, 76, 74],
    ("for the tjusig bench, what parts do I need?",1): [67, 65, 66, 68, 69],
    ("for the tjusig bench, what parts do I need?",0): [75, 72, 73, 76, 74],
    ("for the tjusig bench, what is the first step?",1): [68, 65, 66, 67, 69],
    ("for the tjusig bench, what is the first step?",0): [73, 72, 75, 76, 74],
    ("for the tjusig bench, what is the second step?",1): [68, 65, 66, 67, 69],
    ("for the tjusig bench, what is the second step?",0): [72, 73, 75, 76, 74],
    ("for the tjusig bench, how many nails do I need for step one?",1): [68, 65, 66, 67, 69],
    ("for the tjusig bench, how many nails do I need for step one?",0): [73, 72, 75, 76, 74],
    ("for the tjusig bench, how many parts do I need for step two?",1): [68, 65, 66, 67, 69],
    ("for the tjusig bench, how many parts do I need for step two?",0): [73, 75, 72, 76, 74],
    ("for the tjusig bench, what is the last step?",1): [75, 74, 73, 72, 71],
    ("for the tjusig bench, what is the last step?",0): [65, 66, 67, 68, 69],
    ("what is the herman chair", 1): [183, 184, 185, 186, 187],
    ("what is the herman chair", 0): [190, 191, 192, 193, 194],
    ("for the herman chair, what are the warnings I should know of?", 1): [184, 183, 185, 186, 187],
    ("for the herman chair, what are the warnings I should know of?", 0): [194, 191, 192, 193, 190],
    ("for the herman chair, what parts do I need?", 1): [184, 183, 185, 186, 187],
    ("for the herman chair, what parts do I need?", 0): [194, 193, 192, 191, 190],
    ("for the herman chair, what is the first step?", 1): [185, 184, 183, 186, 187],
    ("for the herman chair, what is the first step?", 0): [194, 193, 192, 191, 190],
    ("for the herman chair, what is the second step?", 1): [185, 184, 183, 186, 187],
    ("for the herman chair, what is the second step?", 0): [191, 193, 192, 194, 190],
    ("for the herman chair, how many nails do I need for step one?", 1): [185, 184, 183, 186, 187],
    ("for the herman chair, how many nails do I need for step one?", 0): [194, 193, 192, 191, 190],
    ("for the herman chair, how many parts do I need for step two?", 1): [185, 184, 183, 186, 187],
    ("for the herman chair, how many parts do I need for step two?", 0): [194, 193, 192, 191, 190],
    ("for the herman chair, what is the last step?", 1): [191, 190, 189, 188, 192],
    ("for the herman chair, what is the last step?", 0): [183, 184, 185, 186, 187],
    ("what is the norraker chair", 1): [135, 136, 137, 138, 139],
    ("what is the norraker chair", 0): [146, 142, 145, 144, 143],
    ("for the norraker chair, what are the warnings I should know of?", 1): [136, 135, 137, 138, 139],
    ("for the norraker chair, what are the warnings I should know of?", 0): [146, 142, 145, 144, 143],
    ("for the norraker chair, what parts do I need?", 1): [137, 136, 135, 138, 139],
    ("for the norraker chair, what parts do I need?", 0): [146, 142, 145, 144, 143],
    ("for the norraker chair, what is the first step?", 1): [137, 136, 135, 138, 139],
    ("for the norraker chair, what is the first step?", 0): [146, 142, 145, 144, 143],
    ("for the norraker chair, what is the second step?", 1): [137, 136, 135, 138, 139],
    ("for the norraker chair, what is the second step?", 0): [146, 142, 145, 144, 143],
    ("for the norraker chair, how many nails do I need for step one?", 1): [137, 136, 135, 138, 139],
    ("for the norraker chair, how many nails do I need for step one?", 0): [146, 142, 145, 144, 143],
    ("for the norraker chair, how many parts do I need for step two?", 1): [137, 136, 135, 138, 139],
    ("for the norraker chair, how many parts do I need for step two?", 0): [146, 142, 145, 144, 143],
    ("for the norraker chair, what is the last step?", 1): [146, 145, 144, 143, 142],
    ("for the norraker chair, what is the last step?", 0): [135, 136, 137, 138, 139]
}

# ignore me!
# test_set={
#     "what is the applaro desk": {"all_uiuds": [21, 22, 23, 24, 25, 26,27], "top_5_uiuds": [21,22,23,24,25]},
#     "for the applaro desk, what are the warnings I should know of?": {"all_uiuds": [21, 22, 23, 24, 25, 26,27], "top_5_uiuds": [22,21,23,24,25]},
#     "for the applaro desk, what parts do I need?": {"all_uiuds": [21, 22, 23, 24, 25, 26,27], "top_5_uiuds": [22,21,23,24,25]},
#     "for the applaro desk, what is the first step?": {"all_uiuds": [21, 22, 23, 24, 25, 26,27], "top_5_uiuds": [23,22,21,24,25]},
#     "for the applaro desk, what is the second step?": {"all_uiuds": [21, 22, 23, 24, 25, 26,27], "top_5_uiuds": [24,22,21,23,25]},
#     "for the applaro desk, how many nails do I need for step one?": {"all_uiuds": [21, 22, 23, 24, 25, 26,27], "top_5_uiuds": [23,22,21,24,25]},
#     "for the applaro desk, how many parts do I need for step two?": {"all_uiuds": [21, 22, 23, 24, 25, 26,27], "top_5_uiuds": [23,22,21,24,25]},
# }

In [31]:
n_unique = len([x[0] for x in dataset.keys()])/16
print("Number of unique furniture pieces in dataset", n_unique)
print("Size of total dataset", len(dataset.keys()))

Number of unique furniture pieces in dataset 14.0
Size of total dataset 224


In [32]:
import json

# Load the JSON data from the file
with open('image_metadata.json', 'r') as file:
    data = json.load(file)

# Extract filenames
filenames = [item['filename'] for item in data.values()]

# Extract unique furniture items from filenames
unique_items = set()
for filename in filenames:
    item = filename.split('.page')[0]
    unique_items.add(item)

# Count the number of unique furniture items
num_unique_items = len(unique_items)

print(f'Total number of unique furniture items: {num_unique_items}')


Total number of unique furniture items: 90


In [33]:
print("Ratio of furniture in dataset to total unique furniture pieces", round(n_unique/num_unique_items,3))

Ratio of furniture in dataset to total unique furniture pieces 0.156


## Add relevancy scores to retrieved images in dataset:

In [34]:
# Arrays to be appended
first_array = [1, 0.6, 0.2, 0.2]
second_array = [0, 0, 0, 0]

# List of keys
keys = list(dataset.keys())

# Append arrays to the respective entries alternately
for i in range(len(keys)):
    if i % 2 == 0:
        dataset[keys[i]].extend(first_array)
    else:
        dataset[keys[i]].extend(second_array)

# Printing updated dataset to check
for key in keys[:5]:  # Print the first 5 entries to check
    print(f'{key}: {dataset[key]}')

for key in keys[49:54]:  # Print 5 entries starting from the 50th to check
    print(f'{key}: {dataset[key]}')

('what is the Alex desk', 1): [923, 924, 925, 926, 927, 1, 0.6, 0.2, 0.2]
('what is the Alex desk', 0): [930, 931, 932, 933, 934, 0, 0, 0, 0]
('for the Alex desk, what parts do I need?', 1): [925, 924, 923, 926, 927, 1, 0.6, 0.2, 0.2]
('what is the Flisat desk', 0): [906, 905, 904, 903, 902, 0, 0, 0, 0]
('for the Flisat desk, what parts do I need?', 1): [889, 887, 888, 890, 891, 1, 0.6, 0.2, 0.2]
('for the Flisat desk, what parts do I need?', 0): [906, 905, 904, 903, 902, 0, 0, 0, 0]


In [35]:
# New dictionary
new_dict = {}

# Iterate through the original dictionary
for (question, value) in dataset.items():
    # Split the question into words
    words = question[0].split()

    # Find the matching word in unique_words
    embed_key = None
    for word in words:
        if word.lower() in unique_items:
            embed_key = word.lower()
            break
    
    # Reformat the value
    new_value = {
        "idxs_and_scores": value,
        "embed_key": embed_key
    }
    
    # Add to the new dictionary
    new_dict[question] = new_value

# Print the new dictionary
for k, v in new_dict.items():
    print(f"{k}: {v}")

('what is the Alex desk', 1): {'idxs_and_scores': [923, 924, 925, 926, 927, 1, 0.6, 0.2, 0.2], 'embed_key': 'alex'}
('what is the Alex desk', 0): {'idxs_and_scores': [930, 931, 932, 933, 934, 0, 0, 0, 0], 'embed_key': 'alex'}
('for the Alex desk, what parts do I need?', 1): {'idxs_and_scores': [925, 924, 923, 926, 927, 1, 0.6, 0.2, 0.2], 'embed_key': 'alex'}
('for the Alex desk, what parts do I need?', 0): {'idxs_and_scores': [930, 931, 932, 933, 934, 0, 0, 0, 0], 'embed_key': 'alex'}
('for the Alex desk, what is the first step?', 1): {'idxs_and_scores': [926, 925, 924, 923, 927, 1, 0.6, 0.2, 0.2], 'embed_key': 'alex'}
('for the Alex desk, what is the first step?', 0): {'idxs_and_scores': [930, 931, 932, 933, 934, 0, 0, 0, 0], 'embed_key': 'alex'}
('for the Alex desk, what is the second step?', 1): {'idxs_and_scores': [926, 925, 924, 923, 927, 1, 0.6, 0.2, 0.2], 'embed_key': 'alex'}
('for the Alex desk, what is the second step?', 0): {'idxs_and_scores': [930, 931, 932, 933, 934, 0, 0, 

In [36]:
import random

def remap_keys(mapping):
    return {str(k): v for k, v in mapping.items()}


# Split the keys into train, val, and test sets
keys = list(new_dict.keys())
random.shuffle(keys)

train_split = int(0.7 * len(keys))
val_split = int(0.85 * len(keys))

train_keys = keys[:train_split]
val_keys = keys[train_split:val_split]
test_keys = keys[val_split:]

# Create train, val, and test dictionaries
train_dict = {k: new_dict[k] for k in train_keys}
val_dict = {k: new_dict[k] for k in val_keys}
test_dict = {k: new_dict[k] for k in test_keys}

# Save each dictionary to a JSON file
with open('augmented_data/initial_dataset_no_aug_train.json', 'w') as f:
    json.dump(remap_keys(train_dict), f, indent=4)

with open('augmented_data/initial_dataset_no_aug_val.json', 'w') as f:
    json.dump(remap_keys(val_dict), f, indent=4)

with open('augmented_data/initial_dataset_no_aug_test.json', 'w') as f:
    json.dump(remap_keys(test_dict), f, indent=4)

print("Data has been split and saved to train.json, val.json, and test.json")

Data has been split and saved to train.json, val.json, and test.json


In [37]:
def remap_keys(mapping):
    return {str(k): v for k, v in mapping.items()}
    
# Save the updated dataset to a JSON file
with open('augmented_data/initial_dataset_no_aug.json', 'w') as json_file:
    json.dump(remap_keys(dataset), json_file, indent=4)

# Confirming that the file has been saved
print("Updated dataset saved to 'initial_dataset_no_aug.json'.")

Updated dataset saved to 'initial_dataset_no_aug.json'.


In [38]:
from ast import literal_eval

# Load in data and unwrap it 
# Function to convert stringified tuple keys back to tuples
def unwrap_keys(mapping):
    return {literal_eval(k): v for k, v in mapping.items()}

# Load the JSON file
with open('augmented_data/initial_dataset_no_aug.json', 'r') as json_file:
    data_from_json = json.load(json_file)

# print(data_from_json)
# Unwrap the keys to their original tuple format
unwrapped_data = unwrap_keys(data_from_json)

print("JSON file loaded and keys unwrapped successfully.")
# print(unwrapped_data)

JSON file loaded and keys unwrapped successfully.


## Todo

- Add image embedding for retrieved images (new notebook)
- Add text embedding (new notebook)
- ~~Add [1, 0.6, 0.2,0.2,0.2] to first 49 entries, [0,0,0,0,0] for next 49 entries (here)~~
- ~~Fix data to point to correct images (here)~~

## Naive Augmentation - make the dataset 10x larger

In [50]:
import json
from nltk.corpus import wordnet
from textblob import TextBlob
from tqdm import tqdm
# Example dataset
# dataset = {
#     "what is the Alex desk": [923, 924, 925, 926, 927],
#     "for the Alex desk, what are the warnings I should know of?": [924, 923, 925, 926, 927],
#     "for the Alex desk, what parts do I need?": [925, 924, 923, 926, 927],
#     "for the Alex desk, what is the first step?": [926, 925, 924, 923, 927],
#     "for the Alex desk, what is the second step?": [926, 925, 924, 923, 927],
#     "for the Alex desk, how many nails do I need for step one?": [926, 925, 924, 923, 927],
#     "for the Alex desk, how many parts do I need for step two?": [926, 925, 924, 923, 927],
# }

TRAIN_TEST_SPLIT_TYPE = "val"
# Synonym replacement function ensuring semantic meaning
def synonym_replacement(sentence):
    words = sentence.split()
    new_sentence = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word and synonym.isalpha():
                new_sentence.append(synonym)
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)
    return ' '.join(new_sentence)

# Paraphrasing using back-translation with multiple languages
def back_translate(sentence, languages=['fr', 'es', 'de']):
    translations = []
    try:
        blob = TextBlob(sentence)
        for lang in languages:
            translated = str(blob.translate(to=lang).translate(to='en'))
            if translated != sentence:
                translations.append(translated)
    except Exception as e:
        # If there's an error in translation, return the original sentence
        translations.append(sentence)
    return translations

# Template-based augmentation
def template_augmentation(sentence):
    templates = [
        # "Can you tell me about {}?", 
        # "I would like to know about {}.",
        # "What can you say about {}?", 
        # "Provide details about {}.", # train ^^^
        "{} - could you elaborate?",
        "Please explain {} in detail.", 
        "What information is available on {}?", # val ^^^
        # "Could you tell me about {}?",
        # "I need information on {}.",
        # "Could you provide more details on {}?" # test ^^^ 
    ]
    augmented_sentences = []
    for template in templates:
        augmented_sentences.append(template.format(sentence))
    return augmented_sentences

# Function to augment the dataset multiple times
def augment_dataset(dataset, rounds=1):
    augmented_dataset = dataset.copy()
    for _ in range(rounds):
        new_entries = {}
        for question, ids in tqdm(augmented_dataset.items()):
            question, part = question[0],question[1]
            # Synonym Replacement
            augmented_sentence = synonym_replacement(question)
            augmented_tuple = (augmented_sentence, part)
            # print(augmented_tuple)
            if augmented_sentence != question:
                new_entries[augmented_tuple] = ids
            
            # Back-Translation Paraphrasing
            augmented_sentences = back_translate(question)
            for augmented_sentence in augmented_sentences:
                augmented_tuple = (augmented_sentence, part)
                if augmented_sentence != question:
                    new_entries[augmented_tuple] = ids
            
            # Template-Based Augmentation
            augmented_sentences = template_augmentation(question)
            for augmented_sentence in augmented_sentences:
                augmented_tuple = (augmented_sentence, part)
                if augmented_sentence != question:
                    new_entries[augmented_tuple] = ids
            # print(new_entries)
        augmented_dataset.update(new_entries)
    return augmented_dataset

# Augment the dataset
if TRAIN_TEST_SPLIT_TYPE == "train":
    aug_dict = train_dict
elif TRAIN_TEST_SPLIT_TYPE == "val":
    aug_dict = val_dict
elif TRAIN_TEST_SPLIT_TYPE == "test":
    aug_dict = test_dict
else:
    raise Exception("Invalid Train Test Split Type") 
augmented_dataset = augment_dataset(aug_dict, rounds=1)

# Combine the original dataset with the augmented dataset
final_dataset = {**aug_dict, **augmented_dataset}

# Save the augmented dataset to a JSON file including the original dataset
with open(f'augmented_data/augmented_dataset_{TRAIN_TEST_SPLIT_TYPE}.json', 'w') as f:
    json.dump(remap_keys(final_dataset), f, indent=4)

# Load the augmented dataset from the JSON file
with open(f'augmented_data/augmented_dataset_{TRAIN_TEST_SPLIT_TYPE}.json', 'r') as f:
    loaded_augmented_dataset = json.load(f)

# Print the loaded augmented dataset
# print(json.dumps(loaded_augmented_dataset, indent=4))

# Print the length of the augmented dataset
print(f"Length of the augmented dataset: {len(loaded_augmented_dataset)}")


100%|██████████████████████████████████████████████████████| 34/34 [00:00<00:00, 6219.20it/s]

Length of the augmented dataset: 170





In [100]:
# Load the augmented dataset from the JSON file
with open('augmented_data/augmented_dataset.json', 'r') as f:
    loaded_augmented_dataset = json.load(f)

# Print the loaded augmented dataset
# print(json.dumps(loaded_augmented_dataset, indent=4))

# Print the length of the augmented dataset
print(f"Length of the augmented dataset: {len(loaded_augmented_dataset)}")

Length of the augmented dataset: 1176


## Augment to 500 samples with train, test, split

- Saved as `augmented_train_dataset.json`, `augmented_test_dataset.json`, `augmented_val_dataset.json`

In [50]:
import json
import random
from nltk.corpus import wordnet
from textblob import TextBlob
from sklearn.model_selection import train_test_split

# Example dataset
# dataset = {
#     "what is the Alex desk": [923, 924, 925, 926, 927],
#     "for the Alex desk, what are the warnings I should know of?": [924, 923, 925, 926, 927],
#     "for the Alex desk, what parts do I need?": [925, 924, 923, 926, 927],
#     "for the Alex desk, what is the first step?": [926, 925, 924, 923, 927],
#     "for the Alex desk, what is the second step?": [926, 925, 924, 923, 927],
#     "for the Alex desk, how many nails do I need for step one?": [926, 925, 924, 923, 927],
#     "for the Alex desk, how many parts do I need for step two?": [926, 925, 924, 923, 927],
# }

# Synonym replacement function ensuring semantic meaning
def synonym_replacement(sentence):
    words = sentence.split()
    new_sentence = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word and synonym.isalpha():
                new_sentence.append(synonym)
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)
    return ' '.join(new_sentence)

# Paraphrasing using back-translation with multiple languages
def back_translate(sentence, languages=['fr', 'es', 'de']):
    translations = []
    try:
        blob = TextBlob(sentence)
        for lang in languages:
            translated = str(blob.translate(to=lang).translate(to='en'))
            if translated != sentence:
                translations.append(translated)
    except Exception as e:
        # If there's an error in translation, return the original sentence
        translations.append(sentence)
    return translations

# Expanded Template array
templates = [
    "Can you tell me about {}?",
    "I would like to know about {}.",
    "What can you say about {}?",
    "Provide details about {}.",
    "{} - could you elaborate?",
    "Please explain {} in detail.",
    "What information is available on {}?",
    "Could you tell me about {}?",
    "I need information on {}.",
    "Could you provide more details on {}?",
    "Tell me something regarding {}.",
    "Give me an explanation about {}.",
    "Can you elaborate more on {}?",
    "Provide more insights about {}.",
    "{} - can you give further details?",
    "Could you explain {} more thoroughly?",
    "What do you know about {}?",
    "I want to understand {} better.",
    "{} - could you clarify?",
    "Please provide details regarding {}.",
    "Explain {} to me.",
    "Elaborate on {} please.",
    "Give me an overview of {}.",
    "Tell me what you know about {}.",
    "Could you shed light on {}?",
    "Clarify {} for me.",
    "Discuss {} in depth.",
    "{} - what's the scoop?",
    "Break down {} for me.",
]

# Function to randomly select a template
def random_template(templates):
    return random.choice(templates)

# Function to apply augmentation based on dataset type
def apply_augmentation(dataset, augment_func, rounds=10):
    augmented_dataset = {}
    for _ in range(rounds):
        for question, ids in dataset:
            augmented_sentences = augment_func(question)
            for augmented_sentence in augmented_sentences:
                if augmented_sentence not in augmented_dataset:
                    augmented_dataset[augmented_sentence] = ids
    return augmented_dataset

# Different augmentation strategies for train, val, test sets
def augment_train_set(question):
    selected_template = random_template(templates)
    augmented_sentence = selected_template.format(question)
    return [augmented_sentence]

def augment_val_set(question):
    selected_template = random_template(templates)
    augmented_sentence = selected_template.format(question)
    return [augmented_sentence]

def augment_test_set(question):
    selected_template = random_template(templates)
    augmented_sentence = selected_template.format(question)
    return [augmented_sentence]

# Split dataset into train, val, test sets
train_dataset, test_dataset = train_test_split(list(dataset.items()), test_size=0.2, random_state=42)
train_dataset, val_dataset = train_test_split(train_dataset, test_size=0.1, random_state=42)

# Apply augmentation to each dataset
augmented_train = apply_augmentation(train_dataset, augment_train_set, rounds=10)
augmented_val = apply_augmentation(val_dataset, augment_val_set, rounds=10)
augmented_test = apply_augmentation(test_dataset, augment_test_set, rounds=10)

# Combine augmented datasets into final train, val, test sets
final_train_dataset = {**dataset, **augmented_train}
final_val_dataset = {**dataset, **augmented_val}
final_test_dataset = {**dataset, **augmented_test}

# Save the augmented datasets to JSON files
with open('augmented_data/augmented_train_dataset.json', 'w') as f:
    json.dump(final_train_dataset, f, indent=4)

with open('augmented_data/augmented_val_dataset.json', 'w') as f:
    json.dump(final_val_dataset, f, indent=4)

with open('augmented_data/augmented_test_dataset.json', 'w') as f:
    json.dump(final_test_dataset, f, indent=4)

# Print lengths of augmented datasets
print(f"Length of augmented train dataset: {len(final_train_dataset)}")
print(f"Length of augmented val dataset: {len(final_val_dataset)}")
print(f"Length of augmented test dataset: {len(final_test_dataset)}")

# Print 10 random samples from each dataset for verification
print("\nRandom samples from augmented train dataset:")
print(random.sample(list(final_train_dataset.items()), 10))
print("\nRandom samples from augmented val dataset:")
print(random.sample(list(final_val_dataset.items()), 10))
print("\nRandom samples from augmented test dataset:")
print(random.sample(list(final_test_dataset.items()), 10))


Length of augmented train dataset: 348
Length of augmented val dataset: 84
Length of augmented test dataset: 136

Random samples from augmented train dataset:
[('Provide more insights about for the vittsjo shelf, what parts do I need?.', [876, 871, 885, 884, 883]), ('Can you elaborate more on for the vittsjo shelf, what parts do I need??', [876, 871, 885, 884, 883]), ('Give me an overview of for the Alex desk, how many nails do I need for step one?.', [926, 925, 924, 923, 927]), ('for the Alex desk, how many parts do I need for step two?', [926, 925, 924, 923, 927]), ('for the Pahl desk, how many nails do I need for step one?', [917, 915, 916, 918, 919]), ('for the Pahl desk, how many parts do I need for step two?', [918, 915, 916, 917, 919]), ('Provide details about for the Flisat desk, what parts do I need?.', [889, 887, 888, 890, 891]), ('What do you know about for the vaniljstang shelf, what is the first step??', [862, 859, 861, 863, 870]), ('for the Fredrik desk, what parts do I n

## Augment to above 1 million samples given 7 

In [16]:
import json
from nltk.corpus import wordnet
from textblob import TextBlob

# Example dataset
dataset = {
    "what is the Alex desk": [923, 924, 925, 926, 927],
    "for the Alex desk, what are the warnings I should know of?": [924, 923, 925, 926, 927],
    "for the Alex desk, what parts do I need?": [925, 924, 923, 926, 927],
    "for the Alex desk, what is the first step?": [926, 925, 924, 923, 927],
    "for the Alex desk, what is the second step?": [926, 925, 924, 923, 927],
    "for the Alex desk, how many nails do I need for step one?": [926, 925, 924, 923, 927],
    "for the Alex desk, how many parts do I need for step two?": [926, 925, 924, 923, 927],
}

# Synonym replacement function ensuring semantic meaning
def synonym_replacement(sentence):
    words = sentence.split()
    new_sentence = []
    for word in words:
        synonyms = wordnet.synsets(word)
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name()
            if synonym != word and synonym.isalpha():
                new_sentence.append(synonym)
            else:
                new_sentence.append(word)
        else:
            new_sentence.append(word)
    return ' '.join(new_sentence)

# Paraphrasing using back-translation with multiple languages
def back_translate(sentence, languages=['fr', 'es', 'de']):
    translations = []
    try:
        blob = TextBlob(sentence)
        for lang in languages:
            translated = str(blob.translate(to=lang).translate(to='en'))
            if translated != sentence:
                translations.append(translated)
    except Exception as e:
        # If there's an error in translation, return the original sentence
        translations.append(sentence)
    return translations

# Template-based augmentation
def template_augmentation(sentence):
    templates = [
        "Can you tell me about {}?",
        "I would like to know about {}.",
        "What can you say about {}?",
        "Provide details about {}.",
        "{} - could you elaborate?",
        "Please explain {} in detail.",
        "What information is available on {}?",
        "Could you tell me about {}?",
        "I need information on {}.",
        "Could you provide more details on {}?"
    ]
    augmented_sentences = []
    for template in templates:
        augmented_sentences.append(template.format(sentence))
    return augmented_sentences

# Function to augment the dataset multiple times
def augment_dataset(dataset, rounds=1):
    augmented_dataset = dataset.copy()
    for _ in range(rounds):
        new_entries = {}
        for question, ids in augmented_dataset.items():
            # Synonym Replacement
            augmented_sentence = synonym_replacement(question)
            if augmented_sentence != question:
                new_entries[augmented_sentence] = ids
            
            # Back-Translation Paraphrasing
            augmented_sentences = back_translate(question)
            for augmented_sentence in augmented_sentences:
                if augmented_sentence != question:
                    new_entries[augmented_sentence] = ids
            
            # Template-Based Augmentation
            augmented_sentences = template_augmentation(question)
            for augmented_sentence in augmented_sentences:
                if augmented_sentence != question:
                    new_entries[augmented_sentence] = ids
        
        augmented_dataset.update(new_entries)
    return augmented_dataset

# Augment the dataset
augmented_dataset = augment_dataset(dataset, rounds=5)

# Save the augmented dataset to a JSON file including the original dataset
final_dataset = {**dataset, **augmented_dataset}
with open('augmented_data/augmented_dataset.json', 'w') as f:
    json.dump(final_dataset, f, indent=4)

# Print the augmented dataset
# print(json.dumps(final_dataset, indent=4))

# Print the length of the augmented dataset
print(f"Length of the augmented dataset: {len(final_dataset)}")


Length of the augmented dataset: 1123587


In [27]:
from PIL import Image

# Load the image
image_path = "data_wiki/laiva.page_2.jpg"
image = Image.open(image_path)

# Get image resolution
image_resolution = image.size
image_resolution

(596, 842)