# Hacktahon project 1: Query categorisation

In this project you will use a transformer model in order to classify queries from the UK IKEA website. For this purpose we have provided a list with the 10 000 most common queries, a transformer model as well as some sample code.

As you experiment with the model and do the classification you will quickly find that how the classification is done will be very dependent on _what_ you actually ask the transformer to do and _how_ you ask it to do so. 

Finding the best way of prompting the transformer will be the main tasks in this project, but as the queries are unlabeled actually defning what 'best' means will also play a large part

In [1]:
# Import necessary modules
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#load model/tokenizer into memory and move to GPU
model_name = "/home/transformers"

model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print("Model and tokenizer loaded")

model.parallelize()
print("Moved model to GPUs")

Model and tokenizer loaded
Moved model to GPUs


In [6]:
hfb = """Customer support, Living room seating, Store and organise furniture, Workspaces, 
Bedroom furniture, Beds & Mattresses, Bathroom, Kitchen, Dining, Children´s IKEA, 
Lighting, Bed and bath textiles, Home textiles, Rugs, Cooking, Eating, 
Decoration, Outdoor & Secondary storage, Home organisation, Other business opportunities, 
Home electronics, Home Appliances"""

# simple example on how to run the inference pipeline, on a toy data set using a very basic prompt
categories = """item, room, service, other"""

df = pd.read_csv('example.csv')
df['output'] = 0
i = 0
while i < 20:
    inp = [f"query: {search}. which category does it belong to?" +
           categories for search in df.iloc[i:i+5, 1]]
    inputs = tokenizer.batch_encode_plus(
        inp, return_tensors='pt', padding=True)
    inputs = inputs.to("cuda:0")
    with torch.no_grad():
        for j in range(len(inp)):
            #print(i+j)
            outputs = model.generate(inputs['input_ids'])
            df.iloc[i+j,
                        2] = tokenizer.decode(outputs[j], skip_special_tokens=True)
    i += 5
print(df[['searchKeyword', 'output']])

          searchKeyword   output
0              wardrobe     item
1                  desk     item
2                kallax     item
3                mirror     item
4      chest of drawers     item
5                  malm     item
6                  sofa     item
7         bedside table     item
8               shelves     item
9               drawers     item
10           kid's room     room
11             delivery  service
12   assemble furniture  service
13     customer support  service
14              returns  service
15          dining room     room
16               closet     room
17             bathroom     room
18                porch     room
19              balcony     room
