In [1]:
import pandas as pd
import numpy as np
import spacy
spacy_eng = spacy.load("en_core_web_sm")

In [40]:
df = pd.read_csv("../input/dataset/train.csv")
test_df = pd.read_csv("../input/dataset/test.csv")

checking for null values

In [3]:
df.isnull().sum()

content       0
title         0
uid           0
target_ind    0
dtype: int64

In [4]:
test_df.isnull().sum()

content    0
title      0
uid        0
dtype: int64

great! there are no null values

Making a new column which contains both the content and the title 

In [41]:
df['info'] = df['content'] + ' ' + df['title']
test_df['info'] =  test_df['content'] + ' ' + test_df['title']

Loading glove embeddings. I will be using the embeddings to preprocess the data. Words that are present in the embeddings need not be cleaned. Words that are not present in the embeddings will have to be cleaned 

In [6]:
import pickle

with open('../input/pickled-glove840b300d-for-10sec-loading/glove.840B.300d.pkl', 'rb') as fp:
     embeddings_index = pickle.load(fp)

Defining the tokenizer

In [7]:
def tokenizer(text):
    return [tok.text.lower() for tok in spacy_eng.tokenizer(text)]

A function to build the vocabulary and another to check how much of the data is present in the embeddings 

In [8]:
def check_coverage(vocab,embeddings_index):
    a = {}
    oov = {}
    k = 0
    i = 0
    for word in tqdm(vocab):
        try:
            a[word] = embeddings_index[word]
            k += vocab[word]
        except:

            oov[word] = vocab[word]
            i += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))
    sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]

    return sorted_x

def build_vocab(sentences, verbose =  True):

    vocab = {}
    for sentence in tqdm(sentences, disable = (not verbose)):
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [10]:
from tqdm.notebook import tqdm
import operator
vocab = build_vocab(df['info'].apply(lambda x:tokenizer(x)))
oov = check_coverage(vocab,embeddings_index)
oov[:10]

  0%|          | 0/35112 [00:00<?, ?it/s]

  0%|          | 0/90916 [00:00<?, ?it/s]

Found embeddings for 68.03% of vocab
Found embeddings for  97.11% of all text


[(' ', 49607),
 ('  ', 12622),
 ('\t', 8756),
 ('   ', 8007),
 ('as568a', 3942),
 ('75a.', 1872),
 ('trichlorethylene', 1872),
 ('1/8&#034', 1741),
 ('70a.', 1702),
 ('1/4&#034', 1613)]

Observations:
- There is more than one space between many words.
- There is a \t in the sentences.

Taking care of the problems

In [11]:
import re
def preprocess(x):
    x = x.replace('\t', " ")
    x = re.sub(' +', ' ', x)
    x = tokenizer(x)
    return x

In [12]:
vocab = build_vocab(df['info'].apply(preprocess))
oov = check_coverage(vocab,embeddings_index)
oov[:10]

  0%|          | 0/35112 [00:00<?, ?it/s]

  0%|          | 0/90790 [00:00<?, ?it/s]

Found embeddings for 68.13% of vocab
Found embeddings for  98.42% of all text


[('as568a', 3942),
 ('75a.', 1872),
 ('trichlorethylene', 1872),
 ('1/8&#034', 1741),
 ('70a.', 1702),
 ('1/4&#034', 1613),
 ('1/2&#034', 1408),
 ('3/16&#034', 1361),
 ('sizing.(big', 1324),
 ('7oa', 1301)]

Most of the unkown data is just random ids. Therefore, just the above preprocessing should be enough

In [15]:
def list2sent(x):
    x = ' '.join(x)
    return x

In [42]:
df['info'] = df['info'].apply(preprocess)
df['info'] = df['info'].apply(list2sent)
test_df['info'] = test_df['info'].apply(preprocess)
test_df['info'] = test_df['info'].apply(list2sent)

Checking if any rows are repeated 

In [28]:
repeated_products = df['info'].value_counts()
repeated_products.head()

the dickies original 874 work pant authentic and signature pant from a company that has been making work wear since 1922 amazon.com : dickies men 's original 874 washed work pant : clothing                                                                                                                                                                                                                                                                                                                            625
a dress classic handcrafted of silky two - ply 80s pinpoint cotton oxford with impeccable tailoring details including single needle stitching for stronger seams . english tab collar with button cuffs . imported . note : add an additional $ 5.00 for big and tall sizing.(big and tall sizes include : 18 , 18.5 , 19 , 20 collar sizes and any 37 \ 38 inch sleeve lengths ) amazon.com : pinpoint oxford tab collar button cuff dress shirt : clothing                                               

wow, a lot of rows are repeated multiple times with the highest being 625 times

In [34]:
d1 = df[df['info'] == repeated_products.index[0]]
d1.head()

Unnamed: 0,content,title,uid,target_ind,info,info_word_count
38,The dickies original 874 work pant authentic a...,Amazon.com: Dickies Men's Original 874 Washed ...,B0000WLVEC,311,the dickies original 874 work pant authentic a...,33
73,The dickies original 874 work pant authentic a...,Amazon.com: Dickies Men's Original 874 Washed ...,B00028AVK4,311,the dickies original 874 work pant authentic a...,33
82,The dickies original 874 work pant authentic a...,Amazon.com: Dickies Men's Original 874 Washed ...,B00028AZ6E,397,the dickies original 874 work pant authentic a...,33
115,The dickies original 874 work pant authentic a...,Amazon.com: Dickies Men's Original 874 Washed ...,B0002ZY2X4,397,the dickies original 874 work pant authentic a...,33
140,The dickies original 874 work pant authentic a...,Amazon.com: Dickies Men's Original 874 Washed ...,B0001YRG1Q,397,the dickies original 874 work pant authentic a...,33


In [35]:
d1['target_ind'].value_counts()

397    316
311    309
Name: target_ind, dtype: int64

We can see that the same product has been repeated many times and it has also been assigned to multiple categories.
Given that, for the test set we have to predict only one category, I have decided to let the different categories be different rows. This way the model can learn that some rows belong to mulltiple categories and thus the probability of the correct categories will be high and it will output the most relevant category  

Removing the duplicates

In [47]:
df.drop_duplicates(subset = ['info', 'target_ind'], inplace = True)

In [49]:
repeated_products = df['info'].value_counts()
repeated_products.head()

the namco museum series lasted five games on the playstation , collecting old namco arcade titles ranging from the important ( such as pac - man , dig - dug , and galaxian ) to the obscure ( do toy pop , grobda , the tower of druaga , and phozon ring a bell ? ) . the nintendo 64 edition of the line acts as a " best of " the museum tour , gathering together the premium classic games from the company 's past : pac - man , ms. pac - man , pole position , galaga , galaxian , and dig dug . every gamer older than ten should be well familiar with the pellet - chomping , maze - running pac - man games , though some might not recall the other titles as clearly . pole position was the perennial racing game of its time , offering much - improved graphics to a genre that had seen the rudimentary - looking night driver not too much earlier . meanwhile , galaxian and galaga were more stylish takes on taito 's space invaders , each having its own distinct variation on the theme ( the first offered di

In [50]:
d1 = df[df['info'] == repeated_products.index[0]]
d1.head()

Unnamed: 0,content,title,uid,target_ind,info
2452,The Namco Museum series lasted five games on t...,Namco Museum 64,B000038A6U,70,the namco museum series lasted five games on t...
7496,The Namco Museum series lasted five games on t...,Namco Museum 64,B000038A6U,69,the namco museum series lasted five games on t...
8750,The Namco Museum series lasted five games on t...,Namco Museum 64,B000038A6U,43,the namco museum series lasted five games on t...
20472,The Namco Museum series lasted five games on t...,Namco Museum 64,B000038A6U,104,the namco museum series lasted five games on t...
31173,The Namco Museum series lasted five games on t...,Namco Museum 64,B000038A6U,63,the namco museum series lasted five games on t...


Great! we have removed the duplicate rows

Creating a column to check number of words in the info

In [51]:
df['info_word_count'] = df['info'].apply(lambda x: len(x.split()))
df.describe()

Unnamed: 0,target_ind,info_word_count
count,28272.0,28272.0
mean,262.453134,206.449385
std,142.071231,249.631573
min,0.0,2.0
25%,138.0,61.0
50%,294.0,138.0
75%,361.0,317.0
max,499.0,10261.0


In [52]:
len(df[df['info_word_count']<=256])/len(df)

0.7057512733446519

In [53]:
len(df[df['info_word_count']<=512])/len(df)

0.9489247311827957

There are 500 labels

70% of the sentences are less than 256 words and 94% are less than 512 words. 512 would be the appropriate max length for the model but due to resource constraints, I will be using the max length as 256.

Saving the preprocesed datasets, which will be used in the models

In [27]:
df.to_csv('train.csv')
test_df.to_csv('test.csv')