# A notebook to explore text classification using word embedders

In this notebook, I will explore taking a public dataset of books with metadata such as description, title and category/genre. 
Ill then use a word embedder to vectorize the description and title and then use XGBoost to create a classifier on the category. 
I will use GenSim's fasttext implementation as the word embedder to vectorize the description and title. 
I will then repeat this process but using the native FastText implementation and compare the results. 
I will then host these models on Amazon's SageMaker 

## Install libraries, initialise variables, download dataset

In [1]:
! pip install gensim==3.8.3

Collecting gensim==3.8.3
  Downloading gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 16.2 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-5.1.0-py3-none-any.whl (57 kB)
[K     |████████████████████████████████| 57 kB 1.3 MB/s  eta 0:00:01
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.3 smart-open-5.1.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import gensim
from gensim.models import FastText
from gensim.test.utils import common_texts  # some example sentences
from gensim.utils import simple_preprocess
print(common_texts[1])
print(len(common_texts))

['survey', 'user', 'computer', 'system', 'response', 'time']
9


gemsim expects the sentences to already be tokenized and pre-processed.

In [3]:
help(gensim.models.FastText)

Help on class FastText in module gensim.models.fasttext:

class FastText(gensim.models.base_any2vec.BaseWordEmbeddingsModel)
 |  Train, use and evaluate word representations learned using the method
 |  described in `Enriching Word Vectors with Subword Information <https://arxiv.org/abs/1607.04606>`_, aka FastText.
 |  
 |  The model can be stored/loaded via its :meth:`~gensim.models.fasttext.FastText.save` and
 |  :meth:`~gensim.models.fasttext.FastText.load` methods, or loaded from a format compatible with the original
 |  Fasttext implementation via :func:`~gensim.models.fasttext.load_facebook_model`.
 |  
 |  Attributes
 |  ----------
 |  wv : :class:`~gensim.models.keyedvectors.FastTextKeyedVectors`
 |      This object essentially contains the mapping between words and embeddings. These are similar to the embeddings
 |      computed in the :class:`~gensim.models.word2vec.Word2Vec`, however here we also include vectors for n-grams.
 |      This allows the model to compute embedding

In [4]:
import pandas as pd
import numpy as np
import json
import sagemaker

In [5]:
# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket() # replace with your own bucket if you have one 
s3 = sagemaker_session.boto_session.resource('s3')


prefix_gensim = 'data_gensim_xgb'
prefix_fasttext = 'data_fasttext'

## Get the data into a working format with just the features we need

In [6]:
# Downloading the book metadata
! wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Books.json.gz
# Uncompressing
!gzip -d meta_Books.json.gz -f

--2021-07-28 17:07:21--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Books.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1215601087 (1.1G) [application/octet-stream]
Saving to: ‘meta_Books.json.gz’


2021-07-28 17:08:21 (19.8 MB/s) - ‘meta_Books.json.gz’ saved [1215601087/1215601087]



The filesize is a bit too big, so we can reduce that if the below line by taking a subset of that dataset.

In [7]:
#Reducing the dataset 
! head -n 25000 meta_Books.json > books_train.json

In [8]:
#load data
data=pd.read_json('books_train.json', lines=True)
#shuffle the data in place
data = data.sample(frac=1).reset_index(drop=True)
# show first few rows
data.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,image,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin
0,"[Books, History, Military]",,[],,"War Plans of the Great Powers, 1880-1914",[],[],,Visit Amazon's Paul M. Kennedy Page,[],"2,132,852 in Books (",[],Books,,NaT,,0049400827
1,"[Books, Science Fiction &amp; Fantasy, Science...",,[In bestsellers McCaffrey and Scarborough's ch...,,First Warning: Acorna's Children,"[0060525401, 006052541X, 0060525436, 006105095...",[],,Visit Amazon's Anne McCaffrey Page,[],"526,794 in Books (","[0380818485, 0380818477, 0060525401, 006105984...",Books,,NaT,$4.79,006052538X
2,"[Books, Science Fiction & Fantasy, Fantasy]",,[1st UK edition paperback fine In stock shippe...,,Xena Warria Princess - Prophecy of Darkness,"[0441006590, 0441008526, 1524101605, 157297215...",[],,Visit Amazon's Stella Howard Page,[],"10,365,992 in Books (",[],Books,,NaT,$30.17,000651149X
3,"[Books, Medical Books, Medicine]",,[Master dosage calculations with the ratio-pro...,,Dosage Calculations: Ratio-Proportion Approach...,"[0803644140, 0803644752, 0323079334, 149634799...",[],,Visit Amazon's Gloria D. Pickar Page,[],"6,186,165 in Books (","[1439058474, 1285429451, 0803669453, 0470930640]",Books,,NaT,$12.97,0007786662
4,"[Books, Biographies & Memoirs, True Crime]",,[],,A Handful of Summers,"[0006388108, 0307388409, 1558215662, 1476737398]",[],,Visit Amazon's Gordon Forbes Page,[],"4,183,155 in Books (","[1928257429, 147673741X, 1928257445, 1558215662]",Books,,NaT,$72.17,0006388086


We are only interested in a few columns from this dataset, so we will create a dataframe that onyl returns these

In [9]:
data_subset = data[["category","description", "title" ]]

In [10]:
data_subset.head()

Unnamed: 0,category,description,title
0,"[Books, History, Military]",[],"War Plans of the Great Powers, 1880-1914"
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,First Warning: Acorna's Children
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,Xena Warria Princess - Prophecy of Darkness
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,Dosage Calculations: Ratio-Proportion Approach...
4,"[Books, Biographies & Memoirs, True Crime]",[],A Handful of Summers


We will do some analysis of the data we have here to see how the data looks.

In [11]:
length = data_subset.category.apply(len)

In [12]:
length.unique()

array([3, 2, 0, 5, 4])

In [13]:
data_subset["cnt_cats"] = data_subset.category.apply(len)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [14]:
data_subset["cnt_desc"] = data_subset.description.apply(len)
data_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,category,description,title,cnt_cats,cnt_desc
0,"[Books, History, Military]",[],"War Plans of the Great Powers, 1880-1914",3,0
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,First Warning: Acorna's Children,3,6
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,Xena Warria Princess - Prophecy of Darkness,3,1
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,Dosage Calculations: Ratio-Proportion Approach...,3,1
4,"[Books, Biographies & Memoirs, True Crime]",[],A Handful of Summers,3,0


In [15]:
# delete the rows that have no category
data_subset = data_subset[data_subset.cnt_cats != 0]
data_subset = data_subset[data_subset.cnt_desc != 0]

In [16]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,First Warning: Acorna's Children,3,6
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,Xena Warria Princess - Prophecy of Darkness,3,1
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,Dosage Calculations: Ratio-Proportion Approach...,3,1
6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,A Perversion of Justice: A Southern Tragedy of...,3,1
7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",Bird Populations (Collins New Naturalist Libra...,3,9


In [17]:
data_subset["cat_x2"] = data_subset["category"].str[1]

In [18]:
data_subset.head(10)

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,First Warning: Acorna's Children,3,6,Science Fiction &amp; Fantasy
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,Xena Warria Princess - Prophecy of Darkness,3,1,Science Fiction & Fantasy
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,Dosage Calculations: Ratio-Proportion Approach...,3,1,Medical Books
6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,A Perversion of Justice: A Southern Tragedy of...,3,1,Biographies & Memoirs
7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",Bird Populations (Collins New Naturalist Libra...,3,9,Science & Math
8,"[Books, Literature &amp; Fiction]","[Australian rock musician, lyricist and actor ...",And the ass saw the angel,2,3,Literature &amp; Fiction
10,"[Books, Literature & Fiction, Short Stories & ...",[A wonderful read for all bird-lovers and thos...,The Secret Language of Birds: A Treasury of My...,3,1,Literature & Fiction
13,"[Books, Science Fiction &amp; Fantasy, Fantasy]",[Stephen Donaldson was born in 1947 in Clevela...,One Tree (The Second Chronicles of Thomas Cove...,3,3,Science Fiction &amp; Fantasy
14,"[Books, Mystery, Thriller & Suspense]",[Physical description: 535 p. ; 25 cm. Subject...,A Cure for All Diseases,2,1,"Mystery, Thriller & Suspense"
15,"[Books, Reference, Words, Language &amp; Grammar]","[Science, , ]","Activity Resources (McGraw-Hill Science, Grade 5)",3,3,Reference


We can see that the category column has an array which is a hierachy classification of the book. We can train our classifer on just one of those, they are all books, so no need to be interested in the first element, but the second element looks more interesting.

We just want to clean some of the data as we can see there was some encoding issues whcih we can fix with a "replace"

In [19]:
data_subset["cat_x2"] = data_subset["cat_x2"].replace("&amp;", "&", regex=True)

In [20]:
data_subset["cat_x2"].head()

1    Science Fiction & Fantasy
2    Science Fiction & Fantasy
3                Medical Books
6        Biographies & Memoirs
7               Science & Math
Name: cat_x2, dtype: object

In [21]:
len(data_subset["cat_x2"].unique())

33

In [22]:
data_subset['description_str'] = data_subset['description'].apply(lambda x: ' '.join(map(str, x)))

In [23]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,First Warning: Acorna's Children,3,6,Science Fiction & Fantasy,In bestsellers McCaffrey and Scarborough's cha...
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,Xena Warria Princess - Prophecy of Darkness,3,1,Science Fiction & Fantasy,1st UK edition paperback fine In stock shipped...
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,Dosage Calculations: Ratio-Proportion Approach...,3,1,Medical Books,Master dosage calculations with the ratio-prop...
6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,A Perversion of Justice: A Southern Tragedy of...,3,1,Biographies & Memoirs,Kathryn Medico lives in South Florida and teac...
7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",Bird Populations (Collins New Naturalist Libra...,3,9,Science & Math,Winner of the British Ecological Society's Ma...


We want to update the category column

In [24]:
data_subset["cat_x2"] = data_subset["cat_x2"].astype("category")

In [25]:
data_subset["cat_x2"].cat.codes

1        27
2        27
3        18
6         1
7        26
         ..
24994    13
24995     8
24996    17
24998    29
24999    10
Length: 17602, dtype: int8

In [26]:
data_subset["cat_x2_code"] = data_subset["cat_x2"].cat.codes

In [27]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,First Warning: Acorna's Children,3,6,Science Fiction & Fantasy,In bestsellers McCaffrey and Scarborough's cha...,27
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,Xena Warria Princess - Prophecy of Darkness,3,1,Science Fiction & Fantasy,1st UK edition paperback fine In stock shipped...,27
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,Dosage Calculations: Ratio-Proportion Approach...,3,1,Medical Books,Master dosage calculations with the ratio-prop...,18
6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,A Perversion of Justice: A Southern Tragedy of...,3,1,Biographies & Memoirs,Kathryn Medico lives in South Florida and teac...,1
7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",Bird Populations (Collins New Naturalist Libra...,3,9,Science & Math,Winner of the British Ecological Society's Ma...,26


## GenSim requires us to do some cleansing of the data and tokenize 

In [28]:
def remove_numbers(text): 
    '''  
    This function takes strings containing numbers and returns strings with numbers removed.
    '''
    return re.sub(r'\d+', '', text) 

In [29]:
def remove_mentions(text):
    '''  
    This function takes strings containing mentions and returns strings with 
    mentions (@ and the account name) removed.
    Input(string): one tweet, contains mentions
    Output(string): one tweet, mentions (@ and the account name mentioned) removed 
    '''
    mentions = re.compile(r'@\w+ ?')
    return mentions.sub(r'', text)

In [30]:
def extract_mentions(text):
    '''
    This function takes strings containing mentions and returns strings with 
    mentions (@ and the account name) extracted into a different element,
    and removes the mentions in the original sentence.
    Input(string): one sentence, contains mentions
    '''
    mentions = [i[1:] for i in text.split() if i.startswith("@")]
    sentence = re.compile(r'@\w+ ?').sub(r'', text)
    return sentence,mentions

In [31]:
! pip install spacy

Collecting spacy
  Downloading spacy-3.1.1-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 20.3 MB/s eta 0:00:01
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp36-cp36m-manylinux2014_x86_64.whl (35 kB)
Collecting catalogue<2.1.0,>=2.0.4
  Downloading catalogue-2.0.4-py3-none-any.whl (16 kB)
Collecting thinc<8.1.0,>=8.0.8
  Downloading thinc-8.0.8-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (622 kB)
[K     |████████████████████████████████| 622 kB 50.7 MB/s eta 0:00:01
[?25hCollecting spacy-legacy<3.1.0,>=3.0.7
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting tqdm<5.0.0,>=4.38.0
  Downloading tqdm-4.61.2-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 1.2 MB/s  eta 0:00:01
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.5-cp36-cp36m-manylinux2014_

In [32]:
! pip install textblob

Collecting textblob
  Downloading textblob-0.15.3-py2.py3-none-any.whl (636 kB)
[K     |████████████████████████████████| 636 kB 22.5 MB/s eta 0:00:01
Installing collected packages: textblob
Successfully installed textblob-0.15.3
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [33]:
import nltk
import spacy
from textblob import TextBlob
import re
import string
import glob
import sagemaker

In [34]:
punc_list = string.punctuation #you can self define list of punctuation to remove here
def remove_punctuation(text): 
    """
    This function takes strings containing self defined punctuations and returns
    strings with punctuations removed.
    """
    translator = str.maketrans('', '', punc_list) 
    return text.translate(translator) 

In [35]:
def remove_whitespace(text): 
    '''
    This function takes strings containing mentions and returns strings with 
    whitespaces removed.
    '''
    return  " ".join(text.split())

In [36]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [37]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,First Warning: Acorna's Children,3,6,Science Fiction & Fantasy,In bestsellers McCaffrey and Scarborough's cha...,27
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,Xena Warria Princess - Prophecy of Darkness,3,1,Science Fiction & Fantasy,1st UK edition paperback fine In stock shipped...,27
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,Dosage Calculations: Ratio-Proportion Approach...,3,1,Medical Books,Master dosage calculations with the ratio-prop...,18
6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,A Perversion of Justice: A Southern Tragedy of...,3,1,Biographies & Memoirs,Kathryn Medico lives in South Florida and teac...,1
7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",Bird Populations (Collins New Naturalist Libra...,3,9,Science & Math,Winner of the British Ecological Society's Ma...,26


In [38]:
data_subset["description_str"]=data_subset["description_str"].apply(remove_html_tags)
data_subset["title"]=data_subset["title"].apply(remove_html_tags)

In [39]:
data_subset["description_str"] = data_subset["description_str"].str.lower()
data_subset["title"] = data_subset["title"].str.lower()

In [40]:
data_subset["description_str"]=data_subset["description_str"].apply(remove_whitespace).apply(remove_punctuation).apply(remove_numbers)
data_subset["title"]=data_subset["title"].apply(remove_whitespace).apply(remove_punctuation).apply(remove_numbers)


In [41]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [42]:
from nltk.tokenize import word_tokenize 
def tokenize_sent(text): 
    ''' 
    This function takes strings and returns tokenized words.
    '''
    word_tokens = word_tokenize(text)  
    return word_tokens 

In [43]:
data_subset["description_str_token"] = data_subset["description_str"].apply(tokenize_sent)

In [44]:
data_subset["title_token"] = data_subset["title"].apply(tokenize_sent)

In [45]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [46]:
stopwords_list = set(stopwords.words('english'))

In [47]:
from collections import Counter
counter = Counter()
for word in  [w for sent in data_subset["description_str_token"] for w in sent]:
    counter[word] += 1        
counter.most_common(10)

[('the', 149030),
 ('and', 100553),
 ('of', 90445),
 ('a', 70005),
 ('to', 53743),
 ('in', 50295),
 ('is', 32180),
 ('for', 24949),
 ('with', 22727),
 ('as', 19951)]

In [48]:
#least frequent words
counter.most_common()[:-10:-1]

[('secondguessers', 1),
 ('therenot', 1),
 ('pickoff', 1),
 ('bullpens', 1),
 ('baserunning', 1),
 ('ballstrike', 1),
 ('adrem', 1),
 ('onlookers', 1),
 ('minimatchups', 1)]

In [49]:
top_n = 10
bottom_n = 10
stopwords_list |= set([word for (word, count) in counter.most_common(top_n)])
stopwords_list |= set([word for (word, count) in counter.most_common()[:-bottom_n:-1]])
stopwords_list |= {'thats'}
def remove_stopwords(tokenized_text): 
    '''
    This function takes a list of tokenized words from the description and title, removes self-defined stop words from the list,
    and returns the list of words with stop words removed
    '''
    filtered_text = [word for word in tokenized_text if word not in stopwords_list] 
    return filtered_text

In [50]:
data_subset["description_str_token"] = data_subset["description_str_token"].apply(remove_stopwords)
data_subset["title_token"] = data_subset["title_token"].apply(remove_stopwords)

In [51]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,title_token
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,first warning acornas children,3,6,Science Fiction & Fantasy,in bestsellers mccaffrey and scarboroughs char...,27,"[bestsellers, mccaffrey, scarboroughs, charmin...","[first, warning, acornas, children]"
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,xena warria princess prophecy of darkness,3,1,Science Fiction & Fantasy,st uk edition paperback fine in stock shipped ...,27,"[st, uk, edition, paperback, fine, stock, ship...","[xena, warria, princess, prophecy, darkness]"
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,dosage calculations ratioproportion approach t...,3,1,Medical Books,master dosage calculations with the ratiopropo...,18,"[master, dosage, calculations, ratioproportion...","[dosage, calculations, ratioproportion, approa..."
6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,a perversion of justice a southern tragedy of ...,3,1,Biographies & Memoirs,kathryn medico lives in south florida and teac...,1,"[kathryn, medico, lives, south, florida, teach...","[perversion, justice, southern, tragedy, murde..."
7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",bird populations collins new naturalist librar...,3,9,Science & Math,winner of the british ecological societys mars...,26,"[winner, british, ecological, societys, marsh,...","[bird, populations, collins, new, naturalist, ..."


In [52]:
! pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.5.0.tar.gz (622 kB)
[K     |████████████████████████████████| 622 kB 38.1 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25ldone
[?25h  Created wheel for autocorrect: filename=autocorrect-2.5.0-py3-none-any.whl size=621854 sha256=9d91c8ff00b4d4c9a26204bc178d969351b2edaaabe26feddf106a5931e85316
  Stored in directory: /home/ec2-user/.cache/pip/wheels/a4/51/6c/f75116aae65b52be7ad1d57e47ad4e89ab818bf45d9093021f
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.5.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [53]:
from autocorrect import Speller

In [54]:
spell = Speller(lang='en', fast = True)
def spelling_correct(tokenized_text):
    """
    This function takes a list of tokenized words from a sentence, spell check every words and returns the 
    corrected words if applicable. Note that not every wrong spelling words will be identified.
    """
    corrected = [spell(word) for word in tokenized_text] 
    return corrected

In [55]:
data_subset["description_str_token"] = data_subset["description_str_token"].apply(spelling_correct)
data_subset["title_token"] = data_subset["title_token"].apply(spelling_correct)

In [56]:
data_subset['description_str'].replace('', np.nan, inplace=True)

In [57]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,title_token
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,first warning acornas children,3,6,Science Fiction & Fantasy,in bestsellers mccaffrey and scarboroughs char...,27,"[bestsellers, mccaffrey, scarboroughs, charmin...","[first, warning, acornas, children]"
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,xena warria princess prophecy of darkness,3,1,Science Fiction & Fantasy,st uk edition paperback fine in stock shipped ...,27,"[st, uk, edition, paperback, fine, stock, ship...","[xeno, warria, princess, prophecy, darkness]"
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,dosage calculations ratioproportion approach t...,3,1,Medical Books,master dosage calculations with the ratiopropo...,18,"[master, dosage, calculations, ratioproportion...","[dosage, calculations, ratioproportion, approa..."
6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,a perversion of justice a southern tragedy of ...,3,1,Biographies & Memoirs,kathryn medico lives in south florida and teac...,1,"[kathryn, medic, lives, south, florida, teache...","[perversion, justice, southern, tragedy, murde..."
7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",bird populations collins new naturalist librar...,3,9,Science & Math,winner of the british ecological societys mars...,26,"[winner, british, ecological, society, marsh, ...","[bird, populations, collins, new, naturalist, ..."


In [58]:
# remove the rows which don't have data
data_subset = data_subset.dropna()

### Now data has been cleansed, we are ready to train a model

We will see when we return a sentence in it's vectorized format, we will have an array of 50 items, as that is the size we have choosen, where this is capturing the semantics of the sentence, and that will enable us to compare 2 sentences and see how similar they are for instance, and for this use-case, to be able to train a classifier. 

In [59]:
model_gensim = FastText(size=50, window=5, min_count=1)

In [60]:
token_desc = data_subset["description_str_token"] + data_subset["title_token"]
token_desc.head()

1    [bestsellers, mccaffrey, scarboroughs, charmin...
2    [st, uk, edition, paperback, fine, stock, ship...
3    [master, dosage, calculations, ratioproportion...
6    [kathryn, medic, lives, south, florida, teache...
7    [winner, british, ecological, society, marsh, ...
dtype: object

In [61]:
model_gensim.build_vocab(sentences=token_desc)

In [62]:
model_gensim.train(sentences=token_desc, total_examples=len(token_desc), epochs=50) 

In [63]:
from gensim.test.utils import get_tmpfile
fname = get_tmpfile("fasttext.model")

model_gensim.save('books_gensim_model.bin')

In [64]:
description_str = data_subset["description_str"]

In [65]:
vector_description_str = model_gensim.wv[description_str]

In [66]:
#what happens if I do the wv is made on the token_desc

In [67]:
len(vector_description_str)

15357

In [68]:
data_subset["description_str_token"][0]

KeyError: 0

In [69]:
description_str[0]

KeyError: 0

In [70]:
vector_description_str[0]

array([ 0.15061761, -0.22866273,  0.14617342,  0.30884618,  0.02439997,
       -0.3813262 , -0.13121431, -0.14215586,  0.51443768,  0.1037375 ,
       -0.24933131,  0.08981992,  0.46285298, -0.23384802,  0.15566206,
       -0.33288106,  0.03284791, -0.09148692, -0.03425598, -0.50361264,
        0.26700714, -0.0456257 , -0.2992793 , -0.08063361,  0.12734178,
        0.00625457, -0.01884551, -0.87336093, -0.08160407, -0.07867698,
       -0.39554146,  0.28552851, -0.24006692, -0.28874454, -0.10695081,
       -0.34137124,  0.23884676, -0.35137567,  0.09541436,  0.19921464,
       -0.11534573, -0.06538808, -0.15550601,  0.23118731, -0.34995532,
        0.23691154,  0.13610038, -0.13849372,  0.10327838,  0.04905422])

In [71]:
vector_description_str = np.split(vector_description_str,len(vector_description_str))

In [72]:
vector_description_str[0].shape

(1, 50)

In [73]:
title_str = data_subset["title"]

In [74]:
vector_title_str = model_gensim.wv[title_str]

In [75]:
len(vector_title_str)

15357

In [76]:
vector_title_str.shape

(15357, 50)

In [77]:
vector_title_str = np.split(vector_title_str,len(vector_title_str))

In [78]:
vector_desc_title = np.concatenate((vector_title_str, vector_description_str), axis=1)

In [79]:
vector_title_str[0]

array([[ 0.0651715 , -0.39411464,  0.42241248,  0.3177773 ,  0.4319102 ,
        -0.01771371, -0.4520534 , -0.62032485,  0.9928784 , -0.06408522,
         0.26175588, -0.42529353,  1.0138596 , -0.60837865,  0.2441559 ,
        -1.2245815 , -0.9688899 ,  1.1390104 , -0.46787295, -0.72538525,
         1.1622295 , -0.11322684, -0.31891525, -0.03702623,  0.20710579,
        -0.44438636,  0.29351732, -1.6084657 ,  0.20977274, -1.4457878 ,
        -1.190064  ,  0.1618709 ,  0.38325697, -0.30552247, -0.6841324 ,
        -0.6971913 , -0.62698233,  0.33103782,  0.38949668,  0.9626486 ,
        -0.9678825 , -0.00738468,  0.48337263,  2.2751577 , -0.7544336 ,
         0.2762255 , -0.8119534 , -0.47978908,  0.41971564, -0.6995469 ]],
      dtype=float32)

In [80]:
vector_description_str[0]

array([[ 0.15061761, -0.22866273,  0.14617342,  0.30884618,  0.02439997,
        -0.3813262 , -0.13121431, -0.14215586,  0.51443768,  0.1037375 ,
        -0.24933131,  0.08981992,  0.46285298, -0.23384802,  0.15566206,
        -0.33288106,  0.03284791, -0.09148692, -0.03425598, -0.50361264,
         0.26700714, -0.0456257 , -0.2992793 , -0.08063361,  0.12734178,
         0.00625457, -0.01884551, -0.87336093, -0.08160407, -0.07867698,
        -0.39554146,  0.28552851, -0.24006692, -0.28874454, -0.10695081,
        -0.34137124,  0.23884676, -0.35137567,  0.09541436,  0.19921464,
        -0.11534573, -0.06538808, -0.15550601,  0.23118731, -0.34995532,
         0.23691154,  0.13610038, -0.13849372,  0.10327838,  0.04905422]])

In [82]:
vector_desc_title[0]

array([[ 0.0651715 , -0.39411464,  0.42241248,  0.31777731,  0.43191019,
        -0.01771371, -0.4520534 , -0.62032485,  0.99287838, -0.06408522,
         0.26175588, -0.42529353,  1.01385963, -0.60837865,  0.2441559 ,
        -1.22458148, -0.96888989,  1.13901043, -0.46787295, -0.72538525,
         1.16222954, -0.11322684, -0.31891525, -0.03702623,  0.20710579,
        -0.44438636,  0.29351732, -1.60846567,  0.20977274, -1.44578779,
        -1.19006395,  0.1618709 ,  0.38325697, -0.30552247, -0.6841324 ,
        -0.6971913 , -0.62698233,  0.33103782,  0.38949668,  0.96264857,
        -0.96788251, -0.00738468,  0.48337263,  2.27515769, -0.75443357,
         0.27622551, -0.81195343, -0.47978908,  0.41971564, -0.69954687],
       [ 0.15061761, -0.22866273,  0.14617342,  0.30884618,  0.02439997,
        -0.3813262 , -0.13121431, -0.14215586,  0.51443768,  0.1037375 ,
        -0.24933131,  0.08981992,  0.46285298, -0.23384802,  0.15566206,
        -0.33288106,  0.03284791, -0.09148692, -0.

In [83]:
vector_desc_title.shape

(15357, 2, 50)

We want to reshape the vector into a 2D with same number of rows and concatenating the data

In [84]:
big_vector_title_descr = vector_desc_title.reshape(len(vector_title_str),100)

In [85]:
big_vector_title_descr.shape

(15357, 100)

In [86]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,title_token
1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,first warning acornas children,3,6,Science Fiction & Fantasy,in bestsellers mccaffrey and scarboroughs char...,27,"[bestsellers, mccaffrey, scarboroughs, charmin...","[first, warning, acornas, children]"
2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,xena warria princess prophecy of darkness,3,1,Science Fiction & Fantasy,st uk edition paperback fine in stock shipped ...,27,"[st, uk, edition, paperback, fine, stock, ship...","[xeno, warria, princess, prophecy, darkness]"
3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,dosage calculations ratioproportion approach t...,3,1,Medical Books,master dosage calculations with the ratiopropo...,18,"[master, dosage, calculations, ratioproportion...","[dosage, calculations, ratioproportion, approa..."
6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,a perversion of justice a southern tragedy of ...,3,1,Biographies & Memoirs,kathryn medico lives in south florida and teac...,1,"[kathryn, medic, lives, south, florida, teache...","[perversion, justice, southern, tragedy, murde..."
7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",bird populations collins new naturalist librar...,3,9,Science & Math,winner of the british ecological societys mars...,26,"[winner, british, ecological, society, marsh, ...","[bird, populations, collins, new, naturalist, ..."


In [87]:
len(data_subset)

15357

In [88]:
df_big_vector_title_descr = pd.DataFrame(data=big_vector_title_descr)

In [89]:
df_big_vector_title_descr.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,0.065172,-0.394115,0.422412,0.317777,0.43191,-0.017714,-0.452053,-0.620325,0.992878,-0.064085,...,-0.115346,-0.065388,-0.155506,0.231187,-0.349955,0.236912,0.1361,-0.138494,0.103278,0.049054
1,0.061473,0.437519,-0.152286,0.884974,0.092397,0.022035,-0.237929,-0.649927,1.249475,-1.04289,...,-0.04046,-0.431711,0.578465,0.354228,-0.646574,-0.069177,-0.354534,-0.175961,-0.437133,-0.1419
2,0.311475,-0.826405,-0.583969,0.731414,-0.732494,0.134193,0.169792,-0.448967,-1.33749,0.647214,...,-0.472603,-0.647555,-0.235218,-0.243032,0.114401,-0.229752,0.182064,-0.298881,0.066142,0.035161
3,-0.080059,-0.157396,-0.297149,1.093677,0.010896,-0.89409,0.300231,-0.209488,0.237316,-0.321829,...,0.230122,-0.123347,-0.117917,0.242885,0.072328,0.104013,-0.629506,-0.284695,-0.187429,-0.031264
4,-1.165964,-0.021377,-0.187748,0.081071,-0.395813,-0.245278,-0.673058,-0.152063,1.350735,1.375042,...,-0.132701,-0.266029,-0.029161,-0.138395,-0.436706,-0.218846,0.104312,-0.418283,-0.065884,0.049011


Our index on both these DataFrames wont align anymore, so we need to reset the index so we can do that.

In [90]:
data_subset = data_subset.reset_index()

In [91]:
data_subset.head()

Unnamed: 0,index,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,title_token
0,1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,first warning acornas children,3,6,Science Fiction & Fantasy,in bestsellers mccaffrey and scarboroughs char...,27,"[bestsellers, mccaffrey, scarboroughs, charmin...","[first, warning, acornas, children]"
1,2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,xena warria princess prophecy of darkness,3,1,Science Fiction & Fantasy,st uk edition paperback fine in stock shipped ...,27,"[st, uk, edition, paperback, fine, stock, ship...","[xeno, warria, princess, prophecy, darkness]"
2,3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,dosage calculations ratioproportion approach t...,3,1,Medical Books,master dosage calculations with the ratiopropo...,18,"[master, dosage, calculations, ratioproportion...","[dosage, calculations, ratioproportion, approa..."
3,6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,a perversion of justice a southern tragedy of ...,3,1,Biographies & Memoirs,kathryn medico lives in south florida and teac...,1,"[kathryn, medic, lives, south, florida, teache...","[perversion, justice, southern, tragedy, murde..."
4,7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",bird populations collins new naturalist librar...,3,9,Science & Math,winner of the british ecological societys mars...,26,"[winner, british, ecological, society, marsh, ...","[bird, populations, collins, new, naturalist, ..."


In [92]:
data_subset_2 = pd.concat([data_subset, df_big_vector_title_descr], axis=1)

In [93]:
data_subset_2.head()

Unnamed: 0,index,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,...,90,91,92,93,94,95,96,97,98,99
0,1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,first warning acornas children,3,6,Science Fiction & Fantasy,in bestsellers mccaffrey and scarboroughs char...,27,"[bestsellers, mccaffrey, scarboroughs, charmin...",...,-0.115346,-0.065388,-0.155506,0.231187,-0.349955,0.236912,0.1361,-0.138494,0.103278,0.049054
1,2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,xena warria princess prophecy of darkness,3,1,Science Fiction & Fantasy,st uk edition paperback fine in stock shipped ...,27,"[st, uk, edition, paperback, fine, stock, ship...",...,-0.04046,-0.431711,0.578465,0.354228,-0.646574,-0.069177,-0.354534,-0.175961,-0.437133,-0.1419
2,3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,dosage calculations ratioproportion approach t...,3,1,Medical Books,master dosage calculations with the ratiopropo...,18,"[master, dosage, calculations, ratioproportion...",...,-0.472603,-0.647555,-0.235218,-0.243032,0.114401,-0.229752,0.182064,-0.298881,0.066142,0.035161
3,6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,a perversion of justice a southern tragedy of ...,3,1,Biographies & Memoirs,kathryn medico lives in south florida and teac...,1,"[kathryn, medic, lives, south, florida, teache...",...,0.230122,-0.123347,-0.117917,0.242885,0.072328,0.104013,-0.629506,-0.284695,-0.187429,-0.031264
4,7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",bird populations collins new naturalist librar...,3,9,Science & Math,winner of the british ecological societys mars...,26,"[winner, british, ecological, society, marsh, ...",...,-0.132701,-0.266029,-0.029161,-0.138395,-0.436706,-0.218846,0.104312,-0.418283,-0.065884,0.049011


### We want to check the count of each of the classes to check for class imbalance

With another version of XGBoost, we can supply the weights as a vector as a parameter for the training which will improve the model training to help the model be less bias because of the class imbalance

In [94]:
data_subset_2['cat_x2_code'].unique()

array([27, 18,  1, 26, 17, 19, 23,  9, 12,  8, 13, 28,  4, 32, 30, 22,  5,
        0, 20,  2, 10, 14, 11, 24, 29, 15,  7, 21, 25,  6, 16, 31,  3],
      dtype=int8)

In [95]:
data_subset_2_cat_x2_agg = data_subset_2.groupby(by=['cat_x2_code']).count()['index']
print(data_subset_2_cat_x2_agg)

cat_x2_code
0      588
1      782
2      430
3        3
4     2739
5      405
6       45
7       64
8      541
9      277
10     221
11     206
12     296
13     854
14     265
15      29
16      15
17    2472
18     115
19     611
20     520
21      67
22     585
23     555
24     395
25     123
26     487
27     266
28     205
29     208
30     703
31       7
32     278
Name: index, dtype: int64


Get the data in the format ready for fasttext too

In [96]:
data_subset_2["fastText_label"] = '__label__' + data_subset["cat_x2_code"].astype(str) 

We have our data in a format that we like now, but for the training, we can select a few columns for this.

In [97]:
data_subset_2.head()

Unnamed: 0,index,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,...,91,92,93,94,95,96,97,98,99,fastText_label
0,1,"[Books, Science Fiction &amp; Fantasy, Science...",[In bestsellers McCaffrey and Scarborough's ch...,first warning acornas children,3,6,Science Fiction & Fantasy,in bestsellers mccaffrey and scarboroughs char...,27,"[bestsellers, mccaffrey, scarboroughs, charmin...",...,-0.065388,-0.155506,0.231187,-0.349955,0.236912,0.1361,-0.138494,0.103278,0.049054,__label__27
1,2,"[Books, Science Fiction & Fantasy, Fantasy]",[1st UK edition paperback fine In stock shippe...,xena warria princess prophecy of darkness,3,1,Science Fiction & Fantasy,st uk edition paperback fine in stock shipped ...,27,"[st, uk, edition, paperback, fine, stock, ship...",...,-0.431711,0.578465,0.354228,-0.646574,-0.069177,-0.354534,-0.175961,-0.437133,-0.1419,__label__27
2,3,"[Books, Medical Books, Medicine]",[Master dosage calculations with the ratio-pro...,dosage calculations ratioproportion approach t...,3,1,Medical Books,master dosage calculations with the ratiopropo...,18,"[master, dosage, calculations, ratioproportion...",...,-0.647555,-0.235218,-0.243032,0.114401,-0.229752,0.182064,-0.298881,0.066142,0.035161,__label__18
3,6,"[Books, Biographies & Memoirs, True Crime]",[Kathryn Medico lives in South Florida and tea...,a perversion of justice a southern tragedy of ...,3,1,Biographies & Memoirs,kathryn medico lives in south florida and teac...,1,"[kathryn, medic, lives, south, florida, teache...",...,-0.123347,-0.117917,0.242885,0.072328,0.104013,-0.629506,-0.284695,-0.187429,-0.031264,__label__1
4,7,"[Books, Science & Math, Biological Sciences]","[, Winner of the British Ecological Society's ...",bird populations collins new naturalist librar...,3,9,Science & Math,winner of the british ecological societys mars...,26,"[winner, british, ecological, society, marsh, ...",...,-0.266029,-0.029161,-0.138395,-0.436706,-0.218846,0.104312,-0.418283,-0.065884,0.049011,__label__26


Might be better to pick the columns, rather than drop so many, lets look at the head

In [98]:
#create a new dataframe before saving the data as CSV
df_gensim_xgb_sampleweight = data_subset_2.drop(columns=['index','category','description','title','cnt_cats','cnt_desc','cat_x2','description_str','description_str_token','title_token','fastText_label'])
df_fasttext = data_subset_2[['fastText_label','description_str', 'title']]

In [99]:
df_fasttext.head()

Unnamed: 0,fastText_label,description_str,title
0,__label__27,in bestsellers mccaffrey and scarboroughs char...,first warning acornas children
1,__label__27,st uk edition paperback fine in stock shipped ...,xena warria princess prophecy of darkness
2,__label__18,master dosage calculations with the ratiopropo...,dosage calculations ratioproportion approach t...
3,__label__1,kathryn medico lives in south florida and teac...,a perversion of justice a southern tragedy of ...
4,__label__26,winner of the british ecological societys mars...,bird populations collins new naturalist librar...


In [100]:
df_gensim_xgb_sampleweight.head()

Unnamed: 0,cat_x2_code,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,27,0.065172,-0.394115,0.422412,0.317777,0.43191,-0.017714,-0.452053,-0.620325,0.992878,...,-0.115346,-0.065388,-0.155506,0.231187,-0.349955,0.236912,0.1361,-0.138494,0.103278,0.049054
1,27,0.061473,0.437519,-0.152286,0.884974,0.092397,0.022035,-0.237929,-0.649927,1.249475,...,-0.04046,-0.431711,0.578465,0.354228,-0.646574,-0.069177,-0.354534,-0.175961,-0.437133,-0.1419
2,18,0.311475,-0.826405,-0.583969,0.731414,-0.732494,0.134193,0.169792,-0.448967,-1.33749,...,-0.472603,-0.647555,-0.235218,-0.243032,0.114401,-0.229752,0.182064,-0.298881,0.066142,0.035161
3,1,-0.080059,-0.157396,-0.297149,1.093677,0.010896,-0.89409,0.300231,-0.209488,0.237316,...,0.230122,-0.123347,-0.117917,0.242885,0.072328,0.104013,-0.629506,-0.284695,-0.187429,-0.031264
4,26,-1.165964,-0.021377,-0.187748,0.081071,-0.395813,-0.245278,-0.673058,-0.152063,1.350735,...,-0.132701,-0.266029,-0.029161,-0.138395,-0.436706,-0.218846,0.104312,-0.418283,-0.065884,0.049011


### For this version of XGBoost, we need to supply 3 arguments to the model which are the features, labels and optionally the sample weight which is going to help improve the performance of the model as we have an imbalanced dataset

In [101]:
X = df_gensim_xgb_sampleweight.drop(['cat_x2_code'], axis=1).values
y = df_gensim_xgb_sampleweight['cat_x2_code'].values


In [102]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
yX_train = np.column_stack((y_train, X_train))
yX_test = np.column_stack((y_test, X_test))
np.savetxt("book_gensim_train_v1.csv", yX_train, delimiter=",", fmt='%0.3f')
np.savetxt("book_gensim_test_v1.csv", yX_test, delimiter=",", fmt='%0.3f')

In [103]:
print(y_test.shape)

(5068,)


In [104]:
# Upload the dataset to an S3 bucket
input_train = sagemaker_session.upload_data(path='book_gensim_train_v1.csv', key_prefix='%s/data' % prefix_gensim)
input_validation = sagemaker_session.upload_data(path='book_gensim_test_v1.csv', key_prefix='%s/data' % prefix_gensim)

In [105]:
#from sagemaker.inputs import TrainingInput

train_data = sagemaker.inputs.TrainingInput(s3_data=input_train,content_type="csv")
validation_data = sagemaker.inputs.TrainingInput(s3_data=input_validation,content_type="csv")

In our training script, we have a parser that is expecting the hyper-parameters below.

In [106]:
hyperparams = {
        "n_estimators": "300", 
        "n_jobs":"4",
        "max_depth":"10",
#        "min_child_weight": "6",
        "learning_rate": "0.1", 
        "objective":'multi:softmax', 
#        "reg_alpha": "10",
        "gamma": "4"
}

instance_type = "ml.m5.2xlarge"

Below is our estimator using the XGBoost framework and using our training script which is using another version of the XGB algorithm, not the SageMaker built-in algorithm.

In [107]:
# updated XGBoost to XGBClassifier https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html#train-a-model-with-open-source-xgboost
from sagemaker import get_execution_role
from sagemaker.xgboost.estimator import XGBoost

role = get_execution_role()

xgb_estimator = XGBoost(
    entry_point="train.py",
    hyperparameters=hyperparams,
    role=role,
    instance_count=1,
    instance_type='ml.m5.4xlarge',
    framework_version="1.2-1",
    eval_metric="merror",
)

In [108]:
xgb_estimator.fit({'train': train_data, 'validation': validation_data })

2021-07-28 17:50:45 Starting - Starting the training job...
2021-07-28 17:50:47 Starting - Launching requested ML instancesProfilerReport-1627494645: InProgress
...
2021-07-28 17:51:37 Starting - Preparing the instances for training......
2021-07-28 17:52:45 Downloading - Downloading input data
2021-07-28 17:52:45 Training - Downloading the training image.....[34m[2021-07-28 17:53:23.576 ip-10-0-119-202.eu-west-1.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Invoking user training script.[0m
[34mINFO:sagemaker-containers:Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34mINFO:sagemaker-containers:Generating setup.cfg[0m
[34mINFO:sagemaker-containers:Generating MANIFEST.in[0m
[34mINFO:sagemaker-container

In [109]:
xgb_predictor_gensim = xgb_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.2xlarge"
)

-------------!

In [147]:
print(xgb_predictor_gensim)

<sagemaker.xgboost.model.XGBoostPredictor object at 0x7f8fb7a8c630>


In [110]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import NumpyDeserializer
csv_serializer = CSVSerializer()
np_deserializer = NumpyDeserializer()

xgb_predictor_gensim.serializer = csv_serializer
xgb_predictor_gensim.deserializer = np_deserializer



In [111]:
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

predictions_test_xgb_weighted = [ float(xgb_predictor_gensim.predict(x)) for x in X_test]  
score = f1_score(y_test,predictions_test_xgb_weighted,labels=np.unique(y),average='micro')

print('F1 Score(micro): %.1f' % (score * 100.0))

F1 Score(micro): 53.9


In [112]:
# xgb_predictor_gensim.delete_endpoint()

### In the next steps, we will use the built-in XGBoost which doesn't allow you to set the weights for the classes and see how the results differ.

If we use the XGBClassifer, then we are going to need to divide our training data into 3 files, X =features, y=Labels, and W=weights - all the same length. 

We are going to need to cerate a map to class to add the weight. 

In [113]:
import boto3
container_uri = sagemaker.image_uris.retrieve('xgboost', boto3.Session().region_name, version='1.0-1')

# Create the estimator
xgb_bi = sagemaker.estimator.Estimator(container_uri,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix_gensim),
                                    sagemaker_session=sagemaker_session)
# Set the hyperparameters
xgb_bi.set_hyperparameters(eta=0.1,
                        max_depth=10,
                        gamma=4,
                        num_class=len(np.unique(y)),
                        alpha=10,
                        min_child_weight=6,
                        silent=0,
                        objective='multi:softmax',
                        num_round=300)

In [114]:
xgb_bi.fit({'train': train_data, 'validation': validation_data })

2021-07-28 18:05:10 Starting - Starting the training job...
2021-07-28 18:05:34 Starting - Launching requested ML instancesProfilerReport-1627495510: InProgress
......
2021-07-28 18:06:34 Starting - Preparing the instances for training......
2021-07-28 18:07:34 Downloading - Downloading input data
2021-07-28 18:07:34 Training - Downloading the training image...
2021-07-28 18:07:54 Training - Training image download completed. Training in progress.[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value multi:softmax to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Deter

# We trained our model and now want to test out the predictions

In [115]:
xgb_predictor = xgb_bi.deploy(
    initial_instance_count=1, 
    instance_type='ml.m4.xlarge'
)

---------------!

In [148]:
print(xgb_predictor)

<sagemaker.predictor.Predictor object at 0x7f8fb7a644e0>


In [116]:
xgb_predictor.serializer = csv_serializer

predictions_test = [ float(xgb_predictor.predict(x).decode('utf-8')) for x in X_test] 
score = f1_score(y_test,predictions_test,labels=np.unique(y),average='micro')

print('F1 Score(micro): %.1f' % (score * 100.0))

F1 Score(micro): 49.9


All done, you can delete your endpoint

In [117]:
#xgb_predictor.delete_endpoint()

# Next we will test out the FastText native supervised Text classification 

In this step, we want to see if the native FastText algorithm is able to do the same but with less hard work.
With native FastText, you do not need to tokenize your sentences, and you also do not need to pick vector size as a parameter for the mdoel training. 
This algorithm will do the work for you behind the scenes. 
What we do need to do though, is get the data in to the required format which means adding a string of "__label__" before the label and then we will concatenate that with the description and title into one field and then present that to the algorithm. 



In [118]:
df_fasttext['full'] = df_fasttext['fastText_label'] + ' ' + df_fasttext['description_str'] + ' ' + df_fasttext['title'] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [119]:
df_fasttext.head()

Unnamed: 0,fastText_label,description_str,title,full
0,__label__27,in bestsellers mccaffrey and scarboroughs char...,first warning acornas children,__label__27 in bestsellers mccaffrey and scarb...
1,__label__27,st uk edition paperback fine in stock shipped ...,xena warria princess prophecy of darkness,__label__27 st uk edition paperback fine in st...
2,__label__18,master dosage calculations with the ratiopropo...,dosage calculations ratioproportion approach t...,__label__18 master dosage calculations with th...
3,__label__1,kathryn medico lives in south florida and teac...,a perversion of justice a southern tragedy of ...,__label__1 kathryn medico lives in south flori...
4,__label__26,winner of the british ecological societys mars...,bird populations collins new naturalist librar...,__label__26 winner of the british ecological s...


Taken the same index as our test example above to see if the fasttext algo can make the same prediction

In [120]:
! pip install fasttext==0.9.1

Collecting fasttext==0.9.1
  Downloading fasttext-0.9.1.tar.gz (57 kB)
[K     |████████████████████████████████| 57 kB 7.8 MB/s  eta 0:00:01
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.7.0-py2.py3-none-any.whl (199 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.1-cp36-cp36m-linux_x86_64.whl size=2161577 sha256=1d451c8c4923ce4ed58cc621aaeac8106d21c6693925505ac887b265c1a3e207
  Stored in directory: /home/ec2-user/.cache/pip/wheels/ae/e8/a0/03628c77c2e0aa813f067f6d7708a4579d15abf6f45e8716c5
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.1 pybind11-2.7.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [121]:
import fasttext

In [122]:
fasttext_dataset = df_fasttext['full']

In [123]:
from sklearn.model_selection import train_test_split

train_fasttext_native, val_fasttext_native = train_test_split(fasttext_dataset, test_size=0.33, random_state=42)

train_file_name = 'train_books_fasttext_native.csv'
valid_file_name = 'valid_books_fasttext_native.csv'
train_fasttext_native.to_csv(train_file_name, index=False, header=False)
val_fasttext_native.to_csv(valid_file_name, index=False, header=False)

In [124]:
model_native = fasttext.train_supervised(input=train_file_name, lr=0.1, epoch=50)

In [125]:
modelwordGram = fasttext.train_supervised(input=train_file_name, lr=0.1, epoch=50, wordNgrams=2)

### We will run a simple test with the validation data, we are returned the precision and recall, and we can play with the hyperparameters to tune this 

In [175]:
FastText_Precision_Recall = model_native.test(valid_file_name, k=5)
print(FastText_Precision_Recall)

(5068, 0.16988950276243095, 0.8494475138121547)


In [177]:
f1_score = 2*((FastText_Precision_Recall[1]*FastText_Precision_Recall[2])/(FastText_Precision_Recall[1]+FastText_Precision_Recall[2]))
print('F1 Score(micro): %.1f' % (f1_score * 100.0))

F1 Score(micro): 28.3


In [128]:
df_valid_ft= pd.read_csv(valid_file_name)
df_valid_ft.head()

Unnamed: 0,__label__27 to say this book is action packed is almost an understatementa wonderful escapist yarn interzone on the rise of the iron mooncompulsive reading for all ages guardian on the rise of the iron moonwildly imaginative and compelling this charming steampunk yarn plays out against a backdrop of civil war and failed rebellion layered and complex treachery and love in surprising cornerspublishers weekly on the kingdom beyond the wavesa dickensian atmosphere with shades of indiana jones featuring a strongwilled adventuress that will appeal to steampunk fans library journal on the kingdom beyond the wavessteampunk fantasy and sf with a victorianera feel a riproaring indiana jonesstyle adventure rt book reviews stars on the kingdom beyond the waves stephen hunt is the author of the court of the air he set up one of the first science fiction and fantasy websites wwwsfcrowsnestcom in today the site has readers a month stephen hunt is in his thirties and lives with his wife and children in surrey the rise of the iron moon
0,__label__0 the inch glass vase now housed in t...
1,__label__30 louisa may alcott is the author o...
2,__label__17 kathryn smith has always loved hap...
3,__label__26 in lonely planets astronomer david...
4,__label__4 the elephants have a house mcgraw h...


In [129]:
fasttext_sample_validation = data_subset_2['description_str'] + data_subset_2['title']
fasttext_sample_validation.head()

0    in bestsellers mccaffrey and scarboroughs char...
1    st uk edition paperback fine in stock shipped ...
2    master dosage calculations with the ratiopropo...
3    kathryn medico lives in south florida and teac...
4    winner of the british ecological societys mars...
dtype: object

## Test the prediction versus what we got with the xgb classifer

In [130]:
model_native.predict(fasttext_sample_validation[1], k=5)

(('__label__27', '__label__17', '__label__19', '__label__30', '__label__25'),
 array([0.93902868, 0.03559512, 0.01350102, 0.00587968, 0.00185194]))

In [131]:
modelwordGram.predict(fasttext_sample_validation[1], k=5)

(('__label__27', '__label__17', '__label__19', '__label__22', '__label__30'),
 array([0.82037991, 0.06401689, 0.02658203, 0.02078471, 0.01122167]))

We can host our model on SageMaker. Blazing Text built-in algorithm is compatible with Fasttext's models, so we can upload the fastText model to S3 and then point a SageMaker endpoint configuration to this model, and then deploy our endpoint

In [132]:
model_filename = "books_fasttext_native.bin"
model_native.save_model(model_filename)

In [133]:
from time import gmtime, strftime


In [137]:
!tar -czvf model.tar.gz books_fasttext_native.bin
model_location = sagemaker_session.upload_data("model.tar.gz", bucket=bucket, key_prefix=f"fasttext/model-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}/output")
!rm books_fasttext_native.tar.gz books_fasttext_native.bin

books_fasttext_native.bin


In [139]:
container = sagemaker.image_uris.retrieve("blazingtext",boto3.Session().region_name,  "1")
print('Using SageMaker BlazingText container: {} ({})'.format(container, boto3.Session().region_name))

Using SageMaker BlazingText container: 685385470294.dkr.ecr.eu-west-1.amazonaws.com/blazingtext:1 (eu-west-1)


# Deploy endpoint in SageMaker

Blazing text is compatiable with fasttext models such that you can train the fasttext model wherever you want, and then you can push the model to S3 in the required format, i.e. saved as a .tar.gz file and then can deploy the model in SageMaker to take care of the heavy lifting.

In [154]:
#use blazing text container and the fasttext model
model_fastText_book = sagemaker.Model(
    model_data=model_location, 
    image_uri=container, 
    role=role, 
    sagemaker_session=sagemaker_session)

#

model_fastText_book.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m4.xlarge')

from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

predictor = sagemaker.Predictor(
    endpoint_name=model_fastText_book.endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)


-------------!

In [172]:
fasttext_sample_validation[1]

'st uk edition paperback fine in stock shipped from our uk warehousexena warria princess  prophecy of darkness'

In [173]:
sentence = [ fasttext_sample_validation[1] ]
payload = {"instances": sentence }

In [174]:
predictions = predictor.predict(payload)
print(predictions)

[{'label': ['__label__27'], 'prob': [0.9390285015106201]}]


# Clean up, delete endpoint

In [None]:
#fastText_predictor.delete_endpoint()