# A notebook to explore text classification using word embedders

In this notebook, I will explore taking a public dataset of books with metadata such as description, title and category/genre. 
Ill then use a word embedder to vectorize the description and title and then use XGBoost to create a classifier on the category. 
I will use GenSim's fasttext implementation as the word embedder to vectorize the description and title. 
I will then repeat this process but using the native FastText implementation and compare the results. 
I will then host these models on Amazon's SageMaker 

## Install libraries, initialise variables, download dataset

In [2]:
! pip install gensim==3.8.3

Collecting gensim==3.8.3
  Downloading gensim-3.8.3-cp36-cp36m-manylinux1_x86_64.whl (24.2 MB)
[K     |████████████████████████████████| 24.2 MB 22.0 MB/s eta 0:00:01
Collecting smart-open>=1.8.1
  Downloading smart_open-5.2.0-py3-none-any.whl (58 kB)
[K     |████████████████████████████████| 58 kB 10.2 MB/s eta 0:00:01
Installing collected packages: smart-open, gensim
Successfully installed gensim-3.8.3 smart-open-5.2.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [3]:
import gensim
from gensim.models import FastText
from gensim.test.utils import common_texts  # some example sentences
from gensim.utils import simple_preprocess
print(common_texts[1])
print(len(common_texts))

['survey', 'user', 'computer', 'system', 'response', 'time']
9


gemsim expects the sentences to already be tokenized and pre-processed.

In [5]:
import pandas as pd
import numpy as np
import json
import sagemaker

In [6]:
# Get SageMaker session & default S3 bucket
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket() # replace with your own bucket if you have one 
s3 = sagemaker_session.boto_session.resource('s3')


prefix_gensim = 'data_gensim_xgb'
prefix_fasttext = 'data_fasttext'

## Get the data into a working format with just the features we need

In [7]:
# Downloading the book metadata
! wget http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Books.json.gz
# Uncompressing
!gzip -d meta_Books.json.gz -f

--2021-08-25 07:34:55--  http://deepyeti.ucsd.edu/jianmo/amazon/metaFiles/meta_Books.json.gz
Resolving deepyeti.ucsd.edu (deepyeti.ucsd.edu)... 169.228.63.50
Connecting to deepyeti.ucsd.edu (deepyeti.ucsd.edu)|169.228.63.50|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1215601087 (1.1G) [application/octet-stream]
Saving to: ‘meta_Books.json.gz’


2021-08-25 07:36:06 (16.7 MB/s) - ‘meta_Books.json.gz’ saved [1215601087/1215601087]



The filesize is a bit too big, so we can reduce that if the below line by taking a subset of that dataset.

In [8]:
#Reducing the dataset 
! head -n 50000 meta_Books.json > books_train.json

In [9]:
#load data
data=pd.read_json('books_train.json', lines=True)
#shuffle the data in place
data = data.sample(frac=1).reset_index(drop=True)
# show first few rows
data.head()

Unnamed: 0,category,tech1,description,fit,title,also_buy,image,tech2,brand,feature,rank,also_view,main_cat,similar_item,date,price,asin
0,"[Books, Children's Books, Growing Up &amp; Fac...",,[PreSchool-Grade 2-When Katie's water phobia t...,,Katie Catz Makes a Splash (Good Sports),"[0618914846, 0140553126, 0803718829]",[],,Visit Amazon's Anne Rockwell Page,[],"2,391,924 in Books (",[],Books,,NaT,,60284412
1,"[Books, Biographies & Memoirs, Historical]",,[This brief biography focuses more on the poli...,,Andrew Jackson,"[0307946371, 189311449X, 1457694700, 081297346...",[],,Robert V. Remini,[],"512,869 in Books (","[0061807885, 0812973461, 1400030722, 080185911...",Books,,NaT,$9.76,60801328
2,"[Books, Literature &amp; Fiction, Genre Fiction]",,[Racial and class conflicts simmer in this lac...,,The Water Dancers: A Novel,[],[],,Ms. Terry Gamble,[],"4,845,671 in Books (","[0062839896, 0060737948]",Books,,NaT,$2.40,60542667
3,"[Books, Christian Books &amp; Bibles, Catholic...",,"[Spong, an Episcopal bishop and best-selling a...",,Born of a Woman: A Bishop Rethinks the Birth o...,"[0062641298, 0060762055, 0060675322, 006067556...",[],,Visit Amazon's John Shelby Spong Page,[],"1,829,110 in Books (","[0060675187, 0060778423, 0060778407, 006236231...",Books,,NaT,$14.69,60675136
4,"[Books, History, World]",,[Re-creates the world from the second to the f...,,Pagans and Christians,"[0679744061, 0307743748, 1631492225, 140514911...",[],,Visit Amazon's Robin Lane Fox Page,[],"1,548,931 in Books (","[0141022957, 0141022965, 0192803204, 067403218...",Books,,NaT,$9.89,60628529


We are only interested in a few columns from this dataset, so we will create a dataframe that onyl returns these

In [10]:
data_subset = data[["category","description", "title" ]]

In [11]:
data_subset.head()

Unnamed: 0,category,description,title
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,Katie Catz Makes a Splash (Good Sports)
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,Andrew Jackson
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,The Water Dancers: A Novel
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",Born of a Woman: A Bishop Rethinks the Birth o...
4,"[Books, History, World]",[Re-creates the world from the second to the f...,Pagans and Christians


We will do some analysis of the data we have here to see how the data looks.

In [12]:
length = data_subset.category.apply(len)

In [13]:
length.unique()

array([3, 2, 0, 4, 5])

In [14]:
data_subset["cnt_cats"] = data_subset.category.apply(len)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [15]:
data_subset["cnt_desc"] = data_subset.description.apply(len)
data_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Unnamed: 0,category,description,title,cnt_cats,cnt_desc
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,Katie Catz Makes a Splash (Good Sports),3,11
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,Andrew Jackson,3,2
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,The Water Dancers: A Novel,3,4
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",Born of a Woman: A Bishop Rethinks the Birth o...,3,3
4,"[Books, History, World]",[Re-creates the world from the second to the f...,Pagans and Christians,3,3


In [16]:
# delete the rows that have no category
data_subset = data_subset[data_subset.cnt_cats != 0]
data_subset = data_subset[data_subset.cnt_desc != 0]

In [17]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,Katie Catz Makes a Splash (Good Sports),3,11
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,Andrew Jackson,3,2
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,The Water Dancers: A Novel,3,4
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",Born of a Woman: A Bishop Rethinks the Birth o...,3,3
4,"[Books, History, World]",[Re-creates the world from the second to the f...,Pagans and Christians,3,3


In [18]:
data_subset["cat_x2"] = data_subset["category"].str[1]

In [19]:
data_subset.head(10)

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,Katie Catz Makes a Splash (Good Sports),3,11,Children's Books
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,Andrew Jackson,3,2,Biographies & Memoirs
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,The Water Dancers: A Novel,3,4,Literature &amp; Fiction
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",Born of a Woman: A Bishop Rethinks the Birth o...,3,3,Christian Books &amp; Bibles
4,"[Books, History, World]",[Re-creates the world from the second to the f...,Pagans and Christians,3,3,History
6,"[Books, Business &amp; Money, Marketing &amp; ...","[A few years ago, everybody with a product to ...",Loyalty.Com: Customer Relationship Management ...,3,5,Business &amp; Money
7,"[Books, New, Used & Rental Textbooks]",[These two books restore the true perspective ...,Oscar Wilde: Interviews and Recollections (2 V...,2,1,"New, Used & Rental Textbooks"
8,"[Books, Literature &amp; Fiction, Genre Fiction]","[, Tanner Coles football career was over in le...",Necessary Roughness,3,10,Literature &amp; Fiction
9,"[Books, Literature &amp; Fiction, United States]",[An intrepid heroine with a fierce protective ...,While the Duke Was Sleeping: The Rogue Files,3,9,Literature &amp; Fiction
10,"[Books, Children's Books, Education &amp; Refe...","[Gr 3-6With the help of his characters Arlo, E...",My Weird Writing Tips (My Weird School),3,4,Children's Books


We can see that the category column has an array which is a hierachy classification of the book. We can train our classifer on just one of those, they are all books, so no need to be interested in the first element, but the second element looks more interesting.

We just want to clean some of the data as we can see there was some encoding issues whcih we can fix with a "replace"

In [20]:
data_subset["cat_x2"] = data_subset["cat_x2"].replace("&amp;", "&", regex=True)

In [21]:
data_subset["cat_x2"].head()

0            Children's Books
1       Biographies & Memoirs
2        Literature & Fiction
3    Christian Books & Bibles
4                     History
Name: cat_x2, dtype: object

In [22]:
len(data_subset["cat_x2"].unique())

33

In [23]:
data_subset['description_str'] = data_subset['description'].apply(lambda x: ' '.join(map(str, x)))

In [24]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,Katie Catz Makes a Splash (Good Sports),3,11,Children's Books,PreSchool-Grade 2-When Katie's water phobia th...
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,Andrew Jackson,3,2,Biographies & Memoirs,This brief biography focuses more on the polit...
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,The Water Dancers: A Novel,3,4,Literature & Fiction,Racial and class conflicts simmer in this lack...
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",Born of a Woman: A Bishop Rethinks the Birth o...,3,3,Christian Books & Bibles,"Spong, an Episcopal bishop and best-selling au..."
4,"[Books, History, World]",[Re-creates the world from the second to the f...,Pagans and Christians,3,3,History,Re-creates the world from the second to the fo...


We want to update the category column

In [25]:
data_subset["cat_x2"] = data_subset["cat_x2"].astype("category")

In [26]:
data_subset["cat_x2"].cat.codes

0         4
1         1
2        17
3         5
4        13
         ..
49992    24
49995    10
49996    17
49997    17
49999     4
Length: 39813, dtype: int8

In [27]:
data_subset["cat_x2_code"] = data_subset["cat_x2"].cat.codes

In [28]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,Katie Catz Makes a Splash (Good Sports),3,11,Children's Books,PreSchool-Grade 2-When Katie's water phobia th...,4
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,Andrew Jackson,3,2,Biographies & Memoirs,This brief biography focuses more on the polit...,1
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,The Water Dancers: A Novel,3,4,Literature & Fiction,Racial and class conflicts simmer in this lack...,17
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",Born of a Woman: A Bishop Rethinks the Birth o...,3,3,Christian Books & Bibles,"Spong, an Episcopal bishop and best-selling au...",5
4,"[Books, History, World]",[Re-creates the world from the second to the f...,Pagans and Christians,3,3,History,Re-creates the world from the second to the fo...,13


## GenSim requires us to do some cleansing of the data and tokenize 

In [29]:
def remove_numbers(text): 
    '''  
    This function takes strings containing numbers and returns strings with numbers removed.
    '''
    return re.sub(r'\d+', '', text) 

In [30]:
def remove_mentions(text):
    '''  
    This function takes strings containing mentions and returns strings with 
    mentions (@ and the account name) removed.
    Input(string): one tweet, contains mentions
    Output(string): one tweet, mentions (@ and the account name mentioned) removed 
    '''
    mentions = re.compile(r'@\w+ ?')
    return mentions.sub(r'', text)

In [31]:
def extract_mentions(text):
    '''
    This function takes strings containing mentions and returns strings with 
    mentions (@ and the account name) extracted into a different element,
    and removes the mentions in the original sentence.
    Input(string): one sentence, contains mentions
    '''
    mentions = [i[1:] for i in text.split() if i.startswith("@")]
    sentence = re.compile(r'@\w+ ?').sub(r'', text)
    return sentence,mentions

In [32]:
! pip install spacy

Collecting spacy
  Downloading spacy-3.1.2-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
[K     |████████████████████████████████| 5.9 MB 19.6 MB/s eta 0:00:01
[?25hCollecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.1-cp36-cp36m-manylinux2014_x86_64.whl (456 kB)
[K     |████████████████████████████████| 456 kB 54.4 MB/s eta 0:00:01
[?25hCollecting spacy-legacy<3.1.0,>=3.0.7
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting wasabi<1.1.0,>=0.8.1
  Downloading wasabi-0.8.2-py3-none-any.whl (23 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.5-cp36-cp36m-manylinux2014_x86_64.whl (35 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.5-cp36-cp36m-manylinux2014_x86_64.whl (20 kB)
Collecting thinc<8.1.0,>=8.0.8
  Downloading thinc-8.0.8-cp36-cp36m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (622 kB)
[K     |████████████████████

In [33]:
! pip install textblob

Collecting textblob
  Downloading textblob-0.15.3-py2.py3-none-any.whl (636 kB)
[K     |████████████████████████████████| 636 kB 20.2 MB/s eta 0:00:01
Installing collected packages: textblob
Successfully installed textblob-0.15.3
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [34]:
import nltk
import spacy
from textblob import TextBlob
import re
import string
import glob
import sagemaker

In [35]:
punc_list = string.punctuation #you can self define list of punctuation to remove here
def remove_punctuation(text): 
    """
    This function takes strings containing self defined punctuations and returns
    strings with punctuations removed.
    """
    translator = str.maketrans('', '', punc_list) 
    return text.translate(translator) 

In [36]:
def remove_whitespace(text): 
    '''
    This function takes strings containing mentions and returns strings with 
    whitespaces removed.
    '''
    return  " ".join(text.split())

In [37]:
def remove_html_tags(text):
    """Remove html tags from a string"""
    import re
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

In [38]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,Katie Catz Makes a Splash (Good Sports),3,11,Children's Books,PreSchool-Grade 2-When Katie's water phobia th...,4
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,Andrew Jackson,3,2,Biographies & Memoirs,This brief biography focuses more on the polit...,1
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,The Water Dancers: A Novel,3,4,Literature & Fiction,Racial and class conflicts simmer in this lack...,17
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",Born of a Woman: A Bishop Rethinks the Birth o...,3,3,Christian Books & Bibles,"Spong, an Episcopal bishop and best-selling au...",5
4,"[Books, History, World]",[Re-creates the world from the second to the f...,Pagans and Christians,3,3,History,Re-creates the world from the second to the fo...,13


In [39]:
data_subset["description_str"]=data_subset["description_str"].apply(remove_html_tags)
data_subset["title"]=data_subset["title"].apply(remove_html_tags)

In [40]:
data_subset["description_str"] = data_subset["description_str"].str.lower()
data_subset["title"] = data_subset["title"].str.lower()

In [41]:
data_subset["description_str"]=data_subset["description_str"].apply(remove_whitespace).apply(remove_punctuation).apply(remove_numbers)
data_subset["title"]=data_subset["title"].apply(remove_whitespace).apply(remove_punctuation).apply(remove_numbers)


In [42]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [43]:
from nltk.tokenize import word_tokenize 
def tokenize_sent(text): 
    ''' 
    This function takes strings and returns tokenized words.
    '''
    word_tokens = word_tokenize(text)  
    return word_tokens 

In [44]:
data_subset["description_str_token"] = data_subset["description_str"].apply(tokenize_sent)

In [45]:
data_subset["title_token"] = data_subset["title"].apply(tokenize_sent)

In [46]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/ec2-user/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [47]:
stopwords_list = set(stopwords.words('english'))

In [48]:
from collections import Counter
counter = Counter()
for word in  [w for sent in data_subset["description_str_token"] for w in sent]:
    counter[word] += 1        
counter.most_common(10)

[('the', 394618),
 ('and', 293214),
 ('of', 252017),
 ('a', 195739),
 ('to', 155360),
 ('in', 131292),
 ('is', 93090),
 ('for', 71907),
 ('with', 64229),
 ('as', 50946)]

In [49]:
#least frequent words
counter.most_common()[:-10:-1]

[('snoozed', 1),
 ('wwwtaherehbookscom', 1),
 ('taherehmafi', 1),
 ('overcaffeinated', 1),
 ('whichwood', 1),
 ('memoirashistory', 1),
 ('jailings', 1),
 ('righth', 1),
 ('abernathys', 1)]

In [50]:
top_n = 10
bottom_n = 10
stopwords_list |= set([word for (word, count) in counter.most_common(top_n)])
stopwords_list |= set([word for (word, count) in counter.most_common()[:-bottom_n:-1]])
stopwords_list |= {'thats'}
def remove_stopwords(tokenized_text): 
    '''
    This function takes a list of tokenized words from the description and title, removes self-defined stop words from the list,
    and returns the list of words with stop words removed
    '''
    filtered_text = [word for word in tokenized_text if word not in stopwords_list] 
    return filtered_text

In [51]:
data_subset["description_str_token"] = data_subset["description_str_token"].apply(remove_stopwords)
data_subset["title_token"] = data_subset["title_token"].apply(remove_stopwords)

In [52]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,title_token
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,katie catz makes a splash good sports,3,11,Children's Books,preschoolgrade when katies water phobia threat...,4,"[preschoolgrade, katies, water, phobia, threat...","[katie, catz, makes, splash, good, sports]"
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,andrew jackson,3,2,Biographies & Memoirs,this brief biography focuses more on the polit...,1,"[brief, biography, focuses, political, career,...","[andrew, jackson]"
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,the water dancers a novel,3,4,Literature & Fiction,racial and class conflicts simmer in this lack...,17,"[racial, class, conflicts, simmer, lackluster,...","[water, dancers, novel]"
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",born of a woman a bishop rethinks the birth of...,3,3,Christian Books & Bibles,spong an episcopal bishop and bestselling auth...,5,"[spong, episcopal, bishop, bestselling, author...","[born, woman, bishop, rethinks, birth, jesus]"
4,"[Books, History, World]",[Re-creates the world from the second to the f...,pagans and christians,3,3,History,recreates the world from the second to the fou...,13,"[recreates, world, second, fourth, century, ad...","[pagans, christians]"


In [53]:
! pip install autocorrect

Collecting autocorrect
  Downloading autocorrect-2.5.0.tar.gz (622 kB)
[K     |████████████████████████████████| 622 kB 47.3 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: autocorrect
  Building wheel for autocorrect (setup.py) ... [?25ldone
[?25h  Created wheel for autocorrect: filename=autocorrect-2.5.0-py3-none-any.whl size=621854 sha256=40044dca5d965052160ea821def60d10048655c502b599724cd2e5ed25caa50a
  Stored in directory: /home/ec2-user/.cache/pip/wheels/a4/51/6c/f75116aae65b52be7ad1d57e47ad4e89ab818bf45d9093021f
Successfully built autocorrect
Installing collected packages: autocorrect
Successfully installed autocorrect-2.5.0
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [54]:
from autocorrect import Speller

In [55]:
spell = Speller(lang='en', fast = True)
def spelling_correct(tokenized_text):
    """
    This function takes a list of tokenized words from a sentence, spell check every words and returns the 
    corrected words if applicable. Note that not every wrong spelling words will be identified.
    """
    corrected = [spell(word) for word in tokenized_text] 
    return corrected

In [56]:
data_subset["description_str_token"] = data_subset["description_str_token"].apply(spelling_correct)
data_subset["title_token"] = data_subset["title_token"].apply(spelling_correct)

In [57]:
data_subset['description_str'].replace('', np.nan, inplace=True)

In [58]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,title_token
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,katie catz makes a splash good sports,3,11,Children's Books,preschoolgrade when katies water phobia threat...,4,"[preschoolgrade, katie, water, phobia, threate...","[katie, cat, makes, splash, good, sports]"
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,andrew jackson,3,2,Biographies & Memoirs,this brief biography focuses more on the polit...,1,"[brief, biography, focuses, political, career,...","[andrew, jackson]"
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,the water dancers a novel,3,4,Literature & Fiction,racial and class conflicts simmer in this lack...,17,"[racial, class, conflicts, summer, lackluster,...","[water, dancers, novel]"
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",born of a woman a bishop rethinks the birth of...,3,3,Christian Books & Bibles,spong an episcopal bishop and bestselling auth...,5,"[song, episcopal, bishop, bestselling, author,...","[born, woman, bishop, rethink, birth, jesus]"
4,"[Books, History, World]",[Re-creates the world from the second to the f...,pagans and christians,3,3,History,recreates the world from the second to the fou...,13,"[recreates, world, second, fourth, century, ad...","[pagans, christians]"


In [59]:
# remove the rows which don't have data
data_subset = data_subset.dropna()

### Now data has been cleansed, we are ready to train a model

We will see when we return a sentence in it's vectorized format, we will have an array of 50 items, as that is the size we have choosen, where this is capturing the semantics of the sentence, and that will enable us to compare 2 sentences and see how similar they are for instance, and for this use-case, to be able to train a classifier. 

In [60]:
model_gensim = FastText(size=50, window=5, min_count=1)

In [61]:
token_desc = data_subset["description_str_token"] + data_subset["title_token"]
token_desc.head()

0    [preschoolgrade, katie, water, phobia, threate...
1    [brief, biography, focuses, political, career,...
2    [racial, class, conflicts, summer, lackluster,...
3    [song, episcopal, bishop, bestselling, author,...
4    [recreates, world, second, fourth, century, ad...
dtype: object

In [62]:
model_gensim.build_vocab(sentences=token_desc)

In [63]:
model_gensim.train(sentences=token_desc, total_examples=len(token_desc), epochs=50) 

In [64]:
from gensim.test.utils import get_tmpfile
fname = get_tmpfile("fasttext.model")

model_gensim.save('books_gensim_model.bin')

In [65]:
description_str = data_subset["description_str"]

In [66]:
vector_description_str = model_gensim.wv[description_str]

In [67]:
#what happens if I do the wv is made on the token_desc

In [68]:
len(vector_description_str)

35856

In [147]:
data_subset["description_str_token"][4]

['recreates',
 'world',
 'second',
 'fourth',
 'century',
 'ad',
 'graecoroman',
 'gods',
 'lost',
 'dominion',
 'christianity',
 'conversion',
 'constantine',
 'triumphed',
 'mediterranean',
 'world']

In [148]:
description_str[4]

'recreates the world from the second to the fourth century ad when the graecoroman gods lost their dominion and christianity with the conversion of constantine triumphed in the mediterranean world'

In [149]:
vector_description_str[4]

array([[-0.01796775, -0.32192793,  0.4427467 ,  1.01180661, -0.34540969,
         0.19013038,  0.27239057,  0.34442747, -0.87200141,  0.39073905,
        -1.11305904,  0.15815021, -0.28595585, -0.67114389, -0.29362822,
         0.52312618, -0.48540148, -0.23456573,  0.16478145, -1.10297   ,
        -0.11184762, -0.36203459,  0.52202058, -0.24631335,  0.07348673,
         0.43191677, -0.02977397,  0.2740055 , -1.282076  , -0.1733315 ,
         0.46160921, -0.36126369,  0.36206457,  0.39822555,  0.03307921,
        -0.3936322 ,  0.82833076,  0.55542183, -0.37598661, -0.68967819,
         0.15875748, -0.17870745, -0.38433194, -0.41593724,  0.00131598,
        -0.13523583,  0.79293609,  0.46883106,  1.0133872 ,  0.27109554]])

In [72]:
vector_description_str = np.split(vector_description_str,len(vector_description_str))

In [73]:
vector_description_str[1].shape

(1, 50)

In [74]:
title_str = data_subset["title"]

In [75]:
vector_title_str = model_gensim.wv[title_str]

In [76]:
len(vector_title_str)

35856

In [77]:
vector_title_str.shape

(35856, 50)

In [78]:
vector_title_str = np.split(vector_title_str,len(vector_title_str))

In [79]:
vector_desc_title = np.concatenate((vector_title_str, vector_description_str), axis=1)

In [80]:
vector_title_str[0]

array([[-0.11389955, -0.17884767, -0.04533883,  0.7479725 , -0.6620094 ,
         0.2998545 ,  0.6683462 , -0.67709017, -0.00276513, -0.01161655,
        -0.43439627,  0.6503054 ,  0.12595864,  0.62316966,  0.43913954,
         0.10223692, -0.00877468,  0.5330663 ,  0.14419468, -0.4775471 ,
         0.3600916 , -0.35590523,  0.11081461, -0.9129674 ,  0.25066143,
        -0.29290995,  0.548772  , -0.15917143,  0.03620967, -0.02649642,
        -0.3584182 ,  0.10806427,  0.40827608,  0.20630829,  0.13886416,
         0.42032284,  0.2844103 ,  0.41912055,  0.10839075,  0.1459674 ,
         0.5091885 ,  0.12314051,  0.38874978,  0.2753711 ,  0.4628218 ,
         0.7065846 , -0.2159911 ,  0.534624  ,  0.6421197 , -0.01627837]],
      dtype=float32)

In [81]:
vector_description_str[0]

array([[ 0.21512936, -0.27270401,  0.09654656,  0.80286735, -0.43365952,
         0.12777027,  0.54810727,  0.01851196, -0.57441145,  0.13433439,
        -0.6212936 ,  0.37913698,  0.08904486,  0.19478741, -0.17652661,
         0.01280498, -0.16443644, -0.01446998,  0.01062257, -0.50620121,
         0.21986067, -0.29872182,  0.36526218, -0.09799331,  0.34898707,
        -0.04536806,  0.29148313,  0.3266516 , -0.39401016,  0.08060693,
        -0.10553125, -0.0111227 ,  0.28406581, -0.13413872,  0.25219336,
        -0.0949575 ,  0.73573917,  0.37018463,  0.36154485, -0.05152865,
         0.11773488,  0.09467947,  0.00430389, -0.17713252,  0.01171682,
         0.32209581,  0.62795264,  0.89702249, -0.02891457, -0.04578722]])

In [82]:
vector_desc_title[0]

array([[-0.11389955, -0.17884767, -0.04533883,  0.74797249, -0.66200942,
         0.29985449,  0.66834623, -0.67709017, -0.00276513, -0.01161655,
        -0.43439627,  0.65030539,  0.12595864,  0.62316966,  0.43913954,
         0.10223692, -0.00877468,  0.53306627,  0.14419468, -0.47754711,
         0.3600916 , -0.35590523,  0.11081461, -0.91296738,  0.25066143,
        -0.29290995,  0.54877198, -0.15917143,  0.03620967, -0.02649642,
        -0.3584182 ,  0.10806427,  0.40827608,  0.20630829,  0.13886416,
         0.42032284,  0.2844103 ,  0.41912055,  0.10839075,  0.14596739,
         0.50918847,  0.12314051,  0.38874978,  0.2753711 ,  0.46282181,
         0.70658457, -0.21599109,  0.53462398,  0.64211971, -0.01627837],
       [ 0.21512936, -0.27270401,  0.09654656,  0.80286735, -0.43365952,
         0.12777027,  0.54810727,  0.01851196, -0.57441145,  0.13433439,
        -0.6212936 ,  0.37913698,  0.08904486,  0.19478741, -0.17652661,
         0.01280498, -0.16443644, -0.01446998,  0.

In [83]:
vector_desc_title.shape

(35856, 2, 50)

We want to reshape the vector into a 2D with same number of rows and concatenating the data

In [84]:
big_vector_title_descr = vector_desc_title.reshape(len(vector_title_str),100)

In [85]:
big_vector_title_descr.shape

(35856, 100)

In [86]:
data_subset.head()

Unnamed: 0,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,title_token
0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,katie catz makes a splash good sports,3,11,Children's Books,preschoolgrade when katies water phobia threat...,4,"[preschoolgrade, katie, water, phobia, threate...","[katie, cat, makes, splash, good, sports]"
1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,andrew jackson,3,2,Biographies & Memoirs,this brief biography focuses more on the polit...,1,"[brief, biography, focuses, political, career,...","[andrew, jackson]"
2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,the water dancers a novel,3,4,Literature & Fiction,racial and class conflicts simmer in this lack...,17,"[racial, class, conflicts, summer, lackluster,...","[water, dancers, novel]"
3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",born of a woman a bishop rethinks the birth of...,3,3,Christian Books & Bibles,spong an episcopal bishop and bestselling auth...,5,"[song, episcopal, bishop, bestselling, author,...","[born, woman, bishop, rethink, birth, jesus]"
4,"[Books, History, World]",[Re-creates the world from the second to the f...,pagans and christians,3,3,History,recreates the world from the second to the fou...,13,"[recreates, world, second, fourth, century, ad...","[pagans, christians]"


In [87]:
len(data_subset)

35856

In [88]:
df_big_vector_title_descr = pd.DataFrame(data=big_vector_title_descr)

In [89]:
df_big_vector_title_descr.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
0,-0.1139,-0.178848,-0.045339,0.747972,-0.662009,0.299854,0.668346,-0.67709,-0.002765,-0.011617,...,0.117735,0.094679,0.004304,-0.177133,0.011717,0.322096,0.627953,0.897022,-0.028915,-0.045787
1,-1.901171,0.966746,0.800008,1.095033,0.101101,0.258321,-0.849448,-0.399141,-0.033316,-1.279748,...,0.200965,-0.490044,0.204726,-0.386321,0.27571,0.116053,0.806538,0.755876,0.783512,0.026197
2,0.245467,1.100971,-0.25072,1.757939,-0.022062,0.596411,1.154969,0.210083,-1.066611,-0.287484,...,0.082629,-0.125232,-0.167305,-0.067108,0.1056,0.187292,0.31107,0.562679,0.393484,-0.067186
3,0.494056,-0.334348,0.303605,0.42771,-0.208441,0.157509,-0.659632,0.80217,-0.178103,-0.485869,...,0.207267,-0.34752,-0.006677,-0.645979,0.118257,-0.173144,1.493761,1.180336,0.389957,-0.333253
4,0.423756,-0.589404,1.486319,0.376148,-1.256931,0.041691,-0.721385,0.905591,-1.418775,1.202415,...,0.158757,-0.178707,-0.384332,-0.415937,0.001316,-0.135236,0.792936,0.468831,1.013387,0.271096


Our index on both these DataFrames wont align anymore, so we need to reset the index so we can do that.

In [90]:
data_subset = data_subset.reset_index()

In [91]:
data_subset.head()

Unnamed: 0,index,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,title_token
0,0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,katie catz makes a splash good sports,3,11,Children's Books,preschoolgrade when katies water phobia threat...,4,"[preschoolgrade, katie, water, phobia, threate...","[katie, cat, makes, splash, good, sports]"
1,1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,andrew jackson,3,2,Biographies & Memoirs,this brief biography focuses more on the polit...,1,"[brief, biography, focuses, political, career,...","[andrew, jackson]"
2,2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,the water dancers a novel,3,4,Literature & Fiction,racial and class conflicts simmer in this lack...,17,"[racial, class, conflicts, summer, lackluster,...","[water, dancers, novel]"
3,3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",born of a woman a bishop rethinks the birth of...,3,3,Christian Books & Bibles,spong an episcopal bishop and bestselling auth...,5,"[song, episcopal, bishop, bestselling, author,...","[born, woman, bishop, rethink, birth, jesus]"
4,4,"[Books, History, World]",[Re-creates the world from the second to the f...,pagans and christians,3,3,History,recreates the world from the second to the fou...,13,"[recreates, world, second, fourth, century, ad...","[pagans, christians]"


In [92]:
data_subset_2 = pd.concat([data_subset, df_big_vector_title_descr], axis=1)

In [93]:
data_subset_2.head()

Unnamed: 0,index,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,...,90,91,92,93,94,95,96,97,98,99
0,0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,katie catz makes a splash good sports,3,11,Children's Books,preschoolgrade when katies water phobia threat...,4,"[preschoolgrade, katie, water, phobia, threate...",...,0.117735,0.094679,0.004304,-0.177133,0.011717,0.322096,0.627953,0.897022,-0.028915,-0.045787
1,1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,andrew jackson,3,2,Biographies & Memoirs,this brief biography focuses more on the polit...,1,"[brief, biography, focuses, political, career,...",...,0.200965,-0.490044,0.204726,-0.386321,0.27571,0.116053,0.806538,0.755876,0.783512,0.026197
2,2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,the water dancers a novel,3,4,Literature & Fiction,racial and class conflicts simmer in this lack...,17,"[racial, class, conflicts, summer, lackluster,...",...,0.082629,-0.125232,-0.167305,-0.067108,0.1056,0.187292,0.31107,0.562679,0.393484,-0.067186
3,3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",born of a woman a bishop rethinks the birth of...,3,3,Christian Books & Bibles,spong an episcopal bishop and bestselling auth...,5,"[song, episcopal, bishop, bestselling, author,...",...,0.207267,-0.34752,-0.006677,-0.645979,0.118257,-0.173144,1.493761,1.180336,0.389957,-0.333253
4,4,"[Books, History, World]",[Re-creates the world from the second to the f...,pagans and christians,3,3,History,recreates the world from the second to the fou...,13,"[recreates, world, second, fourth, century, ad...",...,0.158757,-0.178707,-0.384332,-0.415937,0.001316,-0.135236,0.792936,0.468831,1.013387,0.271096


### We want to check the count of each of the classes to check for class imbalance

With another version of XGBoost, we can supply the weights as a vector as a parameter for the training which will improve the model training to help the model be less bias because of the class imbalance

In [94]:
data_subset_2['cat_x2_code'].unique()

array([ 4,  1, 17,  5, 13,  2, 20, 12,  9, 26, 18, 14, 25, 30, 11, 28, 29,
       19,  8,  0, 24, 22, 32,  6, 21, 27, 23,  7, 16, 10, 15, 31,  3],
      dtype=int8)

In [95]:
data_subset_2_cat_x2_agg = data_subset_2.groupby(by=['cat_x2_code']).count()['index']
print(data_subset_2_cat_x2_agg)

cat_x2_code
0     1304
1     1707
2     1655
3        8
4     6299
5      803
6      133
7      327
8      949
9      720
10     309
11    1048
12     705
13    1421
14     706
15      50
16      41
17    6230
18     307
19    1566
20    1049
21     164
22    1150
23     921
24     673
25     475
26     980
27     548
28     519
29     517
30    2174
31      12
32     386
Name: index, dtype: int64


Get the data in the format ready for fasttext too

In [96]:
data_subset_2["fastText_label"] = '__label__' + data_subset["cat_x2_code"].astype(str) 

We have our data in a format that we like now, but for the training, we can select a few columns for this.

In [97]:
data_subset_2.head()

Unnamed: 0,index,category,description,title,cnt_cats,cnt_desc,cat_x2,description_str,cat_x2_code,description_str_token,...,91,92,93,94,95,96,97,98,99,fastText_label
0,0,"[Books, Children's Books, Growing Up &amp; Fac...",[PreSchool-Grade 2-When Katie's water phobia t...,katie catz makes a splash good sports,3,11,Children's Books,preschoolgrade when katies water phobia threat...,4,"[preschoolgrade, katie, water, phobia, threate...",...,0.094679,0.004304,-0.177133,0.011717,0.322096,0.627953,0.897022,-0.028915,-0.045787,__label__4
1,1,"[Books, Biographies & Memoirs, Historical]",[This brief biography focuses more on the poli...,andrew jackson,3,2,Biographies & Memoirs,this brief biography focuses more on the polit...,1,"[brief, biography, focuses, political, career,...",...,-0.490044,0.204726,-0.386321,0.27571,0.116053,0.806538,0.755876,0.783512,0.026197,__label__1
2,2,"[Books, Literature &amp; Fiction, Genre Fiction]",[Racial and class conflicts simmer in this lac...,the water dancers a novel,3,4,Literature & Fiction,racial and class conflicts simmer in this lack...,17,"[racial, class, conflicts, summer, lackluster,...",...,-0.125232,-0.167305,-0.067108,0.1056,0.187292,0.31107,0.562679,0.393484,-0.067186,__label__17
3,3,"[Books, Christian Books &amp; Bibles, Catholic...","[Spong, an Episcopal bishop and best-selling a...",born of a woman a bishop rethinks the birth of...,3,3,Christian Books & Bibles,spong an episcopal bishop and bestselling auth...,5,"[song, episcopal, bishop, bestselling, author,...",...,-0.34752,-0.006677,-0.645979,0.118257,-0.173144,1.493761,1.180336,0.389957,-0.333253,__label__5
4,4,"[Books, History, World]",[Re-creates the world from the second to the f...,pagans and christians,3,3,History,recreates the world from the second to the fou...,13,"[recreates, world, second, fourth, century, ad...",...,-0.178707,-0.384332,-0.415937,0.001316,-0.135236,0.792936,0.468831,1.013387,0.271096,__label__13


Might be better to pick the columns, rather than drop so many, lets look at the head

In [98]:
#create a new dataframe before saving the data as CSV
df_gensim_xgb_sampleweight = data_subset_2.drop(columns=['index','category','description','title','cnt_cats','cnt_desc','cat_x2','description_str','description_str_token','title_token','fastText_label'])
df_fasttext = data_subset_2[['fastText_label','description_str_token', 'title_token']]

In [99]:
df_fasttext['token_sentence'] = df_fasttext['description_str_token'] + df_fasttext['title_token']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [100]:
df_fasttext['untoken'] = [' '.join(map(str, l)) for l in df_fasttext['token_sentence']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [101]:
df_fasttext['full'] = df_fasttext['fastText_label'] + ' ' + df_fasttext['untoken'] 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [102]:
df_fasttext.head()

Unnamed: 0,fastText_label,description_str_token,title_token,token_sentence,untoken,full
0,__label__4,"[preschoolgrade, katie, water, phobia, threate...","[katie, cat, makes, splash, good, sports]","[preschoolgrade, katie, water, phobia, threate...",preschoolgrade katie water phobia threatens ex...,__label__4 preschoolgrade katie water phobia t...
1,__label__1,"[brief, biography, focuses, political, career,...","[andrew, jackson]","[brief, biography, focuses, political, career,...",brief biography focuses political career andre...,__label__1 brief biography focuses political c...
2,__label__17,"[racial, class, conflicts, summer, lackluster,...","[water, dancers, novel]","[racial, class, conflicts, summer, lackluster,...",racial class conflicts summer lackluster first...,__label__17 racial class conflicts summer lack...
3,__label__5,"[song, episcopal, bishop, bestselling, author,...","[born, woman, bishop, rethink, birth, jesus]","[song, episcopal, bishop, bestselling, author,...",song episcopal bishop bestselling author rescu...,__label__5 song episcopal bishop bestselling a...
4,__label__13,"[recreates, world, second, fourth, century, ad...","[pagans, christians]","[recreates, world, second, fourth, century, ad...",recreates world second fourth century ad graec...,__label__13 recreates world second fourth cent...


In [103]:
df_gensim_xgb_sampleweight.head()

Unnamed: 0,cat_x2_code,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,4,-0.1139,-0.178848,-0.045339,0.747972,-0.662009,0.299854,0.668346,-0.67709,-0.002765,...,0.117735,0.094679,0.004304,-0.177133,0.011717,0.322096,0.627953,0.897022,-0.028915,-0.045787
1,1,-1.901171,0.966746,0.800008,1.095033,0.101101,0.258321,-0.849448,-0.399141,-0.033316,...,0.200965,-0.490044,0.204726,-0.386321,0.27571,0.116053,0.806538,0.755876,0.783512,0.026197
2,17,0.245467,1.100971,-0.25072,1.757939,-0.022062,0.596411,1.154969,0.210083,-1.066611,...,0.082629,-0.125232,-0.167305,-0.067108,0.1056,0.187292,0.31107,0.562679,0.393484,-0.067186
3,5,0.494056,-0.334348,0.303605,0.42771,-0.208441,0.157509,-0.659632,0.80217,-0.178103,...,0.207267,-0.34752,-0.006677,-0.645979,0.118257,-0.173144,1.493761,1.180336,0.389957,-0.333253
4,13,0.423756,-0.589404,1.486319,0.376148,-1.256931,0.041691,-0.721385,0.905591,-1.418775,...,0.158757,-0.178707,-0.384332,-0.415937,0.001316,-0.135236,0.792936,0.468831,1.013387,0.271096


### For this version of XGBoost, we need to supply 3 arguments to the model which are the features, labels and optionally the sample weight which is going to help improve the performance of the model as we have an imbalanced dataset

In [104]:
X = df_gensim_xgb_sampleweight.drop(['cat_x2_code'], axis=1).values
y = df_gensim_xgb_sampleweight['cat_x2_code'].values


In [105]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
yX_train = np.column_stack((y_train, X_train))
yX_test = np.column_stack((y_test, X_test))
np.savetxt("book_gensim_train_v1.csv", yX_train, delimiter=",", fmt='%0.3f')
np.savetxt("book_gensim_test_v1.csv", yX_test, delimiter=",", fmt='%0.3f')

In [106]:
print(y_test.shape)

(11833,)


In [107]:
# Upload the dataset to an S3 bucket
input_train = sagemaker_session.upload_data(path='book_gensim_train_v1.csv', key_prefix='%s/data' % prefix_gensim)
input_validation = sagemaker_session.upload_data(path='book_gensim_test_v1.csv', key_prefix='%s/data' % prefix_gensim)

In [108]:
#from sagemaker.inputs import TrainingInput

train_data = sagemaker.inputs.TrainingInput(s3_data=input_train,content_type="csv")
validation_data = sagemaker.inputs.TrainingInput(s3_data=input_validation,content_type="csv")

In our training script, we have a parser that is expecting the hyper-parameters below.

In [109]:
hyperparams = {
        "n_estimators": "300", 
        "n_jobs":"4",
        "max_depth":"10",
#        "min_child_weight": "6",
        "learning_rate": "0.1", 
        "objective":'multi:softmax', 
#        "reg_alpha": "10",
        "gamma": "4"
}

instance_type = "ml.m5.2xlarge"

Below is our estimator using the XGBoost framework and using our training script which is using another version of the XGB algorithm, not the SageMaker built-in algorithm.

In [110]:
# updated XGBoost to XGBClassifier https://sagemaker.readthedocs.io/en/stable/frameworks/xgboost/using_xgboost.html#train-a-model-with-open-source-xgboost
from sagemaker import get_execution_role
from sagemaker.xgboost.estimator import XGBoost

role = get_execution_role()

xgb_estimator = XGBoost(
    entry_point="train.py",
    hyperparameters=hyperparams,
    role=role,
    instance_count=1,
    instance_type='ml.m5.4xlarge',
    framework_version="1.2-1",
    eval_metric="merror",
)

In [111]:
xgb_estimator.fit({'train': train_data, 'validation': validation_data })

2021-08-25 08:06:02 Starting - Starting the training job...
2021-08-25 08:06:25 Starting - Launching requested ML instancesProfilerReport-1629878761: InProgress
...
2021-08-25 08:06:53 Starting - Preparing the instances for training......
2021-08-25 08:07:54 Downloading - Downloading input data...
2021-08-25 08:08:25 Training - Downloading the training image..[34m[2021-08-25 08:08:40.380 ip-10-0-241-122.eu-west-1.compute.internal:1 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None[0m
[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Invoking user training script.[0m
[34mINFO:sagemaker-containers:Module train does not provide a setup.py. [0m
[34mGenerating setup.py[0m
[34mINFO:sagemaker-containers:Generating setup.cfg[0m
[34mINFO:sagemaker-containers:Generating MANIFEST.in[0m
[34mINFO:sagemaker-container

In [112]:
xgb_predictor_gensim = xgb_estimator.deploy(
    initial_instance_count=1,
    instance_type="ml.m5.2xlarge"
)

-------------!

In [113]:
print(xgb_predictor_gensim)

<sagemaker.xgboost.model.XGBoostPredictor object at 0x7f4d6123e400>


In [114]:
from sagemaker.serializers import CSVSerializer
from sagemaker.deserializers import NumpyDeserializer
csv_serializer = CSVSerializer()
np_deserializer = NumpyDeserializer()

xgb_predictor_gensim.serializer = csv_serializer
xgb_predictor_gensim.deserializer = np_deserializer



In [115]:
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report

predictions_test_xgb_weighted = [ float(xgb_predictor_gensim.predict(x)) for x in X_test]  
score = f1_score(y_test,predictions_test_xgb_weighted,labels=np.unique(y),average='micro')

print('F1 Score(micro): %.1f' % (score * 100.0))

F1 Score(micro): 60.1


In [116]:
# xgb_predictor_gensim.delete_endpoint()

### In the next steps, we will use the built-in XGBoost which doesn't allow you to set the weights for the classes and see how the results differ.

If we use the XGBClassifer, then we are going to need to divide our training data into 3 files, X =features, y=Labels, and W=weights - all the same length. 

We are going to need to cerate a map to class to add the weight. 

In [117]:
import boto3
container_uri = sagemaker.image_uris.retrieve('xgboost', boto3.Session().region_name, version='1.0-1')

# Create the estimator
xgb_bi = sagemaker.estimator.Estimator(container_uri,
                                    role, 
                                    instance_count=1, 
                                    instance_type='ml.m4.4xlarge',
                                    output_path='s3://{}/{}/output'.format(bucket, prefix_gensim),
                                    sagemaker_session=sagemaker_session)
# Set the hyperparameters
xgb_bi.set_hyperparameters(eta=0.1,
                        max_depth=10,
                        gamma=4,
                        num_class=len(np.unique(y)),
                        alpha=10,
                        min_child_weight=6,
                        silent=0,
                        objective='multi:softmax',
                        num_round=300)

In [118]:
xgb_bi.fit({'train': train_data, 'validation': validation_data })

2021-08-25 08:28:23 Starting - Starting the training job...
2021-08-25 08:28:46 Starting - Launching requested ML instancesProfilerReport-1629880103: InProgress
......
2021-08-25 08:29:46 Starting - Preparing the instances for training......
2021-08-25 08:30:54 Downloading - Downloading input data
2021-08-25 08:30:54 Training - Downloading the training image...
2021-08-25 08:31:15 Training - Training image download completed. Training in progress..[34mINFO:sagemaker-containers:Imported framework sagemaker_xgboost_container.training[0m
[34mINFO:sagemaker-containers:Failed to parse hyperparameter objective value multi:softmax to Json.[0m
[34mReturning the value itself[0m
[34mINFO:sagemaker-containers:No GPUs detected (normal if no gpus installed)[0m
[34mINFO:sagemaker_xgboost_container.training:Running XGBoost Sagemaker in algorithm mode[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Determined delimiter of CSV input is ','[0m
[34mINFO:root:Dete

# We trained our model and now want to test out the predictions

In [119]:
xgb_predictor = xgb_bi.deploy(
    initial_instance_count=1, 
    instance_type='ml.m4.xlarge'
)

-------------!

In [120]:
print(xgb_predictor)

<sagemaker.predictor.Predictor object at 0x7f4d611ade48>


In [121]:
xgb_predictor.serializer = csv_serializer

predictions_test = [ float(xgb_predictor.predict(x).decode('utf-8')) for x in X_test] 
score = f1_score(y_test,predictions_test,labels=np.unique(y),average='micro')

print('F1 Score(micro): %.1f' % (score * 100.0))

F1 Score(micro): 56.8


All done, you can delete your endpoint

In [122]:
#xgb_predictor.delete_endpoint()

# Next we will test out the FastText native supervised Text classification 

In this step, we want to see if the native FastText algorithm is able to do the same but with less hard work.
With native FastText, you do not need to tokenize your sentences, and you also do not need to pick vector size as a parameter for the mdoel training. 
This algorithm will do the work for you behind the scenes. 
What we do need to do though, is get the data in to the required format which means adding a string of "__label__" before the label and then we will concatenate that with the description and title into one field and then present that to the algorithm. 



In [123]:
df_fasttext.head()

Unnamed: 0,fastText_label,description_str_token,title_token,token_sentence,untoken,full
0,__label__4,"[preschoolgrade, katie, water, phobia, threate...","[katie, cat, makes, splash, good, sports]","[preschoolgrade, katie, water, phobia, threate...",preschoolgrade katie water phobia threatens ex...,__label__4 preschoolgrade katie water phobia t...
1,__label__1,"[brief, biography, focuses, political, career,...","[andrew, jackson]","[brief, biography, focuses, political, career,...",brief biography focuses political career andre...,__label__1 brief biography focuses political c...
2,__label__17,"[racial, class, conflicts, summer, lackluster,...","[water, dancers, novel]","[racial, class, conflicts, summer, lackluster,...",racial class conflicts summer lackluster first...,__label__17 racial class conflicts summer lack...
3,__label__5,"[song, episcopal, bishop, bestselling, author,...","[born, woman, bishop, rethink, birth, jesus]","[song, episcopal, bishop, bestselling, author,...",song episcopal bishop bestselling author rescu...,__label__5 song episcopal bishop bestselling a...
4,__label__13,"[recreates, world, second, fourth, century, ad...","[pagans, christians]","[recreates, world, second, fourth, century, ad...",recreates world second fourth century ad graec...,__label__13 recreates world second fourth cent...


Taken the same index as our test example above to see if the fasttext algo can make the same prediction

In [124]:
! pip install fasttext==0.9.1

Collecting fasttext==0.9.1
  Downloading fasttext-0.9.1.tar.gz (57 kB)
[K     |████████████████████████████████| 57 kB 7.9 MB/s  eta 0:00:01
[?25hCollecting pybind11>=2.2
  Using cached pybind11-2.7.1-py2.py3-none-any.whl (200 kB)
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25ldone
[?25h  Created wheel for fasttext: filename=fasttext-0.9.1-cp36-cp36m-linux_x86_64.whl size=2161606 sha256=4e2e9aba61ecebec602f468e526d1767e0cdb543957fe6271969b34e91ef26c9
  Stored in directory: /home/ec2-user/.cache/pip/wheels/ae/e8/a0/03628c77c2e0aa813f067f6d7708a4579d15abf6f45e8716c5
Successfully built fasttext
Installing collected packages: pybind11, fasttext
Successfully installed fasttext-0.9.1 pybind11-2.7.1
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/python3/bin/python -m pip install --upgrade pip' command.[0m


In [125]:
import fasttext

In [126]:
fasttext_dataset = df_fasttext['full']

In [127]:
from sklearn.model_selection import train_test_split

train_fasttext_native, val_fasttext_native = train_test_split(fasttext_dataset, test_size=0.33, random_state=42)

train_file_name = 'train_books_fasttext_native.csv'
valid_file_name = 'valid_books_fasttext_native.csv'
train_fasttext_native.to_csv(train_file_name, index=False, header=False)
val_fasttext_native.to_csv(valid_file_name, index=False, header=False)

In [128]:
model_native = fasttext.train_supervised(input=train_file_name, lr=0.1, epoch=50)

In [129]:
modelwordGram = fasttext.train_supervised(input=train_file_name, lr=0.1, epoch=50, wordNgrams=2)

### We will run a simple test with the validation data, we are returned the precision and recall, and we can play with the hyperparameters to tune this 

In [130]:
FastText_Precision_Recall = model_native.test(valid_file_name, k=1)
print(FastText_Precision_Recall)

(11833, 0.6220738612355278, 0.6220738612355278)


In [131]:
f1_score = 2*((FastText_Precision_Recall[1]*FastText_Precision_Recall[2])/(FastText_Precision_Recall[1]+FastText_Precision_Recall[2]))
print('F1 Score(micro): %.1f' % (f1_score * 100.0))

F1 Score(micro): 62.2


In [132]:
df_valid_ft= pd.read_csv(valid_file_name)
df_valid_ft.head()

Unnamed: 0,__label__1 new york times fashion critic horn teamed quintessential american designer class write memoir finished weeks death june year nonlinear formatblass skips telling prize designing gingham dress patent leather belt fashion show fort wayne ind back role serving armed forces wwiithe book feel scrapbook memories indeed delightful one considers colorful life class led originally midwest moved new york age eventually became one fashions biggest names written first person peppered snapshots class pat buckle nancy kissinger nancy reagan gloria vanderbilt others blasts memoir tribute designer writes typical american success storycopyright reed business information inc name class signifies bestmade clothes america appears many products including mens wear bed lines blue jeans bill class limited founded seventh avenue business continues today lifetime class recipient numerous industry public service awards trustee new york public library began longawaited memoir bare class completed shortly death june bare class
0,__label__1 supermodel dickinson waste time sug...
1,__label__17 poet childrens book author former ...
2,__label__10 leaders want light entrepreneurial...
3,__label__2 isnt much know maintaining positive...
4,__label__4 grade scored hits five dont know mu...


In [133]:
fasttext_sample_validation = data_subset_2['description_str'] + data_subset_2['title']
fasttext_sample_validation.head()

0    preschoolgrade when katies water phobia threat...
1    this brief biography focuses more on the polit...
2    racial and class conflicts simmer in this lack...
3    spong an episcopal bishop and bestselling auth...
4    recreates the world from the second to the fou...
dtype: object

## Test the prediction versus what we got with the xgb classifer

In [134]:
model_native.predict(fasttext_sample_validation[1], k=1)

(('__label__1',), array([0.69193423]))

In [135]:
modelwordGram.predict(fasttext_sample_validation[1], k=1)

(('__label__1',), array([0.61193895]))

We can host our model on SageMaker. Blazing Text built-in algorithm is compatible with Fasttext's models, so we can upload the fastText model to S3 and then point a SageMaker endpoint configuration to this model, and then deploy our endpoint

In [136]:
model_filename = "books_fasttext_native.bin"
model_native.save_model(model_filename)

In [137]:
from time import gmtime, strftime


In [138]:
!tar -czvf model.tar.gz books_fasttext_native.bin
model_location = sagemaker_session.upload_data("model.tar.gz", bucket=bucket, key_prefix=f"fasttext/model-{strftime('%Y-%m-%d-%H-%M-%S', gmtime())}/output")
!rm books_fasttext_native.tar.gz books_fasttext_native.bin

books_fasttext_native.bin
rm: cannot remove ‘books_fasttext_native.tar.gz’: No such file or directory


In [139]:
container = sagemaker.image_uris.retrieve("blazingtext",boto3.Session().region_name,  "1")
print('Using SageMaker BlazingText container: {} ({})'.format(container, boto3.Session().region_name))

Using SageMaker BlazingText container: 685385470294.dkr.ecr.eu-west-1.amazonaws.com/blazingtext:1 (eu-west-1)


# Deploy endpoint in SageMaker

Blazing text is compatiable with fasttext models such that you can train the fasttext model wherever you want, and then you can push the model to S3 in the required format, i.e. saved as a .tar.gz file and then can deploy the model in SageMaker to take care of the heavy lifting.

In [140]:
#use blazing text container and the fasttext model
model_fastText_book = sagemaker.Model(
    model_data=model_location, 
    image_uri=container, 
    role=role, 
    sagemaker_session=sagemaker_session)

#

model_fastText_book.deploy(
    initial_instance_count = 1,
    instance_type = 'ml.m4.xlarge')

from sagemaker.deserializers import JSONDeserializer
from sagemaker.serializers import JSONSerializer

predictor = sagemaker.Predictor(
    endpoint_name=model_fastText_book.endpoint_name,
    sagemaker_session=sagemaker_session,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer(),
)


-------------!

In [141]:
fasttext_sample_validation[1]

'this brief biography focuses more on the political career of andrew jackson than on his military heroism at the battle of new orleans in the war of  it nevertheless provides an overview of the martial events that made jacksons rise to the presidency possible robert remini is widely touted as one of the great historians of the jacksonian era and andrew jackson is his most accessible book on the periods most intriguing figure the best biography of andrew jackson available it summarizes adequately the best of the old scholarship while at the same time branching off to offer significant new interpretations of crucial points  library journalin this concise and wellwritten biography robert v remini has a more ambitious objective than merely recounting the life of a famous manhe portrays the president not as a symbol of the age nor a personification of proletarian striving but as a shrewd and able politician a pioneer in using the office of the presidency for both national and narrowly parti

In [142]:
sentence = [ fasttext_sample_validation[1] ]
payload = {"instances": sentence }

In [143]:
predictions = predictor.predict(payload)
print(predictions)

[{'label': ['__label__1'], 'prob': [0.6919344067573547]}]


# Clean up, delete endpoint

In [144]:
#fastText_predictor.delete_endpoint()