# Notes

Check whether stemming & lemmatization make difference

Do we need bigram/trigram frequencies (within titles)?

Do we need dataframe including cleaned data & genres? (which cleaned data?) -> create dictionary of lists maybe
https://www.geeksforgeeks.org/python-ways-to-create-a-dictionary-of-lists/

How do we incorporate POS-tagging & NER in classifier?

# Importing & Cleaning data

This section imports the data into a pandas dataframe and goes through the following preprocessing steps:

-Case collapsing 
-Remove punctuation 
-Tokenization 
-N-Grams: bigrams and trigrams 
-Stemming -> check whether this makes a difference 
-Lemmatization -> check whether this makes a difference 
-Part-of-speech (POS) tagging 
-Named entity recognition (NER) 

## Import Data

Imports data as pandas dataframe

Create list of title and genre columns from original data

Create list including all 32 genres

In [1]:
import pandas as pd
df = pd.read_csv(r'/Users/feliciaheilgendorff/Documents/AU/NLP/NLP-Project/amazon/book32listing.csv', encoding='latin1', header=None)
df1 = df[[3,6]] # only columns with titles and genres
df1.columns = ['title', 'genre']
print(df1[:5])

                                               title      genre
0                    Mom's Family Wall Calendar 2016  Calendars
1                    Doug the Pug 2016 Wall Calendar  Calendars
2  Moleskine 2016 Weekly Notebook, 12M, Large, Bl...  Calendars
3            365 Cats Color Page-A-Day Calendar 2016  Calendars
4               Sierra Club Engagement Calendar 2016  Calendars


In [2]:
titles = df1['title'] # list of all titles
titles1 = titles.values.tolist() # change to list of strings
print(titles1[0:5]) # test whether it worked

["Mom's Family Wall Calendar 2016", 'Doug the Pug 2016 Wall Calendar', 'Moleskine 2016 Weekly Notebook, 12M, Large, Black, Soft Cover (5 x 8.25)', '365 Cats Color Page-A-Day Calendar 2016', 'Sierra Club Engagement Calendar 2016']


In [3]:
genres = df1['genre']
genres = genres.values.tolist()
genres = pd.DataFrame(genres)

In [4]:
df1.genre.unique() # list of all possible genres

array(['Calendars', 'Comics & Graphic Novels', 'Test Preparation',
       'Mystery, Thriller & Suspense', 'Science Fiction & Fantasy',
       'Romance', 'Humor & Entertainment', 'Literature & Fiction',
       'Gay & Lesbian', 'Engineering & Transportation',
       'Cookbooks, Food & Wine', 'Crafts, Hobbies & Home',
       'Arts & Photography', 'Education & Teaching',
       'Parenting & Relationships', 'Self-Help', 'Computers & Technology',
       'Medical Books', 'Science & Math', 'Health, Fitness & Dieting',
       'Business & Money', 'Law', 'Biographies & Memoirs', 'History',
       'Politics & Social Sciences', 'Reference',
       'Christian Books & Bibles', 'Religion & Spirituality',
       'Sports & Outdoors', 'Teen & Young Adult', "Children's Books",
       'Travel'], dtype=object)

## Case Collapsing
Change all uppercase to lowercase letters

In [5]:
case_collap = map(lambda x:x.lower(), titles1)
case_collap_list = list(case_collap)
print(case_collap_list[0:5])

["mom's family wall calendar 2016", 'doug the pug 2016 wall calendar', 'moleskine 2016 weekly notebook, 12m, large, black, soft cover (5 x 8.25)', '365 cats color page-a-day calendar 2016', 'sierra club engagement calendar 2016']


## Remove Punctuation
Remove punctuation by creating translation table

Punctuation to be removed is given in string: string.punctuation

In [6]:
import string
trans = str.maketrans('', '', string.punctuation)
rem_punct = [s.translate(trans) for s in case_collap_list]
print(rem_punct[0:5])

['moms family wall calendar 2016', 'doug the pug 2016 wall calendar', 'moleskine 2016 weekly notebook 12m large black soft cover 5 x 825', '365 cats color pageaday calendar 2016', 'sierra club engagement calendar 2016']


## Tokenization
Split all titles into words (output: list of lists of strings)

In [7]:
import nltk
from nltk.tokenize import word_tokenize
tokenized_titles = [word_tokenize(i) for i in rem_punct]
print(tokenized_titles[:5])

[['moms', 'family', 'wall', 'calendar', '2016'], ['doug', 'the', 'pug', '2016', 'wall', 'calendar'], ['moleskine', '2016', 'weekly', 'notebook', '12m', 'large', 'black', 'soft', 'cover', '5', 'x', '825'], ['365', 'cats', 'color', 'pageaday', 'calendar', '2016'], ['sierra', 'club', 'engagement', 'calendar', '2016']]


## N-Grams: Bigrams and Trigrams
Create bigrams + trigrams (could be done in one go e.g. n=2 for bigrams, n=3 for trigrams etc.)

In [8]:
# bigrams
token_bigram = []
for title in tokenized_titles:
    title_bigram = []
    for w in range(len(title) - 1):
        title_bigram.append([title[w], title[w + 1]])
    token_bigram.append(title_bigram)
print(token_bigram[:5]) # test whether working

[[['moms', 'family'], ['family', 'wall'], ['wall', 'calendar'], ['calendar', '2016']], [['doug', 'the'], ['the', 'pug'], ['pug', '2016'], ['2016', 'wall'], ['wall', 'calendar']], [['moleskine', '2016'], ['2016', 'weekly'], ['weekly', 'notebook'], ['notebook', '12m'], ['12m', 'large'], ['large', 'black'], ['black', 'soft'], ['soft', 'cover'], ['cover', '5'], ['5', 'x'], ['x', '825']], [['365', 'cats'], ['cats', 'color'], ['color', 'pageaday'], ['pageaday', 'calendar'], ['calendar', '2016']], [['sierra', 'club'], ['club', 'engagement'], ['engagement', 'calendar'], ['calendar', '2016']]]


In [9]:
# trigrams
token_trigram = []
for title in tokenized_titles:
    title_trigram = []
    for w in range(len(title) - 2):
        title_trigram.append([title[w], title[w + 1], title[w + 2]])
    token_trigram.append(title_trigram)
print(token_trigram[:5]) # test whether working

[[['moms', 'family', 'wall'], ['family', 'wall', 'calendar'], ['wall', 'calendar', '2016']], [['doug', 'the', 'pug'], ['the', 'pug', '2016'], ['pug', '2016', 'wall'], ['2016', 'wall', 'calendar']], [['moleskine', '2016', 'weekly'], ['2016', 'weekly', 'notebook'], ['weekly', 'notebook', '12m'], ['notebook', '12m', 'large'], ['12m', 'large', 'black'], ['large', 'black', 'soft'], ['black', 'soft', 'cover'], ['soft', 'cover', '5'], ['cover', '5', 'x'], ['5', 'x', '825']], [['365', 'cats', 'color'], ['cats', 'color', 'pageaday'], ['color', 'pageaday', 'calendar'], ['pageaday', 'calendar', '2016']], [['sierra', 'club', 'engagement'], ['club', 'engagement', 'calendar'], ['engagement', 'calendar', '2016']]]


## Stemming
Test this out to see whether it makes a difference in final classifier

PorterStemmer (one algorithm for stemming; less aggressive than LancasterStemming)


Create empty list to contain lists of stems in each title

Create empty list for stems of title

Add each stemmed word in title to second list

Append list of stemmed words to first list

In [10]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()
stems = []   
for title in tokenized_titles:
    stems_title = []
    for word in title:
        stems_title.append(porter.stem(word))
    stems.append(stems_title)
    
print(stems[0:5]) # test whether it works

[['mom', 'famili', 'wall', 'calendar', '2016'], ['doug', 'the', 'pug', '2016', 'wall', 'calendar'], ['moleskin', '2016', 'weekli', 'notebook', '12m', 'larg', 'black', 'soft', 'cover', '5', 'x', '825'], ['365', 'cat', 'color', 'pageaday', 'calendar', '2016'], ['sierra', 'club', 'engag', 'calendar', '2016']]


## Lemmatization

Same principle as with stemming

Test this out to see whether it makes a difference in final classifier

In [11]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmas = []
for title in tokenized_titles:
    lemmas_title = []
    for word in title:
        lemmas_title.append(lemmatizer.lemmatize(word))
    lemmas.append(lemmas_title)
    
print(lemmas[0:5]) # test whether it works

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/feliciaheilgendorff/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


[['mom', 'family', 'wall', 'calendar', '2016'], ['doug', 'the', 'pug', '2016', 'wall', 'calendar'], ['moleskine', '2016', 'weekly', 'notebook', '12m', 'large', 'black', 'soft', 'cover', '5', 'x', '825'], ['365', 'cat', 'color', 'pageaday', 'calendar', '2016'], ['sierra', 'club', 'engagement', 'calendar', '2016']]


## Part-of-Speech (POS) Tagging
Create list of lists with tokens and their corresponding part-of-speech tag in each title

In [12]:
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag

postag = []
for title in tokenized_titles:
    postag.append(nltk.pos_tag(title))
    
print(postag[0:5]) # testing whether postag worked

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/feliciaheilgendorff/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


[[('moms', 'NNS'), ('family', 'NN'), ('wall', 'NN'), ('calendar', 'NN'), ('2016', 'CD')], [('doug', 'VB'), ('the', 'DT'), ('pug', 'NN'), ('2016', 'CD'), ('wall', 'NN'), ('calendar', 'NN')], [('moleskine', 'NN'), ('2016', 'CD'), ('weekly', 'JJ'), ('notebook', 'NN'), ('12m', 'CD'), ('large', 'JJ'), ('black', 'JJ'), ('soft', 'JJ'), ('cover', 'NN'), ('5', 'CD'), ('x', 'JJ'), ('825', 'CD')], [('365', 'CD'), ('cats', 'NNS'), ('color', 'VBP'), ('pageaday', 'IN'), ('calendar', 'NN'), ('2016', 'CD')], [('sierra', 'NN'), ('club', 'NN'), ('engagement', 'NN'), ('calendar', 'NN'), ('2016', 'CD')]]


## Named Entity Recognition (NER)

Create list of lists with words, correspondent POS and named entity tags for each title (uses postags created in previous step)

In [13]:
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk import ne_chunk
from nltk.chunk import tree2conlltags

ner = []
for title in titles1:
    ner.append(tree2conlltags(ne_chunk(pos_tag(word_tokenize(title)))))
    
print(ner[0:5]) # test whether NER worked

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /Users/feliciaheilgendorff/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     /Users/feliciaheilgendorff/nltk_data...
[nltk_data]   Package words is already up-to-date!


[[('Mom', 'NNP', 'B-PERSON'), ("'s", 'POS', 'O'), ('Family', 'NNP', 'B-PERSON'), ('Wall', 'NNP', 'I-PERSON'), ('Calendar', 'NNP', 'I-PERSON'), ('2016', 'CD', 'O')], [('Doug', 'NNP', 'O'), ('the', 'DT', 'O'), ('Pug', 'NNP', 'O'), ('2016', 'CD', 'O'), ('Wall', 'NNP', 'B-FACILITY'), ('Calendar', 'NNP', 'I-FACILITY')], [('Moleskine', 'NN', 'O'), ('2016', 'CD', 'O'), ('Weekly', 'NNP', 'O'), ('Notebook', 'NNP', 'O'), (',', ',', 'O'), ('12M', 'CD', 'O'), (',', ',', 'O'), ('Large', 'NNP', 'B-PERSON'), (',', ',', 'O'), ('Black', 'NNP', 'B-PERSON'), (',', ',', 'O'), ('Soft', 'NNP', 'B-PERSON'), ('Cover', 'NNP', 'I-PERSON'), ('(', '(', 'O'), ('5', 'CD', 'O'), ('x', 'RB', 'O'), ('8.25', 'CD', 'O'), (')', ')', 'O')], [('365', 'CD', 'O'), ('Cats', 'NNPS', 'B-ORGANIZATION'), ('Color', 'NNP', 'I-ORGANIZATION'), ('Page-A-Day', 'NNP', 'O'), ('Calendar', 'NNP', 'O'), ('2016', 'CD', 'O')], [('Sierra', 'NNP', 'B-PERSON'), ('Club', 'NNP', 'B-ORGANIZATION'), ('Engagement', 'NNP', 'I-ORGANIZATION'), ('Calenda

# Classifiers

Check whether stemming/lemmatization make difference in final classifiers
-> does it improve/worsen classifier?


-Naive Bayes (bag-of-words)

-BERT (language model)

https://towardsdatascience.com/text-classification-with-nlp-tf-idf-vs-word2vec-vs-bert-41ff868d1794

In [14]:
# First step: make third column in original dataframe for cleaned titles 
# (could have columns for stemmed/lemmatized etc versions)
# Columns: tokens, bigrams, trigrams, stems, lemmas, NER (this includes POS)

In [15]:
# create dataframe containing tokenized titles and genres
tok_title = pd.DataFrame({0: tokenized_titles})
data_in = [tok_title[0], df1["genre"]]
headers = ["titles", "genres"]

data = pd.concat(data_in, axis=1, keys=headers)
print(data[:5])

                                              titles     genres
0               [moms, family, wall, calendar, 2016]  Calendars
1             [doug, the, pug, 2016, wall, calendar]  Calendars
2  [moleskine, 2016, weekly, notebook, 12m, large...  Calendars
3       [365, cats, color, pageaday, calendar, 2016]  Calendars
4         [sierra, club, engagement, calendar, 2016]  Calendars


In [16]:
# split data into train and test
import numpy as np

test_pct=0.2 # split into 80/20%

# create mask
mask = np.random.choice([0, 1], p=[1 - test_pct, test_pct], size=data.shape[0])

# apply mask
data["mask"] = mask
test = data[data["mask"] == 1]
train = data[data["mask"] == 0]

# removing column
test = test.drop("mask", axis="columns").reset_index()
train = train.drop("mask", axis="columns").reset_index()

# remove original indexing data (otherwise we have double indexing)
test = test.drop("index", axis="columns")
train = train.drop("index", axis="columns")

## Multinomial Naive Bayes

In [17]:
from collections import Counter

In [18]:
# token frequencies within each title
unigram_count = []
for title in tokenized_titles:
    uni_title = Counter()
    for i in title:
        uni_title[i] += 1
    unigram_count.append(uni_title)

print(unigram_count[:5]) # test

[Counter({'moms': 1, 'family': 1, 'wall': 1, 'calendar': 1, '2016': 1}), Counter({'doug': 1, 'the': 1, 'pug': 1, '2016': 1, 'wall': 1, 'calendar': 1}), Counter({'moleskine': 1, '2016': 1, 'weekly': 1, 'notebook': 1, '12m': 1, 'large': 1, 'black': 1, 'soft': 1, 'cover': 1, '5': 1, 'x': 1, '825': 1}), Counter({'365': 1, 'cats': 1, 'color': 1, 'pageaday': 1, 'calendar': 1, '2016': 1}), Counter({'sierra': 1, 'club': 1, 'engagement': 1, 'calendar': 1, '2016': 1})]


In [19]:
# token frequencies in total vocab
tok_freq = Counter()
for title in train['titles']:
    for i in title:
        tok_freq[i] += 1
        
print(tok_freq)



In [20]:
# group data by genre

grouped = train.groupby(train.genres)

Calendars = grouped.get_group("Calendars")
Comics = grouped.get_group("Comics & Graphic Novels")
Test = grouped.get_group("Test Preparation")
Mystery = grouped.get_group("Mystery, Thriller & Suspense")
SciFi = grouped.get_group("Science Fiction & Fantasy")
Romance = grouped.get_group("Romance")
Humor = grouped.get_group("Humor & Entertainment")
Literature = grouped.get_group("Literature & Fiction")
LGBTQ = grouped.get_group("Gay & Lesbian")
Engineering = grouped.get_group("Engineering & Transportation")
Food = grouped.get_group("Cookbooks, Food & Wine")
Crafts = grouped.get_group("Crafts, Hobbies & Home")
Arts = grouped.get_group("Arts & Photography")
Education = grouped.get_group("Education & Teaching")
Parenting = grouped.get_group("Parenting & Relationships")
SelfHelp = grouped.get_group("Self-Help")
Computers = grouped.get_group("Computers & Technology")
Medical = grouped.get_group("Medical Books")
Science = grouped.get_group("Science & Math")
Health = grouped.get_group("Health, Fitness & Dieting")
Business = grouped.get_group("Business & Money")
Law = grouped.get_group("Law")
Biographies = grouped.get_group("Biographies & Memoirs")
History = grouped.get_group("History")
Politics = grouped.get_group("Politics & Social Sciences")
Reference = grouped.get_group("Reference")
Bibles = grouped.get_group("Christian Books & Bibles")
Religion = grouped.get_group("Religion & Spirituality")
Sports = grouped.get_group("Sports & Outdoors")
Teen = grouped.get_group("Teen & Young Adult")
Childrens = grouped.get_group("Children's Books")
Travel = grouped.get_group("Travel")

GenreGroups = [Calendars['titles'], Comics['titles'], Test['titles'], Mystery['titles'], SciFi['titles'], 
               Romance['titles'], Humor['titles'], Literature['titles'], LGBTQ['titles'], Engineering['titles'], 
               Food['titles'], Crafts['titles'], Arts['titles'], Education['titles'], Parenting['titles'], 
               SelfHelp['titles'], Computers['titles'], Medical['titles'], Science['titles'], Health['titles'], 
               Business['titles'], Law['titles'], Biographies['titles'], History['titles'], Politics['titles'], 
               Reference['titles'], Bibles['titles'], Religion['titles'], Sports['titles'], Teen['titles'], 
               Childrens['titles'], Travel['titles']]

In [21]:
# token frequencies in each genre
keys = ['Calendars', 'Comics & Graphic Novels', 'Test Preparation',
       'Mystery, Thriller & Suspense', 'Science Fiction & Fantasy',
       'Romance', 'Humor & Entertainment', 'Literature & Fiction',
       'Gay & Lesbian', 'Engineering & Transportation',
       'Cookbooks, Food & Wine', 'Crafts, Hobbies & Home',
       'Arts & Photography', 'Education & Teaching',
       'Parenting & Relationships', 'Self-Help', 'Computers & Technology',
       'Medical Books', 'Science & Math', 'Health, Fitness & Dieting',
       'Business & Money', 'Law', 'Biographies & Memoirs', 'History',
       'Politics & Social Sciences', 'Reference',
       'Christian Books & Bibles', 'Religion & Spirituality',
       'Sports & Outdoors', 'Teen & Young Adult', "Children's Books",
       'Travel']

genre_count = []
for g in GenreGroups:
    genre_title = Counter()
    for title in g:
        for i in title:
            genre_title[i] += 1
    genre_count.append(genre_title)

In [22]:
# combine keys and titles grouped by genres
zipped_values = zip(keys, genre_count)
tok_freq_genres = list(zipped_values)

In [23]:
tok_freq_genres[0][1]

Counter({'moms': 8,
         'family': 9,
         'wall': 791,
         'calendar': 1939,
         '2016': 1128,
         'doug': 1,
         'the': 364,
         'pug': 4,
         'moleskine': 27,
         'weekly': 63,
         'notebook': 18,
         '12m': 7,
         'large': 27,
         'black': 26,
         'soft': 9,
         'cover': 29,
         '5': 14,
         'x': 27,
         '825': 13,
         '365': 76,
         'cats': 34,
         'color': 13,
         'pageaday': 52,
         'sierra': 3,
         'club': 4,
         'engagement': 68,
         'wilderness': 9,
         'ansel': 3,
         'adams': 3,
         'dilbert': 5,
         'daytoday': 94,
         'mary': 8,
         'engelbreit': 6,
         'deluxe': 34,
         'never': 3,
         'give': 1,
         'up': 9,
         'amy': 3,
         'knapp': 2,
         'big': 11,
         'grid': 4,
         'essential': 1,
         'organization': 1,
         'and': 133,
         'communication': 1,
       

In [24]:
# number of words in a class
len(tok_freq_genres[0][1]) # first genre

2353

In [25]:
# class name
tok_freq_genres[0][0] # first genre

'Calendars'

In [26]:
# number of total vocabulary (training set)
V = len(Counter(tok_freq))
print(V)

74847


In [27]:
# number of titles (in training set)
N_titles = len(train)
print(N_titles)

166420


In [28]:
# number of titles in each genre
N_genre = train['genres'].value_counts()
print(N_genre[:5])

Travel                       14709
Children's Books             10918
Medical Books                 9682
Health, Fitness & Dieting     9483
Business & Money              7974
Name: genres, dtype: int64


In [29]:
N_genre['Travel'] # access number of titles in specified genre

14709

In [None]:
# compute priors for each class
# total vocab = V
# number of words in each class -> token frequencies for each genre

for i in train['genres']:
    # count words within each document for all documents in class
    
# probabilities of title being in given genre
probabilities = []
for i in N_genre:
    probabilities = i / N_titles

In [None]:
# likelihoods for each word
# probability genre given word
# x = token frequencies for each genre -> saved as dictionary in tok_freq_genres[0][1] for first genre

# Nc = number of words in class
for i in tok_freq_genres[0][i]:
    len(tok_freq_genres[0][i]) # gives length of i-th genre
tok_freq_genres[0][0] # gives name of first genre
# V = total vocab

#(x+1) / (Nc + V)

# probability genre given word not in testset

In [None]:
# test NB using test data

## BERT

In [None]:
# import relevant packages
import transformers

# do we need PyTorch / Tensorflow as well?