In this case study we will be implementing an elementary model that utilizes word embeddings for text classification. Word embeddings are known for encoding contextual information. In this notebook we will use a pretrained model to generate word embeddings of each word in a sentence. Further, average of all embeddings for a sentence will be the sentence representation. Each sentence representation will be classified into one of the categories. The entire process is described step by step below:

1. Load the dataset from the disk
2. Tokenize text in the dataset and create vocabulary
3. Load the word2vec model from the disk into a python dictionary
4. Load embeddings for each word and take average
5. One hot encode the target labels
6. Train the classifier

In [331]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [332]:
from nltk.tokenize import RegexpTokenizer
import numpy as np
import re

### Load the dataset from the disk

In [333]:
import pandas as pd
df = pd.read_csv('/content/bbc-text.csv')
df.head()

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...


In [334]:
df['text'][0]

'tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to high

### Tokenizer
Regular expression based tokenizers to consider only alphabetical sequences and ignore numerical sequences.

In [335]:
def complaint_to_words(comp):
    
    words = RegexpTokenizer('\w+').tokenize(comp)
    words = [re.sub(r'([xx]+)|([XX]+)|(\d+)', '', w).lower() for w in words]
    words = list(filter(lambda a: a != '', words))
    
    return words

In [336]:
text = "I have outdated information on my credit report that I have previously disputed that has yet to be removed this information xx is XX more then seven years old and 12.1 does not meet credit reporting requirements"

In [337]:
words = complaint_to_words(text)

In [338]:
words

['i',
 'have',
 'outdated',
 'information',
 'on',
 'my',
 'credit',
 'report',
 'that',
 'i',
 'have',
 'previously',
 'disputed',
 'that',
 'has',
 'yet',
 'to',
 'be',
 'removed',
 'this',
 'information',
 'is',
 'more',
 'then',
 'seven',
 'years',
 'old',
 'and',
 'does',
 'not',
 'meet',
 'credit',
 'reporting',
 'requirements']

### Vocabulary
Extracing all the unique words from the dataset

In [339]:
df.shape

(2225, 2)

In [340]:
all_words = list()
for comp in df['text']:
    for w in complaint_to_words(comp):
        all_words.append(w)

In [341]:
print('Size of vocabulary: {}'.format(len(set(all_words))))

Size of vocabulary: 27850


In [342]:
all_words[-10:-1]

['player', 'more', 'of', 'the', 'same', 'in', 'future', 'please', 'he']

In [343]:
print('Complaint\n', df['text'][10], '\n')
print('Tokens\n', complaint_to_words(df['text'][10]))

Complaint
 berlin cheers for anti-nazi film a german movie about an anti-nazi resistance heroine has drawn loud applause at berlin film festival.  sophie scholl - the final days portrays the final days of the member of the white rose movement. scholl  21  was arrested and beheaded with her brother  hans  in 1943 for distributing leaflets condemning the  abhorrent tyranny  of adolf hitler. director marc rothemund said:  i have a feeling of responsibility to keep the legacy of the scholls going.   we must somehow keep their ideas alive   he added.  the film drew on transcripts of gestapo interrogations and scholl s trial preserved in the archive of communist east germany s secret police. their discovery was the inspiration behind the film for rothemund  who worked closely with surviving relatives  including one of scholl s sisters  to ensure historical accuracy on the film. scholl and other members of the white rose resistance group first started distributing anti-nazi leaflets in the su

### Indexing
Indexing each unique word in the dataset by assigning it a unique number.

In [344]:
index_dict = dict()
count = 1
index_dict['<unk>'] = 0
for word in set(all_words):
    index_dict[word] = count
    count += 1

In [345]:
#index_dict

### Dataset
Utilizing indexed words to replace words by index. This makes the dataset numerical and keras readable.

In [346]:
embeddings_index = {}
f = open('/content/glove.6B.300d.txt') # GLOBAL VECTOR
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

In [347]:
#embeddings_index

In [348]:
# emmbed_dict = {}
# with open('/content/glove.6B.200d.txt','r') as f:
#   for line in f:
#     values = line.split()
#     word = values[0]
#     vector = np.asarray(values[1:],'float32')
#     emmbed_dict[word]=vector

In [349]:
from scipy import spatial

In [350]:
def find_similar_word(emmbedes):
  nearest = sorted(embeddings_index.keys(), key=lambda word: spatial.distance.euclidean(embeddings_index[word], emmbedes))
  return nearest


In [351]:
#find_similar_word(embeddings_index['rat'])#[0:10]

In [352]:
#len(embeddings_index['unk'])

In [353]:
#embeddings_list = embeddings_index.items()

In [354]:
#list(embeddings_list)[99:101]

#### Taking average of all word embeddings in a sentence to generate the sentence representation.

In [355]:
data_list = list()
for comp in df['text']:
    sentence = np.zeros(300)
    count = 0
    for w in complaint_to_words(comp):
        try:
            sentence += embeddings_index[w]
            count += 1
        except KeyError:
            continue
    data_list.append(sentence / count)

In [356]:
len(data_list[0])

300

#### Converting categrical labels to numerical format and further one hot encoding on the numerical labels.

In [357]:
df

Unnamed: 0,category,text
0,tech,tv future in the hands of viewers with home th...
1,business,worldcom boss left books alone former worldc...
2,sport,tigers wary of farrell gamble leicester say ...
3,sport,yeading face newcastle in fa cup premiership s...
4,entertainment,ocean s twelve raids box office ocean s twelve...
...,...,...
2220,business,cars pull down us retail figures us retail sal...
2221,politics,kilroy unveils immigration policy ex-chatshow ...
2222,entertainment,rem announce new glasgow concert us band rem h...
2223,politics,how political squabbles snowball it s become c...


In [358]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df['Product'])
df['Target'] = le.transform(df['category'])
df.head()

KeyError: ignored

### One hot Encoding

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(np.array(data_list), df.Target.values, 
    test_size=0.15, random_state=0)

In [None]:
print(X_train.shape)

In [None]:
print(y_train.shape)

#### Training and testing the classifier

In [None]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score
clf = BernoulliNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(accuracy_score(y_test, pred))

In [None]:
import nltk
nltk.download('reuters')

In [None]:
from nltk.corpus import reuters

In [None]:
reuters.fileids()

In [None]:
nltk.download('reuters')

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(accuracy_score(y_test, pred))