## Amazon Reviews 

In [1]:
import urllib.request
import pandas as pd
import urllib
from sklearn import model_selection, preprocessing

#Feature Engineering
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif

#keras
from keras.preprocessing.text import Tokenizer


Using TensorFlow backend.


In [0]:
#import data from my github repo
url = "https://raw.githubusercontent.com/deanhoperobertson/NLP-Text-Classifiation-/master/corpus.txt"
data = urllib.request.urlopen(url).read()
data = data.decode('utf-8')


In [0]:
labels, texts = [], []
for i, line in enumerate(str(data).split("\n")):
    content = line.replace("\\","").split()
    labels.append(content[0])
    texts.append(" ".join(content[1:]))

# create a dataframe using texts and lables
trainDF = pd.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels

In [4]:
trainDF.head()

Unnamed: 0,text,label
0,Stuning even for the non-gamer: This sound tra...,__label__2
1,The best soundtrack ever to anything.: I'm rea...,__label__2
2,Amazing!: This soundtrack is my favorite music...,__label__2
3,Excellent Soundtrack: I truly like this soundt...,__label__2
4,"Remember, Pull Your Jaw Off The Floor After He...",__label__2


In [0]:
#split into train and test set (default is 25%)
#set seed for consistency
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'], test_size=0.25, random_state=1)


In [0]:
# encode target labels into intergers between 0 - no.classes
# label 2 = 1
# label 1 = 0
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

## Feature Engineeering

Now we need to convert the worded sentences into vectors by replacing the words with numercil representation. 

for Example:
"The cat sat on the hat" = {1,3,10,7,1,12}

This can be done in a number of ways such as:
-  CountVectorizer (Bag of Words)
- TF-IDF Matrix



## 1.1 Bag of Words

Here we use countervectorize to create a "bag of words", that is a corpus of unique words from the text that we are interested in modeling. The CountVectorizer's defaults settings:
- Convert all characters into lower case
- Remove all stops words such as “the”, “a”, “an”, “in” 

This bag of words is then used to vectorize each sentence (row) is a document term matrix with the columns defined a specific term/word.

- fit() function - learns a vocabulary from text
- transform() function - encodes the documents into vectors 


**Pro's**
- Easy to undertand and apply
- Flexible vocabulary

**Cons**
- All words are treated eqaully (include meanlingless words)
-  Discarding word order ignores the context, which can be very valuable in text model. 


![alt text](Bag of Words.png "Title")

### 1.1.1 CountVectorizer

In [0]:
# instantiate a count vectorizer object to tokenize on a word level
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}', strip_accents ="ascii", stop_words=None)
count_vect.fit(trainDF['text'])

#see the unique words being tokenized 
count_vect.get_feature_names()[1:10]

#create a document term matrix using the 
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

In [0]:
#Here we can see the how the each word have been tokenized
# count_vect.vocabulary_

In [9]:
#view that DT matrix in dataframe
pd.DataFrame(xtrain_count.toarray(), columns=count_vect.get_feature_names()).head()

Unnamed: 0,0,00,000,001,002,00290,007,0070412901,0072316373,008,...,zzzz,zzzzz,zzzzzz,zzzzzzz,zzzzzzzzzz,zzzzzzzzzzzz,zzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzz,zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 1.1.2 Bag of Words ( with Keras)

In [0]:
#tokenizes the words - here we aree able to define the maxium number of words we would like to include ranked by frequency
t = Tokenizer(num_words=None)
# fit the tokenizer on the documents
t.fit_on_texts(trainDF['text'])

In [11]:
#see vectorization of words
print(t.word_counts)



In [12]:
# Easy method to calculate vocabulary (unique word count)
len(set(t.word_counts))


33470

In [13]:
xtrain_keras = t.texts_to_matrix(train_x, mode='count')
xvalid_keras = t.texts_to_matrix(valid_x, mode='count')
print(xtrain_keras)

[[ 0. 11.  1. ...  0.  0.  0.]
 [ 0.  1.  1. ...  0.  0.  0.]
 [ 0.  3.  0. ...  0.  0.  0.]
 ...
 [ 0.  3.  1. ...  0.  0.  0.]
 [ 0.  2.  1. ...  0.  0.  0.]
 [ 0.  6.  3. ...  0.  0.  0.]]


##1.2 TD-IDF Vectors 




A TDIF matrix is an alternative method of representing  words in tokenized form. This method differs from the traditional bag of words model as it weights the importance of each word different depending on its frequency within the list of documents. 

This allows for words like “the” which generally appear many times (large counts) and are not very meaningful in the encoded vectors. The TD-IFD vectorization method weights these meaningless words less than other words that offer more value. TF-IDF score is composed by two terms:

- Normalized Term Frequency (TF)
- Inverse Document Frequency (IDF)

with the following formula:

\begin{align}
\frac{Term Frequency​(TF)}{Inverse Document Frequency​(IDF)}
\end{align}

TF component summarizes how often a given word appears within a document, and the IDF component reduces the weighting of words that appear a lot across documents. This is done using by computing  logarithm (Ln) of the number of the documents in the corpus divided by the number of documents where the specific term appears.

The end result is method of vectorizing text that **highlights** words that are more interesting within a collections of texts. 

**Pros**
- Accounts for different word importance
- small range of values in matrix speeds up convergance

**Cons**
- Discards word order ignores and thus looses context.



### 1.2.1 Word-Level

In [0]:
# vectorizing on a word level
tfidf_vect = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', max_features=5000)
tfidf_vect.fit(trainDF['text'])
xtrain_tfidf =  tfidf_vect.transform(train_x)
xvalid_tfidf =  tfidf_vect.transform(valid_x)

In [0]:
# The first line shows the vectoriztion of vocab
# The idf score indicates how frequent the word is. The lowest score of 1.0 to the most frequently observed word
print(tfidf_vect.vocabulary_)
print(tfidf_vect.idf_)

[5.85373154 6.62692143 6.54687872 ... 6.80924299 6.22145632 6.77645316]


### 1.2.2 N-gram Level (Basic)

A further level of vectorization is adapting an N-gram approach which processes text as a sequence of words. Simply put, an N-gram is a sequence of N words. N-grams can be any size greater than one word. For Example:

"Doctor" - Uni-gram <br>
"Doctor has" = Bi-gram" <br>
"Doctor has evidence" - Tri-gram <br>

The N-gram approach cater for the fact that some words are often found in close proximitry to each other and together can have a very different meaning processed individually. For example, the bi-gram "not good" has a very different meaning if processed as "not" & "good" on a word level. The N-gram model allows for these words relationships to be reckonized within the text. 

In most cases, we set an N-gram range that allow a variety of sized N-grams a chance to be presented. 


In [0]:
# ngram level tf-idf 
tfidf_vect_ngram = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', ngram_range=(2,3), max_features=5000)
tfidf_vect_ngram.fit(trainDF['text'])
xtrain_tfidf_ngram =  tfidf_vect_ngram.transform(train_x)
xvalid_tfidf_ngram =  tfidf_vect_ngram.transform(valid_x)

##1.3 Customize a Vectorization Function

In [0]:
#only take the top 2000 N-grams and 
#ignore words that a doc freq of less than 2 
TOP_K = 20000
MIN_DOC_FREQ = 2

def ngram_vectorize(train_texts, train_labels, val_texts):
    kwargs = {
        'ngram_range' : (1, 2),
        'dtype' : 'int32',
        'strip_accents' : 'unicode',
        'decode_error' : 'replace',
        'analyzer' : 'word',
        'min_df' : MIN_DOC_FREQ,
    }
    
    # Learn Vocab from train texts and vectorize train and val sets
    tfidf_vectorizer = TfidfVectorizer(**kwargs)
    x_train = tfidf_vectorizer.fit_transform(train_texts)
    x_val = tfidf_vectorizer.transform(val_texts)
    
    # Select best k features, with feature importance measured by f_classif (which is the deafault)
    selector = SelectKBest(f_classif, k=min(TOP_K, x_train.shape[1]))  
    selector.fit(x_train, train_labels)
    x_train = selector.transform(x_train).astype('float32')
    x_val = selector.transform(x_val).astype('float32')
    return x_train, x_val

In [0]:
    # Vectorize the data and returns a tf-idf matrix
    x_train, x_val = ngram_vectorize(train_x, train_y, valid_x)

