# Natural Language Processing - NLP

NLP is one of the fields that, together with ML, make up the bigger part of AI. It deals with understanding between computers and humans, and interactions between them as such.

NLP is used in fields like Information Engineering, Web Mining, Structure Analysis and Generation, Text Analysis, Speech Recognition, etc.

### Examples of good NLP engines - Siri, Google Assistant, Alexa, etc.

The basic idea behind NLP is to bridge the gap of understanding between humans and machines - making machines independant and to automate everything with textual or voice data, instead of manual interference / effort.

### NLP, in a nutshell, is the key to understanding and building a functional AI. 

Any AI depends on the interactions made by a human and a machine understanding them as such. Natural Language refers to the making, structuring and understanding of a language that we, as humans have developed. We are able to understand each other seamlessly, so long the topic of discussion is generalistic. Imagine teaching the concept of language and understanding to a Machine. NLP basically caters to this genre of intelligence in Data Sciences.

An Advanced version of NLP (NLU) requires indepth understanding of knowledge and logic to form and understand sentences, ability to parse speech data, and recognize each person using 'memories'. Technically, being able to train a machine to understand and learn the way humans do - only much, much faster. So with this in mind, let us take our first steps towards the true building blocks of Artificial Intelligence - the Natural Language Processing.

### As such, let us start with understanding our data -

As much as we shied away from categorical data in Machine Learning, NLP deals with "natural language" data, i.e., mostly categorical data. As such, we need to improvise and develop upon methods that would help us get a better idea of what we are dealing with.

In [1]:
import nltk
import gensim.models
import networkx as nx
import re
import pandas as pd
import regex as regex

### For this introduction to NLP, we will be dealing with specific aspects of data : 
#### Processing Documents,   Building a Sample Information Retrieval model,   Sentiment Analysis,   and Text Analysis. 
### Finally, we conclude by tinkering on Speech data.


# NLP - Part 1 : Processing Documents

#### Let us take a look at our dataset, and basic types of pre-processing we would need to do on these.
- Removing frequent / unimportant words (stopwords)
- Standardizing all words that have the same root (stemming / lemmatizing)
- Getting Sentences/phrases in the document
- Ranking terms / words in the document in order for easier lookup


In [2]:
import collections
from sklearn import metrics
from nltk import ngrams
from stemming.porter2 import stem

In [3]:
raw_training_data, raw_testing_data = [],[]

with open("train.txt") as f:
    train=f.readlines()
    ttrain=f.read()
for item in train:
    raw_training_data.append(item.strip())
    
with open("test.txt") as f:
    test=f.readlines()
for item in test:
    raw_testing_data.append(item.strip())

#get an idea of the data
freq=pd.Series(' '.join(train[1]).split()).value_counts()[:10]
freq=list(freq.index)
wordcount = {}

In [4]:
# eliminate duplicates, split by punctuation and use case demiliters
for word in ttrain.lower().split('\t', 1):
    word = word.replace(".","")
    word = word.replace(",","")
    word = word.replace(":","")
    word = word.replace("\"","")
    word = word.replace("!","")
    word = word.replace("*","")
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] += 1
#most common words / features
word_counter = collections.Counter(wordcount)

#clean data

### In the files taken, we see that entirety of the dataset is categorical data. Hence, some cleaning is required.

In [5]:
def clean(raw_data):
    
    #split into labels and text
    labels=[lab.split('\t', 1)[0] for lab in raw_data]
    training_data= [item.split('\t', 1)[1] for item in raw_data]
    
    labels,training_data
    
    #convert to lowercase, stem / lemmatize
    training_data = [i.lower() for i in training_data]
    training_data = [" ".join([stem(word) for word in sentence.split(" ")]) for sentence in training_data]
        
    #replace links, email_id, currencies, entities etc
    training_data=[re.sub(r'[\w\.-]+@[\w\.-]+', "$EMAIL_ID", i) for i in training_data]
    training_data=[re.sub(r"(<?)http:\S+", "$URL", i) for i in training_data]
    training_data=[re.sub(r"\$\d+", "$CURR", i) for i in training_data]
    training_data=[re.sub(r'\b\d+\b', "$NUM", i) for i in training_data]
    training_data=[re.sub(r'\b(me|her|him|us|them|you)\b', "$ENTITIES", i) for i in training_data]
    
    
    #remove punctuation, special chars, tokenize data
    training_data = [regex.sub(r"[^\P{P}$]+", " ", i) for i in training_data]
    training_data = [re.sub(r"[^0-9A-Za-z/$' ]", " ", i) for i in training_data]
    
    #regularize data w.r.t days, times, months and year
    regex_match_days= r'monday|tuesday|wednesday|thursday|friday|saturday|sunday'
    regex_match_times= r'morning|afternoon|evening'
    regex_match_events= r'after|before|during'
    regex_match_month= r'january|february|march|april|may|june|july|august|september|october|november|december'
    
    training_data = [re.sub(regex_match_days, "$DAY", i) for i in training_data]
    training_data = [re.sub(regex_match_times, "$TIMES", i) for i in training_data]
    training_data = [re.sub(regex_match_events, "$EVENTS", i) for i in training_data]
    training_data = [re.sub(regex_match_month, "$MONTH", i) for i in training_data]
    
    #remove extra spaces and blanks
    training_data = [item.strip() for item in training_data]
    
    #return cleaned data
    return training_data, labels

## We see that data has been read into, and cleaning has been completed.

### Now we need to see how to handle cleaned data - we will develop a small phrase extraction algorithm to get phrases from the data. We will train our data in terms of these phrases. 

In [6]:
def get_phrases(text, n):
    
    #can define in two ways - one way is to get both bigrams and tri-grams from a single pass of the function
    #depends on the number of values unpacked by the host OS, so it has been rewritten for a dual pass and as a multi n-gram formation function
    
    n_grams=ngrams(text.split(),n)
    gram_list = []
    for grams in n_grams:
        gram_list.append('_'.join(map(str,grams)))
    gram_string = ' '.join(gram_list)
    #print gram_string
    return gram_string

### To use these phrases as our training data, we need to rank them and make more 'numerical' weights out of them, so that a Machine Learning classifier like SVM or Naive Bayes Classifier can be used directly on our data.

TF-IDF (Term Frequency - Inverse Document Frequency) is one such ranking metric. Basically, the words that appear really often in all documents are given no weights. The rarer the words are, the higher they will have an IDF value. However, the number of times a word is specified in that particular (sentence / phrase / paragraph) document decides it's Term Frequency.

Multiplication of TF with IDF gives a numeric value, which can be used to rank documents. Higher TF-IDF value means higher ranking of the document.


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import train_test_split

ImportError: No module named 'sklearn.cross_validation'

In [15]:
def tf_idf(data):
    
    vector_model= TfidfVectorizer(min_df=1) #single dataframe
    X = vector_model.fit_transform(data) #fit model and transform data to required vector
    
    #only one axis holds the ranking -> X axis
    return X

In [16]:
#get the cleaned data : 
training_data, training_labels = clean(raw_training_data)
testing_data, testing_labels = clean(raw_testing_data)

X = tf_idf(training_data+testing_data)
Y = training_labels+testing_labels

#set random_state to 42 for reproducible results
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

### We will train our model by using the SVM classifier. It is a simple binary classifier, fit for our dataset.

In [9]:
from sklearn.svm import SVC


### Since we do not know what kind of phrases we are working with, let us go with the idea of unigrams, bigrams and trigrams.

This next cell deals with training and model fitting for a unigram phrase - basically, presence of a single word. Although it does not make much sense to have something like this, this acts as a dummy idea for the highest un-augmented data accuracy for the model.

In [11]:
svm=SVC(C=2100, kernel = 'rbf')
svm.fit(x_train, y_train)
print ("SVM classification : \n")
print ("Training Accuracy : ",svm.score(x_train, y_train),"\nTesting Accuracy : ",svm.score(x_test, y_test))
# Training Accuracy : 0.885 , Testing Accuracy : 0.781
#bi-gram and tri-gram SVM model
total_data = training_data+testing_data
total_labels = training_labels+testing_labels

bi_gram_data = [get_phrases(item,2) for item in total_data]
tri_gram_data = [get_phrases(item,3) for item in total_data]

X2, Y2 = tf_idf(bi_gram_data), total_labels
X3, Y3 = tf_idf(tri_gram_data), total_labels

x_train1, x_test1, y_train1, y_test1 = train_test_split(X2, Y2, test_size=0.3, random_state=42)
x_train2, x_test2, y_train2, y_test2 = train_test_split(X3, Y3, test_size=0.3, random_state=42)

NameError: name 'x_train' is not defined

### Let us build the same model fit for bigrams and trigrams now.

In [21]:
best_c={}
for increment in range(1000, 15000, 100):
    svm=SVC(C=increment, kernel = 'rbf')
    svm.fit(x_train1, y_train1)
    tr_ac, te_ac = svm.score(x_train1, y_train1), svm.score(x_test1, y_test1)
    best_c[str(increment)]=te_ac
import operator
print " Max C value : ",max(best_c.iteritems(),key=operator.itemgetter(1))[0]

#SVM C value for tri-grams, takes time to compute. Comment it out till the print statement if you dont want to run time constraints
best_tri_c={}
for increment in range(10000, 30000, 100):
    svm=SVC(C=increment, kernel = 'rbf')
    svm.fit(x_train1, y_train1)
    tr_ac, te_ac = svm.score(x_train1, y_train1), svm.score(x_test1, y_test1)
    best_tri_c[str(increment)]=te_ac
import operator
print " Max C value trigrams : ",max(best_tri_c.iteritems(),key=operator.itemgetter(1))[0] 


 Max C value :  14600
 Max C value trigrams :  23200


In [22]:
#bi-gram SVM max value as C = 14600.
svm=SVC(C=14600, kernel='rbf')
svm.fit(x_train1, y_train1)
print "SVM Classification for Bi-Grams : "
print "Training accuracy : ",svm.score(x_train1, y_train1) # 0.9797
print "Testing accuracy : ",svm.score(x_test1, y_test1) # 0.7828

#tri-gram SVM max value as C = 23200.
svm=SVC(C=23200, kernel='rbf')
svm.fit(x_train2, y_train2)
print "\nSVM Classification for Tri-Grams : "
print "Training accuracy : ",svm.score(x_train2, y_train2) # 0.9929
print "Testing accuracy : ",svm.score(x_test2, y_test2) # 0.7391

SVM Classification for Bi-Grams : 
Training accuracy :  0.9797172710510141
Testing accuracy :  0.7827956989247312

SVM Classification for Tri-Grams : 
Training accuracy :  0.9929317762753535
Testing accuracy :  0.739068100358423


### From this, we see that even though the training accuracy is really high, our testing accuracy is reducing. We can assume from this that since the data we have is categorical, formation of higher order n_grams would reduce the model accuracy.

Hence, it is a better idea for us to use the bi_gram model fit rather than the tri_gram model fit, because bi_grams give better accuracy difference between the train / test values, and still maintain an intuitive logic for presentation.

# NLP - Part 2 : Building a Sample Information Retrieval Model

#### Let us take a look at our dataset, and basic types of pre-processing we would need to do on these.
- Removing / transformation of data on the basis of encoding
- Extract words from the encoded data
- Removing Stopwords and forming regex compilation patterns to look for in data
- Building the solution in a structured format

In [23]:
import re
import json
from collections import defaultdict
from nltk import ngrams
from nltk.tokenize import word_tokenize
from collections import OrderedDict

### My data is encoded in a different format for the given bytestream, so I will decode the data to direct python objects. For those of you who do not have this problem, skip to the next cell.

In [None]:
DATAPOINTS = ["delivery_currency", "delivery_amount", "delivery_rounding", "return_currency", "return_amount", "return_rounding"]

def read_json(filename):

    with open(filename,"r") as f_p:
        list_isda_data = json.load(f_p)
    print "LIST DATA : unicode characters detected. Manually removing them."
    print "Cleaned JSON data..."
    cleaned_data = json_load_byteified(open(filename));
    print "Sending cleaned data for information extraction"

    return cleaned_data

def json_load_byteified(file_handle): #if you want to check with filenames
    return convbytes(json.load(file_handle, object_hook=convbytes),ignore_dicts=True)

def json_loads_byteified(json_text): #if you want to check with text data without filenames
    return convbytes(json.loads(json_text, object_hook=convbytes),ignore_dicts=True)

def convbytes(data, ignore_dicts=False):
    if isinstance(data,unicode):
        return data.encode('utf-8')
    if isinstance(data, list):
        return [ convbytes(item, ignore_dicts=True) for item in data ]
    if isinstance(data, dict) and not ignore_dicts:
        return { convbytes(key, ignore_dicts=True): convbytes(value, ignore_dicts=True) for key,value in data.iteritems() }
    return data

In [31]:
def extract(data):
    #we have the cleaned data from read_json()
    #we need to get the data we need for stopwords, i.e, the text needs to be processed.

    #SOLUTION and Explanation : 
    #remove stopwords first and then use regex.compile and pattern.check to find currency, rounding type and amount.	
    #amount appears right after currency, so extract format should be `currency`:`amount`. use regex.compile for this.
    #if currency regex matches more than once, then compare return_curr with delivery_curr
    #if "rounding" occurs anywhere, note the type. if type is not mentioned, take nearest as default.
    #if "rounding" occurs multiple times, take the first and last occurrance of rounding, one will be deliver, the other will be return
    #match amount regex if it appears multiple times. first occurrance is delivery, next occurrance is return.

    #start by finding stopwords. Take the stopwords file. Note: not using the stopwords from nltk.corpus since it has a few common words missing.
    stopwords,finwords=[],[]
    with open("stopwords.txt","r+") as stop:
        for line in stop:
            for word in line.split():
                stopwords.append(word)

    #extract "text" from the data object parsed from read_json().
    #remove stopwords from "text" and tokenize words, form bi-grams for regex matching at currency and rounding.
    predicted_output=[]
    for i in range(len(data)):
        text_data=[datum['text']for datum in data][i].lower()
        print "\nTEXT DATA : \n",text_data
        word_tokens=word_tokenize(text_data)
        filtered_sentence=[w for w in word_tokens if not w in stopwords]
        #print "\nGetting bi-grams"
        #bi-grams are a text input mechanism, so we will do a copy from list -> text as well to send to the param.

        st_text=""
        info_list=[]
        for item in filtered_sentence:
            st_text=st_text+" "+item
        info_list = get_bigrams(st_text)
        #bi-grams have been formed, returned. Now a simple return and regex check for the phrases we are looking for :

        curr_type_and_amount, rounding_type, it, it1 = [],[],0,0
        for item in info_list:
            del_curr = re.match("[a-z]{3}_[0-9]",item)
            if del_curr:
                curr_type_and_amount.insert(it+1,item)
            if item == "rounded_up" or item == "rounded_down" or item == "rounded_nearest" or item== "amount_rounded":
                rounding_type.insert(it1+1,item)
        #currency type and amount has been found. Let us move to delivery type and amount.
        #if currency type and amount is found more than once, the latter type and amount is return, not delivery
        #if rounded type is found more than once, former is delivery, latter is return
        #if found only once, that means return param = del param.

        #print "CURR TYPE AND AMOUNT : \n",curr_type_and_amount
        #print "ROUNDING TYPE: \n",rounding_type
        #search for number of items and decide what item could be where.

        # INTUITION p1:

        #1st item in list1 would be currency type + amount, 2nd list would be rounding type.
        #if there is a 2nd item in list1, it could be return curr/amount.
        #if there is a 2nd item in list2, it could be return rounding type.

        # INTUITION p2:

        #if there is only one item in both lists, return param = delivery param.
        #if there is only one item in list1, and no items in list2, return param = delivery param AND rounding type =  nearest.
        datap=OrderedDict()
        """"""
        datap["delivery_currency"]=""
        datap["delivery_amount"]=""
        datap["delivery_rounding"]=""
        datap["return_currency"]=""
        datap["return_amount"]=""
        datap["return_rounding"]=""

        if len(rounding_type) == 0 or "amount_rounded" in rounding_type:
            datap["delivery_rounding"]="nearest"
            datap["return_rounding"]="nearest"
        else :
            datap["delivery_rounding"]=rounding_type[0][8:] 
            datap["return_rounding"]=rounding_type[len(rounding_type)-1][8:]

        datap["delivery_currency"]=curr_type_and_amount[0][0:3].upper()
        datap["delivery_amount"]=curr_type_and_amount[0][4:]

        datap["return_currency"]=curr_type_and_amount[len(curr_type_and_amount)-1][0:3].upper()
        datap["return_amount"]=curr_type_and_amount[len(curr_type_and_amount)-1][4:]

        for key,value in datap.items():
            print key,": ",value

        for datum in data:
            predicted_output.append(datap)
    final_output = []
    for i in range(len(predicted_output)):
        if i%len(data)==0:
            final_output.append(predicted_output[i])
    return final_output

In [32]:
def get_bigrams(text):
    bi_grams=ngrams(text.split(),2) #set number type of n-grams. We want bi-grams, hence we are using '2' as our param.
    l=[]
    for grams in bi_grams:
        l.append('_'.join(map(str,grams)))
#f1= ' '.join(l) = string output for the bi-grams. We want a list output to re.match() for delivery amount, etc.
#print l
    return l

In [34]:
def evaluate(input_data, predicted_output):
    result = defaultdict(lambda: 0)
    for i, input_instance in enumerate(input_data):
        for key in DATAPOINTS:
            #print "INPUT INSTANCE :",input_instance[key],"\nPred out: ",predicted_output[i/3][key],"\n\n"
            if input_instance[key] == predicted_output[i][key]:
                result[key] += 1

    # compute the accuracy for each datapoint
    print "\nACCURACY : \n"
    for key in DATAPOINTS:
        print(key, 1.0 * result[key] / len(input_data))

    return result
fname=raw_input("Enter filename you want to use for the IR module : ")
data=read_json(fname)
pred_out=extract(data)
result=evaluate(data,pred_out)

Enter filename you want to use for the IR module : isda_data.json
LIST DATA : unicode characters detected. Manually removing them.
Cleaned JSON data...
Sending cleaned data for information extraction

TEXT DATA : 
rounding. the delivery amount and the return amount will be rounded to the nearest integral multiple o f eur 100,000; provided that if an amount corresponds to the exact half o f such multiple, then it will be rounded up; and provided further that, for the purpose o f the calculation o f the return amount where a party's credit support amount is, or is deemed to be, zero, the return amount shall not be rounded.
delivery_currency :  EUR
delivery_amount :  100,000
delivery_rounding :  nearest
return_currency :  EUR
return_amount :  100,000
return_rounding :  nearest

TEXT DATA : 
rounding. the delivery amount and the return amount will be rounded to the nearest integral multiple of usd 10,000 provided that if an amount corresponds to the exact half of such multiple, then it wil

# NLP - Part 3 : Sentiment Analysis

#### Let us take at what we would need to do for this.
- 3 Basic types of sentiments
- Compound data
- Use of  VADER for Sentiment Analysis
- Unsupervised dataset classification with -1,0,+1 classifications

### Sentiment Analysis and the Stigma behind it :

SA has always been a buzz word so long the community at large has been concerned. It is perhaps the most well known version of NLP. It works on the basis of Opinion Mining, understanding the basic attitude of the speaker.

We will make our job easier using the VADER library, although SA has multiple hard learning aspects to it. VADER works on rule-based lexicon for SA, working on their semantic orientation, giving it a rating of positive / negative / neutral and a compound score. 

Compound Score is nothing but sum of all the lexicon ratings in that particular document / sentence, normalized between -1 to 1 (Extreme Negative to Extreme Positive). 

- On average, a positive sentiment has a Compound Score > 0.05,
- A negative sentiment has a Compound score <= -0.05.
- A neutral sentiment lies between these two.

In [12]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
from io import open

def sentiment_scores(sentence): 
  
    sid_obj = SentimentIntensityAnalyzer() 
  
    # polarity_scores method of SentimentIntensityAnalyzer - has positive, negative and compound scores
    sentiment_dict = sid_obj.polarity_scores(sentence) 
    
    print "Sentence Sent : ",sentence
    print "Overall sentiment dictionary is : ", sentiment_dict
    print "sentence was rated as ", sentiment_dict['neg']*100, "% Negative" 
    print "sentence was rated as ", sentiment_dict['neu']*100, "% Neutral" 
    print "sentence was rated as ", sentiment_dict['pos']*100, "% Positive" 
  
    print "Sentence Overall Rated As", 
  
    if sentiment_dict['compound'] >= 0.05 : 
        print("Positive") 
  
    elif sentiment_dict['compound'] <= - 0.05 : 
        print("Negative") 
  
    else : 
        print("Neutral") 

SyntaxError: Missing parentheses in call to 'print' (<ipython-input-12-81d21853c065>, line 11)

### Let us check the SA code with random sentences. 

# NLP - Part 4 : Text Analysis

#### Let us take at what we would need to do for this.
- Remove delimiters
- Clean the text , preprocess it
- Stem (or Lemmatize)
- Convert it all to lower case for standardization
- Tokenize, make a bag of words
- Split datasets into training / testing
- Build a Prediction Model (RF Classifier), Test your code!

In [47]:
import numpy as np   
import pandas as pd  
  
# Import dataset 
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t')  
import re  
  
# Natural Language Tool Kit 
import nltk  
  
nltk.download('stopwords') 
  
# to remove stopword 
from nltk.corpus import stopwords 
  
# for Stemming propose  
from nltk.stem.porter import PorterStemmer 
  
# Initialize empty array 
# to append clean text  
corpus = []  
  
# 1000 (reviews) rows to clean 
for i in range(0, 1000):  
      
    # column : "Review", row ith 
    review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])  
      
    # convert all cases to lower cases 
    review = review.lower()  
      
    # split to array(default delimiter is " ") 
    review = review.split()  
      
    # creating PorterStemmer object to 
    # take main stem of each word 
    ps = PorterStemmer()  
      
    # loop for stemming each word 
    # in string array at ith row     
    review = [ps.stem(word) for word in review 
                if not word in set(stopwords.words('english'))]  
                  
    # rejoin all string array elements 
    # to create back into a string 
    review = ' '.join(review)   
      
    # append each string to create 
    # array of clean text  
    corpus.append(review)

[nltk_data] Downloading package stopwords to /home/viole/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [48]:
from sklearn.feature_extraction.text import CountVectorizer 

# "max_features" is attribute to experiment with to get better results 
cv = CountVectorizer(max_features = 1500)  

X = cv.fit_transform(corpus).toarray()  
y = dataset.iloc[:, 1].values 

from sklearn.cross_validation import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25) 

In [49]:
from sklearn.ensemble import RandomForestClassifier 
  
# n_estimators can be said as number of trees
model = RandomForestClassifier(n_estimators = 501, criterion = 'entropy')                              
model.fit(X_train, y_train) 

  from numpy.core.umath_tests import inner1d


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=501, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

### Final Predictions !


In [50]:
y_pred = model.predict(X_test) 
  
y_pred 

array([1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
       1, 1, 0, 1, 0, 0, 0, 1])

### Get an Idea of how many of the sentences were classified correctly ! - Using Confusion Matrices

In [52]:
from sklearn.metrics import confusion_matrix 
  
# TP, FN, FP, TN
cm = confusion_matrix(y_test, y_pred) 
print cm

[[92 29]
 [36 93]]


In [None]:
from collections import Counter 
  
def count_words(text):                   #counts word frequency 
    skips = [".", ", ", ":", ";", "'", '"'] 
    for ch in skips: 
        text = text.replace(ch, "") 
    word_counts = {} 
    for word in text.split(" "): 
        if word in word_counts: 
            word_counts[word]+= 1 
        else: 
            word_counts[word]= 1 
    return word_counts 
  
    # >>>count_words(text)  You can check the function 
  
  
def count_words_fast(text):      #counts word frequency using Counter from collections 
    text = text.lower() 
    skips = [".", ", ", ":", ";", "'", '"'] 
    for ch in skips: 
        text = text.replace(ch, "") 
    word_counts = Counter(text.split(" ")) 
    return word_counts

def read_book(title_path):  #read a book and return it as a string 
    with open(title_path, "r", encoding ="utf8") as current_file: 
        text = current_file.read() 
        text = text.replace("\n", "").replace("\r", "") 
        return text

def word_stats(word_counts):     # word_counts = count_words_fast(text)    
    num_unique = len(word_counts) 
    counts = word_counts.values() 
    return (num_unique, counts) 


