# Reading the Dataset

In [53]:
reviews_train = []
reviews_test = []
for line in open('full_train.txt','r', encoding="utf8"):
    reviews_train.append(line.strip())
for test in open('full_test.txt','r', encoding="utf8"):
    reviews_test.append(test.strip())  

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the ‘8’ means that 8-bit values are used in the encoding.

The strip() method returns a copy of the string by removing both the leading and the trailing characters (based on the string argument passed).  

# Length of the Training and Testing Data

In [54]:
print(len(reviews_train))
print(len(reviews_test))

25000
25000


# Cleaning the Data

In [55]:
import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews

In [56]:
clean_reviews_train = preprocess_reviews(reviews_train)
clean_reviews_test = preprocess_reviews(reviews_test)

In [57]:
clean_reviews_train[0]

'bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell highs satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled  at  high a classic line inspector im here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isnt'

The re.compile() method is used to compile a regular expression pattern provided as a string into a regex pattern object.

In the above code we see that after quotes we have [ ] which mean those square barckets hold list of characters in them. ( ) hold specific patterns within them.

The .sub() method will replace all those characters with a blank and in the line that follows .sub() will replace all those patterns with a space.

# Text Processing

### 1) Remove Stop Words

Stop words are the very common words like ‘if’, ‘but’, ‘we’, ‘he’, ‘she’, and ‘they’. We can usually remove these words without changing the semantics of a text.

In [58]:
from nltk.corpus import stopwords
english_stop_words = stopwords.words('english')
len(english_stop_words)

179

In [59]:
def stopwords_removal(corpus):
    remove_stop_words = []
    modified = ''
    for review in corpus:
        modified = ''
        words = review.split()
        for word in words:
            if word not in english_stop_words:
                modified = modified + word + " "
        remove_stop_words.append(modified)
    return remove_stop_words    

In [60]:
text_processed = stopwords_removal(clean_reviews_train)

### 2) Normalization

Normalzation is the process to convert all of the different forms of a given word into one.

It can be implemented in 2 ways mainly Stemming and Lemmitization.

#### 1) Stemming 

Stemmers remove morphological affixes from words, leaving only the word stem. Basically prefixes and suffixes are removed leaving behind the word stem. We will be using the Porter Stemmer and Snowball Stemmer on our dataset.

In [61]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
porter = PorterStemmer()
snowball =SnowballStemmer('english')

def porter_stemmed_reviews(corpus):
    porter_final = []
    for review in corpus:
        words = word_tokenize(review)
        modified_porter = ''
        for w in words:
            porter_stem = porter.stem(w)
            modified_porter = modified_porter + porter_stem + " "
        porter_final.append(modified_porter)    
    return porter_final
def snowball_stemmed_reviews(corpus):
    snowball_final = []
    for review in corpus:
        words = review.split()
        modified_snowball = ''
        for w in words:
            snowball_stem = snowball.stem(w)
            modified_snowball = modified_snowball + snowball_stem + " "
        snowball_final.append(modified_snowball)
    return snowball_final    

In [62]:
stemmed_porter = porter_stemmed_reviews(text_processed)
stemmed_snowball = snowball_stemmed_reviews(text_processed)

Porter Stemmer removes suffixes from the word to reduce it to its word stem.

Snowball Stemmer is a better and an updated version of Porter. It is called as Porter2 algorithm.

#### 2) Lemmatization

Lemmatization works by identifying the part-of-speech of a given word and then applying more complex rules to transform the word into its true root.

In [63]:
from nltk.stem import WordNetLemmatizer
lemmatized  = WordNetLemmatizer()
def lemmatized_reviews(corpus):
    final_lemmatized = []
    for review in corpus:
        words = review.split()
        modified_lemmatized = ''
        for w in words:
            lemmatized_text = lemmatized.lemmatize(w)
            modified_lemmatized = modified_lemmatized + lemmatized_text + " "
        final_lemmatized.append(modified_lemmatized)
    return final_lemmatized                                

In [64]:
lem_final = lemmatized_reviews(text_processed)

In contrast to stemming, lemmatization is a lot more powerful. It looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words, aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.

An inflectional ending is a word part that is added to the end of a base word that changes the number or tense of a base word. A base word can stand alone and has meaning (for example, cat, bench, eat, walk).

# Vectorization 

### Bag of Words Model

The basic definition of Vectorization is to represent text in numerical format. Implementation would be creating one large matrix containing one column of all the unique words. Next is to create rows containing sequence of 0's and 1's. 1 if the word in your corpus is present in that respective column of the matrix and a 0 if it is absent. This is One Hot Encoding.

Bag of Words is used to turn text files into numerical vectors or a bag of words.
The Bag of Words (BoW) model is the most basic type of numerical text representation. A phrase can be represented as a bag of words vector, just like the term itself (a string of numbers).

In [65]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary = True)
cv.fit(text_processed)
vector_train = cv.transform(clean_reviews_train)
vector_test = cv.transform(clean_reviews_test)

In [66]:
print('CountVectorizerTrain(n_grams=1):',vector_train.shape)
print('CountVectorizerTest(n_grams=1):',vector_test.shape)

CountVectorizerTrain(n_grams=1): (25000, 92688)
CountVectorizerTest(n_grams=1): (25000, 92688)


In the above code we used only single word features in our model, which we call 1-grams or unigrams. We can potentially add more predictive power to our model by adding two or three word sequences (bigrams or trigrams) as well. For example, if a review had the three word sequence “didn’t love movie” we would only consider these words individually with a unigram-only model and probably not capture that this is actually a negative sentiment because the word ‘love’ by itself is going to be highly correlated with a positive review.

We are increasing the words in our matrix to two or three word sequences. The n-grams parameter allows us to use more than one word sequence.

In [67]:
cv1 = CountVectorizer(binary = False, min_df=0,max_df=1,ngram_range=(1,3))
cv1.fit(clean_reviews_train)
inc_vector_train = cv1.transform(clean_reviews_train)
inc_vector_test = cv1.transform(clean_reviews_test)

In [68]:
print('CountVectorizerTrain(n_grams=1-3):',inc_vector_train.shape)
print('CountVectorizerTest(n_grams=1-3):',inc_vector_test.shape)

CountVectorizerTrain(n_grams=1-3): (25000, 4373123)
CountVectorizerTest(n_grams=1-3): (25000, 4373123)


### TF-IDF Vectorizer

The term tf–idf stands for term frequency–inverse document frequency, it is a mathematical statistic that is planned to reflect how significant a word is to a record in a collection or corpus. The tf–idf esteem builds proportionally to the number of times a word shows up in the document.

Term Frequency (tf) - It gives us the recurrence of the word in each report in the corpus. It is the proportion of the number of times the word shows up in a report contrasted with the all-out the number of words in that record. It increments as the quantity of events of that word inside the record increments.

Inverse Document Frequency (idf) - It is used to figure the heaviness of uncommon words over all reports in the corpus. The words that happen seldom in the corpus have a high IDF score.

In [69]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))
tv_train_reviews=tv.fit_transform(clean_reviews_train)
tv_test_reviews=tv.transform(clean_reviews_test)

In [70]:
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Tfidf_train: (25000, 4373123)
Tfidf_test: (25000, 4373123)


# Building the Classifier using Logistic Regression

In [72]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

In [73]:
target= []
for i in range(25000):
    if i < 12500:
        sentiment = 1
    else:
        sentiment = 0
    target.append(sentiment)

The targets/labels we use will be the same for training and testing because both datasets are structured the same, where the first 12.5k are positive and the last 12.5k are negative.

In [74]:
X_train,X_val,y_train,y_val = train_test_split(inc_vector_train,target,test_size = 0.25)

In [75]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_train,y_train)
    predict = lr.predict(X_val)
    print("Accuracy for C={} : {}".format(c,accuracy_score(y_val,predict)))

Accuracy for C=0.01 : 0.50496
Accuracy for C=0.05 : 0.50496
Accuracy for C=0.25 : 0.50496
Accuracy for C=0.5 : 0.50496
Accuracy for C=1 : 0.50496


In [76]:
final_model = LogisticRegression(C=1)
final_model.fit(inc_vector_train,target)
pred = final_model.predict(inc_vector_test)
print('Accuracy on the Test Data is:{}'.format(accuracy_score(target,pred)))

Accuracy on the Test Data is:0.64664


In [77]:
X_train,X_val,y_train,y_val = train_test_split(tv_train_reviews,target,test_size = 0.25)

In [78]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    lr = LogisticRegression(C=c)
    lr.fit(X_train,y_train)
    predict = lr.predict(X_val)
    print("Accuracy for C={} : {}".format(c,accuracy_score(y_val,predict)))

Accuracy for C=0.01 : 0.49376
Accuracy for C=0.05 : 0.49376
Accuracy for C=0.25 : 0.49376
Accuracy for C=0.5 : 0.49376
Accuracy for C=1 : 0.49376


In [79]:
final_model = LogisticRegression(C=0.5)
final_model.fit(tv_train_reviews,target)
pred = final_model.predict(tv_test_reviews)
print('Accuracy on the Test Data is:{}'.format(accuracy_score(target,pred)))

Accuracy on the Test Data is:0.71736


In [80]:
from sklearn.svm import SVC

In [81]:
X_trian,X_val,y_train,y_val = train_test_split(inc_vector_train,target,test_size = 0.25,random_state = 101)

In [83]:
for c in [0.01,0.05,0.25,0.5,1]:
    svm = SVC(C=c)
    svm.fit(X_train,y_train)
    predict = svm.predict(X_val)
    print('Accuracy for C={} : {}'.format(c,accuracy_score(y_val,predict)))

Accuracy for C=0.01 : 0.49168
Accuracy for C=0.05 : 0.49168
Accuracy for C=0.25 : 0.48736
Accuracy for C=0.5 : 0.48736
Accuracy for C=1 : 0.48736


In [87]:
final_model = SVC(C=0.01)
final_model.fit(inc_vector_train,target)
pred = final_model.predict(inc_vector_test)
print('Accuracy on Test Data is {}'.format(accuracy_score(target,pred)))

Accuracy on Test Data is 0.50064


In [91]:
X_train,X_val,y_train,y_val = train_test_split(tv_train_reviews,target,test_size = 0.25,random_state = 101)

In [92]:
for c in [0.01,0.05,0.25,0.5,1]:
    svm = SVC(C=c)
    svm.fit(X_train,y_train)
    predict = svm.predict(X_val)
    print('Accuracy for C={} : {}'.format(c,accuracy_score(y_val,predict)))

Accuracy for C=0.01 : 0.49168
Accuracy for C=0.05 : 0.49168
Accuracy for C=0.25 : 0.49168
Accuracy for C=0.5 : 0.49168
Accuracy for C=1 : 0.49168


In [94]:
final_model = SVC(C=1)
final_model.fit(tv_train_reviews,target)
pred = final_model.predict(tv_test_reviews)
print('Accuracy on Test Data is {}'.format(accuracy_score(target,pred)))

Accuracy on Test Data is 0.50392


As we can see we were able to achieve an accuracy of 72% with Logistic Regression and 51% with Support Vector Machines.