<a href="https://colab.research.google.com/github/arjangvt/CodeFolio/blob/main/ML/NLP/TextClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Here I use NLTK for basic text and language process.
I use these examples as my own references.

Note:
Typically if you install nltk on your personal computer, you don't need to install each package separately (See NLTK installation colab file). Since Colab does not hold the installation we have to install nesessary packages separately. <br>

Written by: Arjang Fahim <br>
Date: 5/10/2020

In this demo, movie reviews dataset is classified using NLTK API. <br>
The dataset is available in NLTK package and is labeled as Negative and Positive comments. <br>
This is a simnple Naive Bayes classification without further attempt to improve the model. <br>

Written by: Arjang Fahim <br>
Date: 2/2/2020


# Text Classification

In [4]:
import nltk
import random

from nltk.corpus import movie_reviews

nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


True

In [2]:
# Create a list of tuples
# we use words as features
documents = []

In [5]:
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append((list(movie_reviews.words(fileid)),category))

In [6]:
# Shuffle the documents
random.shuffle(documents)

In [7]:
# normalize the dataset
# converting all words to lower case

all_words = []

In [8]:
for w in movie_reviews.words():
    all_words.append(w.lower())

In [9]:
# nltk frequency distribution
all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words['love'])

[(',', 77717), ('the', 76529), ('.', 65876), ('a', 38106), ('and', 35576), ('of', 34123), ('to', 31937), ("'", 30585), ('is', 25195), ('in', 21822), ('s', 18513), ('"', 17612), ('it', 16107), ('that', 15924), ('-', 15595)]
1119


In [10]:
# Limit the words
word_features = list(all_words.keys())[:3000]
print(word_features)



In [11]:
# Find features witin the documents

def find_features(document):
    words = set(document)
    features = {}
    for w in word_features:
        features[w] = (w in words)
        
    return features

In [12]:
print(find_features(movie_reviews.words('pos/cv000_29590.txt')))



In [13]:
featuresets = [(find_features(rev), category) for (rev, category) in documents]
print(featuresets[0])



In [14]:
# split the data to train and test set
train_set =  featuresets[:1900]
test_set =  featuresets[1900:]

In [15]:
# training classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [16]:
# test the accuray

print('Accuracy:', nltk.classify.accuracy(classifier=classifier, gold=test_set) * 100)

classifier.show_most_informative_features(30)


Accuracy: 80.0
Most Informative Features
                  regard = True              pos : neg    =     11.7 : 1.0
                   sucks = True              neg : pos    =      9.3 : 1.0
                  annual = True              pos : neg    =      9.1 : 1.0
           unimaginative = True              neg : pos    =      7.6 : 1.0
              schumacher = True              neg : pos    =      7.4 : 1.0
             silverstone = True              neg : pos    =      7.0 : 1.0
                 idiotic = True              neg : pos    =      7.0 : 1.0
                   kudos = True              pos : neg    =      6.6 : 1.0
               atrocious = True              neg : pos    =      6.6 : 1.0
                  turkey = True              neg : pos    =      6.5 : 1.0
                 cunning = True              pos : neg    =      6.4 : 1.0
                 singers = True              pos : neg    =      6.4 : 1.0
                  suvari = True              neg : pos    =