<a href="https://colab.research.google.com/github/aydawudu/ML_with_PyTorch_and_Sklearn/blob/main/ML_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
## Getting Datasent
import os
import sys
import tarfile
import time
import urllib.request

source = 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
target = 'aclImdb_v1.tar.gz'

In [2]:
if os.path.exists(target):
  os.remove(target)

def reporthook(count, block_size, total_size):
  global start_time
  if count == 0:
    start_time = time.time()
    return
  duration = time.time() - start_time
  progress_size=int(count * block_size)
  speed= progress_size / (1024. **2 * duration)
  percent = count * block_size * 100./ total_size

  sys.stdout.write(f'\r{int(percent)}% | {progress_size / (1024. **2):.2f} MB '
                  f' | {speed:.2f} MB/s | {duration:.2f} sec elapsed')
  sys.stdout.flush()


if not os.path.isdir('aclimb') and not os.path.isfile('aclimdb_v1.tar.gz'):
  urllib.request.urlretrieve(source, target, reporthook)

100% | 80.23 MB  | 2.24 MB/s | 35.81 sec elapsed

In [3]:
if not os.path.isdir('aclImdb'):

    with tarfile.open(target, 'r:gz') as tar:
        tar.extractall()

## Data Preprocessing

In [4]:
#install pyprind for python progree indicator
!pip install pyprind

Collecting pyprind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3


In [5]:
#preprocess the data into a csv file
import pyprind
import pandas as pd
import os
import sys
from packaging import version

basepath='aclImdb'

labels = {'pos': 1, 'neg':  0}

#if the progress bar does not show, change stream=sys.stdout to stream = 2
pbar=pyprind.ProgBar(50000, stream=sys.stdout)

df= pd.DataFrame()
for s in ('test', 'train'):
  for l in ('pos', 'neg'):
    path = os.path.join(basepath, s, l)
    for file in sorted(os.listdir(path)):
      with open(os.path.join(path, file),
                'r', encoding='utf-8') as infile:
          txt=infile.read()

      if version.parse(pd.__version__) >= version.parse("1.3.2"):
        x=pd.DataFrame([[txt, labels[l]]], columns=['review', 'sentiment'])
        df=pd.concat([df, x], ignore_index=False)

      else:
        df=df.append([[txt, labels[l]]], ignore_index=True)

      pbar.update()

df.columns= ['review', 'sentiment']


In [6]:
#shuffle data
import numpy as np

if version.parse(pd.__version__) >= version.parse("1.3.2"):
  df=df.sample(frac=1, random_state=0).reset_index(drop=True)

else:
  np.random.seed(0)
  df=df.reindex(np.random.permutation(df.index))

In [7]:
# optional: save to csv
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [8]:
#import csv and viw the first three reviews and sentiment
df=pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [9]:
df.shape

(50000, 2)

### Introducing the bag-of-words model

In [10]:
#Using CountVectorizer to construct bag of words on sample sentences
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count=CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [11]:
#print the vocabulary
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [12]:
print(bag.toarray()) #note that word order does not matter in bag-of-words model

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


### Assessing word relevancy via term frequency-inverse document frequency

In [13]:
#using TfidTransformer to assign low weights not terms that not useful or discriminatory
from sklearn.feature_extraction.text import TfidfTransformer

tfidf =TfidfTransformer(use_idf=True, norm='l2', smooth_idf= True)

print(tfidf.fit_transform(count.fit_transform(docs)).toarray())



[[0.         0.43370786 0.         0.55847784 0.55847784 0.
  0.43370786 0.         0.        ]
 [0.         0.43370786 0.         0.         0.         0.55847784
  0.43370786 0.         0.55847784]
 [0.50238645 0.44507629 0.50238645 0.19103892 0.19103892 0.19103892
  0.29671753 0.25119322 0.19103892]]


### Cleaning Text Data

In [14]:
#display the last 50 characters from the first document
df.loc[0, 'review'][-50:]

'is seven.<br /><br />Title (Brazil): Not Available'

In [15]:
#using regfex remove HTML and punctuation marks except emoji
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) #to remove HTML
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)  #find emoticons
    text = (re.sub('[\W]+', ' ', text.lower()) +  #remove all non-text word and convert to lowercase andf add emoticons
            ' '.join(emoticons).replace('-', ''))
    return text

In [16]:
##using the preprocessor
preprocessor(df.loc[0, 'review'][-50:])

'is seven title brazil not available'

In [17]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

In [18]:
#apply the preprocessor function to the movie review
df['review']=df['review'].apply(preprocessor)

### Processing documents into tokens

In [19]:
#tokenizing document by splitting words by whitespace
from nltk.stem.porter import PorterStemmer

porter =PorterStemmer() #for word stemming to reduce word to their root form

def tokenizer(text):
  return text.split()

def tokenizer_porter(text):
  return [porter.stem(word) for word in text.split()]

In [20]:
#try the functions
tokenizer('runnners like running and thus they run')

['runnners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [21]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

In [24]:
#removing stopwords
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')
 if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

## Training a logistic regression model for document classification

In [25]:
#split df
X_train =df.loc[:25000, 'review'].values
y_train=df.loc[:25000, 'sentiment'].values
X_test=df.loc[25000:, 'review'].values
y_test=df.loc[25000:, 'sentiment'].values

In [26]:
#set up lr model using gridsearch to find the most optimal parameters
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

"""
param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]
"""

small_param_grid = [{'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [None],
                     'vect__tokenizer': [tokenizer, tokenizer_porter],
                     'clf__penalty': ['l2'],
                     'clf__C': [1.0, 10.0]},
                    {'vect__ngram_range': [(1, 1)],
                     'vect__stop_words': [stop, None],
                     'vect__tokenizer': [tokenizer],
                     'vect__use_idf':[False],
                     'vect__norm':[None],
                     'clf__penalty': ['l2'],
                  'clf__C': [1.0, 10.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(solver='liblinear'))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, small_param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [27]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits




In [28]:
#get the best parameters
print(f'Best parameter set: {gs_lr_tfidf.best_params_}')
print(f'CV Accuracy: {gs_lr_tfidf.best_score_:.3f}')

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7aa2c01595a0>}
CV Accuracy: 0.897


In [29]:
#get accuracy rate
clf=gs_lr_tfidf.best_estimator_
print(f'Test Accuracy: {clf.score(X_test, y_test):.3f}')

Test Accuracy: 0.899


## Working with bigger data - online algorithms and out-of-core learning

In [32]:
#using the csv data
import os
import gzip


if not os.path.isfile('movie_data.csv'):
    if not os.path.isfile('movie_data.csv.gz'):
        print('Please place a copy of the movie_data.csv.gz'
              'in this directory. You can obtain it by'
              'a) executing the code in the beginning of this'
              'notebook or b) by downloading it from GitHub:'
              'https://github.com/rasbt/machine-learning-book/'
              'blob/main/ch08/movie_data.csv.gz')
    else:
        with gzip.open('movie_data.csv.gz', 'rb') as in_f, \
                open('movie_data.csv', 'wb') as out_f:
            out_f.write(in_f.read())

In [34]:
#define tokenizer function to clean unprocesses text data

import numpy as np
import re
from nltk.corpus import stopwords

stop=stopwords.words('english')

def tokenizer(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) +\
        ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

#generator function that reads in and return one document at a time
def stream_docs(path):
  with open(path, 'r', encoding='utf-8') as csv:
    next(csv) #skip header
    for line in csv:
      text, label=line[:-3], int(line[-2])
      yield text, label

In [35]:
#read in first document to test the generator function
next(stream_docs(path='movie_data.csv'))

('"In 1974, the teenager Martha Moxley (Maggie Grace) moves to the high-class area of Belle Haven, Greenwich, Connecticut. On the Mischief Night, eve of Halloween, she was murdered in the backyard of her house and her murder remained unsolved. Twenty-two years later, the writer Mark Fuhrman (Christopher Meloni), who is a former LA detective that has fallen in disgrace for perjury in O.J. Simpson trial and moved to Idaho, decides to investigate the case with his partner Stephen Weeks (Andrew Mitchell) with the purpose of writing a book. The locals squirm and do not welcome them, but with the support of the retired detective Steve Carroll (Robert Forster) that was in charge of the investigation in the 70\'s, they discover the criminal and a net of power and money to cover the murder.<br /><br />""Murder in Greenwich"" is a good TV movie, with the true story of a murder of a fifteen years old girl that was committed by a wealthy teenager whose mother was a Kennedy. The powerful and rich f

In [40]:
#define another function to change the document stream to number of documents
def get_minibatch(doc_stream, size):
  docs, y = [], []
  try:
    for _ in range(size):
      text, label = next(doc_stream)
      docs.append(text)
      y.append(label)
  except StopIteration:
    return None, None
  return docs, y

In [44]:
#setting up the vectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import SGDClassifier

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         tokenizer=tokenizer)

In [46]:
#reinitialize logistic regression
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

clf = SGDClassifier(loss='log', random_state=1)


doc_stream = stream_docs(path='movie_data.csv')

In [47]:
#setting up out-of-core learning
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:36


In [48]:
#elevaluate performance
X_test, y_test = get_minibatch(doc_stream, size=5000)
X_test = vect.transform(X_test)
print(f'Accuracy: {clf.score(X_test, y_test):.3f}')

Accuracy: 0.868


In [49]:
#use the last 5000 to update the model
clf=clf.partial_fit(X_test, y_test)

## Topic modeling
### Decomposing text documents with Latent Dirichlet Allocation
#### Latent Dirichlet Allocation with scikit-learn

In [50]:
import pandas as pd

df=pd.read_csv('movie_data.csv', encoding='utf-8')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [51]:
from sklearn.feature_extraction.text import CountVectorizer

count=CountVectorizer(stop_words='english',
                      max_df=.1,
                      max_features=5000)

X = count.fit_transform(df['review'].values)

In [54]:
#using sklearn LDA
from sklearn.decomposition import LatentDirichletAllocation

lda=LatentDirichletAllocation(n_components=10, #number of topics
                              random_state=123,
                              learning_method='batch')

X_topics=lda.fit_transform(X)

In [55]:
#prin components_ attritbute that stores a matrix containing the word importance
lda.components_.shape

(10, 5000)

In [56]:
#print the 5 most important words for each topic
n_top_words=5
feature_names = count.get_feature_names_out()

for topic_idx, topic in enumerate(lda.components_):
  print(f'Topic {(topic_idx + 1 ):}')
  print(' '.join([feature_names[i]
                  for i in topic.argsort()[:-n_top_words - 1:-1]]))

Topic 1
worst minutes awful script stupid
Topic 2
family mother father children girl
Topic 3
american war dvd music tv
Topic 4
human audience cinema art sense
Topic 5
police guy car dead murder
Topic 6
horror house sex girl woman
Topic 7
role performance comedy actor performances
Topic 8
series episode war episodes tv
Topic 9
book version original read novel
Topic 10
action fight guy guys cool


In [57]:
#plot 3 movies from horror movies
horror = X_topics[:, 5].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print(f'\nHorror movie #{(iter_idx + 1)}:')
    print(df['review'][movie_idx][:300], '...')


Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...

Horror movie #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there  ...

Horror movie #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...
