<a href="https://colab.research.google.com/github/bijouvj/ST-summer-2024/blob/main/SentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

--2024-06-24 03:32:59--  http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2024-06-24 03:33:15 (5.14 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [3]:
import tarfile
with tarfile.open('aclImdb_v1.tar.gz', 'r:gz') as tar:
  tar.extractall()

In [5]:
!pip3 install pyprind

Collecting pyprind
  Downloading PyPrind-2.11.3-py2.py3-none-any.whl (8.4 kB)
Installing collected packages: pyprind
Successfully installed pyprind-2.11.3


In [6]:
import pyprind
import pandas as pd
import os
import sys

basepath = 'aclImdb'

In [14]:
labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
df = pd.DataFrame()

for s in ('test', 'train'):
  for l in ('pos', 'neg'):
    path = os.path.join(basepath, s, l)
    for file in sorted(os.listdir(path)):
      with open(os.path.join(path, file),
              'r', encoding='utf-8') as infile:
        txt = infile.read()
      df = pd.concat([df, pd.DataFrame([[txt, labels[l]]])], ignore_index=True)
      pbar.update()

In [15]:
df.columns = ['review', 'sentiment']

In [16]:
df.head()

Unnamed: 0,review,sentiment
0,I went and saw this movie last night after bei...,1
1,Actor turned director Bill Paxton follows up h...,1
2,As a recreational golfer with some knowledge o...,1
3,"I saw this film in a sneak preview, and it is ...",1
4,Bill Paxton has taken the true story of the 19...,1


In [31]:
import numpy as np
np.random.seed(0)

In [32]:
df = df.reindex(np.random.permutation(df.index))
df.to_csv('movie_data.csv', index=False, encoding='utf-8')

In [33]:
df = pd.read_csv('movie_data.csv', encoding='utf-8')

df.head(3)

Unnamed: 0,review,sentiment
0,Devil's Experiment: 1/10: Hardcore porn films ...,0
1,I was not expecting much from this movie. I wa...,1
2,Having borrowed this movie from the local libr...,1


In [20]:
df.shape

(50000, 2)

In [21]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer()

In [24]:
docs = np.array(['The sun is shining',
                 'The weather is sweet',
                 'The sun is shining, the weather is sweet,'
                 'and one and one is two'])

bag = count.fit_transform(docs)

In [25]:
docs

array(['The sun is shining', 'The weather is sweet',
       'The sun is shining, the weather is sweet,and one and one is two'],
      dtype='<U63')

In [26]:
docs.dtype

dtype('<U63')

In [27]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


In [28]:
# Each index position in the feature vectors shown here corresponds to the
# integer values that are stored as dictionary items in the CountVectorizer
# vocabulary. For example, the first feature at index position 0 resembles the
# count of the word 'and', which only occurs in the last document, and the
# word 'is', at index position 1 (the second feature in the document vectors),
# occurs in all three sentences.
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


In [29]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(use_idf=True,
                         norm='l2',
                         smooth_idf=True)
np.set_printoptions(precision=2)

In [30]:
print(tfidf.fit_transform(count.fit_transform(docs)).toarray())

[[0.   0.43 0.   0.56 0.56 0.   0.43 0.   0.  ]
 [0.   0.43 0.   0.   0.   0.56 0.43 0.   0.56]
 [0.5  0.45 0.5  0.19 0.19 0.19 0.3  0.25 0.19]]


In [44]:
# review text has HTML in it
df.loc[47, 'review']

'For me, this is another one of those films that I got to see off of the Los Angeles based "Z" Channel when it was in service. And it was another one of those movies that I saw when I was young...and learned that there was a world out there...one I did not want to accept.<br /><br />Moving to Los Angeles and getting to watch international cinema became quite the guilty pleasure hobby of mine and to date, no premiere channel programming has matched the "Z" Channel in its showing of international films. The three international films that stuck in my young head were "Spetters", "Beau Pere" and of course this one, "Pixote".<br /><br />This was the most shocking and saddest movie I ever witnessed in my life. This was also one of the first movies that made me understand that there IS a difference in cinema: to entertain, and to inform. Let me be honest..growing up in a small town on the east coast, I had no idea anything like this -- to this extent -- existed. All I knew from South America w

In [45]:
import re
def preprocessor(text):
  text = re.sub('<[^>]*>', '', text)
  emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
  text = re.sub('[\W]+', ' ', text.lower()) + \
         ' '.join(emoticons).replace('-', '')
  return text

In [47]:
preprocessor("</a>This :) is :( <br /><br />a test :-)!")

'this is a test :) :( :)'

In [48]:
# let's clean the data
df['review'] = df['review'].apply(preprocessor)