## Getting started

#### Standard imports and installations

In [None]:
!pip install hub -q

[K     |████████████████████████████████| 122kB 18.4MB/s 
[K     |████████████████████████████████| 1.8MB 37.4MB/s 
[K     |████████████████████████████████| 296kB 48.8MB/s 
[K     |████████████████████████████████| 337kB 55.1MB/s 
[K     |████████████████████████████████| 2.2MB 46.8MB/s 
[K     |████████████████████████████████| 133kB 59.6MB/s 
[K     |████████████████████████████████| 71kB 10.3MB/s 
[K     |████████████████████████████████| 102kB 15.3MB/s 
[K     |████████████████████████████████| 81kB 11.0MB/s 
[K     |████████████████████████████████| 133kB 56.7MB/s 
[K     |████████████████████████████████| 7.3MB 51.1MB/s 
[K     |████████████████████████████████| 92kB 13.0MB/s 
[K     |████████████████████████████████| 133kB 60.5MB/s 
[K     |████████████████████████████████| 3.2MB 45.3MB/s 
[K     |████████████████████████████████| 5.8MB 52.2MB/s 
[K     |████████████████████████████████| 71kB 10.6MB/s 
[K     |████████████████████████████████| 71kB 10.3MB/s 
[

In [None]:
!hub login

In [None]:
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
import hub
from hub.schema import Text, ClassLabel
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

In [None]:
!wget https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz --quiet
!tar -xf aclImdb_v1.tar.gz

Reading one sample review

In [None]:
filename = "aclImdb/train/pos/0_9.txt"
with open(filename, "r") as fin:
    line = fin.readline()
fin.close()

In [None]:
line

'Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school life, such as "Teachers". My 35 years in the teaching profession lead me to believe that Bromwell High\'s satire is much closer to reality than is "Teachers". The scramble to survive financially, the insightful students who can see right through their pathetic teachers\' pomp, the pettiness of the whole situation, all remind me of the schools I knew and their students. When I saw the episode in which a student repeatedly tried to burn down the school, I immediately recalled ......... at .......... High. A classic line: INSPECTOR: I\'m here to sack one of your teachers. STUDENT: Welcome to Bromwell High. I expect that many adults of my age think that Bromwell High is far fetched. What a pity that it isn\'t!'

#### Collecting all filenames for processing

In [None]:
file_names = []
reviews_df = pd.DataFrame(columns=["Review", "Label"])
for root, dirs, files in os.walk("aclImdb/train/pos"):
    file_names.append(files)

#### Appending all positive reviews to the DataFrame

In [None]:
root_dir = "aclImdb/train/pos/"
count = 0
for i in file_names[0]:
    with open(root_dir + i, "r") as fin:
        reviews_df = reviews_df.append(
            {"Review": fin.readline(), "Label": 1}, ignore_index=True
        )
        count += 1
fin.close()

In [None]:
reviews_df

Unnamed: 0,Review,Label
0,I have watched this episode more often than an...,1
1,"I really enjoyed ""Doctor Mordrid"". This is a l...",1
2,Hickory Dickory Dock was a good Poirot mystery...,1
3,"Fragile Carne, just before his great period. A...",1
4,"So I don't ruin it for you, I'll be very brief...",1
...,...,...
12495,Film can be a looking glass to see the world i...,1
12496,"A message movie, but a rather good one. Outsta...",1
12497,Kurosawa weaves a tale that has a cast of char...,1
12498,When you compare what Brian De Palma was doing...,1


#### Appending all negative reviews to the DataFrame

In [None]:
file_names = []
for root, dirs, files in os.walk("aclImdb/train/neg"):
    file_names.append(files)

In [None]:
root_dir = "aclImdb/train/neg/"
count = 0
for i in file_names[0]:
    with open(root_dir + i, "r") as fin:
        reviews_df = reviews_df.append(
            {"Review": fin.readline(), "Label": 0}, ignore_index=True
        )
        count += 1
fin.close()

In [None]:
max_length = 0
for i in reviews_df["Review"]:
    if len(i) > max_length:
        max_length = len(i)

### Uploading the DataFrame to Hub

In [None]:
# Please run this cell only once. Once you have uploaded the dataset, you can simply fetch it by running
# hub.Dataset(url)

# Replace url with your username and dataset name. for example, if your name is Akash and your dataset is
# FlipkartReviews, then
# url = Akash/FlipkartReviews
# Before you can upload datasets, please login into Hub. Run the first cell.

url = "dhiganthrao/IMDB-MovieReviews"

# Uncomment the following lines if you"re uploading *this* dataset for the first time.
# my_schema = {"Review": Text(shape=(None, ), max_shape=(max_length, )),
#              "Label": ClassLabel(num_classes=2)}

# ds = hub.Dataset(url, shape=(25000,), schema=my_schema)
# for i in tqdm(range(len(ds))):
#     ds["Review", i] = reviews_df["Review"][i]
#     ds["Label", i] = reviews_df["Labels"][i]

In [None]:
# Comment out the following line if you"re uploading the dataset for the first time.
ds = hub.Dataset(url)

#### Flushing dataset to disk

In [None]:
# If you"ve gone ahead and uploaded your own dataset into Hub, run this command.
# This command saves all changes to the cloud. You can also view this dataset at
# https://app.activeloop.ai

# ds.flush()

## Fetching data from Hub

In [None]:
print(type(ds))
print(ds.schema)

print(ds["Review", 4].compute())
print(ds["Label", 4].compute())

<class 'hub.api.dataset.Dataset'>
SchemaDict({'Review': Text(shape=(None,), dtype='int64', max_shape=(13704,)), 'Label': ClassLabel(shape=(), dtype='int64', num_classes=2)})
A simple and effective film about what life is all about, responding to challenges. It took a lot of gall for Homer and his friends to be able to grow into manhood without falling in the trap of a prefabricated future that runs from father to son, to be a miner in the local mine and never get out of that fate. It took also three different challenges for Homer and his friends to conquer a personal and free future. The challenge of the first ever man-made artificial satellite, Sputnik 1, a Soviet satellite, a milestone in human history, a turning point that Homer and his friends could not miss, did not want to miss. Then the challenge of science and applied mechanics to calculate and to devise a rocket from scratch or rather from what they could gather in books and order in their minds. Finally the challenge of a wor

## Training a model with our dataset

In [None]:
import re


def preprocessor(text):
    text = re.sub("<[^>]*>", "", text)
    emoticons = re.findall("(?::|;|=)(?:-)?(?:\)|\(|D|P)", text)
    text = re.sub("[\W]+", " ", text.lower()) + " ".join(emoticons).replace("-", "")
    return text


preprocessor("This is a :) test :-( !")

'this is a test :) :('

In [None]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()


def tokenizer(text):
    return text.split()


tokenizer("I find it fun to use Hub")

['I', 'find', 'it', 'fun', 'to', 'use', 'Hub']

In [None]:
def tokenizer_stemmer(text):
    return [porter.stem(word) for word in text.split()]


tokenizer_stemmer("Hub is extremely easy and efficient to use")

['hub', 'is', 'extrem', 'easi', 'and', 'effici', 'to', 'use']

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    strip_accents=None,
    lowercase=True,
    preprocessor=preprocessor,
    tokenizer=tokenizer_stemmer,
    use_idf=True,
    norm="l2",
    smooth_idf=True,
)
X = tfidf.fit_transform(
    [item["Review"].compute() for item in ds]
)  # Our training dataset
y = ds["Label"].compute()  # Training Labels

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegressionCV

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=1, test_size=0.5, shuffle=True
)
clf = LogisticRegressionCV(
    cv=5, scoring="accuracy", random_state=0, n_jobs=-1, verbose=3, max_iter=300
).fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:   51.1s finished


In [None]:
print(f"Accuracy: {clf.score(X_test, y_test)}")

Accuracy: 0.88648
