## CS 178 Project Code

### Arun Malani, Brock Allan, Nathan Chau

Includes data analysis, model training, performance charts, model fine-tuning, and brief explanations.

In [1]:
#First of all we need to import the data that we are using, specifically the IMDB Review / Large Movie Review Dataset
#Dataset	Type	#Instances	#Labels	   Each Instance	
#NLP	                50K    	   2	    Movie review

#The dataset provides a set of 25,000 highly polar movie reviews for training, and 25,000 for testing, there is also unlabeled data - unsupervised?
#States that Raw text and already processed bag of words formats are provided. See the README file contained in the release for more details.
#Information gained from the readme is as follows:
#25k positive and 25k negative reviews
#50k unlabeled reviews for unsupervised learning
#each movie has at most 30 reviews - since reviews for the same movie tend to have the same correlation
#train and test sets are disjoint, so no significant performance by memorizing movie unique terms
#negative reviews are those with a score <= 4 out of 10
#positive reviews are those with a score >= 7 out of 10
#labeled data sets do NOT have the neutral reviews 5-6 but unlabeled data sets DO
#unsupervised learning therefore will want to have k = 3 clusters for each
#states that for unlabeled data there are an even number of reviews > 5 and <= 5

#contains the following file folders, train and test.
#inside of these folders are a positive and negative folder
#inside are reviews stored in text files named as follows:
#`[[id]_[rating].txt]` where the id is a unique id and rating is the star rating for that review on a scale of 1-10.
#for example: `[test/pos/200_8.txt]` is the text for a positive labeled test with id 200 and a rating of 8/10
#the unlabeled set is in train and has a 0 for all ratings
#the data includes the url for each review where the ID is actually the line in the file in which it occurs
#for example: `[urls_[pos, neg, unsup].txt]` where ID 200 means line 200, the url however just goes to the movies review page

#the already tokenized bag of words features are stored in the .feat files in each directory
#text tokens are found in `[imdb.vocab]` and it states that a line with 0:7 in a .feat file
#means that the first word in imdb.vocab (which is "the") appears 7 times in that review
#this would be a way for counting the number of times certain words appear in a review
#may be useful for finding a correlation between saying "hate" 5 or more times and them leaving a negative review.
#it also included a file called `[imdbEr.txt]` which contains the expected rating for each token in the vocab file
#states that its a good way to sense for the average polarity of a word in the dataset
#maybe this would be a way to test our algorithms?

#make sure we cite the dataset at some point

In [5]:
#lets start with unsupervised learning so we can decide on what classifiers
#or ensembles to use on the data
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.inspection import DecisionBoundaryDisplay

import requests
from concurrent.futures import ThreadPoolExecutor
import numpy as np

api_url = "https://api.github.com/repos/apmalani/cs-178-project/contents/train/unsup?ref=main"
base_raw = "https://raw.githubusercontent.com/apmalani/cs-178-project/main/train/unsup/"

resp = requests.get(api_url)
resp.raise_for_status()
items = resp.json()

txt_files = [item["name"] for item in items if item["name"].endswith(".txt")]
print(f"Text files found: {len(txt_files)}")

def fetch_file(idx_name):
    idx, name = idx_name
    url = base_raw + name
    resp = requests.get(url)
    resp.raise_for_status()
    return idx, resp.text

# Pre-allocate list, fixes the dict and list not matching
Unlabeled_List = [None] * len(txt_files)
Unlabeled_dict = {}

with ThreadPoolExecutor(max_workers=20) as executor:
    futures = [executor.submit(fetch_file, (i, name)) for i, name in enumerate(txt_files)]
    for future in futures:
        idx, text = future.result()
        Unlabeled_List[idx] = text
        Unlabeled_dict[idx] = text

# Now list[0] == dict[0]
seed = 1234
np.random.seed(seed)

Text files found: 1000


In [7]:
#I decided to collect the data into two different data structures:
print(Unlabeled_List[0])
print("________________________")
print(Unlabeled_dict[0])
#The list contains the same data that the dict will have but the dict uses integer
#keys for each review, I thought this would be easier to use for something like a NN

I admit, the great majority of films released before say 1933 are just not for me. Of the dozen or so "major" silents I have viewed, one I loved (The Crowd), and two were very good (The Last Command and City Lights, that latter Chaplin circa 1931).<br /><br />So I was apprehensive about this one, and humor is often difficult to appreciate (uh, enjoy) decades later. I did like the lead actors, but thought little of the film.<br /><br />One intriguing sequence. Early on, the guys are supposed to get "de-loused" and for about three minutes, fully dressed, do some schtick. In the background, perhaps three dozen men pass by, all naked, white and black (WWI ?), and for most, their butts, part or full backside, are shown. Was this an early variation of beefcake courtesy of Howard Hughes?
________________________
I admit, the great majority of films released before say 1933 are just not for me. Of the dozen or so "major" silents I have viewed, one I loved (The Crowd), and two were very good (T

After running the above code the following should be true:

The first 1000 (for now) unlabeled movie reviews are stored as text in 
Unlabeled_List <- a list that is 0 indexed
Unlabeled_dict <- a dictionary where the keys are int indexes (0) and the values are the reviews.

### EDA

### Models

### Fine-tuning