<a href="https://colab.research.google.com/github/aypy01/tensorflow/blob/main/sentiments.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Sentiments Analysis using tensorflow and IMDB reviews dataset

##Importing libraries

In [None]:
import matplotlib.pyplot as plt
import os
import re
import shutil
import string
import tensorflow as tf




##Downloading Dataset

In [None]:
print(tf.__version__)
url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset_dir = tf.keras.utils.get_file("aclImdb_v1", url, untar=True, cache_dir=".", cache_subdir="")  # Downloading and extracting

dataset = os.path.join(dataset_dir, "aclImdb")

#Making the train-test
train_ds = os.path.join( dataset, "train")  # directory for train set in the downloaded data
test_ds= os.path.join(dataset,"test")

2.19.0


###Viewing the dataset

In [None]:
#To see the Review at position 1181, of Positive train set
sample_file = os.path.join(train_ds, 'pos/10000_8.txt')
with open(sample_file) as f:
  print(f.read())


Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a plan to help those on the street that were once considered human who did everything from going to school, work, or vote for the matter. Most people think of the homeless as just a lost cause while worrying about things such as racism, the war on Iraq, pressuring kids to succeed, technology, the elections, inflation, or worrying if they'll be next to end up on the streets.<br /><br />But what if you were given a bet to live on the streets for a month without the luxuries you once had from a home, the entertainment sets, a bathroom, pictures on the wall, a computer, and everything you once treasure to see what it's like to be homeless? That is Goddard Bolt's lesson.<br /><br />Mel Brooks (who directs) who stars as Bolt plays a rich man who has everything in the world until deciding to make a bet with a sissy rival (Jeffery Tambor) to see if he can live in the streets for thirty days without th

In [None]:
# random review from the positive set
import random

positive = os.path.join(train_ds, "pos")
sample_filename = random.choice(os.listdir(positive))
sample_path = os.path.join(positive, sample_filename)

with open(sample_path) as f:
    print(f"\nReview: {sample_filename}\n")
    print(f.read())


Review: 8170_7.txt

As if most people didn't already have a jittery outlook on the field of dentistry, this little movie will sure make you paranoid patients squirm. A successful dental hygienist witnesses his wife going down on the pool man (on their anniversary of all days!) and snaps big time into a furious breakdown. After shooting an attack dog's head off, he strolls into work and ends up taking his marital aggression out on the patients as he plans what to do about his "slut" of a wife. There are plenty of up-close shots of mouth-jabbing, tongue-cutting, and beauty queen fondling, as well as a marvelously deranged performance by Corbin Bernsen. The scene in which he ties up and gases his wife before mercilessly yanking her teeth out is definitely hard to watch. A dentist is absolutely the wrong kind of person to go off the deep end and this movie sure explains that in detail. "The Dentist" is incredibly entertaining, fast-paced, and laughably gory at times. Check it out!


###Removing non-usefull directory

In [None]:
# Remove the directory for the unsupervised dataset from 'train'
remove_dir = os.path.join(
    train_ds, "unsup"
)  # Removing the file directory of the unsupervised set
shutil.rmtree(remove_dir)
# This deletes the 'unsup' directory and all its contents

##Data-Loading

Validation_split inside model.fit only works when your data is in memory (like NumPy arrays, Pandas DataFrames, or tf.Tensors).

It does not work with a tf.data.Dataset object (which is what text_dataset_from_directory gives you).

subset="training" and subset="validation" → TensorFlow handles the split.

### IMDB Dataset Split Notes

- `text_dataset_from_directory`:
  - Reads subfolders (`pos/` → label=1, `neg/` → label=0).
  - Returns `(text, label)` pairs inside a `tf.data.Dataset`.

- IMDB training set = **25,000 reviews**  
  - 12,500 positive, 12,500 negative.

- With `validation_split=0.2` and `subset="training"`:
  - Training set → **20,000 reviews** (10k pos + 10k neg).
  - Validation set → **5,000 reviews** (2.5k pos + 2.5k neg).

- **Seed is important** → ensures split is balanced and reproducible.

- You don’t need to manually separate positive/negative; TF handles labels automatically.


In [None]:
# Loading the raw training dataset with text files in the directory
x_train = tf.keras.utils.text_dataset_from_directory(
    train_ds,  # The directory where the training data is located (after removing 'unsup')
    batch_size=32,  # Number of samples to return in each batch
    validation_split=0.2,  # 20% of the x_train will be used for validation
    subset="training",  # Indicating that this is the training subset (remaining 80% is training data)
    seed=42, # Using the seed for random operations (like splitting the data)
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [None]:
# Load the validation dataset
x_val = tf.keras.utils.text_dataset_from_directory(
    train_ds,  # Use train_dir variable which has correct path
    batch_size=32,
    validation_split=0.2,  # 20% of the training data will be used for validation
    subset="validation",  # Specifies this is for validation
    seed=42,  # Ensures reproducibility
)


Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [None]:
# Load the test dataset
x_test = tf.keras.utils.text_dataset_from_directory(
    test_ds,  # Now use the defined test_dir variable
    batch_size=32,  # The batch size for the test dataset
)

Found 25000 files belonging to 2 classes.


As this is a binary classification lets see what that 0 and 1 represent here

In [None]:
print("Label 0 correponds to", x_train.class_names[0])

print( "Label 1 corresponds to", x_train.class_names[1])


Label 0 correponds to neg
Label 1 corresponds to pos


Now we know the label=0 represents the "neg" and label=1 represents 'pos'

Example of 1st 3 reviews in training set

In [None]:
# Iterate through the model , and take the first batch
#As the model have 2 claseses , giving 1st column naeme as revies and second column as label

for review, label in x_train.take(1):
    # 'take(1)' retrieves the first batch of data (text and labels)

    # Loop through the first 3 items in the batch of 32
    for i in range(3):
        # Print the text review (converted from tensor to numpy array for readability)
        print("Review", review.numpy()[i])

        # Print the corresponding label (converted from tensor to numpy array for readability)
        print("Label", label.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

##Preprocessing
For data preperation the we have to make the dataset CLEAN:
for that we have to

1.   Standardize=remove punctuations and </ br> of HTML
2.   Tokennize = splitint the words with whitespaces
3.   Vectorize = converting str into float
For all this we have keras.TextVectorization in layers

###Standardiation/tokenzation

In [None]:
import re
import string

def standardization(input_data):
    # Convert input to lowercase
    lowercase = tf.strings.lower(input_data)

    # Remove any HTML tags like <br />
    stripped_html = tf.strings.regex_replace( lowercase, "<br />", " ")  # Fixed: Replaced '<br />' with '<" ">', added space

    # Remove punctuation using regex
    return tf.strings.regex_replace(stripped_html, "[%s]" % re.escape(string.punctuation), "" )  # Fixed: Changed tf.string to tf.strings


###Vectorization

Choosing Between `adapt()` vs. Manual Vocabulary in `TextVectorization`

- **`adapt()`**  
  - Learns vocabulary automatically from dataset.  
  - Best for **small/medium datasets** or when vocab is unknown.  
  - Simple and reliable (handles token frequency, `[UNK]` etc.).  

- **Manual `vocabulary=` argument**  in TextVectorization
  - Use when you already have a predefined vocab (e.g., pretrained models, research papers, or large standardized datasets).  
  - Saves time on huge datasets (no scanning needed).  
  - Ensures vocab consistency across experiments.  

In [None]:
from tensorflow.keras.layers import TextVectorization

vectorize_layer=TextVectorization(max_tokens=10000, # Maximum number of words to consider
                                  standardize=standardization, # Custom standardization function/something encoded
                                  output_sequence_length=250,  # The length of each sequence (padding or truncating will occur
                                  output_mode='int')  # Output integers inplace of the text


###Mapping Text

In [None]:
# Extract only text (input data) from the training dataset and not labels
x_train_text =x_train.map(lambda x, y: x)

# Adapt the vectorization layer to the training data (calculating the vocabulary)
vectorize_layer.adapt(x_train_text) #


#### Text Standardization & Vectorization (Step-by-Step)

- Convert all input text to lowercase → ensures uniform word matching.  
- Remove HTML tags like `<br />` → cleans unnecessary formatting.  
- Remove punctuation → reduces noise in vocabulary.  
- Create a `TextVectorization` layer → turns text into integer sequences.  
- Limit vocabulary to the 10k most frequent words → keeps model efficient.  
- Pad or truncate reviews to fixed length (250 tokens) → ensures uniform input size.  
- Replace out-of-vocab words with `[UNK]` token → handles unseen words.  
- Extract only the text (drop labels) → prepares raw data for vocabulary building.  
- Adapt the vectorization layer on training text → builds internal vocabulary mapping.  

---

#### Note on Errors
- If cells are run **out of order**, the vectorization layer may be uninitialized.  
- This can trigger `UnimplementedError` when processing text.  
- Always run notebook **top-to-bottom** or reinitialize the layer before using.


###Vectorize function

In [None]:
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label
# getting the batch of 59 reviews and labes from dataset #Why 59 ??
review, label = next(iter(x_train))
# reading the only 1st review and then reading its label and showing
first_review, first_label = review[0], label[0]
print("Review:", first_review)
print("Label:", x_train.class_names[first_label])
print()
print("Vectorized review:", vectorize_text(first_review, first_label))


Review: tf.Tensor(b'Silent Night, Deadly Night 5 is the very last of the series, and like part 4, it\'s unrelated to the first three except by title and the fact that it\'s a Christmas-themed horror flick.<br /><br />Except to the oblivious, there\'s some obvious things going on here...Mickey Rooney plays a toymaker named Joe Petto and his creepy son\'s name is Pino. Ring a bell, anyone? Now, a little boy named Derek heard a knock at the door one evening, and opened it to find a present on the doorstep for him. Even though it said "don\'t open till Christmas", he begins to open it anyway but is stopped by his dad, who scolds him and sends him to bed, and opens the gift himself. Inside is a little red ball that sprouts Santa arms and a head, and proceeds to kill dad. Oops, maybe he should have left well-enough alone. Of course Derek is then traumatized by the incident since he watched it from the stairs, but he doesn\'t grow up to be some killer Santa, he just stops talking.<br /><br />

In [None]:
# Printing the selected integer vectorization text into its text
#This as seen is not taken from the above example review, its the text_vectorization of whole x_train
print("1287 meaning", vectorize_layer.get_vocabulary()[1287])
print("7827 meaning",vectorize_layer.get_vocabulary()[7827])
print("9006 meaning ", vectorize_layer.get_vocabulary()[9006])

1287 meaning silent
7827 meaning cmon
9006 meaning  shrill


## Understanding `vectorize_text` and Batch Behavior

- **`tf.expand_dims(text, -1)`**
  - Adds an extra dimension to the tensor.  
  - Reason: `TextVectorization` expects batched inputs (rank-2: `[batch, text]`).  
  - A single string like `"this movie is great"` is rank-0 → expand makes it rank-1 → compatible with the layer.  


- **`next(iter(x_train))`**
  - Converts the dataset into an iterator and grabs the first batch.  
  - Returns a tuple: `(reviews_batch, labels_batch)`.  
  - Here `reviews_batch` has 32 reviews, and `labels_batch` has 32 labels.  

- **Picking a single review**
  - `first_review, first_label = review[0], label[0]` → selects the first sample.  
  - `x_train.class_names[first_label]` → decodes numeric label into `"pos"` or `"neg"`.  

- **Vectorization step**
  - `vectorize_text(first_review, first_label)` →  
    - Expands the review’s dimensions,  
    - Passes it through `vectorize_layer` to get integers,  
    - Returns `(vectorized_text, label)`.  


##Spliting Dataset

In [None]:
# Mapping the vectorize_text function to each dataset
train = x_train.map(vectorize_text)
val = x_val.map(vectorize_text)
test = x_test.map(vectorize_text)

# Cache the datasets to speed up access
# This stores the dataset in memory after the first pass, preventing it from being read multiple times from disk
# The prefetch method prepares the next batch of data while the model is training
AUTOTUNE = tf.data.AUTOTUNE

train = train.cache().prefetch(buffer_size=AUTOTUNE)
val = val.cache().prefetch(buffer_size=AUTOTUNE)
test = test.cache().prefetch(buffer_size=AUTOTUNE)

##Model

  - In typical tabular/classical ML, you see `Dense → Dense → Output`.  
  - In text problems, you need to first **convert words into numbers that make sense** before a Dense layer can work.  
  - That’s why the pipeline looks different: text → embedding → pooling → Dense.  





In [None]:
from tensorflow.keras import layers,models

model= models.Sequential([
        layers.Embedding(input_dim=10000,output_dim=16),
        layers.GlobalAveragePooling1D(),
        layers.Dropout(0.2),
        layers.Dense(1, activation='sigmoid')
                             ])

#model.summary()

In [None]:
for x_batch, y_batch in train.take(1):
    print("Input shape:", x_batch.shape)


Input shape: (32, 250)


###Layer by Layer Breakdown

1. **Embedding Layer**
   - `layers.Embedding(input_dim=10000, output_dim=16)`  
   - Think of it as a **lookup table**: every word index gets mapped to a 16-dimensional vector.  
   - Example:  
     - Word ID `45` → `[0.12, -0.8, 0.33, ..., 0.09]`  
   - Purpose: capture semantic meaning (words with similar context end up with similar vectors).  
   - Without this, the model would treat words as meaningless IDs.

2. **Dropout**
   - `layers.Dropout(0.2)` randomly drops 20% of values during training.  
   - Prevents overfitting by forcing the model not to rely too heavily on certain features.

3. **GlobalAveragePooling1D**
   - Takes the sequence of word embeddings and **averages them into a single vector**.  
   - This compresses variable-length reviews into a fixed-size representation.  
   - Example: Review length 250 words → `[250, 16]` matrix → becomes `[16]` vector.

4. **Dropout (again)**
   - Adds more regularization after pooling.

5. **Dense Layer**
   - `layers.Dense(1, activation='sigmoid')`  
   - Outputs a single probability between 0 and 1.  
   - Used for binary classification: **positive vs negative review**.



- Embeddings + pooling let the model **understand word relationships and compress info**.  
- After that, a single Dense layer is enough to classify sentiment.



### Intuition
- Embedding = **learnable word meaning**  
- Pooling = **combine word meanings into sentence meaning**  
- Dense + Sigmoid = **decide positive/negative**  

Rule of Thumb
- `Embedding.input_dim` = `TextVectorization.max_tokens`  
- `Embedding.output_dim` = embedding size you choose (e.g., 16, 32, 128 …)

If `input_dim` < vocabulary size → some words will have no embedding.   
 If `input_dim` > vocabulary size → no error, but extra rows in the embedding table are wasted.


##Compile

### BinaryAccuracy` VS `accuracy`

- **`accuracy` (shorthand)**
  - In Keras, `"accuracy"` is just an alias.  
  - For **binary classification**, it defaults to `BinaryAccuracy(threshold=0.5)`.  
  - For **categorical classification**, it defaults to `CategoricalAccuracy`.  

- **`BinaryAccuracy(threshold=0.5)`**
  - Explicit version: says "count prediction as **1** if sigmoid output ≥ 0.5, else **0**".  
  - You can change threshold if needed (e.g., `0.7` for stricter positives).  
  - Gives you more control and clarity in binary tasks.  


## Rule of Thumb
- Use `"accuracy"` if you’re okay with defaults.  
- Use `BinaryAccuracy` if you want to **tune threshold** or make it clear to future-you what’s happening.  


In [None]:
from tensorflow.keras import losses,metrics
model.compile(
              optimizer="Adam",  # popular optimizer to optimize weights and biases
              loss=losses.BinaryCrossentropy(),  # Loss function for binary classification
              metrics=[metrics.BinaryAccuracy(threshold=0.5)] #Threshold is the line b/w the good and bad; here if model have accuracy of vector >50 then its a good word in context and vice vers
              )

##Training


In [None]:
model.fit(train, validation_data=val, epochs=10, verbose=1, shuffle=True)

Epoch 1/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - binary_accuracy: 0.9751 - loss: 0.0879 - val_binary_accuracy: 0.8616 - val_loss: 0.4200
Epoch 2/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - binary_accuracy: 0.9756 - loss: 0.0845 - val_binary_accuracy: 0.8606 - val_loss: 0.4326
Epoch 3/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 5ms/step - binary_accuracy: 0.9780 - loss: 0.0802 - val_binary_accuracy: 0.8606 - val_loss: 0.4406
Epoch 4/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 5ms/step - binary_accuracy: 0.9802 - loss: 0.0760 - val_binary_accuracy: 0.8542 - val_loss: 0.4643
Epoch 5/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 5ms/step - binary_accuracy: 0.9797 - loss: 0.0733 - val_binary_accuracy: 0.8570 - val_loss: 0.4661
Epoch 6/10
[1m625/625[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 7ms/step - binary_accuracy: 0.9808 - loss: 

<keras.src.callbacks.history.History at 0x7943d977d820>

3#Evaluate

In [None]:
loss, accuracy = model.evaluate(test, verbose=2)
# Evaluates the model on the test dataset (x_test)
# 'verbose=2' prints one line per epoch: more informative than 0 (silent) or 1 (progress bar)

print("Loss", loss)  # Displays how far off the model's predictions are (lower is better)
print(f"Accuracy {accuracy:.3f}")  # Displays accuracy as a percentage (formatted to 3 decimal places)


782/782 - 1s - 2ms/step - binary_accuracy: 0.8190 - loss: 0.6408
Loss 0.6407927870750427
Accuracy 0.819


##Predicting

To create new model with the trained sets which we did till now and with that, vectorize_layer and if want then change the activation

In [None]:
export_model = tf.keras.Sequential([
    vectorize_layer,                      # Handles raw string input,its the method of preprocessing inside the model
    model,                                # Your trained model
    layers.Activation('sigmoid')            # Optional extra layer
])

export_model.compile(
    loss=tf.keras.losses.BinaryCrossentropy(),
    optimizer='adam',
    metrics=['accuracy']
)

# Now, you can pass RAW TEXT directly
sample_texts = tf.constant(["This movie was great!",
                            "very good movie",
                            "Terribly good movie"])
predictions = export_model.predict(sample_texts)
print(predictions)


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 131ms/step
[[0.5076539]
 [0.5061186]
 [0.5029051]]


##Saving Data

In [None]:
model.save("sentiments.keras")


##Summary:

### Dataset Handling
- Downloaded **IMDB dataset** → noticed it’s directory-structured:
  - `train/pos`, `train/neg`
  - `test/pos`, `test/neg`
  - `train/unsup` (unlabeled reviews → ignored for supervised training).  
- Labels are inferred from directory names (`pos = 1`, `neg = 0`).  
- Unlike CSV datasets, reviews are stored as **raw text files**.  

### Data Preparation
- Used `tf.keras.utils.text_dataset_from_directory` to load train/test datasets.  
- Specified:
  - `validation_split=0.2`
  - `seed` (to shuffle consistently)
  - `batch_size` (for batching).
- Confirmed that reviews + labels are mixed inside batches (not separated by directory).  

### Text Preprocessing
- Built a **TextVectorization layer**:
  - `max_tokens = 10000` (limit vocabulary size).  
  - `output_sequence_length = 250` (pad/truncate reviews).  
  - `output_mode = "int"` (map words to integers).  
- Wrote a **custom standardization function**:
  - Lowercased text.  
  - Removed punctuation.  
  - Replaced `<br />` tags with spaces.  
- Applied `vectorize_text` function to transform `(text, label)` pairs into integer sequences.  

### Model Architecture
- Used a **Sequential model** with:
  1. **Embedding layer** (`input_dim=10000`, `output_dim=250`).  
  2. **Dropout** (to reduce overfitting).  
  3. **GlobalAveragePooling1D** (summarizes embeddings).  
  4. **Dense(1, activation='sigmoid')` (binary output).  
- Loss: `binary_crossentropy`.  
- Optimizer: `Adam`.  
- Metric: `binary_accuracy` (threshold at 0.5).  

### Training & Evaluation
- Trained with `model.fit(..., validation_split=0.2, epochs=N)`.  
- Evaluated on test set → **~81.9% accuracy**.  
- Saved model as `sentiments.keras`.  



---
## Navigation

###  Explore More Projects
[![Project: Fuel Efficiency](https://img.shields.io/badge/Project-Fuel_Efficiency-e6770b?style=for-the-badge&logo=github&logoColor=00FF80&labelColor=765898)](https://github.com/aypy01/tensorflow/blob/main/fuel_efficiency.ipynb)

[![Project: Fashion MNIST](https://img.shields.io/badge/Project-Fashion_MNIST-e6770b?style=for-the-badge&logo=github&logoColor=00FF80&labelColor=765898)](https://github.com/aypy01/tensorflow/blob/main/fashion-mnist-image-classifier.py)

---

## Author
 <p align="left">
  Created and maintained by &nbsp;
  <a href="https://github.com/aypy01" target="_blank">
  <img src="https://img.shields.io/badge/Aaditya_Yadav-aypy01-e6770b?style=flat-square&logo=github&logoColor=00FF80&labelColor=765898" alt="GitHub Badge"/>
</a>

</p>

<p>
<img src="https://readme-typing-svg.demolab.com?font=Fira+Code&duration=3000&pause=500&color=00FF80&background=765898&center=false&vCenter=false&width=440&lines=Break+Things+First%2C+Understand+Later;Built+to+Debug%2C+Not+Repeat;Learning+What+Actually+Sticks;Code.+Observe.+Refine." alt="Typing SVG" />
</p>


## License

This project is licensed under the [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT).
