<a href="https://colab.research.google.com/github/ekerintaiwoa/MediaApp/blob/master/bookimageprediction20.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Extract Text from Images

In [4]:
from PIL import Image
import pytesseract
import pandas as pd
import os

In [6]:
# Paths
csv_path = "/content/drive/MyDrive/dataset/subset_with_ocr.csv"
image_dir = "/content/drive/MyDrive/dataset/mybookcovers"  # Unzipped folder

In [3]:
!pip install pytesseract

Collecting pytesseract
  Downloading pytesseract-0.3.13-py3-none-any.whl.metadata (11 kB)
Downloading pytesseract-0.3.13-py3-none-any.whl (14 kB)
Installing collected packages: pytesseract
Successfully installed pytesseract-0.3.13


In [10]:
df = pd.read_csv(csv_path)

In [11]:
df.shape

(2500, 8)

In [12]:
df.columns

Index(['ASIN', 'FILENAME', 'IMAGE_URL', 'TITLE', 'AUTHOR', 'CATEGORY_ID',
       'CATEGORY', 'ocr_text'],
      dtype='object')

In [13]:
# Extract text using OCR
def extract_text(filename):
    try:
        image_path = os.path.join(image_dir, filename)
        img = Image.open(image_path)
        text = pytesseract.image_to_string(img)
        return text.strip()
    except:
        return ""

In [14]:

# Only keep rows with matching image files
available_images = set(os.listdir(image_dir))
df = df[df["FILENAME"].isin(available_images)]

In [15]:
# Apply OCR
df["OCR_TEXT"] = df["FILENAME"].apply(extract_text)

In [16]:

# Drop empty OCR outputs
df = df[df["OCR_TEXT"].str.strip() != ""]

Step 2: Train TensorFlow DistilBERT on OCR Text

In [17]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification, create_optimizer
import tensorflow as tf


In [18]:
# Encode labels
label_encoder = LabelEncoder()
df["LABEL"] = label_encoder.fit_transform(df["CATEGORY"])

In [19]:
# Train-test split
train_texts, test_texts, train_labels, test_labels = train_test_split(
    df["OCR_TEXT"], df["LABEL"], test_size=0.2, random_state=42
)

In [20]:
# Tokenization
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
train_enc = tokenizer(list(train_texts), truncation=True, padding=True, max_length=128, return_tensors="tf")
test_enc = tokenizer(list(test_texts), truncation=True, padding=True, max_length=128, return_tensors="tf")

train_dataset = tf.data.Dataset.from_tensor_slices((dict(train_enc), train_labels)).batch(16)
test_dataset = tf.data.Dataset.from_tensor_slices((dict(test_enc), test_labels)).batch(16)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [21]:
# Model
model = TFDistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=len(label_encoder.classes_)
)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

In [24]:
# Compile
steps = len(train_dataset) * 3
optimizer, _ = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=steps)
model.compile(optimizer=optimizer, loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=["accuracy"])

In [26]:
# Train
model.fit(train_dataset, validation_data=test_dataset, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tf_keras.src.callbacks.History at 0x7942a6136350>

In [27]:
# Evaluate
loss, acc = model.evaluate(test_dataset)
print(f"OCR-based classifier accuracy: {acc:.4f}")

OCR-based classifier accuracy: 0.2924


In [29]:
# Save model and tokenizer
model.save_pretrained("tf_genre_classifier")
tokenizer.save_pretrained("tf_genre_classifier")

('tf_genre_classifier/tokenizer_config.json',
 'tf_genre_classifier/special_tokens_map.json',
 'tf_genre_classifier/vocab.txt',
 'tf_genre_classifier/added_tokens.json')

Step-by-Step: Test the Model on a New Book Cover Image

In [28]:
from PIL import Image
import pytesseract

def extract_text_from_image(image_path):
    image = Image.open(image_path)
    text = pytesseract.image_to_string(image)
    return text.strip()


In [30]:
from transformers import DistilBertTokenizer, TFDistilBertForSequenceClassification
import tensorflow as tf
import numpy as np

# Load the trained model and tokenizer
model = TFDistilBertForSequenceClassification.from_pretrained("tf_genre_classifier")
tokenizer = DistilBertTokenizer.from_pretrained("tf_genre_classifier")


Some layers from the model checkpoint at tf_genre_classifier were not used when initializing TFDistilBertForSequenceClassification: ['dropout_19']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at tf_genre_classifier and are newly initialized: ['dropout_39']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [33]:

# Load your label encoder (must match training)
import pickle
with open("label_encoder.pkl", "rb") as f:
    label_encoder = pickle.load(f)

In [32]:
import pickle

# Assuming label_encoder is already defined and trained
with open("label_encoder.pkl", "wb") as f:
    pickle.dump(label_encoder, f)

In [34]:
# Prediction function
def predict_genre_from_image(image_path):
    text = extract_text_from_image(image_path)
    inputs = tokenizer(text, return_tensors="tf", truncation=True, padding=True, max_length=128)
    logits = model(**inputs).logits
    pred = tf.argmax(logits, axis=1).numpy()[0]
    return label_encoder.inverse_transform([pred])[0]

In [35]:
genre = predict_genre_from_image("/content/sample_data/animalfarm.jpg")
print(f"Predicted Genre: {genre}")

Predicted Genre: Children's Books


In [38]:
genre = predict_genre_from_image("/content/sample_data/silentnightsex.png")
print(f"Predicted Genre: {genre}")

Predicted Genre: Travel
