# Part 4 - Binary Sentiment Classification & Part 6 - Clustering

# Install Dependencies
This section installs required libraries (scikit-learn for machine learning, duckdb for querying Parquet files) to support sentiment analysis and clustering. The environment is verified to ensure compatibility.

In [2]:
!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'


  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-packag

In [12]:
!pip install scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.6.1-cp311-cp311-win_amd64.whl.metadata (15 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.15.2-cp311-cp311-win_amd64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.6.1-cp311-cp311-win_amd64.whl (11.1 MB)
   ---------------------------------------- 0.0/11.1 MB ? eta -:--:--
   - -------------------------------------- 0.5/11.1 MB 4.2 MB/s eta 0:00:03
   --------------------- ------------------ 6.0/11.1 MB 21.7 MB/s eta 0:00:01
   ---------------------------------------- 11.1/11.1 MB 27.8 MB/s eta 0:00:00
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading scipy-1.15.2-cp311-cp311-win_amd64.whl (41.2 MB)
   ---------------------------------------- 0.0/41.2 MB ? eta -:--:--
   

In [8]:
import sys
print(sys.prefix)

c:\Users\Alindo\Downloads\assignment3notebook\.conda


In [13]:
import sklearn
print(sklearn.__version__)

1.6.1


In [3]:
!pip install scikit-learn



In [1]:
!pip install duckdb



## Task 6: Clustering / Segmentation (k-means)

### Objective
Segment products into 5 clusters using k-means clustering based on the following features per product:
- **mean_rating**: Average user rating per product.
- **total_reviews**: Total number of reviews for the product.
- **brand_id**: Integer ID mapped from distinct brand strings.
- **category_id**: Integer ID mapped from distinct main category strings.

### Methodology
1. **Data Preparation**:
   - Use DuckDB to query the parquet files and aggregate features by `parent_asin`.
   - Compute `mean_rating` (average rating) and `total_reviews` (count of reviews) per product.
   - Map brands and categories to integer IDs using distinct values from the dataset.
   - Filter out invalid entries (e.g., missing or empty brands/categories, ratings not in [1-5]).
2. **Feature Scaling**:
   - Standardize features using `StandardScaler` to ensure equal contribution to clustering.
3. **K-means Clustering**:
   - Apply k-means with `k=5`, default initialization, and random state 42 using `sklearn.cluster.KMeans`.
   - Assign cluster labels to each product.
4. **Cluster Analysis**:
   - For each cluster, compute:
     - Size (number of products).
     - Average `mean_rating`, `total_reviews`, `brand_id`, and `category_id`.
     - Top 3 brands and categories by frequency.
   - Provide an interpretation of each cluster (e.g., high-rating products, low-review items).

### Implementation
The following code performs the clustering and analysis, outputting the results for each cluster. It handles the large dataset efficiently using DuckDB and processes approximately 32.6 million unique products.

In [None]:
import duckdb
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from collections import Counter
import time

input_path = "F:/sentiments/sentiments/sentiment_*.parquet"
n_clusters = 5
random_state = 42

expected_categories = [
    "sentiment_All_Beauty", "sentiment_Amazon_Fashion", "sentiment_Appliances", "sentiment_Arts_Crafts_and_Sewing", "sentiment_Automotive",
    "sentiment_Baby_Products", "sentiment_Beauty_and_Personal_Care", "sentiment_Books", "sentiment_CDs_and_Vinyl",
    "sentiment_Cell_Phones_and_Accessories", "sentiment_Clothing_Shoes_and_Jewelry", "sentiment_Digital_Music", "sentiment_Electronics",
    "sentiment_Gift_Cards", "sentiment_Grocery_and_Gourmet_Food", "sentiment_Handmade_Products", "sentiment_Health_and_Household",
    "sentiment_Health_and_Personal_Care", "sentiment_Home_and_Kitchen", "sentiment_Industrial_and_Scientific", "sentiment_Kindle_Store",
    "sentiment_Magazine_Subscriptions", "sentiment_Movies_and_TV", "sentiment_Musical_Instruments", "sentiment_Office_Products",
    "sentiment_Patio_Lawn_and_Garden", "sentiment_Pet_Supplies", "sentiment_Software", "sentiment_Sports_and_Outdoors",
    "sentiment_Subscription_Boxes", "sentiment_Tools_and_Home_Improvement", "sentiment_Toys_and_Games", "sentiment_Video_Games", "sentiment_Unknown"
]


def get_most_common(series, top_n=3):
    """Gets the most common items and their counts from a Pandas Series."""
    counts = Counter(series.dropna())
    return counts.most_common(top_n)


start_time = time.time()
print("--- Starting Product Segmentation ---")


print("Initializing DuckDB...")
con = duckdb.connect()
print("DuckDB Initialized.")


print("Creating Brand and Category Mappings...")
try:
    mapping_query = f"""
        SELECT DISTINCT brand, main_category
        FROM read_parquet('{input_path}')
        WHERE brand IS NOT NULL AND TRIM(brand) != ''
          AND main_category IS NOT NULL AND TRIM(main_category) != ''
    """
    mappings_df = con.execute(mapping_query).fetch_df()

    unique_brands = mappings_df['brand'].unique()
    unique_categories = mappings_df['main_category'].unique() 

    brand_map = {brand: idx for idx, brand in enumerate(unique_brands)}
    category_map = {cat: idx for idx, cat in enumerate(unique_categories)}

    id_to_brand = {idx: brand for brand, idx in brand_map.items()}
    id_to_category = {idx: cat for cat, idx in category_map.items()}

    print(f"Found {len(brand_map)} unique brands and {len(category_map)} unique categories.")

except Exception as e:
    print(f"Error creating mappings: {e}")
    con.close()
    exit()


print("Aggregating product features (mean_rating, total_reviews)...")
try:
    aggregation_query = f"""
        WITH ProductStats AS (
            SELECT
                parent_asin,
                CAST(AVG(CAST(rating AS FLOAT)) AS FLOAT) as mean_rating,
                COUNT(*) as total_reviews,
                -- Use MIN/MAX or LIST to handle potential multiple brands/categories per parent_asin if needed
                -- Using FIRST assumes one primary brand/category per parent_asin in the data
                FIRST(brand) as brand,
                FIRST(main_category) as main_category
            FROM read_parquet('{input_path}')
            WHERE rating BETWEEN 1 AND 5
              AND parent_asin IS NOT NULL AND TRIM(parent_asin) != ''
              AND brand IS NOT NULL AND TRIM(brand) != ''        -- Ensure brand is valid
              AND main_category IS NOT NULL AND TRIM(main_category) != '' -- Ensure category is valid
            GROUP BY parent_asin
        )
        -- Final selection ensures we only include products where brand/category mapping is possible
        SELECT
            ps.parent_asin as product_id,
            ps.mean_rating,
            ps.total_reviews,
            ps.brand,
            ps.main_category
        FROM ProductStats ps
        WHERE ps.brand IS NOT NULL AND ps.main_category IS NOT NULL; -- Redundant check, but safe
    """
    
    product_features_df = con.execute(aggregation_query).fetch_df()
    print(f"Aggregated features for {len(product_features_df)} unique products.")

    if product_features_df.empty:
        print("No product data found after aggregation. Exiting.")
        con.close()
        exit()

except Exception as e:
    print(f"Error aggregating features: {e}")
    con.close()
    exit()


print("Preparing data for K-means (mapping IDs, scaling)...")
try:
    
    product_features_df['brand_id'] = product_features_df['brand'].map(brand_map)
    product_features_df['category_id'] = product_features_df['main_category'].map(category_map)

    missing_brands = product_features_df['brand_id'].isna().sum()
    missing_categories = product_features_df['category_id'].isna().sum()
    if missing_brands > 0 or missing_categories > 0:
        print(f"Warning: Found {missing_brands} products with unmapped brands and {missing_categories} with unmapped categories.")
        
        # Let's drop rows with mapping issues for cleaner clustering
        product_features_df.dropna(subset=['brand_id', 'category_id'], inplace=True)
        product_features_df['brand_id'] = product_features_df['brand_id'].astype(int)
        product_features_df['category_id'] = product_features_df['category_id'].astype(int)
        print(f"Proceeding with {len(product_features_df)} products after removing mapping failures.")

    if product_features_df.empty:
        print("No products remaining after handling mapping issues. Exiting.")
        con.close()
        exit()

    features = product_features_df[['mean_rating', 'total_reviews', 'brand_id', 'category_id']].values

    scaler = StandardScaler()
    features_scaled = scaler.fit_transform(features)
    print("Features scaled.")

except Exception as e:
    print(f"Error preparing data for K-means: {e}")
    con.close()
    exit()


print(f"Applying K-means clustering (k={n_clusters})...")
try:
    kmeans = KMeans(n_clusters=n_clusters, random_state=random_state, n_init='auto', verbose=0) # verbose=1 for logs
    cluster_labels = kmeans.fit_predict(features_scaled)

    product_features_df['cluster'] = cluster_labels
    print("K-means fitting complete.")

except Exception as e:
    print(f"Error during K-means clustering: {e}")
    con.close()
    exit()


print("\n--- Cluster Analysis ---")
for cluster_id in range(n_clusters):
    # Filter data for the current cluster
    cluster_df = product_features_df[product_features_df['cluster'] == cluster_id]
    cluster_size = len(cluster_df)

    if cluster_size == 0:
        print(f"\nCluster {cluster_id}: Empty")
        continue

    avg_mean_rating = cluster_df['mean_rating'].mean()
    avg_total_reviews = cluster_df['total_reviews'].mean()
    # Avg IDs are not meaningful, but calculated
    avg_brand_id = cluster_df['brand_id'].mean()
    avg_category_id = cluster_df['category_id'].mean()

    # Find most common brands and categories for interpretation
    top_brands = get_most_common(cluster_df['brand'], top_n=3)
    top_categories = get_most_common(cluster_df['main_category'], top_n=3)

    
    rating_desc = "High" if avg_mean_rating >= 4.2 else "Low" if avg_mean_rating < 3.5 else "Moderate"
   
    review_desc = "Very High" if avg_total_reviews > 1000 else "High" if avg_total_reviews > 100 else "Moderate" if avg_total_reviews > 10 else "Low"
    
    # Describe common elements
    brands_str = ", ".join([f"{b} ({c})" for b, c in top_brands]) if top_brands else "N/A"
    categories_str = ", ".join([f"{cat} ({c})" for cat, c in top_categories]) if top_categories else "N/A"

    interpretation = (
        f"{rating_desc} Rating ({avg_mean_rating:.2f}), {review_desc} Reviews ({avg_total_reviews:.1f}). "
        f"Common Brands: [{brands_str}]. Common Categories: [{categories_str}]."
    )

    print(f"\nCluster {cluster_id}:")
    print(f"  Size: {cluster_size} products ({cluster_size / len(product_features_df) * 100:.1f}%)")
    print(f"  Avg Mean Rating: {avg_mean_rating:.2f}")
    print(f"  Avg Total Reviews: {avg_total_reviews:.2f}")
    
    print(f"  Avg Brand ID: {avg_brand_id:.2f}")
    print(f"  Avg Category ID: {avg_category_id:.2f}")
    print(f"  Top 3 Brands: {top_brands}")
    print(f"  Top 3 Categories: {top_categories}")
    print(f"  Interpretation: {interpretation}")


print("\nClosing DuckDB connection...")
con.close()

end_time = time.time()
print(f"--- Script finished in {end_time - start_time:.2f} seconds ---")

--- Starting Product Segmentation ---
Initializing DuckDB...
DuckDB Initialized.
Creating Brand and Category Mappings...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Found 4769188 unique brands and 50 unique categories.
Aggregating product features (mean_rating, total_reviews)...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Aggregated features for 32609113 unique products.
Preparing data for K-means (mapping IDs, scaling)...
Features scaled.
Applying K-means clustering (k=5)...
K-means fitting complete.

--- Cluster Analysis ---

Cluster 0:
  Size: 6703788 products (20.6%)
  Avg Mean Rating: 4.48
  Avg Total Reviews: 10.88
  Avg Brand ID: 3215458.76
  Avg Category ID: 8.53
  Top 3 Brands: [('Naturalizer', 10367), ('Aerosoles', 10183), ('Goodthreads', 9859)]
  Top 3 Categories: [('Books', 2303290), ('AMAZON FASHION', 1616574), ('Amazon Home', 763316)]
  Interpretation: High Rating (4.48), Moderate Reviews (10.9). Common Brands: [Naturalizer (10367), Aerosoles (10183), Goodthreads (9859)]. Common Categories: [Books (2303290), AMAZON FASHION (1616574), Amazon Home (763316)].

Cluster 1:
  Size: 4716768 products (14.5%)
  Avg Mean Rating: 4.37
  Avg Total Reviews: 16.39
  Avg Brand ID: 1908860.60
  Avg Category ID: 29.78
  Top 3 Brands: [('Unknown', 404677), ('Generic', 46357), ('Format: DVD', 37983)]
  Top 3

## Task 4: Binary Sentiment Prediction (Logistic Regression)

### Objective
Perform binary sentiment classification on review text, where:
- **Positive**: Rating > 3
- **Negative**: Rating ≤ 3
Evaluate the model using accuracy, F1 score, and confusion matrix.

### Methodology
1. **Data Preparation**:
   - Use DuckDB to query review text and ratings from parquet files.
   - Filter for valid ratings (1-5) and non-empty review text.
   - Create binary sentiment labels (1 for positive, 0 for negative).
   - Split data into 80% training and 20% test sets, stratified by sentiment.
2. **Text Vectorization**:
   - Apply TF-IDF vectorization on review text using `TfidfVectorizer` with:
     - Lowercase conversion.
     - Token pattern `\b\w+\b`.
     - Minimum document frequency of 5.
     - Maximum document frequency of 80%.
3. **Classifier**:
   - Use `SGDClassifier` with `log_loss` to implement logistic regression.
   - Train the model in batches of 5 million reviews to handle the large dataset.
4. **Evaluation**:
   - Compute accuracy, F1 score, and confusion matrix for each test batch.
   - Aggregate metrics across all batches to report overall performance.

### Implementation
The following code processes the dataset in batches, trains the logistic regression model, and evaluates performance. It handles 503,024,752 valid reviews, with results reported per batch and overall.

In [3]:
import duckdb
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

input_path = "F:/sentiments/sentiments/sentiment_*.parquet"

batch_size = 5000000 

vectorizer = TfidfVectorizer(
    lowercase=True,
    token_pattern=r'\b\w+\b',
    min_df=5,
    max_df=0.8
)
model = SGDClassifier(loss='log_loss', random_state=42)
classes = [0, 1]

offset = 0
batch_num = 1
con = duckdb.connect()

all_preds = []
all_truths = []

while True:
    query = f"""
        SELECT text, rating
        FROM read_parquet('{input_path}')
        WHERE rating BETWEEN 1 AND 5
          AND text IS NOT NULL
          AND LENGTH(TRIM(text)) > 0
        LIMIT {batch_size} OFFSET {offset}
    """
    df = con.execute(query).fetch_df()
    if df.empty:
        break

    df["label"] = (df["rating"] > 3).astype(int)

    X_train, X_test, y_train, y_test = train_test_split(
        df["text"], df["label"], test_size=0.2, random_state=42
    )

    if batch_num == 1:
        X_train_vec = vectorizer.fit_transform(X_train)
        model.partial_fit(X_train_vec, y_train, classes=classes)
    else:
        X_train_vec = vectorizer.transform(X_train)
        model.partial_fit(X_train_vec, y_train)

    X_test_vec = vectorizer.transform(X_test)
    y_pred = model.predict(X_test_vec)

    all_preds.extend(y_pred)
    all_truths.extend(y_test)

    acc = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    print(f" Batch {batch_num} — accuracy: {acc:.4f}, F1: {f1:.4f}")
    print(cm, "\n")

    offset += batch_size
    batch_num += 1

# Calculate overall metrics
if all_preds and all_truths:
    overall_acc = accuracy_score(all_truths, all_preds)
    overall_f1 = f1_score(all_truths, all_preds)
    overall_cm = confusion_matrix(all_truths, all_preds)

    print("Overall Metrics Across All Batches:")
    print(f"Total Confusion Matrix:\n{overall_cm}")
    print(f"Overall Accuracy: {overall_acc:.4f}")
    print(f"Overall F1 Score: {overall_f1:.4f}")

con.close()

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 1 — accuracy: 0.8713, F1: 0.9175
[[155895  99598]
 [ 29096 715411]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 2 — accuracy: 0.8839, F1: 0.9307
[[103513  96347]
 [ 19786 780354]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 3 — accuracy: 0.8802, F1: 0.9266
[[124176  97018]
 [ 22800 756006]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 4 — accuracy: 0.8753, F1: 0.9276
[[ 75756 109543]
 [ 15193 799508]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 5 — accuracy: 0.8732, F1: 0.9244
[[ 98472 108185]
 [ 18593 774750]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 6 — accuracy: 0.8732, F1: 0.9222
[[121276 104512]
 [ 22336 751876]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 7 — accuracy: 0.8686, F1: 0.9161
[[150792 104706]
 [ 26691 717811]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 8 — accuracy: 0.8684, F1: 0.9200
[[111479 110180]
 [ 21418 756923]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 9 — accuracy: 0.8599, F1: 0.9095
[[156018 111985]
 [ 28105 703892]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 10 — accuracy: 0.8594, F1: 0.9147
[[105720 116688]
 [ 23880 753712]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 11 — accuracy: 0.8604, F1: 0.9141
[[117397 114830]
 [ 24758 743015]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 12 — accuracy: 0.8618, F1: 0.9133
[[133923 111594]
 [ 26591 727892]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 13 — accuracy: 0.8635, F1: 0.9173
[[106781 116880]
 [ 19573 756766]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 14 — accuracy: 0.8688, F1: 0.9277
[[ 27198 124035]
 [  7201 841566]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 15 — accuracy: 0.8718, F1: 0.9293
[[ 29420 120828]
 [  7353 842399]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 16 — accuracy: 0.8729, F1: 0.9298
[[ 30664 120393]
 [  6726 842217]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 17 — accuracy: 0.8739, F1: 0.9303
[[ 31804 119065]
 [  7072 842059]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 18 — accuracy: 0.8747, F1: 0.9307
[[ 32998 118063]
 [  7204 841735]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 19 — accuracy: 0.8767, F1: 0.9326
[[ 23018 117543]
 [  5783 853656]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 20 — accuracy: 0.8592, F1: 0.9173
[[ 78177 122813]
 [ 17961 781049]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 21 — accuracy: 0.8475, F1: 0.9016
[[148854 119608]
 [ 32877 698661]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 22 — accuracy: 0.8538, F1: 0.9053
[[154724 113956]
 [ 32260 699060]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 23 — accuracy: 0.8582, F1: 0.9080
[[158018 110158]
 [ 31686 700138]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 24 — accuracy: 0.8574, F1: 0.9098
[[137887 112836]
 [ 29749 719528]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 25 — accuracy: 0.8577, F1: 0.9131
[[110437 116237]
 [ 26023 747303]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 26 — accuracy: 0.8608, F1: 0.9148
[[113293 112905]
 [ 26324 747478]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 27 — accuracy: 0.8628, F1: 0.9159
[[115644 111158]
 [ 25996 747202]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 28 — accuracy: 0.8643, F1: 0.9168
[[116449 109364]
 [ 26380 747807]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 29 — accuracy: 0.8654, F1: 0.9175
[[117600 108303]
 [ 26266 747831]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 30 — accuracy: 0.8665, F1: 0.9180
[[118989 107263]
 [ 26243 747505]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 31 — accuracy: 0.8679, F1: 0.9188
[[120139 106231]
 [ 25895 747735]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 32 — accuracy: 0.8686, F1: 0.9192
[[120636 105040]
 [ 26405 747919]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 33 — accuracy: 0.8698, F1: 0.9200
[[121737 104349]
 [ 25810 748104]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 34 — accuracy: 0.8693, F1: 0.9196
[[121986 104659]
 [ 26053 747302]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 35 — accuracy: 0.8701, F1: 0.9201
[[122277 103999]
 [ 25913 747811]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 36 — accuracy: 0.8704, F1: 0.9202
[[122801 103740]
 [ 25862 747597]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 37 — accuracy: 0.8651, F1: 0.9175
[[115343 112055]
 [ 22825 749777]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 38 — accuracy: 0.8566, F1: 0.9117
[[116018 123200]
 [ 20250 740532]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 39 — accuracy: 0.8584, F1: 0.9127
[[117823 121077]
 [ 20569 740531]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 40 — accuracy: 0.8602, F1: 0.9137
[[119800 118848]
 [ 20936 740416]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 41 — accuracy: 0.8616, F1: 0.9145
[[121634 117385]
 [ 21004 739977]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 42 — accuracy: 0.8623, F1: 0.9148
[[123186 116297]
 [ 21383 739134]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 43 — accuracy: 0.8634, F1: 0.9154
[[123786 115267]
 [ 21356 739591]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 44 — accuracy: 0.8640, F1: 0.9157
[[124857 114585]
 [ 21455 739103]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 45 — accuracy: 0.8651, F1: 0.9164
[[125957 113174]
 [ 21764 739105]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 46 — accuracy: 0.8519, F1: 0.9095
[[108044 127419]
 [ 20680 743857]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 47 — accuracy: 0.8483, F1: 0.9068
[[109853 130318]
 [ 21430 738399]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 48 — accuracy: 0.8523, F1: 0.9088
[[116142 125561]
 [ 22109 736188]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 49 — accuracy: 0.8692, F1: 0.9208
[[109180 111168]
 [ 19600 760052]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 50 — accuracy: 0.8629, F1: 0.9165
[[110593 116654]
 [ 20487 752266]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 51 — accuracy: 0.8645, F1: 0.9205
[[ 79793 117852]
 [ 17681 784674]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 52 — accuracy: 0.8626, F1: 0.9185
[[ 88221 119295]
 [ 18066 774418]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 53 — accuracy: 0.8614, F1: 0.9166
[[ 99881 119799]
 [ 18818 761502]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 54 — accuracy: 0.8664, F1: 0.9181
[[117330 112126]
 [ 21442 749102]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 55 — accuracy: 0.8730, F1: 0.9221
[[121063 104988]
 [ 22009 751940]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 56 — accuracy: 0.8743, F1: 0.9230
[[120672 103433]
 [ 22260 753635]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 57 — accuracy: 0.8742, F1: 0.9229
[[121827 103538]
 [ 22240 752395]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 58 — accuracy: 0.8748, F1: 0.9232
[[122180 103134]
 [ 22074 752612]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 59 — accuracy: 0.8753, F1: 0.9235
[[122531 102489]
 [ 22204 752776]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 60 — accuracy: 0.8756, F1: 0.9237
[[122537 101945]
 [ 22448 753070]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 61 — accuracy: 0.8759, F1: 0.9238
[[123285 101811]
 [ 22307 752597]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 62 — accuracy: 0.8760, F1: 0.9239
[[123042 101634]
 [ 22361 752963]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 63 — accuracy: 0.8766, F1: 0.9242
[[124068 101078]
 [ 22284 752570]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 64 — accuracy: 0.8764, F1: 0.9241
[[123828 101364]
 [ 22228 752580]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 65 — accuracy: 0.8766, F1: 0.9242
[[124236 100779]
 [ 22574 752411]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 66 — accuracy: 0.8773, F1: 0.9246
[[124842 100302]
 [ 22377 752479]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 67 — accuracy: 0.8753, F1: 0.9237
[[120021 102518]
 [ 22209 755252]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 68 — accuracy: 0.8686, F1: 0.9219
[[ 92988 113508]
 [ 17882 775622]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 69 — accuracy: 0.8809, F1: 0.9357
[[ 14319 115172]
 [  3936 866573]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 70 — accuracy: 0.8637, F1: 0.9254
[[ 18473 131565]
 [  4775 845187]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 71 — accuracy: 0.8760, F1: 0.9327
[[ 16208 120103]
 [  3885 859804]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 72 — accuracy: 0.8765, F1: 0.9330
[[ 16812 119502]
 [  4006 859680]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 73 — accuracy: 0.8758, F1: 0.9324
[[ 19340 119713]
 [  4532 856415]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 74 — accuracy: 0.8427, F1: 0.9094
[[ 52903 144631]
 [ 12659 789807]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 75 — accuracy: 0.8468, F1: 0.9113
[[ 59956 140140]
 [ 13099 786805]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 76 — accuracy: 0.8527, F1: 0.9148
[[ 61318 134699]
 [ 12630 791353]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 77 — accuracy: 0.8633, F1: 0.9202
[[ 75240 121894]
 [ 14850 788016]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 78 — accuracy: 0.8764, F1: 0.9277
[[ 83151 105791]
 [ 17849 793209]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 79 — accuracy: 0.8703, F1: 0.9206
[[118251 108904]
 [ 20752 752093]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 80 — accuracy: 0.8650, F1: 0.9175
[[114214 112617]
 [ 22344 750825]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 81 — accuracy: 0.8572, F1: 0.9127
[[111081 120080]
 [ 22699 746140]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 82 — accuracy: 0.8527, F1: 0.9059
[[144186 121483]
 [ 25779 708552]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 83 — accuracy: 0.8482, F1: 0.8997
[[167593 123126]
 [ 28665 680616]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 84 — accuracy: 0.8433, F1: 0.9023
[[119968 131836]
 [ 24864 723332]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 85 — accuracy: 0.8445, F1: 0.9031
[[119750 131231]
 [ 24270 724749]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 86 — accuracy: 0.8452, F1: 0.9032
[[122789 130661]
 [ 24182 722368]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 87 — accuracy: 0.8178, F1: 0.8840
[[124012 154696]
 [ 27477 693815]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 88 — accuracy: 0.8780, F1: 0.9288
[[ 81560 100595]
 [ 21452 796393]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 89 — accuracy: 0.8714, F1: 0.9241
[[ 88846 110182]
 [ 18418 782554]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 90 — accuracy: 0.8697, F1: 0.9218
[[101066 112095]
 [ 18255 768584]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 91 — accuracy: 0.8652, F1: 0.9162
[[128869 113797]
 [ 20989 736345]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 92 — accuracy: 0.8699, F1: 0.9203
[[118350 106245]
 [ 23854 751551]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 93 — accuracy: 0.8688, F1: 0.9198
[[115763 108707]
 [ 22528 753002]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 94 — accuracy: 0.8699, F1: 0.9206
[[115143 109015]
 [ 21122 754720]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 95 — accuracy: 0.8691, F1: 0.9201
[[115086 109871]
 [ 21013 754030]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 96 — accuracy: 0.8697, F1: 0.9205
[[115450 109406]
 [ 20900 754244]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 97 — accuracy: 0.8830, F1: 0.9315
[[ 87646 101029]
 [ 15941 795384]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 98 — accuracy: 0.8802, F1: 0.9286
[[100322 103244]
 [ 16599 779835]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 99 — accuracy: 0.8763, F1: 0.9234
[[130409 104414]
 [ 19294 745883]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 100 — accuracy: 0.8595, F1: 0.9106
[[143305 115598]
 [ 24944 716153]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

 Batch 101 — accuracy: 0.8457, F1: 0.9031
[[ 76729  77338]
 [ 16028 434856]] 



FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

Overall Metrics Across All Batches:
Total Confusion Matrix:
[[10628617 11438593]
 [ 2071764 76465977]]
Overall Accuracy: 0.8657
Overall F1 Score: 0.9188
