#Step 1: Installing Dask for Sentiment Analysis
This command installs several Python libraries for parallel computing, machine learning, and natural language processing, including Dask with all optional dependencies, Dask-ML for scalable ML algorithms, the Transformers library for pre-trained NLP models, and PyTorch for neural network-based AI tasks.

In [1]:
!pip install dask[complete] dask-ml transformers torch



# Step 2: Intialise Dask Clusters and Load the data
Read the dataset from a file, which could be stored on your Google Drive:

In [2]:
from dask.distributed import Client, LocalCluster

# For local development
cluster = LocalCluster(
    n_workers=4,          # Number of workers
    threads_per_worker=2, # Number of threads per each worker
    memory_limit='2GB'    # Memory limit for each worker
)
client = Client(cluster)

# Enable adaptive scaling
cluster.adapt(minimum=2, maximum=10)

client = Client(processes=False)  # Use processes=False if you're on a single machine for better debugging

INFO:distributed.http.proxy:To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO:distributed.scheduler:State start
INFO:distributed.scheduler:  Scheduler at:     tcp://127.0.0.1:40917
INFO:distributed.scheduler:  dashboard at:  http://127.0.0.1:8787/status
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:36659'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:40233'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:41793'
INFO:distributed.nanny:        Start Nanny at: 'tcp://127.0.0.1:43923'
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:39085', name: 0, status: init, memory: 0, processing: 0>
INFO:distributed.scheduler:Starting worker compute stream, tcp://127.0.0.1:39085
INFO:distributed.core:Starting established connection to tcp://127.0.0.1:34734
INFO:distributed.scheduler:Register worker <WorkerState 'tcp://127.0.0.1:44791', name: 2, stat

In [3]:
from google.colab import drive
drive.mount('/content/drive')

INFO:distributed.deploy.adaptive:Retiring workers [1, 2]
INFO:distributed.scheduler:Retire worker names (1, 2)
INFO:distributed.scheduler:Retiring worker tcp://127.0.0.1:42481
INFO:distributed.scheduler:Retiring worker tcp://127.0.0.1:44791
INFO:distributed.active_memory_manager:Retiring worker tcp://127.0.0.1:44791; no unique keys need to be moved away.
INFO:distributed.active_memory_manager:Retiring worker tcp://127.0.0.1:42481; no unique keys need to be moved away.
INFO:distributed.scheduler:Remove worker <WorkerState 'tcp://127.0.0.1:42481', name: 1, status: closing_gracefully, memory: 0, processing: 0> (stimulus_id='retire-workers-1714865591.2226017')
INFO:distributed.scheduler:Retired worker tcp://127.0.0.1:42481
INFO:distributed.scheduler:Remove worker <WorkerState 'tcp://127.0.0.1:44791', name: 2, status: closing_gracefully, memory: 0, processing: 0> (stimulus_id='retire-workers-1714865591.2226017')
INFO:distributed.scheduler:Retired worker tcp://127.0.0.1:44791
INFO:distribute

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [4]:
import dask.dataframe as dd

# Define schema: Dask uses pandas for dtype, so define similarly
schema = {
    "customer_id": str,
    "product_id": str,
    "product_parent": str,
    "product_title": str,
    "product_category": str,
    "star_rating": int,
    "full_text": str,
    "language": str
}

# Read the data with predefined schema
df = dd.read_parquet('/content/drive/MyDrive/Big Data Project/language.parquet/', columns=list(schema.keys()))


In [5]:
import torch

# Check if CUDA is available and then set the default device to GPU
if torch.cuda.is_available():
    device = torch.device("cuda")
    print("Using GPU:", torch.cuda.get_device_name(0))
else:
    device = torch.device("cpu")
    print("GPU not available, using CPU instead.")


INFO:distributed.nanny:Worker process 123800 was killed by signal 9
INFO:distributed.nanny:Worker process 123803 was killed by signal 9


Using GPU: NVIDIA A100-SXM4-40GB


# Step 3 : Sentiment Analysis

In [6]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Pre-load models and tokenizers based on languages to avoid reloading them multiple times
model_paths = {
    "en": "nlptown/bert-base-multilingual-uncased-sentiment",
    "es": "dccuchile/bert-base-spanish-wwm-cased",
    "fr": "camembert/camembert-large",
    "de": "bert-base-german-dbmdz-uncased",
    "zh": "bert-base-chinese",
    "ar": "aubmindlab/bert-base-arabert",
    "ru": "DeepPavlov/rubert-base-cased",
    "pt": "neuralmind/bert-base-portuguese-cased",
    "nl": "wietsedv/bert-base-dutch-cased",
    "it": "dbmdz/bert-base-italian-xxl-cased"
}

models = {lang: AutoModelForSequenceClassification.from_pretrained(path) for lang, path in model_paths.items()}
tokenizers = {lang: AutoTokenizer.from_pretrained(path) for lang, path in model_paths.items()}

def get_sentiment(texts, lang_codes):
    results = []
    for text, lang_code in zip(texts, lang_codes):
        model = models.get(lang_code, models['en'])
        tokenizer = tokenizers.get(lang_code, tokenizers['en'])
        inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        outputs = model(**inputs)
        predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
        results.append(predictions.argmax(dim=-1).item())
    return results

# Use Dask's map_partitions to apply the function
import pandas as pd

def apply_sentiment(df):
    df['sentiment'] = get_sentiment(df['full_text'].tolist(), df['language'].tolist())
    return df




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at dccuchile/bert-base-spanish-wwm-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert/camembert-large and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.ou

In [7]:
# Correct meta information including all columns
meta_df = {
    "customer_id": str,
    "product_id": str,
    "product_parent": str,
    "product_title": str,
    "product_category": str,
    "star_rating": int,
    "full_text": str,
    "language": str,
    "sentiment": int  # New sentiment column as integer
}

result_df = df.map_partitions(apply_sentiment, meta=meta_df)

INFO:distributed.core:Event loop was unresponsive in Nanny for 118.09s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 117.92s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Worker for 117.92s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 117.93s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Nanny for 117.93s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This ca

In [8]:
result_df = result_df.compute()  # This will trigger the computation

This may cause some slowdown.
Consider scattering data ahead of time and using futures.
INFO:distributed.core:Event loop was unresponsive in Scheduler for 9.06s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
INFO:distributed.core:Event loop was unresponsive in Worker for 9.06s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


In [19]:
result_df.to_parquet('/content/drive/MyDrive/Big Data Project/senti.parquet')

In [20]:
print(result_df)

     customer_id  product_id product_parent  \
0       32035145  B00CO2UY6C      546440076   
1        9213870  B00WE2W2A8      319585048   
2       12190192  B00NEZ6OW6      721363676   
3        4176674  B00GJQN89O      934898207   
4       22633251  B00AYZB6Z4      551463410   
...          ...         ...            ...   
2427    27014713  B000P6DYWA      911228139   
2428    49034600  0743458117      859537711   
2429      468675  B011THMEII      476091171   
2430      588716  B000AUMYMM      306567904   
2431    50849801  B0091UIMK0      750008325   

                                          product_title product_category  \
0     Safavieh Lyndhurst Collection LNH214A Traditio...        Furniture   
1     New Wayzon Mini Clip Metal Screen MP3 Music Me...      Electronics   
2     Polaroid Cube HD 1080p Lifestyle Action Video ...           Camera   
3     2 PINK Droplet Latex Free Blender Sponges Liqu...           Beauty   
4                40 Inch LED Light Bar DR 14,400 Lumens