# Labeling

## Automatic Fake Review Labeling using DistilBERT

- Uses the `transformers` pipeline with `astrosbd/fake-reviews-distilbert-v3` to automatically label reviews as **real (1)** or **fake (0)**.
- Processes a CSV dataset of user reviews and applies predictions to each entry using `tqdm` for efficient tracking.
- Saves the newly labeled dataset for downstream tasks such as fine-tuning, evaluation, or data balancing.


In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-classification", model="astrosbd/fake-reviews-distilbert-v3")
# Example review to classify
review = "nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice nice"
#review="This product is amazing! It works exactly as described and I highly recommend it."
# Classify the review as fake or real
result = pipe(review)

# Output the result
print(result)


Device set to use cuda:0


[{'label': 'CG', 'score': 0.9998170733451843}]


In [None]:
import pandas as pd
from transformers import pipeline
from tqdm import tqdm

# Enable tqdm for pandas
tqdm.pandas()

# Load the fake review detection pipeline
pipe = pipeline("text-classification", model="astrosbd/fake-reviews-distilbert-v3")

# Load your dataset
df = pd.read_csv('/content/processed_user_reviewContent.csv')

# Optional: Preview the data
print(df.head())

# Function to classify reviews as real or fake
def classify_review(review):
    if isinstance(review, str) and review.strip():
        result = pipe(review)
        label = result[0]['label']
        if label == 'OR':
            return 1  # Real
        elif label == 'CG':
            return 0  # Fake
    return None  # Default for empty or invalid input

# Use tqdm to track progress
df['label'] = df['reviewContent'].progress_apply(classify_review)

# Filter out rows where reviewContent is missing or empty
df = df[df['reviewContent'].notna()]

# Save to CSV
df.to_csv('New_final_Processed_labeled_reviews.csv', index=False)

# Count and print results
real_reviews_count = df[df['label'] == 1].shape[0]
fake_reviews_count = df[df['label'] == 0].shape[0]

print("Labeled dataset saved to 'New_final_Processed_labeled_reviews.csv'")
print(f"Number of real reviews: {real_reviews_count}")
print(f"Number of fake reviews: {fake_reviews_count}")


Device set to use cuda:0


                                          productURL            userName  \
0  https://www.aliexpress.com/item/10050068766578...  AliExpress Shopper   
1  https://www.aliexpress.com/item/10050068766578...               A***r   
2  https://www.aliexpress.com/item/10050068766578...               J***s   
3  https://www.aliexpress.com/item/10050068766578...               Б***ч   
4  https://www.aliexpress.com/item/10050068766578...               A***r   

  userCountry  userStar                                      reviewContent  \
0          UA       5.0                  Small, it won't work for the car.   
1          BR       5.0  Very satisfied with the product, it is really ...   
2         NaN       5.0             Very satisfied with this little vacuum   
3          RU       5.0               Just a bomb! Works like a hurricane!   
4          KR       5.0                            Strong and good suction   

          reviewTime language  
0  14 Oct 2024 12:28       en  
1  10 Sep 

100%|██████████| 17607/17607 [01:30<00:00, 194.27it/s]


Labeled dataset saved to 'New_final_Processed_labeled_reviews.csv'
Number of real reviews: 12916
Number of fake reviews: 4691
