# BERT-based Fashion-word-check Model
### Goal: 
- To fine-tune a BERT-based model to identify fashion-related words in the 'post_title' and 'post_content' columns using the labeled data from topics0611.csv
### Required Data:
- topics0611.csv: Contains the labeled fashion-related keywords in the 'keyword group' column
- poster_test_fashion_nlpclean.csv: Contains the columns 'post_title', 'post_content', and 'post_comments'. I use 'post_title' and 'post_content' to identify fashion-related words, while 'post_comments' might provide non-fashion-related examples
### Steps to Fine-Tune the BERT Model:
#### 1. Prepare the Dataset:
- Extract fashion-related keywords from topics0611.csv
- Use the 'post_title', 'post_content', and 'post_comments' columns from poster_test_fashion_nlpclean.csv to generate labeled data for training
#### 2. Tokenization and Data Preparation:
- Tokenize the text using a Chinese BERT tokenizer (bert-base-chinese)
- Prepare the dataset for BERT, labeling fashion-related terms from the 'keyword group' and non-fashion-related terms based on 'post_comments' or by excluding fashion keywords from the other columns
#### 3. Model Fine-Tuning:
- Fine-tune the BERT model using the prepared dataset
- Train the model to classify whether a term or sentence is fashion-related
#### 4. Evaluation:
- Evaluate the model on a validation set to ensure it correctly identifies fashion-related content
#### 5. Inference:
- Apply the model to the 'post_title' and 'post_content' columns to identify fashion-related terms

In [5]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from torch.utils.data import Dataset
from sklearn.feature_extraction.text import CountVectorizer  


##### Step 1: Load Data

In [3]:
topics_file = '/home/disk1/red_disk1/Multimodal_MKT/topics0611.csv'
posts_file = '/home/disk1/red_disk1/Multimodal_MKT/test/poster_test_fashion_nlpclean.csv'

topics_df = pd.read_csv(topics_file)
posts_df = pd.read_csv(posts_file)

##### Step 1.1 Data cleaning

In [14]:
import pandas as pd
import re
import emoji
import jieba

# Load stopwords from the provided file
with open('/home/disk1/red_disk1/Multimodal_MKT/stopwords_cn.txt', 'r', encoding='utf-8') as f:
    stopwords = set(f.read().splitlines())

# Load the poster_test_fashion_nlpclean.csv file
poster_df = pd.read_csv('/home/disk1/red_disk1/Multimodal_MKT/test/poster_test_fashion_nlpclean.csv')

# Remove duplicate rows based on poster_id and post_id
poster_df = poster_df.drop_duplicates(subset=['poster_id', 'post_id'])

# Ensure that the post_title and post_content columns are filled
poster_df['post_title'] = poster_df['post_title'].fillna('')
poster_df['post_content'] = poster_df['post_comment_content'].fillna('')

# Combine titles and content for searching
poster_df['combined_text'] = poster_df['post_title'] + ' ' + poster_df['post_content']

# Function for text cleaning
def clean_text(text, stopwords):
    # Convert emojis to text
    text = emoji.demojize(text)
    
    # Remove specific patterns
    text = re.sub(r'- 小红书,,', '', text)
    text = re.sub(r',,\d{2}-\d{2},,', '', text)
    text = re.sub(r'#', ' ', text)
    
    # Remove digits
    text = re.sub(r'\d+', '', text)
    
    # Remove special characters
    cleaned_text = ''.join(char for char in text if char.isalnum() or char.isspace())
    
    # Tokenize
    words = jieba.cut(cleaned_text)
    
    # Remove stopwords
    filtered_words = [word for word in words if word not in stopwords]
    
    return ' '.join(filtered_words)

# Apply data cleaning to post_title, post_content, and post_comments
poster_df['post_title_clean'] = poster_df['post_title'].apply(lambda x: clean_text(x, stopwords))
poster_df['post_content_clean'] = poster_df['post_content'].apply(lambda x: clean_text(x, stopwords))
poster_df['post_comments_clean'] = poster_df['post_comment_content'].fillna('').apply(lambda x: clean_text(str(x), stopwords))

##### Step 2: Prepare Fashion and Non-Fashion Keywords


In [15]:
# Debugging: Check the content of comments_text before vectorization
comments_text = poster_df['post_comments_clean'].fillna('').str.cat(sep=' ')
if not comments_text.strip():  # Check if comments_text is empty or only contains spaces
    print("Warning: comments_text is empty after cleaning!")
else:
    vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')  # Token pattern to capture words
    X_comments = vectorizer.fit_transform([comments_text])
    comment_keywords = vectorizer.get_feature_names_out()  # Extract unique words from comments

    # Filter out fashion-related keywords to get non-fashion-related keywords
    non_fashion_keywords = [kw for kw in comment_keywords if kw not in fashion_keywords]

    # Create DataFrames for fashion and non-fashion keywords
    non_fashion_df = pd.DataFrame(non_fashion_keywords, columns=['keyword'])
    non_fashion_df['label'] = 0  # Label as non-fashion-related

    # Combine fashion and non-fashion keywords into a single DataFrame
    combined_df = pd.concat([fashion_df, non_fashion_df]).reset_index(drop=True)


#### Step 3: Train-Test Split


In [16]:
train_df, test_df = train_test_split(combined_df, test_size=0.2, random_state=42)


#### Step 4: Prepare Dataset for BERT


In [17]:
class KeywordDataset(Dataset):
    def __init__(self, keywords, labels, tokenizer, max_len):
        self.keywords = keywords
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.keywords)

    def __getitem__(self, idx):
        keyword = str(self.keywords[idx])
        label = self.labels[idx]
        encoding = self.tokenizer.encode_plus(
            keyword,
            add_special_tokens=True,
            max_length=self.max_len,
            return_token_type_ids=False,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt',
        )

        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

# Load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')

# Create datasets
MAX_LEN = 16
train_dataset = KeywordDataset(
    keywords=train_df['keyword'].values,
    labels=train_df['label'].values,
    tokenizer=tokenizer,
    max_len=MAX_LEN
)

test_dataset = KeywordDataset(
    keywords=test_df['keyword'].values,
    labels=test_df['label'].values,
    tokenizer=tokenizer,
    max_len=MAX_LEN
)


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/269k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/624 [00:00<?, ?B/s]



##### Step 5: Load BERT model


In [24]:
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=2)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-chinese and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


##### Step 6: Training Arguments and Trainer


In [27]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=4,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy='epoch'
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=lambda p: {
        'accuracy': accuracy_score(p.label_ids, p.predictions.argmax(-1)),
        'precision': precision_score(p.label_ids, p.predictions.argmax(-1)),
        'recall': recall_score(p.label_ids, p.predictions.argmax(-1)),
        'f1': f1_score(p.label_ids, p.predictions.argmax(-1)),
    }
)


ImportError: Using the `Trainer` with `PyTorch` requires `accelerate>=0.21.0`: Please run `pip install transformers[torch]` or `pip install accelerate -U`

##### Step 7: Train the model


In [26]:
trainer.train()


NameError: name 'trainer' is not defined

##### Step 8: Evaluate the model


In [None]:
trainer.evaluate()


##### Step 9: Apply model to post_title and post_content


In [None]:
def identify_fashion_related(text_series, tokenizer, model, max_len=16):
    inputs = tokenizer(text_series.tolist(), padding=True, truncation=True, max_length=max_len, return_tensors='pt')
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=1)
    return predictions

# Apply to post_title and post_content
posts_df['title_fashion_related'] = identify_fashion_related(posts_df['post_title_clean'], tokenizer, model)
posts_df['content_fashion_related'] = identify_fashion_related(posts_df['post_content_clean'], tokenizer, model)

print(posts_df[['post_title', 'title_fashion_related', 'post_content', 'content_fashion_related']])
