# Model Introduction

The model is downloaded from Huggingface for different stereotypes detection. When a single sentence is passed into the model, it can identify nine classes inclusing gender stereotypes. Therefore the article utilizes the model to calculate the gender stereotype scores within each sentence then obtain the average score of an entire article. See the details: https://huggingface.co/wu981526092/Sentence-Level-Stereotype-Detector

In [1]:
import pandas as pd
import spacy
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
detector = pipeline (
	"text-classification",
	model="../models/wu981526092",
	tokenizer="../models/wu981526092"
)

Device set to use cuda:0
NVIDIA GeForce RTX 5060 Laptop GPU with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_75 sm_80 sm_86 sm_89 sm_90 compute_90.
If you want to use the NVIDIA GeForce RTX 5060 Laptop GPU GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/



# Process the dataset

In [3]:
news = pd.read_csv("../data/splited/m_ir_ms.csv")

In [4]:
tokenizer = spacy.load("en_core_web_sm")  # run python -m spacy download en_core_web_sm in cmd
def sent_tokenize(text):
    doc = tokenizer(text) # the result is doc!
    return [sent.text for sent in doc.sents]

In [5]:
avg_scores = []
for idx, row in news.iterrows():
    sentences = sent_tokenize(str(row['content']))
    scores = []
    for sent in sentences:
        result = detector(sent)
        # only consider gender stereotypes
        gender_scores = [r['score'] for r in result if r['label'] == 'stereotype_gender']
        if gender_scores:
            scores.extend(gender_scores)
    
    # if the news does not contain gender stereotypes, the score will be marked as 0
    avg_score = sum(scores) / len(sentences) if scores else 0
    avg_scores.append(avg_score)


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [6]:
news['stereotype_score'] = avg_scores
news_sorted = news.sort_values(by='stereotype_score', ascending=False)
news_sorted.to_csv("../data/stereotype_calculated/m_ir_ms_with_stereotype_score.csv", index=False)