<a href="https://colab.research.google.com/github/ethanwongca/COMP396/blob/main/Analyzing_Language_Bias_Between_French_and_English_in_Conventional_Multilingual_Sentiment_Analysis_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Calculating Bias for Multilingual Support Vector Machine and Naive-Bayes for Sentiment Analysis


## Library Imports

Libraries Used:


*   **Sklearn**: Used to use the Multinomial Naive-Bayes and Support Vector Machine Model, build the Tf-Idf Matrix, use proper train test splitting, and build accuracy reports.
*   **Pandas**: Used for building DataFrames
*   **Numpy**: Provides operations for the DataFrames
*   **FairLearn**: Builds specified bias metrics in models

Dependenices are available at **requirements.txt.**



In [10]:
!pip install fairlearn
!pip install optuna



In [11]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score, classification_report
from fairlearn.metrics import MetricFrame, demographic_parity_difference, equalized_odds_difference, selection_rate, equalized_odds_ratio, demographic_parity_ratio
from google.colab import drive
import optuna
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Pre-Processing

Installing spaCy's French and English stop words and pre-processing tools, removing taggers as dataset is unordered.

In [None]:
import spacy
!python -m spacy download en_core_web_sm
!python -m spacy download fr_core_news_sm

In [None]:
nlp_en = spacy.load("en_core_web_sm", disable=["ner", "parser"])
nlp_fr = spacy.load("fr_core_news_sm", disable=["ner", "parser"])

## Pre-Processing Functions

*   **Review Parsing** - Parses through the dataset and reconstructs the text from the word frequencies provided, along with the sentiment labels as well.
*   **Batch Pre-Processing** - Pre-processes the text via spaCy using batch pre-processing with a batch size of 100.
*   **Load Dataset** - The fully pre-process data is placed in a dataframe.



In [None]:
def review_parsing(line):
    """
    Parses lines from the dataset to reconstruct the text by multiplying by the
    proper word frequencies and extracting the sentiment labels as well.

    Args:
      String: The lines in the dataset

    Returns:
      Dict {str:str}: A dictionary that has text and sentiment as the keys and
      the reconstructed text and sentiment as values.
    """
    words = []
    parts = line.strip().split()
    sentiment = None

    for part in parts:
        if part.startswith("#label#"):
            sentiment = part.split(":")[1]
        else:
            word, freq = part.split(":")
            words.extend([word] * int(freq))

    reconstructed_text = " ".join(words)
    return {'text': reconstructed_text, 'sentiment': sentiment}

def batch_preprocess_en(texts):
    """
    Batch pre-process English texts.

    Args:
      List[str]: All of the English texts.

    Return:
      List[str]: All of the English texts fully pre-processed.
    """
    processed_texts = []
    for doc in nlp_en.pipe(texts, batch_size=100):
        tokens = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
        processed_texts.append(' '.join(tokens))
    return processed_texts

def batch_preprocess_fr(texts):
    """
    Batch pre-process French texts.

    Args:
      List[str]: All of the French texts.

    Return:
      List[str]: All of the French texts fully pre-processed.
    """
    processed_texts = []
    for doc in nlp_fr.pipe(texts, batch_size=20):
        tokens = [token.lemma_.lower() for token in doc if token.is_alpha and not token.is_stop]
        processed_texts.append(' '.join(tokens))
    return processed_texts

def load_dataset_to_dataframe_en(file_path):
    """
    Transforming the English pre-processed data into a dataframe.

    Args:
      FILE: The CSV file with all the English data.

    Return:
      DataFrame: A dataframe that has the pre-processed texts along with the
      sentiments.
    """
    texts, sentiments = [], []

    with open(file_path, 'r') as file:
        for line in file:
            parsed_line = review_parsing(line)
            texts.append(parsed_line['text'])
            sentiments.append(parsed_line['sentiment'])

    processed_texts = batch_preprocess_en(texts)

    df = pd.DataFrame({
        'ProcessedText': processed_texts,
        'Sentiment': sentiments
    })

    return df

def load_dataset_to_dataframe_fr(file_path):
    """
    Transforming the French pre-processed data into a dataframe.

    Args:
      FILE: The CSV file with all the French data.

    Return:
      DataFrame: A dataframe that has the pre-processed texts along with the
      sentiments.
    """
    texts, sentiments = [], []

    with open(file_path, 'r') as file:
        for line in file:
            parsed_line = review_parsing(line)
            texts.append(parsed_line['text'])
            sentiments.append(parsed_line['sentiment'])

    processed_texts = batch_preprocess_fr(texts)

    df = pd.DataFrame({
        'ProcessedText': processed_texts,
        'Sentiment': sentiments
    })

    return df


## Loading the Particular Datasets

The file path to the Webis-CLS-10 Dataset

In [None]:
FILE_PATH_EN = '/content/drive/My Drive/dataset/en/music/unlabeled.processed'
FILE_PATH_FR = '/content/drive/My Drive/dataset/fr/music/unlabeled.processed'

FILE_PATH_EN_dvd = '/content/drive/My Drive/dataset/en/dvd/unlabeled.processed'
FILE_PATH_FR_dvd = '/content/drive/My Drive/dataset/fr/dvd/unlabeled.processed'

FILE_PATH_EN_books = '/content/drive/My Drive/dataset/en/books/unlabeled.processed'
FILE_PATH_FR_books = '/content/drive/My Drive/dataset/fr/books/unlabeled.processed'

Loading the French and English dataset from the Webis-CLS-10 Dataset. We are taking the all three sub-categories of the dataset which are the dvd, music, and books categories.

In [None]:
data_en = load_dataset_to_dataframe_en(FILE_PATH_EN)
data_fr = load_dataset_to_dataframe_fr(FILE_PATH_FR)

data_en_dvd = load_dataset_to_dataframe_en(FILE_PATH_EN_dvd)
data_fr_dvd = load_dataset_to_dataframe_fr(FILE_PATH_FR_dvd)

data_en_books = load_dataset_to_dataframe_en(FILE_PATH_EN_books)
data_fr_books = load_dataset_to_dataframe_fr(FILE_PATH_FR_books)

## Caching the Datasets

Caching the datasets so there is no need to keep pre-processing the data

In [None]:
data_en.to_csv("/content/drive/My Drive/dataset/en/music/unlabeled_update.csv", index=False)
data_fr.to_csv("/content/drive/My Drive/dataset/fr/music/unlabeled_update.csv", index=False)

data_en_dvd.to_csv("/content/drive/My Drive/dataset/en/dvd/unlabeled_update.csv", index=False)
data_fr_dvd.to_csv("/content/drive/My Drive/dataset/fr/dvd/unlabeled_update.csv", index=False)

data_en_books.to_csv("/content/drive/My Drive/dataset/en/books/unlabeled_update.csv", index=False)
data_fr_books.to_csv("/content/drive/My Drive/dataset/fr/books/unlabeled_update.csv", index=False)

In [12]:
data_en = pd.read_csv("/content/drive/My Drive/dataset/en/music/unlabeled_update.csv")
data_fr = pd.read_csv("/content/drive/My Drive/dataset/fr/music/unlabeled_update.csv")

data_en_dvd = pd.read_csv("/content/drive/My Drive/dataset/en/dvd/unlabeled_update.csv")
data_fr_dvd = pd.read_csv("/content/drive/My Drive/dataset/fr/dvd/unlabeled_update.csv")

data_en_books = pd.read_csv("/content/drive/My Drive/dataset/en/books/unlabeled_update.csv")
data_fr_books = pd.read_csv("/content/drive/My Drive/dataset/fr/books/unlabeled_update.csv")

In [None]:
print(data_fr_books.shape)

(32870, 2)


# The Dataset Pre-Processed

In [None]:
display(data_en)

Unnamed: 0,ProcessedText,Sentiment
0,pretty pretty pretty good good good record rec...,negative
1,classical alive year narration peter child you...,positive
2,chamillionaire chamillionaire chamillionaire c...,negative
3,perfect perfect giants giants world world linc...,positive
4,play play playing playing autoharp autoharp au...,positive
...,...,...
25215,num num num num num num num num num num num nu...,positive
25216,album album album excuse excuse word word like...,positive
25217,destiny destiny music music stone stone beauti...,negative
25218,well well good tha tha tha rap rap hear hear r...,positive


In [None]:
display(data_fr)

Unnamed: 0,ProcessedText,Sentiment
0,ringard ringard ringard esrt aufray rêver fair...,negative
1,l l l d d jamais jamais n n peur indépendanc t...,negative
2,muse muse muse muse muse muse muse qu qu qu qu...,negative
3,num num num num l l l groupe groupe groupe évo...,negative
4,d d d d d d d sympa sympa sympa sympa sympa sy...,negative
...,...,...
15935,d d rappeler dizain j année cd petit adore osu...,positive
15936,vallenato vallenato vallenato rien rien modern...,negative
15937,sambora sambora connaître guitarist album proc...,positive
15938,disco disco n n compil qualité énorme passez c...,negative


## Creating the Multi-Lingual Dataset

**Sample Data** A function that ensures equal number of Englsih and French Reviews in the Dataframe.

In [13]:
def sample_data(df_en, df_fr, seed=42):
    """
    Adjusts sampling to ensure an equal number of English and French samples,
    maximizing the amount of data used while respecting the specified percentages.

    Args:
      Dataframe: The Pre-Processed English DataFrame
      Dataframe: The Pre-Proecessed French DataFrame
      Seed: Determines the random values

    Returns:
      Dataframe: The multilingual dataset
    """
    # Determine the smallest number of positive or negative reviews in any language
    min_samples = min(len(df_en[df_en['Sentiment'] == 'positive']),
                      len(df_en[df_en['Sentiment'] == 'negative']),
                      len(df_fr[df_fr['Sentiment'] == 'positive']),
                      len(df_fr[df_fr['Sentiment'] == 'negative']))

    # Sample from each subgroup to ensure balance
    en_pos = df_en[df_en['Sentiment'] == 'positive'].sample(n=min_samples, random_state=seed)
    en_neg = df_en[df_en['Sentiment'] == 'negative'].sample(n=min_samples, random_state=seed)
    fr_pos = df_fr[df_fr['Sentiment'] == 'positive'].sample(n=min_samples, random_state=seed)
    fr_neg = df_fr[df_fr['Sentiment'] == 'negative'].sample(n=min_samples, random_state=seed)

    # Mark each sample with its language
    en_pos['Language'] = 'English'
    en_neg['Language'] = 'English'
    fr_pos['Language'] = 'French'
    fr_neg['Language'] = 'French'

    # Combine all samples and shuffle them
    balanced_dataset = pd.concat([en_pos, en_neg, fr_pos, fr_neg], ignore_index=True)
    balanced_dataset = balanced_dataset.sample(frac=1, random_state=seed).reset_index(drop=True)

    return balanced_dataset

### Building the Tf-Idf Matrix

A Tf-Idf Matrix must be built to have the SVM and Naive Bayes to work.

In [14]:
def preprocess_and_vectorize(df):
    """
    Pre-Processes and vectorizes the text data in the DataFrame.

    Args:
      DataFrame: The Multi-Lingual Dataset

    Returns:
      DataFrame: Tf-Idf of the shape of the samples and feature representing the
      vectorized text data.
      DataFrame: The sentiment labels associated with each text
      TfidfVectorizer: Contains the vocabulary and idf scores of each term
      NumpyArray: An array of every text entry's language
    """
    tfidf = TfidfVectorizer(max_features=10000)
    X = tfidf.fit_transform(df['ProcessedText'])
    y = df['Sentiment'].values
    return X, y, tfidf, df['Language'].values

## Calculating the Specified Bias Metrics, Training the SVM and Naive Bayes, and Splitting the Data

**Ensure Binary Labels**: Ensures the specified labels are binary, so the bias metrics function works as intended.

**Calculate Bias Metrics**: Calculates the demographic parity ratiom equalized odds ratio, demographic parity difference, and equalized odds difference.

**Map Labels**: Maps the sentiment labels to have binary classification.

**Train and Evaluate:** Trains the SVM and Naive Bayes models, and calculating the corresponding precision, recall and f1-scores for each langugage. Also outputs the bias metrics for the models.

In [16]:
def ensure_binary_labels(y):
    """
    Ensures the specified labels are binary, so the bias metrics function works
    as intended

    Args:
      DataFrame: The sentiment dataframe
    """
    unique_labels = np.unique(y)
    if set(unique_labels) == {0, 1} or set(unique_labels) == {-1, 1}:
        return np.where(y == -1, 0, y)
    else:
        raise ValueError("Labels must be binary and in {0, 1} or {-1, 1}.")

def calculate_bias_metrics(y_true, y_pred, sensitive_features):
    """
    Calculates the demographic parity ratiom equalized odds ratio,
    demographic parity difference, and equalized odds difference.

    Args:
      DataFrame: Dataframe the contains the actual sentiment labels for the
      specified text in the dataset
      DataFrame: Dataframe the contains the predicted sentiment labels for the
      specified text in the dataset
      DataFrame: Contains the dataframe with the languages corresponding to
      the sentiment labels
    """
    y_true_binary = ensure_binary_labels(y_true)
    y_pred_binary = ensure_binary_labels(y_pred)

    m_dpr = demographic_parity_ratio(y_true_binary, y_pred_binary, sensitive_features=sensitive_features)
    m_eqo = equalized_odds_ratio(y_true_binary, y_pred_binary, sensitive_features=sensitive_features)
    m_dpr_2 = demographic_parity_difference(y_true_binary, y_pred_binary, sensitive_features=sensitive_features)
    m_eqo_2 = equalized_odds_difference(y_true_binary, y_pred_binary, sensitive_features=sensitive_features)

    print(f"The demographic parity ratio is {m_dpr}")
    print(f"The equalized odds ratio is {m_eqo}")
    print(f"The demographic parity differece is {m_dpr_2}")
    print(f"The equalized odds difference is {m_eqo_2}")

def map_labels(y):
    """
    Maps the positve and negative setiments to be binary

    Args:
      DataFrame: The dataset without binary labels for sentiment

    Return:
      DataFrame: The dataset with binary labels

    """

    return np.where(y == 'positive', 1, 0)

def objective(trial):
    """
    Objective function for hyperparameter tuning using Optuna.

    Args:
        trial (optuna.trial): A trial is a process of evaluating an objective function.

    Returns:
        float: The accuracy of the model with the suggested parameters.
    """
    # Selecting the model type
    model_type = trial.suggest_categorical('model_type', ['SVM', 'NaiveBayes'])

    # Configuring parameters for SVM
    if model_type == 'SVM':
        C = trial.suggest_loguniform('svm_C', 1e-10, 1e10)
        kernel = trial.suggest_categorical('svm_kernel', ['linear', 'rbf', 'poly', 'sigmoid'])
        gamma = trial.suggest_categorical('svm_gamma', ['scale', 'auto'])
        model = SVC(C=C, kernel=kernel, gamma=gamma)

    # Configuring parameters for Naive Bayes
    else:
        alpha = trial.suggest_float('nb_alpha', 1e-10, 10.0)
        fit_prior = trial.suggest_categorical('nb_fit_prior', [True, False])
        model = MultinomialNB(alpha=alpha, fit_prior=fit_prior)

    # Training the model on the training dataset
    model.fit(X_train, map_labels(y_train))
    y_pred = model.predict(X_test)
    y_pred_mapped = map_labels(y_pred)  # Ensure predictions are in binary format

    # Calculate and return the accuracy
    return accuracy_score(map_labels(y_test), y_pred_mapped)

def train_and_evaluate_with_optuna(X_train, y_train, X_test, y_test, languages_test):
    """
    Sets up and runs the hyperparameter optimization using Optuna, then evaluates the best model.

    Args:
        X_train, y_train: Training data and labels.
        X_test, y_test: Testing data and labels.
        languages_test: DataFrame containing the language data corresponding to y_test.

    Uses Optuna to optimize the hyperparameters of either an SVM or Naive Bayes classifier.
    """
    # Creating a study object to maximize the objective function
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=50)  # Adjust the number of trials based on resource availability

    # Use the best model parameters to predict and evaluate
    best_model = study.best_trial.user_attrs['model']
    y_pred = best_model.predict(X_test)
    y_pred_mapped = map_labels(y_pred)  # Map labels to binary format

    # Calculate and print bias metrics
    calculate_bias_metrics(map_labels(y_test), y_pred_mapped, languages_test)

    y_pred_final = np.where(y_pred_mapped == 1, 'positive', 'negative')
    print("Results for the best model:")
    print("Overall Accuracy:", accuracy_score(y_test, y_pred_final))
    print("Overall Classification Report:")
    print(classification_report(y_test, y_pred_final))

    # Evaluating model performance for each language
    for language in ['English', 'French']:
        idx = languages_test == language
        y_test_lang = y_test[idx]
        y_pred_lang = y_pred_final[idx]

        print(f"Accuracy on {language}: {accuracy_score(y_test_lang, y_pred_lang)}")
        print(f"Classification Report for {language}:")
        print(classification_report(y_test_lang, y_pred_lang))

    print("----------------------------------------------------")

def train_and_evaluate(X_train, y_train, X_test, y_test, languages_test, model, model_name="Model", use_random_search=False):
    """
    Trains the SVM and Naive Bayes models, and calculates the corresponding
    precision, recall, and f1-scores for each language. Also outputs the
    bias metrics for the models. Optionally includes random search for hyperparameter tuning.

    Args:
      X_train, y_train: Training data and labels.
      X_test, y_test: Testing data and labels.
      languages_test: Array indicating the language of each test sample.
      model: The machine learning model instance to be trained.
      model_name: A name for the model for display purposes.
      use_random_search: If True, perform random search to find optimal hyperparameters.
    """

    y_train_mapped = map_labels(y_train)
    y_test_mapped = map_labels(y_test)

    if use_random_search:
        # Define the parameter distribution for SVM
        param_distributions = {
            'C': np.logspace(-4, 4, 20),  # Regularization parameter
            'kernel': ['linear', 'rbf', 'sigmoid'],  # Type of kernel
            'gamma': ['scale', 'auto'] + list(np.logspace(-4, 1, 20))  # Kernel coefficient
        }
        # Setup the RandomizedSearchCV
        random_search = RandomizedSearchCV(estimator=model, param_distributions=param_distributions,
                                           n_iter=50, cv=3, verbose=2, random_state=42, n_jobs=-1)
        random_search.fit(X_train, y_train_mapped)

        # Using the best estimator found by RandomizedSearchCV
        model = random_search.best_estimator_
        print(f"Best parameters found: {random_search.best_params_}")
        print(f"Best cross-validation score: {random_search.best_score_:.2f}")

    else:
        # Train the model as usual
        model.fit(X_train, y_train_mapped)

    # Predict on the test set using the trained or best model found
    y_pred_mapped = model.predict(X_test)

    # Calculate bias metrics
    bias_metrics = calculate_bias_metrics(y_test_mapped, y_pred_mapped, languages_test)

    # Convert predictions back to 'positive'/'negative' for reporting
    y_pred = np.where(y_pred_mapped == 1, 'positive', 'negative')

    print(f"Results for {model_name}:")
    print("Overall Accuracy:", accuracy_score(y_test, y_pred))
    print("Overall Classification Report:")
    print(classification_report(y_test, y_pred))

    # Accuracy and classification report by language
    for language in ['English', 'French']:
        idx = languages_test == language
        y_test_lang = y_test[idx]
        y_pred_lang = y_pred[idx]

        print(f"Accuracy on {language}: {accuracy_score(y_test_lang, y_pred_lang)}")
        print(f"Classification Report for {language}:")
        print(classification_report(y_test_lang, y_pred_lang))

    print("----------------------------------------------------")

# Training the Music Data Using SVM and Naive Bayes

Splitting up the data, using an 80-20 training and validation split and calling the corresponding functions for a properly trained model.

In [17]:
df_sampled = sample_data(data_en, data_fr)

X, y, tfidf, languages = preprocess_and_vectorize(df_sampled)

X_train, X_test, y_train, y_test, languages_train, languages_test = train_test_split(X, y, languages, test_size=0.2, random_state=42)


In [None]:
print(df_sampled.shape)

(31880, 3)


## Checking the Hyperparameters

In [None]:
print("Running hyperparameter optimization and model evaluation...")
train_and_evaluate_with_optuna(X_train, y_train, X_test, y_test, languages_test)

[I 2024-04-25 15:40:02,076] A new study created in memory with name: no-name-3c587e75-fbda-48a0-8cc5-ea4085a2e7d5
[I 2024-04-25 15:40:02,106] Trial 0 finished with value: 0.5039209535759097 and parameters: {'model_type': 'NaiveBayes', 'nb_alpha': 9.978064474438852, 'nb_fit_prior': False}. Best is trial 0 with value: 0.5039209535759097.
[I 2024-04-25 15:40:02,128] Trial 1 finished with value: 0.5039209535759097 and parameters: {'model_type': 'NaiveBayes', 'nb_alpha': 8.3281649430947, 'nb_fit_prior': False}. Best is trial 0 with value: 0.5039209535759097.
[I 2024-04-25 15:40:02,149] Trial 2 finished with value: 0.5039209535759097 and parameters: {'model_type': 'NaiveBayes', 'nb_alpha': 7.3043382898505005, 'nb_fit_prior': False}. Best is trial 0 with value: 0.5039209535759097.
[I 2024-04-25 15:40:02,170] Trial 3 finished with value: 0.5039209535759097 and parameters: {'model_type': 'NaiveBayes', 'nb_alpha': 7.486178784179179, 'nb_fit_prior': True}. Best is trial 0 with value: 0.5039209535

Running hyperparameter optimization and model evaluation...


  C = trial.suggest_loguniform('svm_C', 1e-10, 1e10)
[I 2024-04-25 15:46:57,510] Trial 8 finished with value: 0.5039209535759097 and parameters: {'model_type': 'SVM', 'svm_C': 7.429608749911732e-05, 'svm_kernel': 'sigmoid', 'svm_gamma': 'scale'}. Best is trial 0 with value: 0.5039209535759097.
[I 2024-04-25 15:46:57,540] Trial 9 finished with value: 0.5039209535759097 and parameters: {'model_type': 'NaiveBayes', 'nb_alpha': 7.450184595546536, 'nb_fit_prior': True}. Best is trial 0 with value: 0.5039209535759097.
  C = trial.suggest_loguniform('svm_C', 1e-10, 1e10)
[I 2024-04-25 15:53:20,190] Trial 10 finished with value: 0.5039209535759097 and parameters: {'model_type': 'SVM', 'svm_C': 3859568647.185754, 'svm_kernel': 'poly', 'svm_gamma': 'auto'}. Best is trial 0 with value: 0.5039209535759097.
  C = trial.suggest_loguniform('svm_C', 1e-10, 1e10)
[I 2024-04-25 15:59:44,235] Trial 11 finished with value: 0.5039209535759097 and parameters: {'model_type': 'SVM', 'svm_C': 1.131587750285519

## Running the model

In [18]:
svm_model = SVC(kernel='linear')
nb_model = MultinomialNB()

# Train and evaluate models, including language-specific performance
print("Evaluating SVM...")
train_and_evaluate(X_train, y_train, X_test, y_test, languages_test, svm_model, "SVM")

print("Evaluating Naive Bayes...")
train_and_evaluate(X_train, y_train, X_test, y_test, languages_test, nb_model, "Naive Bayes")

Evaluating SVM...
The demographic parity ratio is 0.9390938937751449
The equalized odds ratio is 0.4708679956050855
The demographic parity differece is 0.031663550059306544
The equalized odds difference is 0.09143453850117986
Results for SVM:
Overall Accuracy: 0.8800188205771644
Overall Classification Report:
              precision    recall  f1-score   support

    negative       0.89      0.87      0.88      3213
    positive       0.87      0.89      0.88      3163

    accuracy                           0.88      6376
   macro avg       0.88      0.88      0.88      6376
weighted avg       0.88      0.88      0.88      6376

Accuracy on English: 0.8482003129890454
Classification Report for English:
              precision    recall  f1-score   support

    negative       0.86      0.83      0.85      1603
    positive       0.83      0.87      0.85      1592

    accuracy                           0.85      3195
   macro avg       0.85      0.85      0.85      3195
weighted avg   

# Training the DVD Data Using SVM and Naive-Bayes

In [None]:
print(sample_data(data_en_dvd, data_fr_dvd).shape)

(18716, 3)


In [None]:
df_sampled = sample_data(data_en_dvd, data_fr_dvd)
X, y, tfidf, languages = preprocess_and_vectorize(df_sampled)

# Split the data, ensuring languages array is split consistently with X and y
X_train, X_test, y_train, y_test, languages_train, languages_test = train_test_split(X, y, languages, test_size=0.2, random_state=42)

# Initialize models
svm_model = SVC(kernel='linear')
nb_model = MultinomialNB()

# Train and evaluate models, including language-specific performance
print("Evaluating SVM...")
train_and_evaluate(X_train, y_train, X_test, y_test, languages_test, svm_model, "SVM")

print("Evaluating Naive Bayes...")
train_and_evaluate(X_train, y_train, X_test, y_test, languages_test, nb_model, "Naive Bayes")


Evaluating SVM...
The demographic parity ratio is 0.9938598414836628
The equalized odds ratio is 0.8510878861234726
The demographic parity differece is 0.003132134051684776
The equalized odds difference is 0.023429677896095652
Results for SVM:
Overall Accuracy: 0.8629807692307693
Overall Classification Report:
              precision    recall  f1-score   support

    negative       0.87      0.85      0.86      1873
    positive       0.86      0.87      0.86      1871

    accuracy                           0.86      3744
   macro avg       0.86      0.86      0.86      3744
weighted avg       0.86      0.86      0.86      3744

Accuracy on English: 0.8553191489361702
Classification Report for English:
              precision    recall  f1-score   support

    negative       0.87      0.84      0.85       947
    positive       0.84      0.87      0.86       933

    accuracy                           0.86      1880
   macro avg       0.86      0.86      0.86      1880
weighted avg  

# Training the Books Data Using SVM and Naive-Bayes

In [None]:
print(sample_data(data_en_books, data_fr_books).shape)

(65740, 3)


In [None]:
df_sampled = sample_data(data_en_books, data_fr_books)
X, y, tfidf, languages = preprocess_and_vectorize(df_sampled)

X_train, X_test, y_train, y_test, languages_train, languages_test = train_test_split(X, y, languages, test_size=0.2, random_state=42)

svm_model = SVC(kernel='linear')
nb_model = MultinomialNB()

print("Evaluating SVM...")
train_and_evaluate(X_train, y_train, X_test, y_test, languages_test, svm_model, "SVM")

print("Evaluating Naive Bayes...")
train_and_evaluate(X_train, y_train, X_test, y_test, languages_test, nb_model, "Naive Bayes")

Evaluating SVM...
The demographic parity ratio is 0.9883989554152294
The equalized odds ratio is 0.7175357836302722
The demographic parity differece is 0.005885656300913311
The equalized odds difference is 0.04176079944506905
Results for SVM:
Overall Accuracy: 0.8768634012777609
Overall Classification Report:
              precision    recall  f1-score   support

    negative       0.88      0.87      0.88      6565
    positive       0.87      0.88      0.88      6583

    accuracy                           0.88     13148
   macro avg       0.88      0.88      0.88     13148
weighted avg       0.88      0.88      0.88     13148

Accuracy on English: 0.8562471325890809
Classification Report for English:
              precision    recall  f1-score   support

    negative       0.86      0.85      0.86      3294
    positive       0.85      0.86      0.86      3245

    accuracy                           0.86      6539
   macro avg       0.86      0.86      0.86      6539
weighted avg   

# Transforming the Notebook into a Latex File

In [None]:
!apt-get -q install texlive-xetex texlive-fonts-recommended texlive-plain-generic

In [None]:
!jupyter nbconvert --to pdf "/content/drive/My Drive/Colab Notebooks/Analyzing Language Bias Between French and English in Conventional Multilingual Sentiment Analysis Models.ipynb"

[NbConvertApp] Converting notebook /content/drive/My Drive/Colab Notebooks/Analyzing Language Bias Between French and English in Conventional Multilingual Sentiment Analysis Models.ipynb to pdf
[NbConvertApp] Writing 101974 bytes to notebook.tex
[NbConvertApp] Building PDF
[NbConvertApp] Running xelatex 3 times: ['xelatex', 'notebook.tex', '-quiet']
[NbConvertApp] Running bibtex 1 time: ['bibtex', 'notebook']
[NbConvertApp] PDF successfully created
[NbConvertApp] Writing 99654 bytes to /content/drive/My Drive/Colab Notebooks/Analyzing Language Bias Between French and English in Conventional Multilingual Sentiment Analysis Models.pdf


In [None]:
import os
from google.colab import files
files.download(f"/content/drive/My Drive/Colab Notebooks/Analyzing Language Bias Between French and English in Conventional Multilingual Sentiment Analysis Models.pdf")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>