# Text Translation and Sentiment Analysis using Transformers

## Project Overview:

The objective of this project is to analyze the sentiment of movie reviews in three different languages - English, French, and Spanish. We have been given 30 movies, 10 in each language, along with their reviews and synopses in separate CSV files named `movie_reviews_eng.csv`, `movie_reviews_fr.csv`, and `movie_reviews_sp.csv`.

- The first step of this project is to convert the French and Spanish reviews and synopses into English. This will allow us to analyze the sentiment of all reviews in the same language. We will be using pre-trained transformers from HuggingFace to achieve this task.

- Once the translations are complete, we will create a single dataframe that contains all the movies along with their reviews, synopses, and year of release in all three languages. This dataframe will be used to perform sentiment analysis on the reviews of each movie.

- Finally, we will use pretrained transformers from HuggingFace to analyze the sentiment of each review. The sentiment analysis results will be added to the dataframe. The final dataframe will have 30 rows


The output of the project will be a CSV file with a header row that includes column names such as **Title**, **Year**, **Synopsis**, **Review**, **Review Sentiment**, and **Original Language**. The **Original Language** column will indicate the language of the review and synopsis (*en/fr/sp*) before translation. The dataframe will consist of 30 rows, with each row corresponding to a movie.

In [None]:
import transformers
import sentencepiece



In [2]:
# imports
import pandas as pd
from transformers import MarianMTModel, MarianTokenizer
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm

A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.3.5 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "g:\Dev\Esther\Text translations\.venv\Lib\site-packages\ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "g:\Dev\Esther\Text translations\.venv\Lib\site-packages\traitlets\config\application.py", line 1075, in launch_instance
    app.start()
  File "g:\Dev\Esther\Text translations\.venv\Lib\site-packages\ipykernel\kernelapp.py", l

### Get data from `.csv` files and then preprocess data

In [3]:
import pandas as pd

def preprocess_data() -> pd.DataFrame:
    """
    Reads the three movie review CSV files from the local 'data' folder,
    standardizes column names, adds 'Original Language', and returns
    a single combined dataframe.
    """

    # English
    df_eng = pd.read_csv("data/movie_reviews_eng.csv")
    df_eng = df_eng.rename(columns={
        df_eng.columns[0]: "Title",
        df_eng.columns[1]: "Year",
        df_eng.columns[2]: "Synopsis",
        df_eng.columns[3]: "Review"
    })
    df_eng["Original Language"] = "en"

    # French
    df_fr = pd.read_csv("data/movie_reviews_fr.csv")
    df_fr = df_fr.rename(columns={
        df_fr.columns[0]: "Title",
        df_fr.columns[1]: "Year",
        df_fr.columns[2]: "Synopsis",
        df_fr.columns[3]: "Review"
    })
    df_fr["Original Language"] = "fr"

    # Spanish
    df_sp = pd.read_csv("data/movie_reviews_sp.csv")
    df_sp = df_sp.rename(columns={
        df_sp.columns[0]: "Title",
        df_sp.columns[1]: "Year",
        df_sp.columns[2]: "Synopsis",
        df_sp.columns[3]: "Review"
    })
    df_sp["Original Language"] = "sp"

    # Combine all into one dataframe
    df = pd.concat([df_eng, df_fr, df_sp], ignore_index=True)
    return df

df = preprocess_data()


In [4]:
df.sample(10)

Unnamed: 0,Title,Year,Synopsis,Review,Original Language
18,Les Visiteurs en Amérique,2000,Dans cette suite de la comédie française Les V...,"""Le film est une perte de temps totale. Les bl...",fr
1,The Dark Knight,2008,Batman (Christian Bale) teams up with District...,"""The Dark Knight is a thrilling and intense su...",en
4,Inception,2010,Dom Cobb (Leonardo DiCaprio) is a skilled thie...,"""Inception is a mind-bending and visually stun...",en
28,Torrente: El brazo tonto de la ley,1998,"En esta comedia española, un policía corrupto ...","""Torrente es una película vulgar y ofensiva qu...",sp
21,La Casa de Papel,(2017-2021),Esta serie de televisión española sigue a un g...,"""La Casa de Papel es una serie emocionante y a...",sp
14,Le Fabuleux Destin d'Amélie Poulain,2001,Cette comédie romantique raconte l'histoire d'...,"""Le Fabuleux Destin d'Amélie Poulain est un fi...",fr
13,Les Choristes,2004,Ce film raconte l'histoire d'un professeur de ...,"""Les Choristes est un film magnifique qui vous...",fr
27,El Bar,2017,Un grupo de personas quedan atrapadas en un ba...,"""El Bar es una película ridícula y sin sentido...",sp
26,Toc Toc,2017,"En esta comedia española, un grupo de personas...","""Toc Toc es una película aburrida y poco origi...",sp
10,La La Land,2016,Cette comédie musicale raconte l'histoire d'un...,"""La La Land est un film absolument magnifique ...",fr


### Text translation

Translate the **Review** and **Synopsis** column values to English.


We load two translation models (French→English and Spanish→English) and create a simple function called `translate()`.  

This function takes a text, sends it through the model, and returns the English translation.  

We will use it later to translate all movie reviews and synopses into English.


In [None]:
import torch

Collecting torch>=2.6.0
  Downloading torch-2.9.1-cp312-cp312-win_amd64.whl.metadata (30 kB)
Collecting setuptools (from torch>=2.6.0)
  Using cached setuptools-80.9.0-py3-none-any.whl.metadata (6.6 kB)
Downloading torch-2.9.1-cp312-cp312-win_amd64.whl (110.9 MB)
   ---------------------------------------- 0.0/110.9 MB ? eta -:--:--
   - -------------------------------------- 3.4/110.9 MB 18.3 MB/s eta 0:00:06
   ----- ---------------------------------- 16.5/110.9 MB 43.3 MB/s eta 0:00:03
   ----------- ---------------------------- 31.7/110.9 MB 53.0 MB/s eta 0:00:02
   -------------- ------------------------- 41.2/110.9 MB 51.3 MB/s eta 0:00:02
   --------------- ------------------------ 42.7/110.9 MB 41.8 MB/s eta 0:00:02
   ---------------- ----------------------- 46.1/110.9 MB 37.6 MB/s eta 0:00:02
   ----------------- ---------------------- 48.2/110.9 MB 33.7 MB/s eta 0:00:02
   -------------------- ------------------- 55.8/110.9 MB 33.9 MB/s eta 0:00:02
   ---------------------- 

  You can safely remove it manually.


In [None]:
import safetensors



In [7]:
# load translation models and tokenizers
from transformers import MarianMTModel, MarianTokenizer

# TODO 2: Update the code below
fr_en_model_name = "Helsinki-NLP/opus-mt-fr-en"
es_en_model_name = "Helsinki-NLP/opus-mt-es-en"

fr_en_tokenizer = MarianTokenizer.from_pretrained(fr_en_model_name)
fr_en_model = MarianMTModel.from_pretrained(fr_en_model_name)

es_en_tokenizer = MarianTokenizer.from_pretrained(es_en_model_name)
es_en_model = MarianMTModel.from_pretrained(es_en_model_name)


# TODO 3: Complete the function below
def translate(text: str, model, tokenizer) -> str:
    """
    function to translate a text using a model and tokenizer
    """
    # encode the text using the tokenizer
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    # generate the translation using the model
    outputs = model.generate(**inputs, max_length=512)

    # decode the generated output and return the translated text
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return decoded

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


### We have a dataframe with movies in three languages:
English, French, and Spanish.

English reviews don’t need translation, but French and Spanish do.

So in this code we:
- Find all French reviews and translate them to English.
- Find all French synopses and translate them to English.
- Find all Spanish reviews and translate them to English.
- Find all Spanish synopses and translate them to English.

Put the translated text back into the dataframe, replacing the original French/Spanish text.

After this step, all movies have their review and synopsis in English, no matter what the original language was.

In [8]:
# TODO 4: Update the code below

# filter reviews in French and translate to English
fr_reviews = df.loc[df["Original Language"] == "fr", "Review"]
fr_reviews_en = fr_reviews.apply(lambda x: translate(str(x), fr_en_model, fr_en_tokenizer))

# filter synopsis in French and translate to English
fr_synopsis = df.loc[df["Original Language"] == "fr", "Synopsis"]
fr_synopsis_en = fr_synopsis.apply(lambda x: translate(str(x), fr_en_model, fr_en_tokenizer))

# filter reviews in Spanish and translate to English
es_reviews = df.loc[df["Original Language"] == "sp", "Review"]
es_reviews_en = es_reviews.apply(lambda x: translate(str(x), es_en_model, es_en_tokenizer))

# filter synopsis in Spanish and translate to English
es_synopsis = df.loc[df["Original Language"] == "sp", "Synopsis"]
es_synopsis_en = es_synopsis.apply(lambda x: translate(str(x), es_en_model, es_en_tokenizer))

# update dataframe with translated text (overwrite original Review and Synopsis)
df.loc[df["Original Language"] == "fr", "Review"] = fr_reviews_en
df.loc[df["Original Language"] == "fr", "Synopsis"] = fr_synopsis_en

df.loc[df["Original Language"] == "sp", "Review"] = es_reviews_en
df.loc[df["Original Language"] == "sp", "Synopsis"] = es_synopsis_en


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


ImportError: cannot import name '_data_ptr_allocated' from 'torch.distributed.utils' (g:\Dev\Esther\Text translations\.venv\Lib\site-packages\torch\distributed\utils.py)

In [None]:
df.sample(10)

FileNotFoundError: [Errno 2] No such file or directory: 'data/movie_reviews_eng.csv'

### Sentiment Analysis

Use HuggingFace pretrained model for sentiment analysis of the reviews. Store the sentiment result **Positive** or **Negative** in a new column titled **Sentiment** in the dataframe.

### We want to know whether each movie review is positive or negative.
To do that, we use a ready-made AI model from HuggingFace.

HuggingFace is like an “app store for AI models.”
You can download models for translation, summarization, sentiment analysis, and more. All without building them yourself.

In this part of the code:
- We load a sentiment analysis model from HuggingFace.
- This model already knows how to read text and decide if it sounds happy, angry, or negative.
- We write a simple function called analyze_sentiment():
- It takes a piece of text and sends it to the model
- And the model tells us: POSITIVE or NEGATIVE

So now, for every translated review, we can automatically check how the reviewer felt about the movie.

In [None]:
# TODO 5: Update the code below
# load sentiment analysis model
from transformers import pipeline

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
sentiment_classifier = pipeline("sentiment-analysis", model=model_name)


# TODO 6: Complete the function below
def analyze_sentiment(text, classifier):
    """
    function to perform sentiment analysis on a text using a model
    """
    if pd.isna(text):
        return None
    
    result = classifier(str(text))[0]     # returns {'label': 'POSITIVE', 'score': ...}
    return result["label"]


### Posive or negative label
We apply the `analyze_sentiment()` function to every movie review in the dataframe.  

For each review, the model returns either **POSITIVE** or **NEGATIVE**.  
We store this result in a new column called **Review Sentiment**, so every movie now has a sentiment label based on its English review.

In [None]:
# TODO 7: Add code below for sentiment analysis
# perform sentiment analysis on reviews and store results in new column

df["Review Sentiment"] = df["Review"].apply(lambda x: analyze_sentiment(x, sentiment_classifier))

In [None]:
df.sample(10)

In [None]:
# export the results to a .csv file
df.to_csv("reviews_with_sentiment.csv", index=False)
