# **Problem Set 4**

In this probelm set we will use survey data from Switzerland to study what voters think about immigration.

In particular, this data was collected in 2020, after a referendum to introduce immigration restrictions. If you are curious, see "Immigration Restriction Referendum" in [this page](https://en.wikipedia.org/wiki/2020_Swiss_referendums#:~:text=The%20referendum%2C%20which%20was%20sponsored,of%20people%20to%20foreign%20nationals.).

The dataset contains responses from a sample of approximately 1500 respondents, who were asked, among other things, why they voted as they did. These questions are recorded as text.

## **Install packages**

In [None]:
! pip install datasets
! pip install transformers[torch]
! pip install accelerate -U
! pip install -U transformers
! pip install evaluate
! pip install -q -U google-generativeai
! pip install openai
! pip install -U sentence-transformers
! pip install bertopic

Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2024.9.0-py3-none-any.whl (1

In [None]:
import pandas as pd

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
df = pd.read_stata('/content/gdrive/MyDrive/1225_VOTO13_Data_Stata_v1.0.1.dta') #Put here the correct path

For a description of the variables, you can use the codebook. The dataset contains results from a survey collected in Switzerland after citizens voted on four referendums: "Initiative for Moderate Immigration," "Purchase of new fighter jets," "Hunting Law," "Child Tax Deductions," and "Paternity Leave." Respondents are asked how they voted and, among other things, why they did so.

In [None]:
df.columns

Index(['personid_4', 'polint', 'part', 'title_1', 'title_2', 'title_3',
       'title_4', 'title_5', 'title_8', 'title_9',
       ...
       'w_dtccpv1_pop', 'w_dtccpv2_pop', 'w_dtccpv3_pop', 'w_dtccpv4_pop',
       'w_dtccpv5_pop', 'w_dtccpv1_samp', 'w_dtccpv2_samp', 'w_dtccpv3_samp',
       'w_dtccpv4_samp', 'w_dtccpv5_samp'],
      dtype='object', length=193)

You can notice that the dataframe contains 193 variables, most of which are numeric. We will focus on a set of variables that contain text, these variables are answers to questions regarding the main reason why the respondent voted in a specific way. These are the variables whose name begins with "reason" and ends with "_txt". As we want to focus on those, we will drop all other variables.

In [None]:
columns = [col for col in df.columns if col.startswith('reason') and col.endswith('_txt')]
columns.insert(0, "personid_4")

df = df[columns]

In [None]:
df.head()

Unnamed: 0,personid_4,reason1_acc1_txt,reason1_acc2_txt,reason1_acc3_txt,reason1_acc4_txt,reason1_acc5_txt,reason1_den1_txt,reason1_den2_txt,reason1_den3_txt,reason1_den4_txt,...,reason2_acc1_txt,reason2_acc2_txt,reason2_acc3_txt,reason2_acc4_txt,reason2_acc5_txt,reason2_den1_txt,reason2_den2_txt,reason2_den3_txt,reason2_den4_txt,reason2_den5_txt
0,10001,,,,,,,,,,...,,,,,,,,,,
1,10056,,,,,,C'est dangereux vis-à-vis de nos rapports avec...,Je souhaite une régulation naturelle de la faune,Les personnes aisées qui en profitent,Je pense qu'il peut y avoir des abus,...,,,,,,"J'ai tendance à voter plutôt à gauche, les arg...",,,,La situation politique dans 10 ans est inconnue.
2,10153,,,,c'est déjà un premier pas et en suisse on a dé...,,je vote jamais UDC.,pour la protection des animaux,je me suis basée d'après les débats et je souh...,,...,,,,,,,,,,
3,10187,,revidirung muss geben,,,finde schweiz braucht leistungsfähige armee,wischi waschi vorlage,,es ist bis jetzt gegangen,die söllen ferien nehmên wer soll das alles bez.,...,,,,,,,,,,
4,10217,,,,,,,,,,...,,,,,,,,,,


There are a few elements that we need to underline regarding the data:

1. Missing values are common (that is, cells that are empty).
2. Answers, when present, are quite short.
3. Data come in three different languages (German, French, and Italian).
4. Answers are organized in terms of "most important" and "second most important."

We will need to make the dataset more homogeneous. For this reason, we will proceed by translating the text into English.

## Question 1: Translate into English using OpenAI API (or Google Gemini)

Prepare the code to translate the explanations into English. You can either use ChatGPT or Google Gemini.

Explain the prompt you are using and apply it to a few examples. For example, be aware that the party promoting the referendum has different names depending on the language (SVP or UDC), do you think this is something that should be addressed?

You don't need to translate the whole dataset, as you can use a version I have already prepared below.

Set up a function called "translate_text" which makes the call given a text input.

This multilingual problem should be addressed. Therefore, besides asking to translate into English, I specified in my prompt with a command to translate in the context of the relating country. In case that GEMINI returns a lot of explanatioins, I then asked to return only the translated English.

In [None]:
import google.generativeai as genai
from google.colab import userdata

# Set up the API key
GOOGLE_API_KEY=userdata.get('GEMINI')
genai.configure(api_key=GOOGLE_API_KEY)

In [None]:
# set up translate text function
def translate_text(input_text, target_language="en"):
    model = genai.GenerativeModel("gemini-1.5-flash")
    prompt = f"Translate this text into {target_language}, please be aware that the text can be in different languages, and please translate in the context of the relating country and only return the translation: {input_text}"
    response = model.generate_content(prompt)
    return response.text

In [None]:
print(translate_text("weil wir zu viele Leute sind in der kleinen Schweiz. Man merkt das, wenn man irgendwo hin will - überall hat es viele Leute."))

ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 23753.26ms
ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 633.67ms


because we have too many people in small Switzerland. You notice it when you want to go somewhere – everywhere is crowded.



In [None]:
print(translate_text("on a déjà assez d'étrangers."))

ERROR:tornado.access:503 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 962.39ms


We already have enough foreigners.



## Clustering

In this part, we want to analyze the explanations. We will begin by using BERTopic, a topic model / clustering algorithm that takes advantage of sentence level embeddings from transformer models to create topics.

In [None]:
#HERE READ THE FILE WITH THE TRANSLATED EXPLANATIONS
df = pd.read_csv("/content/gdrive/MyDrive/df_translated_V2_updated.csv")

In [None]:
df.head()

Unnamed: 0,id,year,reason,gpt_response
0,10815,2020,era l'unico partito che in questo frangente so...,The language of the text is Italian.\n\nIn the...
1,10971,2020,genug ist genug,The language of the given text is German.\n\nT...
2,10992,2020,nous devons travailer avec l'europe.,The language of the given text is French.\n\nI...
3,11082,2020,wegen unserer älteren bürger die werden ausgeg...,The language of the given text is German.\n\nI...
4,11231,2020,weil wir zu viele Leute sind in der kleinen Sc...,The language of the given text is German.\n\nI...


In [None]:
na_count = df['gpt_response'].isna().sum()
print(f"Number of NA values in gpt_response column: {na_count}")

Number of NA values in gpt_response column: 0


In [None]:
import re

df['reason_ENG'] = df['gpt_response'].apply(lambda x: re.split(r'\s*\${3}\s*', x)[-1])

In [None]:
reasons=[]

for index, row in df.iterrows():
    reasons.append(row['reason_ENG'])

In [None]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
reasons_embeddings_mpnet = model.encode(reasons)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

### Question 2
1. What are we doing in the previous cell? Go on Hugging Face and have a look at the description of the model we are using.
2. Go and check the documentation for BERTopic. What is the model doing? [Documentation](https://maartengr.github.io/BERTopic/index.html)
3. Which are the limitation? (See for example [here](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html) )


1. We first checked the massing values from the file, split the text into a new column named "reason_ENG", and then add them to a new list "reasons". The model we are using can map sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

2. BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions. Topic modeling techniques include supervised, semi-supervised, and etc.

3. For example, when each documents contains several paragraphs, the document is truncated and the topic model is only trained on a small part of the data. Moreover, removing stop words as a preprocessing step is not advised as the transformer-based embedding models needfull context to create accurate embeddings. Thirdly,

In [None]:
from bertopic import BERTopic
from bertopic.representation import KeyBERTInspired

In [None]:
from umap import UMAP
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

from hdbscan import HDBSCAN
hdbscan_model = HDBSCAN(min_cluster_size=60, metric='euclidean', cluster_selection_method='leaf', prediction_data=True)

In [None]:
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance, OpenAI, PartOfSpeech
from sklearn.feature_extraction.text import CountVectorizer
vectorizer_model = CountVectorizer(stop_words="english", ngram_range=(1, 3), lowercase=True)

keybert_model = KeyBERTInspired()

# MMR
mmr_model = MaximalMarginalRelevance(diversity=0.3)

# Part-of-Speech
pos_model = PartOfSpeech("en_core_web_sm")

representation_model = {
    "KeyBERT": keybert_model,
    "MMR": mmr_model,
    "POS": pos_model
}

In [None]:
topic_model = BERTopic(

  # Pipeline models
  embedding_model=model,
  umap_model=umap_model,
  hdbscan_model=hdbscan_model,
  vectorizer_model=vectorizer_model,
  representation_model=representation_model,

  # Hyperparameters
  top_n_words=15,
  verbose=True,
  calculate_probabilities=True
  )

# Train model
topics, probs = topic_model.fit_transform(reasons, reasons_embeddings_mpnet)

# Show topics
topic_model.get_topic_info()

2024-11-21 01:17:09,040 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2024-11-21 01:17:46,298 - BERTopic - Dimensionality - Completed ✓
2024-11-21 01:17:46,300 - BERTopic - Cluster - Start clustering the reduced embeddings
2024-11-21 01:17:46,599 - BERTopic - Cluster - Completed ✓
2024-11-21 01:17:46,610 - BERTopic - Representation - Extracting topics from clusters using representation models.
2024-11-21 01:18:40,021 - BERTopic - Representation - Completed ✓


Unnamed: 0,Topic,Count,Name,Representation,KeyBERT,MMR,POS,Representative_Docs
0,-1,1205,-1_switzerland_people_foreigners_bilateral,"[switzerland, people, foreigners, bilateral, a...","[economic reasons, bilateral agreements, agree...","[switzerland, foreigners, swiss, bilateral agr...","[people, foreigners, bilateral, agreements, bi...",[because we have bilateral agreements with the...
1,0,595,0_federal_initiative_council_federal council,"[federal, initiative, council, federal council...","[federal council, council, federal, protest, l...","[initiative, council, federal council, solutio...","[federal, initiative, council, good, solution,...","[The Federal Council reacts., Recommendation o..."
2,1,192,1_party_people party_swiss people party_swiss ...,"[party, people party, swiss people party, swis...","[party initiative swiss, vote swiss people, sw...","[swiss people party, party swiss people, party...","[party, people, initiatives, initiative, princ...","[to go against the Swiss People's Party., I am..."
3,2,188,2_european_european union_union_europe,"[european, european union, union, europe, rela...","[european union jeopardize, relations european...","[european union, relationship european, relati...","[relationship, relations, agreements, problems...",[to have a good relationship with the European...
4,3,168,3_need_foreign_foreigners_workers,"[need, foreign, foreigners, workers, need fore...","[foreign workers need, need foreign workers, w...","[need foreigners, need people, skilled workers...","[need, foreign, foreigners, workers, work, ski...","[There is a need for foreign workers., We need..."
5,4,135,4_swiss_switzerland_workers_work,"[swiss, switzerland, workers, work, foreigners...","[workers switzerland needs, foreign labor swit...","[switzerland, jobs swiss, switzerland needs, s...","[workers, work, foreigners, jobs, people, fore...","[Switzerland needs foreign labor., Switzerland..."
6,5,134,5_switzerland_european_union_european union,"[switzerland, european, union, european union,...","[switzerland remain, european union switzerlan...","[european union, bilateral agreements, europea...","[agreements, island, bilateral, bilateral agre...","[With the European Union, Switzerland had to a..."
7,6,132,6_foreigners_people_country_population,"[foreigners, people, country, population, spac...","[overpopulation foreigners, proportion foreign...","[population, overwhelmed foreigners, proportio...","[foreigners, people, country, population, spac...","[has had enough foreigners, too many foreigner..."
8,7,129,7_immigration_limit_borders_foreigners,"[immigration, limit, borders, foreigners, cont...","[restrict immigration, limit immigration, stop...","[immigration, immigrate, mass immigration, con...","[immigration, limit, borders, foreigners, cont...","[Get immigration under control., I do not want..."
9,8,84,8_bilateral_agreements_bilateral agreements_risk,"[bilateral, agreements, bilateral agreements, ...","[bilateral agreements jeopardized, endangers b...","[bilateral agreements risk, agreements risk, a...","[bilateral, agreements, bilateral agreements, ...","[The bilateral agreements are at risk., The bi..."


### Question 3:
1. Check the documentation, what does Topic -1 refer to?
2. What could we do to increase the number of topics generated by the model? What could we do the descrease it?
3. Can you show a few examples of explanations from Topic 9 (wage dumping). Do these explanations match well with the representation given by BERTopic or with the KeyBERT output?

1. -1 refers to all outliers and should typically be ignored.

2. Firstly, we can increase the dimension of UMAP embeddings such as setting the n_components or n_neighbors higher; on the other hand, setting n_components or n_neighbors lower can decrease the number of topics. Secondly, we can reduce the min_cluster_size in HDBSCAN to make each cluster smaller; if we want to decrease it, we can increase the min_cluster_size.Thirdly, increasing top_n_words can help differentiate topics, and therefore generate more topics; while decreasing top_n_words can decrease the number of topics. Last but not least, we can widen ngram_rage to add diversity to vectorized features and therefore increase topic number; and also narrow down the range to reduce features to cut number of topics.

In [None]:
# Extract all documents assigned to Topic 9
docs_topic_9 = [reasons[i] for i, topic in enumerate(topics) if topic == 9]

# Display 10 examples
for i, doc in enumerate(docs_topic_9[:10]):
    print(f"Document {i + 1}:\n{doc}\n")


Document 1:
There are problems of wage dumping as well as unemployment exacerbated by foreign labor.

Document 2:
I live in Chiasso and know many cases of people who have seen their jobs taken away because the newcomers cost less.

Document 3:
Too many people are working here with low wages.

Document 4:
Due to overpopulation and migration, they cost us money; they come in and do not want to work at all. But there are also exceptions.

Document 5:
There are already many unemployed people due to Covid-19.

Document 6:
Young people no longer find work.

Document 7:
My daughter does not have a job, my cousin has been without work for a year and a half, if there were fewer cross-border workers, perhaps our young people would have a few more job positions with equal qualifications (we are talking about the Mendrisiotto).

Document 8:
The increase in the population burdens unskilled workers.

Document 9:
Because I think it was better for the economy, we probably generally need these people t

In [None]:
topic_9_representation = topic_model.get_topic(9)
print(topic_9_representation)

[('wage', 0.06817033407196416), ('dumping', 0.05931492712102103), ('wage dumping', 0.055739093205216896), ('job', 0.044944146352866345), ('jobs', 0.04460549517361525), ('young', 0.04038370744047584), ('work', 0.04026976567597674), ('wages', 0.04010852708417081), ('young people', 0.037816999950020636), ('unemployed', 0.036616403540346495), ('people', 0.035296741541988724), ('labor', 0.024617590717201425), ('cost', 0.024586366912835767), ('unemployment', 0.021390263834041898), ('low', 0.019669093530268616)]


3. The examples in topic 9 are all about wage dumping, losing job, and unemployment, which matches pretty well with the BERTopic output.



## One Shot Classification

Now that we have an idea about which topics are more prevalent in the data, we can use another method to try to refine the previous results.

### Question 4
Setup a classification using the zero-shot classifier on the subsample of explanaitons assigned to Topic 9 (wage dumping). The goal in this case is to detect whether the explanation mentions wages.

Try using this model: https://huggingface.co/MoritzLaurer/deberta-v3-large-zeroshot-v2.0

In [None]:
pip install transformers torch



In [None]:
from transformers import pipeline
# Initialize zero-shot classifier
classifier = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0")

# Define the candidate labels
labels = ["wage", "non-wage"]

# Extract the subsample of explanations assigned to Topic 9 (wage dumping)
docs_topic_9 = [reasons[i] for i, topic in enumerate(topics) if topic == 9]

config.json:   0%|          | 0.00/1.02k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/870M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/8.66M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/970 [00:00<?, ?B/s]

In [None]:
# Classify each explanation
for i, explanation in enumerate(docs_topic_9):
    result = zeroclassifier(explanation, candidate_labels=labels)
    print(f"Document {i + 1}:")
    print(f"Scores: {result['scores']}\n")

Document 1:
Scores: [0.8844894170761108, 0.11551059037446976]

Document 2:
Scores: [0.5747592449188232, 0.4252406656742096]

Document 3:
Scores: [0.9579455852508545, 0.0420544259250164]

Document 4:
Scores: [0.9191432595252991, 0.08085674047470093]

Document 5:
Scores: [0.8867632746696472, 0.11323676258325577]

Document 6:
Scores: [0.7487205266952515, 0.25127944350242615]

Document 7:
Scores: [0.9071237444877625, 0.09287624806165695]

Document 8:
Scores: [0.7814421057701111, 0.2185579091310501]

Document 9:
Scores: [0.5277970433235168, 0.47220301628112793]

Document 10:
Scores: [0.915023922920227, 0.08497604727745056]

Document 11:
Scores: [0.8291773796081543, 0.1708226501941681]

Document 12:
Scores: [0.8474560379981995, 0.15254397690296173]

Document 13:
Scores: [0.910165548324585, 0.08983449637889862]

Document 14:
Scores: [0.8637590408325195, 0.13624094426631927]

Document 15:
Scores: [0.5053606629371643, 0.4946393072605133]

Document 16:
Scores: [0.6985858678817749, 0.301414102315