# Analysis of Weaponised Wikipedia Edits

## Data Import and Structure

From the **LLM Results** folder, I retrieved every `.csv` file and stored them in a list called `dfs`.  

I also created a dictionary `dfs_named` that maps each article name to its corresponding CSV table containing all edits for that article:

```python
dfs_named["2004_Ukrainian_presidential_election_analysis"]
dfs_named["Abortion_in_Ukraine_analysis"]
```
* `dfs` contains 120 CSV tables, giving 117,517 rows.

## Filtering for Weaponised Edits

From these 117,517 rows, I only kept the ones classified as Weaponised by the LLM:
```python
dfs_weaponised = [df[df['weaponised'] == 'Weaponised'] for df in dfs]
```

This reduces the dataset to **18,707 rows**, i.e., 18,707 edits potentially considered weaponised.

Each DataFrame in `dfs_weaponised` contains exactly 7 columns:

* initial_version – text before the edit
* changed_version – text after the edit
* comment – user’s comment explaining the edit
* user – username of the editor
* date – timestamp of the edit (ISO format)
* llm_output – LLM evaluation of the edit
* weaponised – classification: `"Weaponised"` (because `"Not weaponised"` has already been filtered out).

I also created another dictionary, `dfs_weaponised_named`, to keep track of article names alongside their corresponding filtered DataFrames.

## Mapping External Chunks to Wikipedia Edits

The goal is to check whether externally provided text chunks appear in the weaponised edits.
To do this, I implemented a function:

```python
def generate_ngrams(text, n=4):
    words = text.split()
    return [" ".join(words[i:i+n]) for i in range(len(words)-n+1)]
```

This function slices a text into overlapping sequences of n consecutive words (n-grams).
I then looped over all weaponised DataFrames and checked if any n-gram was present in the **changed_version** column.

### Example

Consider the following chunk:

```bash
"A referendum in the largely ethnic Russian Ukrainian autonomous region of Crimea resulted in the bloodless annexation of Crimea by Russia on 18 March 2014."
```

Its 4-word n-grams include:

```python
'A referendum in the',
 'the largely ethnic Russian',
 'Russian Ukrainian autonomous region',
 'region of Crimea resulted',
 'resulted in the bloodless',
 'bloodless annexation of Crimea',
 'Crimea by Russia on',
 'on 18 March 2014.']
```

When searching with the slice "bloodless annexation of Crimea", the following matches were found:

```
Match found in article 'History_of_Ukraine_analysis', rows [384, 389]
Match found in article 'History_of_Ukraine#Early_modern_period_analysis', rows [391, 394]
Match found in article 'History_of_Ukraine#World_War_II_and_the_Nazi_Occupation_analysis', rows [381, 390]
```

## Manual Verification

I manually checked one example:

```python
dfs_weaponised_named["History_of_Ukraine#Early_modern_period_analysis"].loc[391, "changed_version"]
```
The output confirmed that the **changed_version** at that row contained the phrase *"bloodless annexation of Crimea"*.

In [1]:
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# Path to the folder containing CSVs or .xlsx
csv_folder = Path("datas/previous-project/Hedi/LLM Results")

# Create a list of DataFrames, one for each CSV
dfs = [pd.read_csv(f) for f in csv_folder.glob("*.csv")]

dfs_named = {f.stem: pd.read_csv(f) for f in csv_folder.glob("*.csv")}

# Example: show how many DataFrames were loaded
print(f"Loaded {len(dfs)} CSV files.")
print("Available keys:", list(dfs_named.keys()))

Loaded 120 CSV files.
Available keys: ['COVID-19_pandemic_in_Ukraine_analysis', 'Immigration_to_Ukraine_analysis', 'Andrei_Sheptytsky_analysis', 'Football_in_Ukraine_analysis', 'Abortion_in_Ukraine_analysis', '2004_Ukrainian_presidential_election_analysis', 'History_of_Ukraine_analysis', 'Christianity_in_Russia_analysis', 'Cossack’s_songs_of_Dnipropetrovsk_Region_analysis', 'Flag_of_Ukraine_analysis', 'Belarusians_in_Ukraine_analysis', 'Cinema_of_Ukraine_analysis', 'Brotherhood_of_Independent_Baptist_Churches_and_Ministries_of_Ukraine_analysis', 'Hillsong_Ukraine_analysis', 'Communist_Party_of_the_Soviet_Union_analysis', 'Electricity_in_Ukraine_analysis', 'Censuses_in_Ukraine_analysis', 'Declaration_of_Independence_of_Ukraine_analysis', 'History_of_Christianity_in_Ukraine_analysis', 'Bessarabia_analysis', 'Human_trafficking_in_Ukraine_analysis', 'Crimea_analysis', 'Epiphanius_I_of_Ukraine_analysis', 'Hillsong_Church_Kiev_analysis', 'Government_of_the_Ukrainian_People_s_Republic_in_exil

In [3]:
# dfs_weaponised is a list of the dataframes with only edits that were classified as weaponised by the LLM
dfs_weaponised = [df[df['weaponised'] == 'Weaponised'].reset_index(drop=True) for df in dfs]
print(len(dfs_weaponised))

120


In [4]:
best_chunks = pd.read_excel("datas/previous-project/Hedi/best_chunks_semi_automated_annotated_data_repaired.xlsx")

In [5]:
best_chunks = best_chunks[best_chunks['Annot 1'].fillna('') == 'Correct']
best_chunks = best_chunks[best_chunks['Annot 2'].fillna('') == 'Correct']
best_chunks = best_chunks[best_chunks['Annot 1 - new'].fillna('') == 'Correct']
best_chunks = best_chunks[best_chunks['Annot 2 - new'].fillna('') == 'Correct']

In [84]:
best_chunks.iloc[10:20]

Unnamed: 0,row_index,detected_before,detected_after,clean_before,clean_after,type_of_change_extracted,category_extracted_clean,propaganda_similarity,category_extracted_propaganda_mapped,aligned_before_chunk,aligned_after_chunk,similarity,significance_extracted,Annot 1,Annot 2,Annot 1 - new,Annot 2 - new,Column 1
34,132,The '''Ukrainian Orthodox Church''' is an auto...,"--- \n+++ \n@@ -3,7 +3,7 @@\n | icon ...",The '''Ukrainian Orthodox Church''' is an auto...,"--- \n+++ \n@@ -3,7 +3,7 @@\n | icon ...",rephrasing and renaming,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",The '''Ukrainian Orthodox Church''' is an auto...,The '''Ukrainian Orthodox Church''' ('''UOC''...,0.91382,The change from 'Ukrainian Orthodox Church' to...,Correct,Correct,Correct,Correct,Agree
36,151,The '''[[Ukraine|Ukrainian]] revolution''' of ...,The '''NATO Coup of 2014''' (never known as th...,The '''Ukrainian revolution''' of February 201...,The '''NATO Coup of 2014''' (never known as th...,synonym swap,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",The '''Ukrainian revolution''' of February 201...,The '''NATO Coup of 2014''' (never known as th...,0.898157,This change alters the historical narrative by...,Correct,Correct,Correct,Correct,Agree
37,154,In so far as he had any decided political conv...,Insofar as he may have had political syphilis ...,In so far as he had any decided political conv...,Insofar as he may have had political syphilis ...,rephrasing,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",In so far as he had any decided political conv...,Insofar as he may have had political syphilis ...,0.577801,This edit introduces vulgar and derogatory lan...,Correct,Correct,Correct,Correct,Agree
39,159,The '''English Civil War''' consisted of a ser...,The '''English Civil War''' consisted of a ser...,The '''English Civil War''' consisted of a ser...,The '''English Civil War''' consisted of a ser...,word replacement and rephrasing,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",The '''English Civil War''' consisted of a ser...,The '''English Civil War''' consisted of a ser...,0.877601,The changes introduce derogatory and sexualize...,Correct,Correct,Correct,Correct,Agree
41,165,Kutia (sweet grain pudding) is traditionally s...,Kutta (sweet grain pudding) is traditionally s...,Kutia (sweet grain pudding) is traditionally s...,Kutta (sweet grain pudding) is traditionally s...,Terminology manipulation and phrase alteration,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",Kutia (sweet grain pudding) is traditionally s...,Kutta (sweet grain pudding) is traditionally s...,0.857476,The change from 'Kutia' to 'Kutta' may dilute ...,Correct,Correct,Correct,Correct,Agree
42,173,Yanukovych fled Ukraine amid the [[Euromaidan]...,Yanukovych fled Ukraine amid the [[Euromaidan]...,Yanukovych fled Ukraine amid the Euromaidan up...,Yanukovych fled Ukraine amid the Euromaidan co...,synonym swap,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",Yanukovych fled Ukraine amid the Euromaidan up...,Yanukovych fled Ukraine amid the Euromaidan co...,0.973109,The change from 'uprising' to 'coup' alters th...,Correct,Correct,Correct,Correct,Agree
44,183,"In late February 2014, Russia began to send tr...","In late February 2014, black american chimpanz...","In late February 2014, Russia began to send tr...","In late February 2014, black american chimpanz...",replacement of a phrase,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness","In late February 2014, Russia began to send tr...","In late February 2014, black american chimpanz...",0.692616,This change introduces a racially charged and ...,Correct,Correct,Correct,Correct,Agree
47,200,[[Image:Kraina_Mriy_Ukraine_folk_costume_flowe...,"--- \n+++ \n@@ -2,15 +2,15 @@\n The '''Ukraini...",thumb|A girl wearing a Ukrainian costume with ...,"--- \n+++ \n@@ -2,15 +2,15 @@\n The '''Ukraini...",synonym swap,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",\nThe flowers used to make the vinok were gene...,\n The flowers used to make the ''vinok'' were...,0.989811,The replacement of 'vinok' with 'wreath' dimin...,Correct,Correct,Correct,Correct,Agree
50,219,After the [[Soviet invasion]]{{dn|date=Februar...,After the [[Soviet invasion]]{{disambiguation ...,After the Soviet invasion of Ukraine in 1920 a...,After the Soviet invasion of Ukraine in 1920 a...,Formatting change,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",After the Soviet invasion of Ukraine in 1920 a...,After the Soviet invasion of Ukraine in 1920 a...,1.0,The change from 'dn' to 'disambiguation needed...,Correct,Correct,Correct,Correct,Agree
58,252,In accordance with that time's Imerial termino...,In accordance with that time's prevailing term...,In accordance with that time's Imerial termino...,In accordance with that time's prevailing term...,synonym swap,Terminology Manipulation,0.418396,"Obfuscation, intentional vagueness",In accordance with that time's Imerial termino...,In accordance with that time's prevailing term...,0.93302,The change from 'Imperial terminology' to 'pre...,Correct,Correct,Correct,Correct,Agree


In [6]:
dfs_weaponised = [df for df in dfs_weaponised if not df.empty]
dfs_weaponised_named = {
    name: df[df['weaponised'] == 'Weaponised'].reset_index(drop=True)
    for name, df in dfs_named.items()
    if not df[df['weaponised'] == 'Weaponised'].empty
}