In [7]:
import pandas as pd
from hashlib import sha256
from difflib import SequenceMatcher

# Function to compute a hash of the string
def compute_hash(text):
    return sha256(text.encode('utf-8')).hexdigest()

# Function to check for a match with a threshold using token set intersection
def is_token_set_match(a, b, threshold=0.9):
    tokens_a = set(a.split())
    tokens_b = set(b.split())
    intersection = tokens_a.intersection(tokens_b)
    if len(intersection) / max(len(tokens_a), len(tokens_b)) >= threshold:
        return SequenceMatcher(None, a, b).ratio() >= threshold
    return False

# Load datasets
multitarget_conan = pd.read_csv('data/Multitarget-CONAN.csv')
intentconan_train = pd.read_csv('data/iconan/iconan_train.csv')
intentconan_test = pd.read_csv('data/iconan/iconan_test.csv')
intentconan_dev = pd.read_csv('data/iconan/iconan_dev.csv')
intentconanv2_train = pd.read_csv('data/iconanv2/train.csv')
intentconanv2_test = pd.read_csv('data/iconanv2/test.csv')
intentconanv2_val = pd.read_csv('data/iconanv2/val.csv')

# Concatenate intentconan and intentconanv2 datasets
intentconan = pd.concat([intentconan_train, intentconan_test, intentconan_dev])
intentconanv2 = pd.concat([intentconanv2_train, intentconanv2_test, intentconanv2_val])

# Normalize the text for comparison
multitarget_conan['COUNTER_NARRATIVE'] = multitarget_conan['COUNTER_NARRATIVE'].str.lower().str.strip()
intentconan['counterSpeech'] = intentconan['counterSpeech'].str.lower().str.strip()
intentconanv2['counterspeech'] = intentconanv2['counterspeech'].str.lower().str.strip()

# Compute hashes for fast comparison
multitarget_conan['CN_HASH'] = multitarget_conan['COUNTER_NARRATIVE'].apply(compute_hash)
intentconan['CS_HASH'] = intentconan['counterSpeech'].apply(compute_hash)
intentconanv2['CS_HASH'] = intentconanv2['counterspeech'].apply(compute_hash)

# Prepare to store matches
matches = []

# Check for matches in intentconan
for idx, row in multitarget_conan.iterrows():
    cn_hash = row['CN_HASH']
    for _, iconan_row in intentconan[intentconan['CS_HASH'] == cn_hash].iterrows():
        if is_token_set_match(row['COUNTER_NARRATIVE'], iconan_row['counterSpeech']):
            matches.append({
                'source': 'intentconan',
                'HATE_SPEECH': row['HATE_SPEECH'],
                'COUNTER_NARRATIVE': row['COUNTER_NARRATIVE'],
                'Matched_counterSpeech': iconan_row['counterSpeech']
            })

# Check for matches in intentconanv2
for idx, row in multitarget_conan.iterrows():
    cn_hash = row['CN_HASH']
    for _, iconanv2_row in intentconanv2[intentconanv2['CS_HASH'] == cn_hash].iterrows():
        if is_token_set_match(row['COUNTER_NARRATIVE'], iconanv2_row['counterspeech']):
            matches.append({
                'source': 'intentconanv2',
                'HATE_SPEECH': row['HATE_SPEECH'],
                'COUNTER_NARRATIVE': row['COUNTER_NARRATIVE'],
                'Matched_counterSpeech': iconanv2_row['counterspeech']
            })

# Convert matches to DataFrame
matches_df = pd.DataFrame(matches)

# Save matches to a CSV file
matches_df.to_csv('data/optimized_matched_counterspeech.csv', index=False)



Optimized results have been saved to 'optimized_matched_counterspeech.csv'


## Key Steps:

1. Load multiple datasets containing counter-narratives.
2. Normalize the text data for consistency.
3. Generate a unique hash for each counter-narrative to quickly identify potential matches.
4. Compare counter-narratives across datasets to find exact or near-exact matches.
5. Save the identified matches into a new CSV file.

## Step-by-Step Explanation:

### Importing Libraries:

pandas is imported for handling and processing the datasets.
sha256 from hashlib is used to generate a unique identifier (hash) for each counter-narrative.
SequenceMatcher from difflib helps in comparing the similarity between two texts.

### Defining Functions:

compute_hash(text): This function takes a string (text) as input and returns a unique SHA-256 hash of the text. This hash will be used to quickly identify matching counter-narratives.
is_token_set_match(a, b, threshold=0.9): This function checks if two strings a and b are similar. It first breaks each string into individual words (tokens), then checks if the proportion of common words (tokens) is high enough to be considered a match (based on the threshold, set at 90%). If they pass this check, it then uses SequenceMatcher to see if the overall similarity of the texts is above the threshold.

### Loading the Datasets:

Multiple datasets containing counter-narratives and hate speech are loaded using pd.read_csv(). Each dataset has different sets of counter-narratives (intentconan, intentconanv2, and multitarget_conan).

### Concatenating Datasets:

The script merges different splits (train, test, dev/val) of the intentconan and intentconanv2 datasets into single dataframes for each version.

### Normalizing the Text Data:

The counter-narrative text in each dataset is converted to lowercase and stripped of leading/trailing spaces to ensure uniformity in comparisons. This step reduces discrepancies due to case sensitivity or accidental spaces.

### Generating Hashes:

The compute_hash() function is applied to each counter-narrative in the datasets to create a hash. These hashes serve as a quick way to check if two counter-narratives are exactly the same.
Finding Matches:

The script iterates through each row of the multitarget_conan dataset.
For each counter-narrative, it looks for potential matches in both intentconan and intentconanv2 datasets by comparing the hashes first (fast check). If the hashes match, the texts are further compared using is_token_set_match() to ensure they are sufficiently similar.
If a match is found, a dictionary containing the source dataset, the original hate speech, the counter-narrative, and its matched counterpart is appended to the matches list.

### Saving the Matches:

Once all potential matches are identified, they are stored in a DataFrame called matches_df.
This DataFrame is then saved to a CSV file named optimized_matched_counterspeech.csv.