## Final results deduplication

This notebook serves two main purposes: 

1. Identify **exact duplicates** using a collection of variables appropriate for your research objectives

> For example, you may define content duplicates based on all of the text fields (e.g. 'ad_creative_body' for Facebook ads, 'ad_text' for Google ads, 'google_asr_text', 'aws_ocr_video_text', 'aws_ocr_img_text') and checksum values ('checksum', derived from the [image-video-data-preparation/code/01-get-checksum-for-deduplication.ipynb](https://github.com/Wesleyan-Media-Project/image-video-data-preparation/blob/main/code/01-get-checksum-for-deduplication.ipynb) step)



2. Label duplicate ads with unique creative identifiers (cids)

> Output tables of this notebook map ad_ids with their unique creative ids. Therefore, duplicate ads (represented by ad_id) will share the same cids. 


> Output table for Facebook 2022 ads:
> + cid_fb2022.csv

> Output table for Google 2022 ads:
> + cid_google2022.csv

## Generate creative-level unique IDs 
Assign a unique ID to each piece of unique creative content 

In [None]:
import pandas as pd

In [None]:
'''
Load your final ad_id level "text table" (output from 01-merge-results/01_merge_preprocessed_results/). 

for Facebook 2022 ads: 
fb_2022_adid_text.csv.gz

for Google 2022 ads:
g2022_adid_01062021_11082022_text.csv

'''

df = pd.read_csv('my-final-text-table.csv')

If including checksum values to identify duplicate content, merge the `checksum` columns for images and videos (produced from [image and video data preparation/code/01-get-checksum-for-deduplication.ipynb](https://github.com/Wesleyan-Media-Project/image-video-data-preparation/blob/main/code/01-get-checksum-for-deduplication.ipynb)) into `df` on the `filename` column


In [None]:
'''
Google ads content deduplication fields
'''
columns_for_dedup = ['ad_title', 'ad_text', 'google_asr_text', 'aws_ocr_video_text', 'aws_ocr_img_text', 'checksum']


'''
Facebook ads content deduplication fields
'''

columns_for_dedup = ['ad_id', 'page_name', 'disclaimer', 'ad_creative_body', 'ad_creative_link_caption', 
                     'ad_creative_link_title', 'ad_creative_link_description', 
                     'google_asr_text', 'aws_ocr_text_img', 'aws_ocr_text_vid', 'checksum']

In [None]:
'''
Assign group indices (which will be our creative-level unique IDs) to all data points 
based on repetitive values across columns_for_dedup

In other words, duplicated rows will get assigned the same 'ngroups' value
variable 'ngroups' is a vector that records the group index for every row 
'''

ngroups = df.groupby(by = columns_for_dedup, dropna=False, as_index=False).ngroup()

In [None]:
'''
Prefix the group indices and generate unique IDs in text strings denoted as cid_1, cid_2,..., etc.

Name the variable accordingly (we named it wmp_creative_id)
'''

df.loc[:, 'wmp_creative_id'] = ['cid_' + str(i) for i in ngroups]

## Save ad_id to cid (creative id) mapping files

**ad_id to cid mapping**: ad_ids that share the same cids (creative ids) have duplicate content. 

In [None]:
adid_cid_mapping = pd.DataFrame({'ad_id': df['ad_id'],
             'wmp_creative_id': df['wmp_creative_id']})


In [8]:
'''
Preview ad_id to cid mapping file
'''
adid_cid_mapping.head()

Unnamed: 0,ad_id,wmp_creative_id
0,CR00000257354440376321,cid_3604
1,CR00001130641550737409,cid_26959
2,CR00001435481149538305,cid_44482
3,CR00001915967730876417,cid_63037
4,CR00002202734107295745,cid_50901


In [None]:
OUTFILE_NAME = ''

'''
For example, we named our output data as  

'cid_fb2022.csv'

'cid_google2022.csv'
'''

adid_cid_mapping.to_csv(OUTFILE_NAME, index=False)