## Final results deduplication

Drop **exact duplicates** using a collection of variables appropriate for research objectives

For example, you may drop content duplicates based on all of the text fields ('ad_type', 'ad_title', 'ad_text', 'google_asr_text', 'aws_ocr_video_text', 'aws_ocr_img_text') and checksum values ('checksum', derived from the [image and video data preparation](https://github.com/Wesleyan-Media-Project/image-video-data-preparation/blob/main/code/01-get-checksum-for-deduplication.ipynb) step)

## Generate creative-level unique IDs 
Assign a unique ID to each piece of unique creative content 

In [None]:
import pandas as pd

In [None]:
'''
Load your final results table
'''

df = pd.read_csv('my-final-table.csv')

In [None]:
columns_for_dedup = ['ad_type', 'ad_title', 'ad_text', 'google_asr_text', 'aws_ocr_video_text', 'aws_ocr_img_text', 'checksum']

In [None]:
'''
Assign group indices (which will be our creative-level unique IDs) to all data points 
based on repetitive values across columns_for_dedup

In other words, duplicated rows will get assigned the same 'ngroups' value
variable 'ngroups' is a vector that records the group index for every row 
'''

ngroups = df.groupby(by = columns_for_dedup, dropna=False, as_index=False).ngroup()

In [None]:
'''
Prefix the group indices and generate unique IDs in text strings denoted as cid_1, cid_2,..., etc.

Name the variable accordingly (we named it wmp_creative_id)
'''

df.loc[:, 'wmp_creative_id'] = ['cid_' + str(i) for i in ngroups]