## **Handling of Near-Duplicates**

In social media data analysis, particularly when using Twitter for trend discovery related to additive manufacturing (AM), handling redundant content is critical. Tweets are frequently duplicated, paraphrased, or retweeted with slight modifications. These near-duplicate posts can skew trend detection and inflate topic frequency metrics, ultimately introducing bias into large language model (LLM) responses. Furthermore, when LLM APIs are used in a paid setting, submitting repetitive data leads to unnecessary cost and processing overhead.

To mitigate this, a near-duplicate detection pipeline is implemented to retain only representative tweets per semantic cluster. This approach ensures **semantic diversity**, reduces **computational cost**, and improves the **accuracy of trend extraction** from tweet corpora.

---

### **Theoretical Background**

#### **1. Jaccard Similarity**

Jaccard similarity is a measure of the overlap between two sets, defined as:
$
J(A, B) = \frac{|A \cap B|}{|A \cup B|}
$

It has been widely used for detecting document similarity, particularly in bag-of-words or token-based models [Manning, Raghavan, & Schütze, 2008]
> 🔹 **Key reference**: 
> Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press

#### **2. MinHashing**

MinHash is a probabilistic technique for estimating Jaccard similarity between sets efficiently. Rather than computing the actual intersection and union of sets (which can be costly for large corpora), MinHash represents each set with a fixed-length signature generated through hash functions. The similarity between two MinHash signatures approximates the Jaccard similarity between the original sets.

MinHash was introduced by **Broder (1997)** at AltaVista for detecting near-duplicate web pages, and later refined in [Broder et al., 2000].

> 🔹 **Key reference**:  
> Broder, A. Z. (1997). *On the resemblance and containment of documents*. Proceedings. Compression and Complexity of SEQUENCES 1997.  
> Broder, A. Z., Charikar, M., Frieze, A. M., & Mitzenmacher, M. (2000). *Min-wise independent permutations*. Journal of Computer and System Sciences, 60(3), 630–659.

#### **3. Locality Sensitive Hashing (LSH)**

LSH is a method for indexing high-dimensional data that allows similar items to be hashed into the same "buckets" with high probability. When combined with MinHash, LSH enables **sub-linear time** similarity search over large text corpora [Indyk & Motwani, 1998].

> 🔹 **Key reference**:  
> Indyk, P., & Motwani, R. (1998). *Approximate nearest neighbors: Towards removing the curse of dimensionality*. Proceedings of the thirtieth annual ACM symposium on Theory of computing.

LSH is particularly effective in high-volume tweet streams, where full pairwise similarity comparisons are computationally infeasible.

#### **4. Applications in NLP and Social Media Analytics**

Near-duplicate detection using MinHash and LSH has been successfully applied in:
- Social media deduplication [Petrovic et al., 2010]
- Scalable document clustering [Rajaraman & Ullman, 2014]
- Spam detection and topic tracking [Leskovec et al., 2014]

> 🔹 **Further references**:  
> Petrovic, S., Osborne, M., & Lavrenko, V. (2010). *Streaming first story detection with application to Twitter*. NAACL.  
> Rajaraman, A., & Ullman, J. D. (2014). *Mining of Massive Datasets*. Cambridge University Press.  
> Leskovec, J., Rajaraman, A., & Ullman, J. D. (2014). *Mining of Massive Datasets*, 2nd ed.

---

### **Implementation Overview**

The deduplication process applied in this work consists of the following steps:

1. **Tokenization**  
   Each tweet is converted to lowercase and tokenized into a set of words using whitespace separation.

2. **Signature Generation (MinHash)**  
   A MinHash signature is created for each token set using multiple permutations (e.g., 128 or 256) to balance accuracy and speed.

3. **Similarity Indexing (LSH)**  
   The MinHash signatures are inserted into an LSH index that groups similar tweets by approximate Jaccard similarity.

4. **Duplicate Group Extraction**  
   Near-duplicate pairs are identified based on a similarity threshold (e.g., Jaccard ≥ 0.7) and grouped into connected components (clusters).

5. **Representative Tweet Selection**  
   Within each cluster, only the tweet with the highest engagement (likes, retweets, replies) is retained.

6. **Network Visualization**  
   A graph is constructed where nodes represent tweets and edges represent duplicate relationships. Nodes are colored by cluster, and annotated by dataset index for reference.

---

### **Purpose and Benefits**

This method ensures that:
- **Only unique, high-quality tweets** are passed to the LLM.
- **Topic models and trend analyses** are not skewed by repetition.
- **API cost is minimized** by reducing redundant input.

By integrating this preprocessing step, the pipeline achieves **scalability**, **semantic efficiency**, and **economic viability** for trend analysis tasks in AM-related discourse on social media.


In [3]:
import gc
gc.collect()

0

In [4]:
from datasketch import MinHash, MinHashLSH
from tqdm.notebook import tqdm
import pandas as pd
import re
from am_sma_data_cleaning.utils import remove_spammers,tokenize,find_near_duplicates_parallel,remove_duplicates_keep_highest_engagement,plot_simplified_duplicate_network
pd.set_option('display.max_colwidth', None)

In [5]:
df = pd.read_csv("CLEANED_data_pre_near_duplicates_handled_20250530_204510.csv")
df

Unnamed: 0,Author ID,Total Engagement,Date,tweet_id,row_num,Normalized Text,langdetect_is_english
0,@Bill_Gross,439944,2017-10-17 21:48:10+00:00,@Bill_Gross_2017-10-17T21:48:10.000Z,1399480,"in the ""i'm getting old"" department.., a kid saw this and said, ""oh, you 3d-printed the 'save' icon.""",True
1,@rustbeltlady,383384,2019-03-08 04:41:13+00:00,@rustbeltlady_2019-03-08T04:41:13.000Z,2703961,who gave my little brother a 3d printer,True
2,@McJesse,348608,2021-12-31 00:34:14+00:00,@McJesse_2021-12-31T00:34:14.000Z,2887953,"got a 3d printer for christmas, realized i can use it to print any new year’s glasses i want.",True
3,@olivelorraine_,283017,2021-08-15 20:47:32+00:00,@olivelorraine__2021-08-15T20:47:32.000Z,2963114,the vagina is the original 3d printer,True
4,@rveenewman,213595,2019-01-09 13:25:07+00:00,@rveenewman_2019-01-09T13:25:07.000Z,2447524,a 3d printed light projected animation. proof that there's always new ways to animate everything. #3dprint #animation,True
...,...,...,...,...,...,...,...
3922033,@BramKnaapen,0,2013-11-03 21:51:10+00:00,@BramKnaapen_2013-11-03T21:51:10.000Z,5849490,customizable 3d printed titanium glasses that look good; 3d printing is moving beyond gadgetry : http://,True
3922034,@SybilCollas,0,2014-01-17 15:31:11+00:00,@SybilCollas_2014-01-17T15:31:11.000Z,5178014,"customizable 3d printed tabletop miniatures. oh money, where are you when i need you so? http:// via @kickstarter",True
3922035,@davebower,0,2014-01-16 20:40:39+00:00,@davebower_2014-01-16T20:40:39.000Z,5383868,customizable 3d printed tabletop miniatures. hmmmm. http:// orge/customizable-3d-printed-tabletop-miniatures ...,True
3922036,@SethRichard,0,2014-02-16 05:21:51+00:00,@SethRichard_2014-02-16T05:21:51.000Z,5550420,"customizable 3d printed tabletop miniatures, via @kickstarter only 46 hours left to back it https:// orge/customizable-3d-printed-tabletop-miniatures ...",True


### Find Near Duplicates

In [6]:
BATCH_SIZE = 100_000  
all_duplicates = []

num_batches = (len(df) + BATCH_SIZE - 1) // BATCH_SIZE

# Loop over batches
for batch_idx, i in enumerate(range(0, len(df), BATCH_SIZE), start=1):
    batch_df = df.iloc[i:i + BATCH_SIZE].copy()
    print(f"Processing batch {batch_idx} of {num_batches}: rows {i} to {i + len(batch_df) - 1}")

    # Run duplicate detection on the batch
    batch_duplicates = find_near_duplicates_parallel(batch_df, text_column="Normalized Text")

    # Append batch result
    all_duplicates.append(batch_duplicates)

# Combine all duplicates
duplicates_df = pd.concat(all_duplicates, ignore_index=True)

Processing batch 1 of 40: rows 0 to 99999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 10346
Processing batch 2 of 40: rows 100000 to 199999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 7560
Processing batch 3 of 40: rows 200000 to 299999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 5294
Processing batch 4 of 40: rows 300000 to 399999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 4823
Processing batch 5 of 40: rows 400000 to 499999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 4097
Processing batch 6 of 40: rows 500000 to 599999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 5328
Processing batch 7 of 40: rows 600000 to 699999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 8833
Processing batch 8 of 40: rows 700000 to 799999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 12160
Processing batch 9 of 40: rows 800000 to 899999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 11583
Processing batch 10 of 40: rows 900000 to 999999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 17705
Processing batch 11 of 40: rows 1000000 to 1099999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 480899
Processing batch 12 of 40: rows 1100000 to 1199999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 203626
Processing batch 13 of 40: rows 1200000 to 1299999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 335881
Processing batch 14 of 40: rows 1300000 to 1399999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 187246
Processing batch 15 of 40: rows 1400000 to 1499999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 317376
Processing batch 16 of 40: rows 1500000 to 1599999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 351359
Processing batch 17 of 40: rows 1600000 to 1699999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 347586
Processing batch 18 of 40: rows 1700000 to 1799999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 842850
Processing batch 19 of 40: rows 1800000 to 1899999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 199969
Processing batch 20 of 40: rows 1900000 to 1999999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 203018
Processing batch 21 of 40: rows 2000000 to 2099999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 292597
Processing batch 22 of 40: rows 2100000 to 2199999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 1198257
Processing batch 23 of 40: rows 2200000 to 2299999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 222938
Processing batch 24 of 40: rows 2300000 to 2399999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 300914
Processing batch 25 of 40: rows 2400000 to 2499999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 289244
Processing batch 26 of 40: rows 2500000 to 2599999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 265092
Processing batch 27 of 40: rows 2600000 to 2699999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 253698
Processing batch 28 of 40: rows 2700000 to 2799999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 112710
Processing batch 29 of 40: rows 2800000 to 2899999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 79258
Processing batch 30 of 40: rows 2900000 to 2999999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 1274286
Processing batch 31 of 40: rows 3000000 to 3099999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 439001
Processing batch 32 of 40: rows 3100000 to 3199999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 224156
Processing batch 33 of 40: rows 3200000 to 3299999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 215610
Processing batch 34 of 40: rows 3300000 to 3399999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 1143923
Processing batch 35 of 40: rows 3400000 to 3499999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 420183
Processing batch 36 of 40: rows 3500000 to 3599999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 603947
Processing batch 37 of 40: rows 3600000 to 3699999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 342986
Processing batch 38 of 40: rows 3700000 to 3799999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 215912
Processing batch 39 of 40: rows 3800000 to 3899999
Building MinHash signatures in parallel...


  0%|          | 0/100000 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/100000 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/100000 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 207047
Processing batch 40 of 40: rows 3900000 to 3922037
Building MinHash signatures in parallel...


  0%|          | 0/22038 [00:00<?, ?it/s]

Inserting MinHash into LSH...


LSH Insertion:   0%|          | 0/22038 [00:00<?, ?it/s]

Finding near-duplicate pairs...


Finding near-duplicate pairs:   0%|          | 0/22038 [00:00<?, ?it/s]

Number of near-duplicate pairs found: 42999


In [7]:
duplicates_df

Unnamed: 0,Index_1,Tweet_1,Index_2,Tweet_2,Approx_Jaccard
0,0,"in the ""i'm getting old"" department.., a kid saw this and said, ""oh, you 3d-printed the 'save' icon.""",56627,"in the “i’m getting old department....”, a kid saw this and said “oh you 3d printed the ‘ios settings’ icon.”",0.7
1,0,"in the ""i'm getting old"" department.., a kid saw this and said, ""oh, you 3d-printed the 'save' icon.""",91512,"rt @bill_gross in the ""i'm getting old"" department.., a kid saw this and said, ""oh, you 3d-printed the 'save' icon.""",0.7
2,3,the vagina is the original 3d printer,167,the uterus is the original 3d printer,0.7
3,3,the vagina is the original 3d printer,257,the uterus is the original 3d printer tbqh,0.7
4,3,the vagina is the original 3d printer,28954,the original 3d printer,0.7
...,...,...,...,...,...
11702292,3922021,customizable 3d printed tabletop miniatures by hero forge — kickstarter http:// 5252686/ ...,3922022,customizable 3d printed tabletop miniatures by hero forge — kickstarter http://,0.7
11702293,3922023,customizable 3d printed tabletop miniatures by hero forge — kickstarter hit stretch goal #5! we will http:// 5373707/ ...,3922024,customizable 3d printed tabletop miniatures by hero forge — kickstarter hit stretch goal #5! we will http:// 5373705/ ...,0.7
11702294,3922023,customizable 3d printed tabletop miniatures by hero forge — kickstarter hit stretch goal #5! we will http:// 5373707/ ...,3922025,customizable 3d printed tabletop miniatures by hero forge — kickstarter hit stretch goal #5! we will http:// 5373701/ ...,0.7
11702295,3922024,customizable 3d printed tabletop miniatures by hero forge — kickstarter hit stretch goal #5! we will http:// 5373705/ ...,3922025,customizable 3d printed tabletop miniatures by hero forge — kickstarter hit stretch goal #5! we will http:// 5373701/ ...,0.7


## Visualize 

In [6]:
#plot_simplified_duplicate_network(duplicates_df, save_path="near_duplicates_network.png")

In [7]:
#df

In [8]:
idx_list = [0, 59953]
cols = ['Date', 'tweet_id', 'Normalized Text']

filtered_df = df.loc[idx_list, cols]
filtered_df

Unnamed: 0,Date,tweet_id,Normalized Text
0,2017-10-17 21:48:10+00:00,@Bill_Gross_2017-10-17T21:48:10.000Z,"in the ""i'm getting old"" department.., a kid saw this and said, ""oh, you 3d-printed the 'save' icon."""
59953,2022-11-15 10:47:44+00:00,@joeltelling_2022-11-15T10:47:44.000Z,surprise! new video from @formnext_expo - this is just debuted and i rode it! cc: @stratasys bicycle of the future?


In [9]:
print(repr(df.loc[3, 'Normalized Text']))

'the vagina is the original 3d printer'


## Writeout to CSV

In [10]:
duplicates_df.to_csv("near_duplicates.csv", index=False)

### Drop the Near-duplicates

In [10]:
df_final = remove_duplicates_keep_highest_engagement(df, duplicates_df, engagement_column="Total Engagement")
df_final

Found 267312 groups of near-duplicates
Removing 1086279 duplicates


Unnamed: 0,Author ID,Total Engagement,Date,tweet_id,row_num,Normalized Text,langdetect_is_english
0,@Bill_Gross,439944,2017-10-17 21:48:10+00:00,@Bill_Gross_2017-10-17T21:48:10.000Z,1399480,"in the ""i'm getting old"" department.., a kid saw this and said, ""oh, you 3d-printed the 'save' icon.""",True
1,@rustbeltlady,383384,2019-03-08 04:41:13+00:00,@rustbeltlady_2019-03-08T04:41:13.000Z,2703961,who gave my little brother a 3d printer,True
2,@McJesse,348608,2021-12-31 00:34:14+00:00,@McJesse_2021-12-31T00:34:14.000Z,2887953,"got a 3d printer for christmas, realized i can use it to print any new year’s glasses i want.",True
3,@olivelorraine_,283017,2021-08-15 20:47:32+00:00,@olivelorraine__2021-08-15T20:47:32.000Z,2963114,the vagina is the original 3d printer,True
4,@rveenewman,213595,2019-01-09 13:25:07+00:00,@rveenewman_2019-01-09T13:25:07.000Z,2447524,a 3d printed light projected animation. proof that there's always new ways to animate everything. #3dprint #animation,True
...,...,...,...,...,...,...,...
3922032,@_MichaelShell,0,2013-05-27 22:31:17+00:00,@_MichaelShell_2013-05-27T22:31:17.000Z,5827476,customizable 3d printer that is easy to use and affordable for all. 3d print almost any object. there are no limits! http:// 50769/rigidbot-3d-printer ...,True
3922033,@BramKnaapen,0,2013-11-03 21:51:10+00:00,@BramKnaapen_2013-11-03T21:51:10.000Z,5849490,customizable 3d printed titanium glasses that look good; 3d printing is moving beyond gadgetry : http://,True
3922034,@SybilCollas,0,2014-01-17 15:31:11+00:00,@SybilCollas_2014-01-17T15:31:11.000Z,5178014,"customizable 3d printed tabletop miniatures. oh money, where are you when i need you so? http:// via @kickstarter",True
3922036,@SethRichard,0,2014-02-16 05:21:51+00:00,@SethRichard_2014-02-16T05:21:51.000Z,5550420,"customizable 3d printed tabletop miniatures, via @kickstarter only 46 hours left to back it https:// orge/customizable-3d-printed-tabletop-miniatures ...",True


In [11]:
# remove all remnants of urls
df_final['Normalized Text'] = (
    df_final['Normalized Text']
      .str.replace('http://',  '', regex=False)
      .str.replace('https://', '', regex=False)
      .str.replace('.html',    '', regex=False)
)

In [12]:
from datetime import datetime

now = datetime.now().strftime("%Y%m%d_%H%M%S")

filename = f"CLEANED_DATASET_{now}.csv"

df_final.to_csv(filename, index=False)