Skip to content

blakelaw/Duplicate-Account-Finder

Repository files navigation

NLP-Driven Identification of Duplicate Accounts

Description

This project investigates several methods of detecting users operating multiple accounts online. This project is written in Python, and uses pandas, NLTK, NumPy, scikit-learn, SciPy, and matplotlib throughout.

Features

  • Data extraction and preprocessing: Extracted 400,000 Disqus comments through Disqus Web API (API_Requests.py). Normalized JSON objects, created new dataframes for features, cleaned comment data by removing HTML-like formatting tags, newline characters, and URLs.

  • Methods

    • Word Frequency: Created a corpus (collection of all words in over 400,000 comments), found the top 50 most frequently used words, and found the user frequency for each word
    • Syntax Analysis: Created syntax based metrics, including average sentence length, frequency of long words, percent of sentences capitalized
    • Temporal Analysis: Extracted the hour (0-23) of each post and created a dataframe of post frequency by hour
  • Visualizations: Created dendrograms for each hierarchical clustering method. Check the Dendrograms folder for each. (Note: It may be difficult to see each image on a browser, as the files are very large. I recommend using GIMP or some other photo-editing software)

Usage

API_Requests.py - Script to extract Disqus comments to a feather file format

Stylometry.py - Initial data preprocessing and analysis: finding 50 most frequent words and their distributions

Syntax_Analysis.py - Builds on previous output to include new metrics, including capitalization frequency, sentence length, and use of long words

Time_Analysis.py - Extracts statistics about post frequency for each user

Dendrograms.py - Hierarchical cluster analysis of the data to identify duplicate accounts

Dataframes - Contains all dataframes created or used in the above files (Note: this does not include raw data which was too expensive to upload. I'm happy to share it with anyone who requests it).

  • message_time - raw data for time of each message
  • time_data - Posting frequency for every hour for each user
  • unfiltered_corpus - contains raw corpus for each user along with three statistics
  • corpus - contains cleaned up (i.e., HTML tags and links removed) corpus for each user along with three style-based statistics
  • Top50 - Top 50 word frequency for each user
  • df_time_words - contains time and word frequencies for 266 active users
  • df_time_words_expanded - contains time and word frequencies for 521 users
  • metrics_top25 - style-based statistics and word frequency for each user

Dendrograms - Contains .png files of each dendrogram created in Dendrograms.py. I recommend viewing them in GIMP or other photo editing software, as the file is quite large and might slow your browser if viewed directly.

Results

Method Word Frequency Syntax Analysis Temporal Analysis Accuracy1
C1 93%
C22
C3 73%
C4 73%
C5 67%

1 - lower bound of the true positive identification rate, taken by obtaining the top 15 closest points in each clustering, and identifying the proportion of the matches that are known duplicate accounts. The accuracy could be even higher if it was later revealed a match was the same person.

2 - filtered dataset to only include active users

Sample Calculation: Running process_linkage(C1, LB1).head(15) in Dendrogram.py yields the following table. It has been annotated to denote known alternate accounts:

In this chart, blue represents a match, while orange represents no evidence of a match. Note that when it matches users with the same exact name (e.g. BlueDot and BlueDot), these are not the same account. This occurred because the user made a new account and had similar word frequencies to their previous. Thus, the lower bound for accuracy in this method is $\frac{14}{15}$ = 93%. However, it is possible that Xzplorer is indeed Heebs13, and the accuracy is even higher than we expected at this threshold.

Sample Visualizations

dendrogram1.png zoomed in. Colored components indicate matches
dendrogram1.png zoomed in. Colored components indicate matches


Word cloud of the top 50 words in the PredictIt.org community
Word cloud of PredictIt.org comments, from corpus

About

Using natural language processing to identify duplicate accounts online

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages