This project investigates several methods of detecting users operating multiple accounts online. This project is written in Python, and uses pandas, NLTK, NumPy, scikit-learn, SciPy, and matplotlib throughout.
-
Data extraction and preprocessing: Extracted 400,000 Disqus comments through Disqus Web API (
API_Requests.py). Normalized JSON objects, created new dataframes for features, cleaned comment data by removing HTML-like formatting tags, newline characters, and URLs. -
Methods
- Word Frequency: Created a corpus (collection of all words in over 400,000 comments), found the top 50 most frequently used words, and found the user frequency for each word
- Syntax Analysis: Created syntax based metrics, including average sentence length, frequency of long words, percent of sentences capitalized
- Temporal Analysis: Extracted the hour (0-23) of each post and created a dataframe of post frequency by hour
-
Visualizations: Created dendrograms for each hierarchical clustering method. Check the
Dendrogramsfolder for each. (Note: It may be difficult to see each image on a browser, as the files are very large. I recommend using GIMP or some other photo-editing software)
API_Requests.py - Script to extract Disqus comments to a feather file format
Stylometry.py - Initial data preprocessing and analysis: finding 50 most frequent words and their distributions
Syntax_Analysis.py - Builds on previous output to include new metrics, including capitalization frequency, sentence length, and use of long words
Time_Analysis.py - Extracts statistics about post frequency for each user
Dendrograms.py - Hierarchical cluster analysis of the data to identify duplicate accounts
Dataframes - Contains all dataframes created or used in the above files (Note: this does not include raw data which was too expensive to upload. I'm happy to share it with anyone who requests it).
message_time- raw data for time of each messagetime_data- Posting frequency for every hour for each userunfiltered_corpus- contains raw corpus for each user along with three statisticscorpus- contains cleaned up (i.e., HTML tags and links removed) corpus for each user along with three style-based statisticsTop50- Top 50 word frequency for each userdf_time_words- contains time and word frequencies for 266 active usersdf_time_words_expanded- contains time and word frequencies for 521 usersmetrics_top25- style-based statistics and word frequency for each user
Dendrograms - Contains .png files of each dendrogram created in Dendrograms.py. I recommend viewing them in GIMP or other photo editing software, as the file is quite large and might slow your browser if viewed directly.
| Method | Word Frequency | Syntax Analysis | Temporal Analysis | Accuracy1 |
|---|---|---|---|---|
| C1 | ✓ | ✘ | ✘ | 93% |
| C22 | ✓ | ✘ | ✘ | |
| C3 | ✓ | ✘ | ✓ | 73% |
| C4 | ✓ | ✓ | ✘ | 73% |
| C5 | ✓ | ✓ | ✓ | 67% |
1 - lower bound of the true positive identification rate, taken by obtaining the top 15 closest points in each clustering, and identifying the proportion of the matches that are known duplicate accounts. The accuracy could be even higher if it was later revealed a match was the same person.
2 - filtered dataset to only include active users
Sample Calculation: Running process_linkage(C1, LB1).head(15) in Dendrogram.py yields the following table. It has been annotated to denote known alternate accounts:
In this chart, blue represents a match, while orange represents no evidence of a match. Note that when it matches users with the same exact name (e.g. BlueDot and BlueDot), these are not the same account. This occurred because the user made a new account and had similar word frequencies to their previous. Thus, the lower bound for accuracy in this method is
Sample Visualizations

dendrogram1.png zoomed in. Colored components indicate matches

