Stylometry
Sample project for using stylometry to deanonymize Twitter account author.
Instructions
1. Install the dependencies and run Python:
$ pip install tweepy numpy unidecode nltk scipy sklearn
$ python3In Python, import nltk and download Model punkt.
>>> import nltk
>>> nltk.download()2. Download files:
$ git clone https://github.com/ViliamV/stylometry.git
$ cd stylometry/3. Get Twitter API credentials
- Follow these steps.
- Input credentials into
twitter-API.txt
4. Download tweets
- Create
accounts.txtin main directory and put there account's names to download, one in each line. Put the unknown author's account last. - Create directory
datain main directory. - Run
tweet-downloader.pyand wait. Due to Twitter API speed, it might take a while. - Verify if
datacontains downloaded tweets.
5. Run stylometry
- Edit
classification.pyand change valueUNKNOWN(line 28) to unknown author's account.
UNKNOWN="example_account"- Run
classification.py.
About classification
This code uses Bag of Words model for extracting features from the text. A great introduction for implementing this model can be found here.
The code also uses Czech stopwords and Czech tokenizer, however, it is quite simple to change it.