Skip to content

Commit

Permalink
Updating wording about python version, minor edits
Browse files Browse the repository at this point in the history
  • Loading branch information
analyticascent committed May 11, 2018
1 parent e355d04 commit d34d8e1
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions Stylometric Analysis and Obfuscation Using Python.mdown
Expand Up @@ -4,7 +4,7 @@

___

*You will need [Python 2.7 Anaconda Distribution](https://www.continuum.io/downloads) and [Tweepy](http://www.tweepy.org/) installed to run project code.*
*You will need [Python 3 Anaconda Distribution](https://www.continuum.io/downloads) and [Tweepy](http://www.tweepy.org/) installed to run project code.*

## Contents:

Expand Down Expand Up @@ -54,10 +54,10 @@ As successful methods for "fingerprinting" feeds are found, adversarial techniqu

This project can be thought of as a more sophisticated version of what many consider to be the "Hello world" of machine learning: [The Iris classification problem](https://en.wikipedia.org/wiki/Iris_flower_data_set). Rather than classifying *four* existing measurement features (length/width measurements of *sepals and petals) under *three* categorical outcomes, this project will entail the use of *over a dozen features* to attribute tweets between *two categorical outcomes.* From start to finish, it will boil down to the following:

* Utilizing Twitter's API to acquire tweets in CSV form
* Utilizing Twitter's API to acquire user tweets in CSV form
* Write code blocks for each of the tweet features being measured
* Pre-process (fingerprint) users so logistic regression can be applied
* Apply logistic regression to build authorship attribution model
* Pre-process (fingerprint) users so classification can be applied
* Apply classification algorithms to build authorship attribution model
* If possible, develop methods for subverting those classification schemes

The outcome of this project will be a tradeoff between accuracy and simplicity. Pre-processing will be used to "fingerprint" feeds, then supervised learning in the form of linear discriminant analysis will be used to attribute authorship. As requested by a prospective user of the final code result, this project will account for the fact that many people would rather use something they can *understand* than something that "claims" to be most effective.
Expand Down

0 comments on commit d34d8e1

Please sign in to comment.