Skip to content

Next Steps: Notes from 2018 10 26

dvfeinblum edited this page Oct 29, 2018 · 1 revision

spaCy.io

  • entity recognition
  • nltk that people actually use
  • underlying stuff is also more interesting
  • classification stuff
  • sentiment recognition

Misc

Bag of words: a vectorized representation of whatever you're analyzing Hot encoding Dimensionality Reduction: Look at one sentence, apply statistics to look for features that actually matter, then scale up (principal component analysis) In lieu of bag of words, word2vec "I wanna train a simple classifier"

Google dataset search

Possible workflow

  1. Tokenize
  2. Reduce dimension
  3. Take that reduced

To actually detect my writing v. not:

  1. Crawl other blogs. Like, a lot.
  2. Grab a bunch of your sentence, grab other sentences, go forth and sample
  3. The model would be, give me a word vector, spit out yes/no (dave/not dave)
Clone this wiki locally