Create a Runmode that Splits Sentences Instead of Words #12

dvfeinblum · 2018-10-27T20:18:21Z

Is your feature request related to a problem? Please describe.
The meta-purpose of this project is to learn some NLP. Word2Vec is a really nice low-bar-of-entry way of doing that, and vectors for sentences would be a nice place to start.

Describe the solution you'd like
Currently, the blog parser sanitizes posts by removing punctuation and then NLTKing the words in the post. We should do something similar but, instead of splitting on spaces, we should split on periods.

Describe alternatives you've considered
N/A

Additional context
N/A

dvfeinblum · 2018-11-12T00:59:57Z

Currently, I'm trying to decide where and how to store sentences. One option is to just add them to postgres. Something like

sentence	url	word_count	vector
here's a self-aggrandizing sentence	blag.web/post-1	4	(0.123,0.2431,0.234232,...)

dvfeinblum · 2018-11-12T01:09:39Z

Oooh; also, one nice thing about the word tokenizer I already wrote is that we can throw out words that don't mean much. Glancing at the word_details table, we can probably toss words with the following part_of_speech:

DT
TO
CC
PRP
IN
PRP$

* created sentence table and updated utils * Changed parser to store sentences and words. * Fixed unittest

dvfeinblum · 2018-11-25T21:21:52Z

Leaving this issue open because the last comment still needs to be implemented!

dvfeinblum · 2018-11-26T01:39:45Z

Oh well hey now:

https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/examples/tutorials/word2vec/word2vec_basic.py

dvfeinblum self-assigned this Oct 27, 2018

dvfeinblum added the enhancement New feature or request label Oct 27, 2018

dvfeinblum added this to To do in Vectorize Everything Oct 27, 2018

dvfeinblum moved this from To do to Backlog in Vectorize Everything Oct 27, 2018

dvfeinblum moved this from Backlog to Scrubbed in Vectorize Everything Oct 27, 2018

dvfeinblum moved this from Scrubbed to Backlog in Vectorize Everything Oct 27, 2018

dvfeinblum moved this from Backlog to In progress in Vectorize Everything Nov 12, 2018

dvfeinblum mentioned this issue Nov 12, 2018

[Issues #12] Add Sentence Tokenization #16

Merged

dvfeinblum added a commit that referenced this issue Nov 25, 2018

[Issues #12] Add Sentence Tokenization (#16)

fa0ba35

* created sentence table and updated utils * Changed parser to store sentences and words. * Fixed unittest

dvfeinblum moved this from In progress to Done in Vectorize Everything Nov 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a Runmode that Splits Sentences Instead of Words #12

Create a Runmode that Splits Sentences Instead of Words #12

dvfeinblum commented Oct 27, 2018

dvfeinblum commented Nov 12, 2018

dvfeinblum commented Nov 12, 2018

dvfeinblum commented Nov 25, 2018

dvfeinblum commented Nov 26, 2018

Create a Runmode that Splits Sentences Instead of Words #12

Create a Runmode that Splits Sentences Instead of Words #12

Comments

dvfeinblum commented Oct 27, 2018

dvfeinblum commented Nov 12, 2018

dvfeinblum commented Nov 12, 2018

dvfeinblum commented Nov 25, 2018

dvfeinblum commented Nov 26, 2018