Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a Runmode that Splits Sentences Instead of Words #12

Open
dvfeinblum opened this issue Oct 27, 2018 · 4 comments
Open

Create a Runmode that Splits Sentences Instead of Words #12

dvfeinblum opened this issue Oct 27, 2018 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@dvfeinblum
Copy link
Owner

Is your feature request related to a problem? Please describe.
The meta-purpose of this project is to learn some NLP. Word2Vec is a really nice low-bar-of-entry way of doing that, and vectors for sentences would be a nice place to start.

Describe the solution you'd like
Currently, the blog parser sanitizes posts by removing punctuation and then NLTKing the words in the post. We should do something similar but, instead of splitting on spaces, we should split on periods.

Describe alternatives you've considered
N/A

Additional context
N/A

@dvfeinblum dvfeinblum self-assigned this Oct 27, 2018
@dvfeinblum dvfeinblum added the enhancement New feature or request label Oct 27, 2018
@dvfeinblum dvfeinblum moved this from To do to Backlog in Vectorize Everything Oct 27, 2018
@dvfeinblum dvfeinblum moved this from Backlog to Scrubbed in Vectorize Everything Oct 27, 2018
@dvfeinblum dvfeinblum moved this from Scrubbed to Backlog in Vectorize Everything Oct 27, 2018
@dvfeinblum dvfeinblum moved this from Backlog to In progress in Vectorize Everything Nov 12, 2018
@dvfeinblum
Copy link
Owner Author

Currently, I'm trying to decide where and how to store sentences. One option is to just add them to postgres. Something like

sentence url word_count vector
here's a self-aggrandizing sentence blag.web/post-1 4 (0.123,0.2431,0.234232,...)

@dvfeinblum
Copy link
Owner Author

Oooh; also, one nice thing about the word tokenizer I already wrote is that we can throw out words that don't mean much. Glancing at the word_details table, we can probably toss words with the following part_of_speech:

  • DT
  • TO
  • CC
  • PRP
  • IN
  • PRP$

dvfeinblum added a commit that referenced this issue Nov 25, 2018
* created sentence table and updated utils

* Changed parser to store sentences and words.

* Fixed unittest
@dvfeinblum dvfeinblum moved this from In progress to Done in Vectorize Everything Nov 25, 2018
@dvfeinblum
Copy link
Owner Author

Leaving this issue open because the last comment still needs to be implemented!

@dvfeinblum
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Development

No branches or pull requests

1 participant