Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic maining #109

Closed
dvmorozov opened this issue Nov 11, 2022 · 0 comments
Closed

Topic maining #109

dvmorozov opened this issue Nov 11, 2022 · 0 comments
Assignees
Labels
feature New feature

Comments

@dvmorozov
Copy link
Owner

dvmorozov commented Nov 11, 2022

Solution

  1. Implement script collecting dictionary. Represent document as "bag-of-words". Save dictionary into file. ✔️
  2. Implement iterator class over files in directory. ✔️
  3. Implement model and use iterator class. ✔️
  4. Save corpus into file (every text should be converted into single line) for processing with META. ✔️
  5. Remove Greek letters from the list of special characters. ❓
  6. Output topics into JSON. ✔️
  7. Add lemmatization. Add reference to the main page. ✔️
  8. Set encoding in reading and writing files as script parameter. ✔️

Related

  1. Graph displaying topics vs. time #111.
  2. Save article word vectors into files for subsequent comparison #112.
  3. Support adding new files to the corpus (merge dictionaries) #113.

References

https://www.qblocks.cloud/blog/best-nlp-libraries-python

https://pypi.org/project/gensim/ ✔️
https://radimrehurek.com/gensim/
https://radimrehurek.com/gensim/auto_examples/core/run_topics_and_transformations.html (model)
https://radimrehurek.com/gensim/auto_examples/core/run_corpora_and_vector_spaces.html (corpus iteration)
https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#pre-process-and-vectorize-the-documents (lemmatization)

https://www.nltk.org/ ✔️ (pip install --user -U nltk; after that recreate virtual environment inheriting packages)

https://github.com/clips/pattern

https://www.machinelearningplus.com/nlp/gensim-tutorial/
https://www.machinelearningplus.com/nlp/lemmatization-examples-python/#stanfordcorenlplemmatization

https://stackoverflow.com/questions/8884188/how-to-read-and-write-ini-file-with-python3

@dvmorozov dvmorozov added the feature New feature label Nov 11, 2022
@dvmorozov dvmorozov self-assigned this Nov 11, 2022
dvmorozov added a commit that referenced this issue Nov 18, 2022
dvmorozov added a commit that referenced this issue Nov 20, 2022
dvmorozov added a commit that referenced this issue Nov 20, 2022
dvmorozov added a commit that referenced this issue Nov 20, 2022
dvmorozov added a commit that referenced this issue Nov 22, 2022
dvmorozov added a commit that referenced this issue Nov 23, 2022
…ry is written to file, path to dictionary has been corrected.

#109
dvmorozov added a commit that referenced this issue Nov 23, 2022
dvmorozov added a commit that referenced this issue Nov 23, 2022
dvmorozov added a commit that referenced this issue Nov 24, 2022
dvmorozov added a commit that referenced this issue Nov 24, 2022
dvmorozov added a commit that referenced this issue Nov 25, 2022
…eased. Redundant iterations have been removed. Minor improvements.

#109
dvmorozov added a commit that referenced this issue Nov 27, 2022
dvmorozov added a commit that referenced this issue Nov 27, 2022
dvmorozov added a commit that referenced this issue Dec 3, 2022
dvmorozov added a commit that referenced this issue Dec 14, 2022
dvmorozov added a commit that referenced this issue Dec 18, 2022
dvmorozov added a commit that referenced this issue Dec 19, 2022
…ble for radial cluster graph. 200 topics over all corpus have been mined as well.

#109
dvmorozov added a commit that referenced this issue Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature
Projects
None yet
Development

No branches or pull requests

1 participant