Key-Terms-Extractor

A program that extracts key terms from news articles using NLTK and sklearn to apply tokenization, lemmatization, part-of-speech tagging and tf-idf vectorization.

Features

Command line arguments
XML parsing with lxml
Word tokenization with nltk
Word lemmatization using Word Net Lemmatizer
Stopwords removal
TF-IDF (term frequency-inverse document frequency) vectorization using sklearn

Usage

This program reads the texts from an XML file. This file must contain 'news' tags that have 2 tags: the first one is the header, and the second one is the text. Example:

<news>
    <value>Brain Disconnects During Sleep</value>
    <value>Scientists may have ... in Pasadena.</value>
</news>

You should specify the path to the file, and the number of keywords to extract in the command line. Example:

python key_terms.py example.xml 5

Output

Searching for 5 key terms at example.xml
Brain Disconnects During Sleep:
sleep cortex consciousness tononi tm 

New Portuguese skull may be an early relative of Neandertals:
skull fossil europe trait genus

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example.xml		example.xml
key_terms.py		key_terms.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Key-Terms-Extractor

Features

Usage

Output

About

Releases

Packages

Languages

License

cau777/Key-Terms-Extractor

Folders and files

Latest commit

History

Repository files navigation

Key-Terms-Extractor

Features

Usage

Output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages