This repository contains the following files/folders:
ConvKey_processed_min2max4.csv
- the main file which was used to train Keyword Extractor. It combines both datasets: News-Keywords based dataset and IR-Keyword-based dataset. Dataset has been filtered under assumption that a sentence would have at minimum 2 at maximum 4 keywords- folder
./news_raw
- contains raw data (sentences with keywords) collected from news sources such as: BBC, Salon, The Local (./news_raw/news_bbc.csv
,./news_raw/news_salon.csv
,./news_raw/news_thelocal.csv
) - folder
./IR_kwds_raw
- contains raw data (sentences with keywords) collected using the IR Keywords based method. Folder has 2 files: one includes sentences from which minimum 1 - maximum 3 keywords were identified, the other one includes sentences from which min 2 - max 4 keywords were identified. (./IR_kwds_raw/qulac_min1_max3kwds.csv
,./IR_kwds_raw/qulac_min2_max4kwds.csv
)
All data files follow the same structure and contain the following attributes:
text
- tokeized version of the original textkeywords_indices
- indicies of the keywords in a sentencekeywords_count
- count of keywords in a sentence
For example, for the original sentence: “Conservatives and liberals drink different beer”
Attribute | Data example | Comment |
---|---|---|
text | [’conservatives’, ’and’, ’liberals’, ’drink’, ’different’, ’beer’] | Tokenized sentence |
keywords_indices | [0, 2, 5] | Meaning that the keywords are: [’conservatives’, ’liberals’, ’beer’] |
keywords_count | 3 | Count of the keywords |
Below you can find a sample of the data taken from ConvKey_processed_min2max4.csv
dataset:
text | keywords_indices | keywords_count |
---|---|---|
['find', 'background', 'information', 'about', 'man', 'made', 'satellites'] | [2, 5] | 2 |
['bring', 'back', 'the', 'dark', 'how', 'our', 'overuse', 'of', 'artificial', 'light', 'is', 'changing', 'nighttime', 'for', 'the', 'worse'] | [6, 8, 9] | 3 |
['the', 'hidden', 'vast', 'cruelty', 'of', 'this', 'health', 'care', 'bill', 'an', 'attack', 'on', 'care', 'for', 'the', 'elderly', 'disabled', 'and', 'most', 'vulnerable'] | [2, 6, 7, 8] | 4 |
['us', 'tanks', 'arrive', 'in', 'germany', 'to', 'help', 'nato', 'defences'] | [0, 4, 7] | 3 |