Skip to content

aliannejadi/ConvKey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ConvKey

Data

This repository contains the following files/folders:

  • ConvKey_processed_min2max4.csv - the main file which was used to train Keyword Extractor. It combines both datasets: News-Keywords based dataset and IR-Keyword-based dataset. Dataset has been filtered under assumption that a sentence would have at minimum 2 at maximum 4 keywords
  • folder ./news_raw - contains raw data (sentences with keywords) collected from news sources such as: BBC, Salon, The Local (./news_raw/news_bbc.csv, ./news_raw/news_salon.csv, ./news_raw/news_thelocal.csv)
  • folder ./IR_kwds_raw - contains raw data (sentences with keywords) collected using the IR Keywords based method. Folder has 2 files: one includes sentences from which minimum 1 - maximum 3 keywords were identified, the other one includes sentences from which min 2 - max 4 keywords were identified. (./IR_kwds_raw/qulac_min1_max3kwds.csv, ./IR_kwds_raw/qulac_min2_max4kwds.csv)

Structure of the data

All data files follow the same structure and contain the following attributes:

  • text - tokeized version of the original text
  • keywords_indices - indicies of the keywords in a sentence
  • keywords_count - count of keywords in a sentence

For example, for the original sentence: “Conservatives and liberals drink different beer”

Attribute Data example Comment
text [’conservatives’, ’and’, ’liberals’, ’drink’, ’different’, ’beer’] Tokenized sentence
keywords_indices [0, 2, 5] Meaning that the keywords are: [’conservatives’, ’liberals’, ’beer’]
keywords_count 3 Count of the keywords

Below you can find a sample of the data taken from ConvKey_processed_min2max4.csv dataset:

text keywords_indices keywords_count
['find', 'background', 'information', 'about', 'man', 'made', 'satellites'] [2, 5] 2
['bring', 'back', 'the', 'dark', 'how', 'our', 'overuse', 'of', 'artificial', 'light', 'is', 'changing', 'nighttime', 'for', 'the', 'worse'] [6, 8, 9] 3
['the', 'hidden', 'vast', 'cruelty', 'of', 'this', 'health', 'care', 'bill', 'an', 'attack', 'on', 'care', 'for', 'the', 'elderly', 'disabled', 'and', 'most', 'vulnerable'] [2, 6, 7, 8] 4
['us', 'tanks', 'arrive', 'in', 'germany', 'to', 'help', 'nato', 'defences'] [0, 4, 7] 3

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published