Skip to content

andreeaiana/nemig

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NeMig

NeMig represents a bilingual news collection and knowledge graphs on the topic of migration. The news corpora in German and English were collected from online media outlets from Germany and the US, respectively. NeMIg contains rich textual and metadata information, sentiment and political orientation annotations, as well as named entities extracted from the articles' content and metadata and linked to Wikidata. The corresponding knowledge graphs (NeMigKG) built from each corpus are expanded with up to two-hop neighbors from Wikidata of the initial set of linked entities.

Features

NeMigKG comes in four flavors, for both the German, and the English corpora:

  • Base NeMigKG: contains literals and entities from the corresponding annotated news corpus;
  • Entities NeMigKG: derived from the Base NeMIg by removing all literal nodes, it contains only resource nodes;
  • Enriched Entities NeMigKG: derived from the Entities NeMig by enriching it with up to two-hop neighbors from Wikidata, it contains only resource nodes and Wikidata triples;
  • Complete NeMigKG: the combination of the Base and Enriched Entities NeMig, it contains both literals and resources.

Project Structure

The directory structure of new project looks like this:

├── configs                             <- Hydra configuration files
│   ├── dataset                            <- Dataset configs
│   ├── entity_filtering                   <- Entity filtering configs
│   ├── experiment                         <- Experiment configs
│   ├── hydra                              <- Hydra configs
│   ├── kg_construction                    <- Knowledge graph construction configs
│   ├── kg_serialization                   <- Koowledge graph serialization configs
│   ├── named_entity_linking               <- Named entity linking configs
│   ├── named_entity_recognition           <- Named entity recognition configs
│   ├── sentiment_classification           <- Sentiment classification configs
│   │
│   ├── pipeline.yaml                   <- Main config for the pipeline
│
├── data                   <- Project data
│
├── logs                   <- Logs generated by hydra loggers
│
├── notebooks              <- Jupyter notebooks 
│
├── scripts                <- Shell scripts
│
├── src                             <- Source code
│   ├── dataset                            <- Dataset creation and processing
│   ├── entity_filtering                   <- Entity filtering model
│   ├── kg_construction                    <- Knowledge graph construction model
│   ├── kg_serialization                   <- Koowledge graph serialization model
│   ├── named_entity_linking               <- Named entity linking model
│   ├── named_entity_recognition           <- Named entity recognition model
│   ├── sentiment_classification           <- Sentiment classification model
│   ├── utils                              <- Utility scripts
│   │
│   └── pipeline.py                 <- Run pipeline
│
├── .gitignore                <- List of files ignored by git
├── requirements.txt          <- File for installing python dependencies
└── README.md

How to run

Install dependencies

# clone project
git clone https://github.com/andreeaiana/nemig
cd nemig

# [OPTIONAL] create conda environment
conda create -n nemig_env python=3.9
conda activate nemig_env

# install requirements
pip install -r requirements.txt

Download the mGENRE model as described in mGENRE needed for running the entity linking model.

Run pipeline with chosen experiment configuration from configs/experiment/

python main.py experiment=experiment_name.yaml

You can override any parameter from command line like this

python src/main.py language='de' kg_construction.k_hop=1

Run the Subtopic Modelling notebook to extract sub-topics from the data and integrate the results in the pipeline.

The chosen version of NeMig will be constructed and cached in the cache folder. NeMigKG is serialized in N-Triple format, and the resulting files are placed in the kg folder.

Data

A sample of the annotated news corpora used to construct the knowledge graphs are available in the cache folder. Due to copyright policies, this sample does not contain the body of the articles. A full version of the news corpus is available upon request.

The anonymized user data for each dataset is available in the user data folder.

KG Triples

NeMigKG is hosted on Zenodo. All files are gzipped and in N-Triples format.

A sample of the triple files for can be found in the kg folder. Due to copyright policies, these samples do not contain the body of the news articles.

License

The code is licensed under the MIT License. The data files are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Citation

If you use the dataset, please cite:

@dataset{iana_andreea_2022_7908392,
  author       = {Iana, Andreea and
                  Alam, Mehwish and
                  Grote, Alexander and
                  Nikolajevic, Nevena and
                  Ludwig, Katharina and
                  Müller, Philipp and
                  Weinhardt, Christof and
                  Paulheim, Heiko},
  title        = {{NeMig - A Bilingual News Collection and Knowledge 
                   Graph about Migration}},
  month        = dec,
  year         = 2022,
  publisher    = {Zenodo},
  version      = {v1.0.1},
  doi          = {10.5281/zenodo.7442424},
  url          = {https://doi.org/10.5281/zenodo.7442424}
}

About

NeMig - A Bilingual News Collection and Knowledge Graph about Migration

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published