Skip to content
This repository


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts

branch: master

update readme

latest commit 529cf6c441
chrishan authored September 05, 2012
Octocat-spinner-32 CommonCrawlContest improve extraction September 04, 2012
Octocat-spinner-32 .gitignore update readme September 05, 2012
Octocat-spinner-32 update readme September 05, 2012


Linking Entities in CommonCrawl Dataset onto Wikipedia Concepts. Live Demo


This project is our entry to the CommonCrawl contest. The idea is inspired by Google's release of the entity linking dataset, which provides baseline for research on entity linking and other information retrieval and natural language processing tasks.

Human language is ambiguous, and synonymy and polysemy are fundamental problems in natural language processing (NLP) and information retrieval (IR). One of the approaches for Word Sense Disambiguation (WSD) is utilizing external ontologies, e.g. Wikipedia to determine the meaning of a word based on the probabilities that it can be mapped each of the possible Wikipedia concepts. Our entry aims to build such a corpus of anchortext-WikipediaConcept-Count triples from the CommonCrawl dataset, so as to benifit research on WSD, NLP and IR. More specifically, we extract all anchortexts (the text you click on in a webpage link) which point to a Wikipedia page, together with the corresponding Wikipedia page. Based on the corpus, we developed this web application to demonstrate the anchortext-WikipediaConcept-Count structure.

Application scenarios

  • Given a concept (represented as a wikipedia page), it can tell what are the most common terms people use to describe the concept. This can be seen as an "Explicit Topic Modeling". Example

  • Given a sentence, it can help identify entities (person, locatin, organization) in the sentence and map them onto Wikipedia concepts

  • CommonCrawl vs. Google, with regards to anchortext-WikipediaConcept-Count corpus richness and precision

  • For entity linking tasks, will the combination of both corpus boost the performance compared with the usage of each dataset individually?


Live Demo:

Help Spread

If you find our work interesting, please vote our entry on CommonCrawl Contest Website and stay tuned for our release of the dataset.

Something went wrong with that request. Please try again.