Wikipedia trending topic detection
Python topic detection module for SparkWiki. The module computes statistics, clustering and assigns topics to clusters of trending Wikipedia pages, extracted using the Anomaly Detection Algorithm. Topic classification model is available here. The module works with all language editions of Wikipedia.
- Compute degree, betweeness centrality and modularity for clustering the graph by events
- Match wikipages with their Qids (unique Wikipedia ID)
- Match wikipages with their corresponding topics
- Match wikipages with their pageviews
- Save a new corresponding graph with these attributes
- Give a graphical topics partition of each cluster
- numpy, matplotlib, pandas, networkx, requests
$ pip install python-louvain
- googletrans (Optional)
$ pip install googletrans
Put the graph file into a local folder
Language: EN, FR, RU, etc.
Date format: YYYYMMDD
Graph file name format:
To compute the whole pipeline from a graph with the name and folder path in the correct format (cf. Pre-requisites), run the following command in the terminal:
$ python main.py EN 20200316 20200331
The pipeline can also be computed partially. To do that, specify the optional parameter from 1 to 7 to run only a part of the pipeline corresponding to the features described in the table below:
$ python main.py EN 20200316 20200331 1
||Compute degree, betweeness centrality and modularity|
||Save graph attributes|
||Give topics repartition per cluster|
||Translate labels into English|
Alternatively, one can run the
Topics_exctraction.ipynb notebook. The notebook also includes the code generating visualisations.
Every stage of the pipeline generates and saves a .csv file with corresponding results.
The final step creates
/Figures folder with figures of the topics partition per cluster.
Also, the final stage creates a graph file with all the computed attributes:
In order to explore the detected topics, the graph can be visualized in Gephi. We used Circle Pack Layout with modularity class as a partitioning attribute.
Wikipedia graphs of trending pages are available in
Python/Result for 16/08/2018 to 31/12/2018 and 17/12/2019 to 15/04/2020 periods for DE, EN, ES, FR, IT, RU, ZH languages.
Topic_comparison.ipynb gives a topic comparaison between EN, FR, RU languages. The figures are saved in
Gephi files representing the graphs are also located in
Here you can see a visual example. The animation shows trending topics for the last four months of 2018. The graph visualization illustrates the graph computed for the period 1-15 March 2020.
Wikipedia trending topics detection: SparkWiki
Clustering of trending pages: Community detection
Topic classification model: Language-Agnostic Topic Classification
Labels translation: Googletrans