Birds of a feather flock together: A deep dive into communities of news outlets in Quotebank

Link to the project: Improvise, ADApt, Overcome 2021

Abstract

News sources today are diverse: newspapers, radio stations, websites and more. They are also widely accessible. The internet has allowed us to collect news from all around the world, covering practically any topic. From anywhere, we can read Russian sports news and hear about the latest celebrity gossip from Argentina.

The goal of this project is to organize these news outlets into communities and then analyze the attributes of the resulting communities using external datasets. Some of the factor we want to examine are geolocation, interviewees, owner, size, focus, etc.

We use the Quotebank dataset to cluster news outlets by the speakers they quoted. Once clustered, we analyse each cluster based on news outlets features extracted from Wikidata.

Research questions

Can we find a meaningful way to group news outlets?
Once found, is there a way to find which attributes characterize these groups?
Do some attributes strongly predict how groups are formed?
Can we visually convey the identity of such groups?

Additional datasets

Wikidata, for publishers and speaker information

Methods

Dimentionality reduction to help clustering:

Latent Semantic Analysis

Clustering algorithms to group news outlets:

Kmeans
DBScan

Find characteristic attributes from raw Wikidata:

Latent Dirichlet Allocation (topic extraction)
Semantic analysis with Empath

Find cluster predictors:

Classification methods such as logistic regression or random forests

Visualization tools for clusters:

three.js (3D representation)
Leaflet (Geographical representation)

Notebooks

Our pipeline is split in multiple notebooks:

News agency identification and feature extraction from Wikidata: src/wikidata-handling.ipynb
Analysis, transformation and cleaning of Quotebank: src/handling_quotes_bank.ipynb
Clustering methods exploration: src/clustering.ipynb
Describing quotes using LDA: src/lda.ipynb
Semantic analysis of clusters: src/semantic_analysis.ipynb
Analysis of cluster features: src/feature_importance_analysis.ipynb

Additonally, the src/utilities folder contains notebooks used to generate the data for the datastory visualizations, and python files used as libraries:

Finding journal locations and adding that info to the cluster file: src/utilities/cluster_location.ipynb
Finding the top speakers and top journals in each cluster: src/utilities/cluster-attributes.ipynb
Library to find the mean location from a list of coordinates: src/utilities/distances.py
Library to extract data from Wikidata (online/local): src/utilities/wiki_helpers.py

Data story outline

Our datastory will first introduce the problem, our research questions and the data. We will briefly detail our methodology along with elementary statistics about the dataset.

The results of our analysis are presented in two interactive graphs. Each graph is accompanied by short insights we have found about our data.

The first graph is a 3D projection of our clusters. The reader can click on the cluster labels to show details information about each cluster. The information includes the top news sources, most quoted speakers, and which features best predict the cluster
The second graph maps the news outlets geographically. The user can select which clusters to show on the map.

Timeline

[12.11] Merge pipeline stages, make sure full pipeline works correctly Create a small library of useful functions for the project
[19.11] Explore and analyze the initial results, and dig where interesting things happen
[03.12] Start working on data story and possible visualizations
[10.12] Finalize data story and create blog post

Team organization

Team assignments:

Erwan: data cleaning, feature importance analysis, data story
Julian: semantic analysis, lda, data story
Luca: wikidata extraction, geographical visualization & cluster properties panel, data story
Sylvain: clustering, 3D visualization, data story

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
.jekyll-cache/Jekyll/Cache		.jekyll-cache/Jekyll/Cache
src		src
website		website
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Birds of a feather flock together: A deep dive into communities of news outlets in Quotebank

Abstract

Research questions

Additional datasets

Methods

Notebooks

Data story outline

Timeline

Team organization

About

Releases

Packages

Languages

bataillard/improvise-adapt-overcome

Folders and files

Latest commit

History

Repository files navigation

Birds of a feather flock together: A deep dive into communities of news outlets in Quotebank

Abstract

Research questions

Additional datasets

Methods

Notebooks

Data story outline

Timeline

Team organization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages