Skip to content

Project on media bias, studying 15 french news sites.

License

Notifications You must be signed in to change notification settings

Ukratic/Media_bias

Repository files navigation

Media Bias

Structure

  1. News scraper
  2. Preprocessing & clustering
  3. Classification & dynamic distribution
  4. Analysis & political score (pending)
  5. Training a new BERT model (pending)

1.News scraper

Webscrape using Rvest to gather titles daily from 15 French news sites.
WARNING: News sites often make changes to their architecture (news cycle, big headline, overhaul...).
Check HTML & CSS tags before use.
Mostly cleaned with regex, some spaces and slashes may remain.

# Function to remove html tags, unruly spaces and returns
clean_title = function(htmlString) {
  return(str_squish(gsub("<.*?>", "", htmlString)))
}

# Titles (articles) + title (main title) for Le Monde
titles_lm = le_monde %>% html_elements(".article__title") %>% clean_title
num_articles_lm = length(titles_lm)
title_lm = le_monde %>% html_elements(".article__title-label") %>% clean_title

OUTPUT :

  • 1 headers file (all dates)
  • 1 big file with every title (all dates)

Selected journals :


In bold: main media

Why scrape in R, you may ask, when the rest of the project is in Python ? No particular reason really, I was just coding in R at the time !
If you want to scrape your own data and don't know R, the code can easily be adapted into Python using requests, beautifulsoup and pandas.

2. Preprocessing & clustering

There are very clearly some outstanding themes and recurrent words in the data. Wcloud

Steps:

  • Cleaning articles and lemmatization using spacy (and removing stopwords)
  • Vectorizing into n-grams and checking topic distribution with Latent Dirichlet Allocation
  • Hierarchical and guided topic modeling with BERTopic

INA's themes in french tv news are comprised of 14 topics.
For understanding and coherence, one of the goals will be to explain topic distribution using the same framework.
Short notebook for this plot in ina_topics folder.
INA topics

Full hierarchical BERT topic modeling in images folder (too large to meaningfully display here).

3. Classification & dynamic distribution

  • Plot dynamic topic distribution
  • Merge topics into something more readable & practical
  • Check labels and vocabulary consistency

Dynamic Distribution of topics

Most journals clearly have more articles on French Politics around April 10th and 24th : the 1st and 2nd round of the French presidential elections.
The topic of French Politics is almost always at the top in this period, though International news (mostly about Ukraine) take the lead in February and March for some journals.

Daily average topics

All journals have a roughly similar distribution of topics, with the notable but unsurprising exception of Les Echos (a financial journal first and foremost).

4. Analysis & political score

pending Graphs & exploration of bias.

5. Training a new BERT model

pending
a) Use the newly labeled data to train a classification model on topics of french news articles.
b) Use the newly labeled data to train a sentiment analysis model on bias in french news articles.
c) Use both models on a new dataset of french news articles

Releases

No releases published

Packages

No packages published