- News scraper
- Preprocessing & clustering
- Classification & dynamic distribution
- Analysis & political score (pending)
- Training a new BERT model (pending)
Webscrape using Rvest to gather titles daily from 15 French news sites.
WARNING: News sites often make changes to their architecture (news cycle, big headline, overhaul...).
Check HTML & CSS tags before use.
Mostly cleaned with regex, some spaces and slashes may remain.
# Function to remove html tags, unruly spaces and returns
clean_title = function(htmlString) {
return(str_squish(gsub("<.*?>", "", htmlString)))
}
# Titles (articles) + title (main title) for Le Monde
titles_lm = le_monde %>% html_elements(".article__title") %>% clean_title
num_articles_lm = length(titles_lm)
title_lm = le_monde %>% html_elements(".article__title-label") %>% clean_title
OUTPUT :
- 1 headers file (all dates)
- 1 big file with every title (all dates)
Selected journals :
- Le Monde
- Le Figaro
- C News
- 20 Minutes
- Les Echos
- France Infos
- Nouvel Obs
- La Croix
- Marianne
- BFM TV
- L'Express
- Le Point
- Valeurs Actuelles
- Libération
- L'Humanité
Political compass : - Liberal center : BFM TV, L'Express, Le Point, Les Echos, 20 minutes
- Conservative center : La Croix, Marianne
- Left : Le Monde, Nouvel Obs, Libération
- Center Left : France Info
- Right : Le Figaro, C News
- Hard Right : Valeurs Actuelles
- Hard Left : L'Humanité
In bold: main media
Why scrape in R, you may ask, when the rest of the project is in Python ? No particular reason really, I was just coding in R at the time !
If you want to scrape your own data and don't know R, the code can easily be adapted into Python using requests, beautifulsoup and pandas.
There are very clearly some outstanding themes and recurrent words in the data.
Steps:
- Cleaning articles and lemmatization using spacy (and removing stopwords)
- Vectorizing into n-grams and checking topic distribution with Latent Dirichlet Allocation
- Hierarchical and guided topic modeling with BERTopic
INA's themes in french tv news are comprised of 14 topics.
For understanding and coherence, one of the goals will be to explain topic distribution using the same framework.
Short notebook for this plot in ina_topics folder.
Full hierarchical BERT topic modeling in images folder (too large to meaningfully display here).
- Plot dynamic topic distribution
- Merge topics into something more readable & practical
- Check labels and vocabulary consistency
Most journals clearly have more articles on French Politics around April 10th and 24th : the 1st and 2nd round of the French presidential elections.
The topic of French Politics is almost always at the top in this period, though International news (mostly about Ukraine) take the lead in February and March for some journals.
All journals have a roughly similar distribution of topics, with the notable but unsurprising exception of Les Echos (a financial journal first and foremost).
pending Graphs & exploration of bias.
pending
a) Use the newly labeled data to train a classification model on topics of french news articles.
b) Use the newly labeled data to train a sentiment analysis model on bias in french news articles.
c) Use both models on a new dataset of french news articles