Skip to content

defgsus/frontpage-archive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Archive of news front pages

Scraper

Collector of raw index.html files and the like. Should have started this 20 years ago!

Scraped sites:

id since files url
bild.de 2022-01-28 26 https://www.bild.de
compact-online.de 2022-01-29 5 https://www.compact-online.de/
faz.net 2022-01-29 14 https://www.faz.net/
fr.de 2022-01-28 8 https://www.fr.de/
gmx.net 2022-01-29 7 https://www.gmx.net/
heise.de 2022-01-28 12 https://www.heise.de/
spiegel.de 2022-01-28 21 https://www.spiegel.de/
spiegeldaily.de 2022-01-28 5 https://www.spiegeldaily.de/
sueddeutsche.de 2022-01-29 7 https://www.sueddeutsche.de/
t-online.de 2022-01-29 8 https://www.t-online.de/
volksstimme.de 2022-01-29 8 https://www.volksstimme.de/
web.de 2022-01-29 15 https://web.de
welt.de 2022-01-29 16 https://www.welt.de
zeit.de 2022-01-29 16 https://www.zeit.de/
zeitfuerdieschule.de 2022-01-29 5 https://www.zeitfuerdieschule.de

Well, let's see how far this goes with a free github account. Many websites transmit click-ids and random uuids in their documents so there is a change in every file in each snapshot.

Anyways, currently each snapshot adds about 10mb to the repository size (size of .git directory). That's not going to work for long :-(

UPDATE

Okay, raw data is just too much. The snapshot rate is now set to once a month. I'll try to scrape just the article headlines and archive them in another repository.

TODO