Collector of raw index.html files and the like. Should have started this 20 years ago!
id | since | files | url |
---|---|---|---|
bild.de | 2022-01-28 | 26 | https://www.bild.de |
compact-online.de | 2022-01-29 | 5 | https://www.compact-online.de/ |
faz.net | 2022-01-29 | 14 | https://www.faz.net/ |
fr.de | 2022-01-28 | 8 | https://www.fr.de/ |
gmx.net | 2022-01-29 | 7 | https://www.gmx.net/ |
heise.de | 2022-01-28 | 12 | https://www.heise.de/ |
spiegel.de | 2022-01-28 | 21 | https://www.spiegel.de/ |
spiegeldaily.de | 2022-01-28 | 5 | https://www.spiegeldaily.de/ |
sueddeutsche.de | 2022-01-29 | 7 | https://www.sueddeutsche.de/ |
t-online.de | 2022-01-29 | 8 | https://www.t-online.de/ |
volksstimme.de | 2022-01-29 | 8 | https://www.volksstimme.de/ |
web.de | 2022-01-29 | 15 | https://web.de |
welt.de | 2022-01-29 | 16 | https://www.welt.de |
zeit.de | 2022-01-29 | 16 | https://www.zeit.de/ |
zeitfuerdieschule.de | 2022-01-29 | 5 | https://www.zeitfuerdieschule.de |
Well, let's see how far this goes with a free github account. Many websites transmit click-ids and random uuids in their documents so there is a change in every file in each snapshot.
Anyways, currently each snapshot adds about 10mb to the
repository size (size of .git
directory). That's not going
to work for long :-(
Okay, raw data is just too much. The snapshot rate is now set to once a month. I'll try to scrape just the article headlines and archive them in another repository.
- https://www.n-tv.de/
- https://www.handelsblatt.com/
- https://www.taz.de/
- https://www.wa.de/
- https://www.rnd.de/
- https://www.nzz.ch/
- https://www.bazonline.ch/
- https://www.focus.de/
- https://www.tagesschau.de/
- https://www.heise.de/tp/
- https://www.golem.de/
- https://www.kicker.de/
- https://www.achgut.com/
- https://www.stern.de/