# Loading / Scraping / Walking through Textcorpora as Datasets

---

## Ways to get data...

- [1. read file from your local system](#1)
- [2. download Textfiles f.ex. from gutenberg.org](#2)
- [2.1. in case your IP is blocked about any reason](#3)
- [3. scraping (static) Textcorpora from the Darknet](#4)
- [4. scraping (static) Textcorpora from the Web](#5)
- [5. scraping PDF's from the Web](#6)
- [6. scraping RSS Feeds](#7)
- [7. Allison Parrish's Gutenberg Poetry Corpus](#8)
- [8. al-Ready Datasets for Textprocessing](#9)

---
---

<a class="anchor" id="1"></a>

### Read file from local system

In [1]:
#set variable
filename = './data/alles-macht-weiter.txt'
# open file
file = open(filename, 'rt')
#read it in
amw1 = file.read()
#close it
file.close()
#print the first 101 items
print(amw1[0:100])

Die Geschichtenerzähler machen weiter. die Autoindustrie macht weiter. die Arbeiter machen weiter. d


<a class="anchor" id="2"></a>

---
### download Textfiles f.ex. from gutenberg.org (Schuld und Sühne)

In [2]:
#import library
import requests

url = "http://www.gutenberg.org/files/2554/2554-0.txt"

#request the text
r = requests.get(url)

#print the first 527 characters
print(r.text[0:527])

ï»¿The Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook


<a class="anchor" id="3"></a>

---

### in case your IP is blocked about any reason

you can scrape f.ex. over the TOR-SOCKS Proxy like that (you have to have installed TOR first on your machine):

In [None]:
#import library
import requests

#get your usual IP adress
r = requests.get('http://httpbin.org/ip')
print("request:", r.text)

#creating now an empty session object
session = requests.session()
session.proxies = {}

#get your usual IP adress
s = session.get('http://httpbin.org/ip')
print(s.text)

#adding TOR proxy
session.proxies['http'] = 'socks5h://localhost:9050'
session.proxies['https'] = 'socks5h://localhost:9050'

#get the new IP adress
t = session.get('http://httpbin.org/ip')
print(t.text)

In [None]:
#now get the data you'd like to scrape

url = "http://www.gutenberg.org/files/2554/2554-0.txt"
t2 = session.get(url)
print(t2.text[0:527])

<a class="anchor" id="4"></a>

---
###  if you wanna scrape Textcorpora like that (in HTML-Format) from the Darknet

* List of Librairies in the darknet: http://zqktlwiuavvvqqt4ybvgvi7tyo4hjl5xgfuvpdf6otjiycgwqbym2qad.onion/wiki/Libraries
* The Hidden Wiki: http://zqktlwiuavvvqqt4ybvgvi7tyo4hjl5xgfuvpdf6otjiycgwqbym2qad.onion/wiki/index.php/Main_Page

In [None]:
# replace the WebURL with an Onion-Adress
dw = session.get('http://libraryqtlpitkix.onion/library/Fiction/Stanislaw%20Lem%20-%20GOLEM%20XIV.txt')
#print(dw.headers, "\n")
print(dw.text[0:1527])

<a class="anchor" id="5"></a>

---
### Download (static) Textcorpora as (HTML) from the Web

In [4]:
from urllib import request
url = "https://taz.de/Vorwuerfe-von-schwarzer-KI-Forscherin/!5730475/"
html = request.urlopen(url).read().decode('utf8')
html[:160]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:my="mynames" lang="de"><!-- DEBUG start 22:39:30+01:00 page_id=4627 :: Netzökonomie--><!--\n\t\tCo'

In [7]:
from bs4 import BeautifulSoup
import nltk
raw = BeautifulSoup(html, 'html.parser').get_text()
print(type(raw))
print(raw[438:900])
tokens = nltk.word_tokenize(raw)
print(tokens[300:900])

<class 'str'>
Vorwürfe von schwarzer KI-Forscherin: Proteste bei Google

Die bekannte KI-Forscherin Timnit Gebru verlässt Google im Streit. Grund ist eine Studie zu Sprachverarbeitung, die dem Konzern nicht passt.
Forscherin Timnit Gebru wirft ihrem Ex-Arbeitgeber Zensur vor  Foto: Kristin Callahan/ZUMA Press/imago
Google liebt sein Image als uneigennütziger Tech-Konzern. Da passt es nicht gut ins Bild, wenn Tausende Mitarbeiter:innen protestieren und in einem offenen Bri
['würden', 'damit', 'die', 'Umwelt', 'belasten', '.', 'Auch', 'könnten', 'derartige', 'Systeme', 'für', 'Desinformation', 'missbraucht', 'werden', '.', 'Google', 'wies', 'den', 'Entwurf', 'intern', 'zurück', ',', 'weil', 'er', 'angeblich', 'nicht', 'genügend', 'aktuelle', 'Studien', 'berücksichtige', '.', 'Wis\xadsen\xadschaft\xadler', ':', 'in\xadnen', 'sprechen', 'hingegen', 'von', 'Zensur', '.', 'In', 'der', 'Literaturliste', 'sind', 'indes', 'mehr', 'als', '128', 'Verweise', 'angeführt', '.', 'Schon', 'im', 'Jahr'

In [30]:
#nltk.word_tokenize?

<a class="anchor" id="6"></a>

---
### download PDF's from the Web

In [31]:
#import libraries
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

# set URL
url = "https://www.christian-lindner.de/reden"

#If there is no such folder, create one automatically
folder_location = r'./data/lindner-talks'
if not os.path.exists(folder_location):os.mkdir(folder_location)

#get all pdf's on this page and store it into the folder
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

<a class="anchor" id="7"></a>

---
### Processing RSS Feeds

In [32]:
#!pip install feedparser
import feedparser
from bs4 import BeautifulSoup
from nltk import word_tokenize

#define the page to parse
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
#define what you want to see (the title of the feed-page):
llog['feed']['title']

'Language Log'

In [33]:
#how much posts the page have?
len(llog.entries)

13

In [34]:
#set a variable for the 3 post
post = llog.entries[2]
#print the title from it
post.title

'Aristotelian aerosols?'

In [35]:
#set variable for its content
content = post.content[0].value
#print the first 71 items
content[:70]

'<p>Kasha Patel, "<a href="https://www.washingtonpost.com/weather/2022/'

In [37]:
#parse the html content with beautifulsoup as text
raw = BeautifulSoup(content, 'html.parser').get_text()
#print it
print(raw[:200])

Kasha Patel, "Covid-19 may have seasons for different temperature zones, study suggests", WaPo 1/28/2022:
Aerosol researcher and co-author Chang-Yu Wu explained that local humidity and temperature pla


<a class="anchor" id="8"></a>

---
## Allison Parrish's Gutenberg Poetry Corpus
see: https://github.com/aparrish/gutenberg-poetry-corpus

By [Allison Parrish](https://www.decontextualize.com/)

Allison Parrish made a corpus of around three million lines of poetry from Project Gutenberg. In her notebook [A Project Gutenberg Poetry Corpus: Quick Experiments](https://github.com/aparrish/gutenberg-poetry-corpus/blob/master/quick-experiments.ipynb) she shows a couple of quick examples and experiments in using the corpus in Python. the following examples are from this notebook:

---

First, download the corpus via this [link](http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz) and store it in the same folder then this notebook is.

The file is in gzipped [newline delimited JSON format](http://ndjson.org/): there's a JSON object on each line. You don't need to decompress the file to work with it, since Python has a handy library for working with gzipped files right in the code. The following cell will read in the file and create a list `all_lines` that contains all of these JSON objects.

In [9]:
# download it via `curl`
##!curl -O http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 52.2M  100 52.2M    0     0  2914k      0  0:00:18  0:00:18 --:--:-- 2584k


In [23]:
#unzip it
import gzip, json
all_lines = []
for line in gzip.open("gutenberg-poetry-v001.ndjson.gz"):
    all_lines.append(json.loads(line.strip()))

In [24]:
#extract randomly lines of it
import random
random.sample(all_lines, 8)

[{'s': 'WHAT will you give me, O World, O World!', 'gid': '39032'},
 {'s': 'From thee in more than pristine beauty rise,', 'gid': '27663'},
 {'s': "I ask'd my fair, one happy day,", 'gid': '24815'},
 {'s': "For o'er the crackling fire he heard", 'gid': '214'},
 {'s': 'green for so hot a day.', 'gid': '13118'},
 {'s': 'Handsomest of all the people,', 'gid': '25953'},
 {'s': 'Too joyous to last,', 'gid': '19525'},
 {'s': 'A twelvemonth and a day,', 'gid': '27441'}]

Each object has a key `s` that contains the text of the line of poetry, and a key `gid` that contains the Project Gutenberg ID of the file in question. You can use this ID to look up the title and author of the book of poetry that the line came from (either using the [Project Gutenberg website](https://www.gutenberg.org/) or using pre-built metadata from, e.g., [Gutenberg, dammit](https://github.com/aparrish/gutenberg-dammit/)).

In [25]:
randompoem = random.sample(all_lines, 8)
print(randompoem)
print("～ ❀ ～")

randompoem_t = [line['s'] for line in randompoem]
print(randompoem_t)
print("～ ❀ ～")

randompoem_lb = "\n".join(randompoem_t)
print(randompoem_lb)

[{'s': 'The sound of fight is silent long', 'gid': '5720'}, {'s': 'That I must needs dismount, and search on foot', 'gid': '1365'}, {'s': "Let's cut from all precedent loose,", 'gid': '35033'}, {'s': "Perhaps there's buried treasure out", 'gid': '32553'}, {'s': 'Will enjoy the sunset, the pouring-in of the flood-tide, the', 'gid': '1322'}, {'s': "And reach'd the Bay of Trinity, dark, lone,", 'gid': '37365'}, {'s': 'Such witness yield; a monarch from his throne', 'gid': '4272'}, {'s': 'CCCXLI. To Mrs. Dunlop. Her friendship. A farewell', 'gid': '18500'}]
～ ❀ ～
['The sound of fight is silent long', 'That I must needs dismount, and search on foot', "Let's cut from all precedent loose,", "Perhaps there's buried treasure out", 'Will enjoy the sunset, the pouring-in of the flood-tide, the', "And reach'd the Bay of Trinity, dark, lone,", 'Such witness yield; a monarch from his throne', 'CCCXLI. To Mrs. Dunlop. Her friendship. A farewell']
～ ❀ ～
The sound of fight is silent long
That I must ne

---

you could also f.ex. find in our random output a specific word 

In [27]:
import re
dido = re.search('Dido', randompoem_lb)
print(dido)

<re.Match object; span=(4, 9), match='sound'>


In [28]:
# or finding a specific word in the whole document an printing the whole line (or 8 of it) 
dido_line = [line['s'] for line in all_lines if re.search('Dido', line['s'])]
random.sample(dido_line, 8)

['The banquet-hall of Dido has remained throughout this recital in',
 'Grant that it had been, whom should Dido dread,',
 'because Dido is dead.]',
 "Lo, such was Dido; joyously she bore herself e'en such",
 "Wars, and filial faith, and Dido's pyre;",
 'There now did Dido, Sidon-born, uprear a mighty fane',
 'Now Dido leads',
 'here!" (577-661). Dido speaks him fair and echoes his words, "If']

<a class="anchor" id="9"></a>

# al-Ready Datasets for Textprocessing
---

import the `datasets` library from [huggingface](https://huggingface.co/)

In [18]:
import datasets

all information you'll need about that library, you will find here: https://pypi.org/project/datasets/

### Example:
1. Download the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/)
2. load the training split
3. print the second training example

In [20]:
from datasets import load_dataset
print(load_dataset('squad', split='train')[1])

Reusing dataset squad (/home/whoami/.cache/huggingface/datasets/squad/plain_text/1.0.0/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7)


{'answers': {'answer_start': [188], 'text': ['a copper statue of Christ']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f4190066117f', 'question': 'What is in front of the Notre Dame Main Building?', 'title': 'University_of_Notre_Dame'}
