# Loading / Scraping / Walking through Textcorpora as Datasets

---

## Ways to get data...

- [Read file from your local system](#1)
- [Download Textfiles f.ex. from gutenberg.org](#2)
- [Scraping (static) Textcorpora from the Web](#5)
- [Processing RSS Feeds](#7)
- [al-Ready Datasets from HuggingFace](#8)
- [Extract text from wikipedia](#9)

---
---

<a class="anchor" id="1"></a>

# Read file from local system

In [5]:
#set variable
filename = './data/alles-macht-weiter.txt'
# open file
file = open(filename, 'rt')
#read it in
amw1 = file.read()
#close it
file.close()
#print the first 101 items
print(amw1[0:193])

Die Geschichtenerzähler machen weiter. die Autoindustrie macht weiter. die Arbeiter machen weiter. die Regierungen machen weiter. die Rock’n’Roll-Sänger machen weiter. die Preise machen weiter.


<a class="anchor" id="2"></a>

---
# Download Textfiles f.ex. from gutenberg.org (Kafka's »Die Verwandlung«)

In [7]:
#import library
import requests

url = "https://gutenberg.org/cache/epub/22367/pg22367.txt"

#request the text
r = requests.get(url)

#print the first 527 characters
print(r.text[0:527])

﻿The Project Gutenberg EBook of Die Verwandlung, by Franz Kafka

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org


Title: Die Verwandlung

Author: Franz Kafka

Release Date: August 21, 2007 [EBook #22367]

Language: German


*** START OF THIS PROJECT GUTENBERG EBOOK DIE VERWANDLUNG ***




Produced by J


<a class="anchor" id="3"></a>

<a class="anchor" id="5"></a>

---
# Download (static) Textcorpora as (HTML) from the Web

In [8]:
from urllib import request
url = "https://taz.de/Vorwuerfe-von-schwarzer-KI-Forscherin/!5730475/"
html = request.urlopen(url).read().decode('utf8')
html[:160]

'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml" xmlns:my="mynames" lang="de"><!-- DEBUG start 17:32:42+01:00 page_id=4627 :: Netzökonomie--><!--\n\t\tCo'

In [12]:
from bs4 import BeautifulSoup
import nltk
raw = BeautifulSoup(html, 'html.parser').get_text()
print(type(raw))
print(raw[438:900])

<class 'str'>
portBerlinNordWahrheit

5730475


Vorwürfe von schwarzer KI-Forscherin: Proteste bei Google

Die bekannte KI-Forscherin Timnit Gebru verlässt Google im Streit. Grund ist eine Studie zu Sprachverarbeitung, die dem Konzern nicht passt.
Forscherin Timnit Gebru wirft ihrem Ex-Arbeitgeber Zensur vor  Foto: Kristin Callahan/ZUMA Press/imago
Google liebt sein Image als uneigennütziger Tech-Konzern. Da passt es nicht gut ins Bild, wenn Tausende Mitarbeiter:innen pro


<a class="anchor" id="6"></a>

<a class="anchor" id="7"></a>

---
# Processing RSS Feeds

In [14]:
#!pip install feedparser
import feedparser
from bs4 import BeautifulSoup
from nltk import word_tokenize

#define the page to parse
llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom")
#define what you want to see (the title of the feed-page):
llog['feed']['title']

'Language Log'

In [15]:
#how much posts the page have?
len(llog.entries)

13

In [16]:
#set a variable for the 3 post
post = llog.entries[2]
#print the title from it
post.title

'How Many Languages Are There in China?'

In [17]:
#set variable for its content
content = post.content[0].value
#print the first 71 items
content[:70]

'<p>At least three hundred.</p>\n<p>I like the title, not the one on the'

In [18]:
#parse the html content with beautifulsoup as text
raw = BeautifulSoup(content, 'html.parser').get_text()
#print it
print(raw[:200])

At least three hundred.
I like the title, not the one on the first panel, but the one at the top of each frame, which I have also given as the title of this post.
You probably don't have time to watch


---
# al-Ready Datasets from HuggingFace

import the `datasets` library from [huggingface](https://huggingface.co/)

In [18]:
import datasets

all information you'll need about that library, you will find here: https://pypi.org/project/datasets/

### Example:
1. Download the [Stanford Question Answering Dataset (SQuAD)](https://rajpurkar.github.io/SQuAD-explorer/)
2. load the training split
3. print the second training example

In [20]:
from datasets import load_dataset
print(load_dataset('squad', split='train')[1])

Reusing dataset squad (/home/whoami/.cache/huggingface/datasets/squad/plain_text/1.0.0/4c81550d83a2ac7c7ce23783bd8ff36642800e6633c1f18417fb58c3ff50cdd7)


{'answers': {'answer_start': [188], 'text': ['a copper statue of Christ']}, 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'id': '5733be284776f4190066117f', 'question': 'What is in front of the Notre Dame Main Building?', 'title': 'University_of_Notre_Dame'}


<a class="anchor" id="8"></a>

---

# Extract text from wikipedia

Wikipedia provides an [API](https://en.wikipedia.org/wiki/API), which makes it easy to access the data.<br>
We will use the library [wikipedia-api](https://pypi.org/project/Wikipedia-API/) which makes use of this API.

In [19]:
import wikipediaapi
import re
from tqdm.notebook import tqdm

## Load a wikipedia page into a page object

In [20]:
# create a wikipediaapi object and define the language version
wiki = wikipediaapi.Wikipedia(
        language='en',
        extract_format=wikipediaapi.ExtractFormat.WIKI
)

In [21]:
page = wiki.page('Aesthetics')

We can get a summary of the text with `.summary`.

In [22]:
print(page.summary)

Aesthetics, or esthetics, is a branch of philosophy that deals with the nature of beauty and taste, as well as the philosophy of art (its own area of philosophy that comes out of aesthetics). It examines aesthetic values, often expressed through judgments of taste.Aesthetics covers both natural and artificial sources of experiences and how we form a judgment about those sources. It considers what happens in our minds when we engage with objects or environments such as viewing visual art, listening to music, reading poetry, experiencing a play, watching a fashion show, movie, sports or even exploring various aspects of nature. The philosophy of art specifically studies how artists imagine, create, and perform works of art, as well as how people use, enjoy, and criticize art. Aesthetics considers why people like some works of art and not others, as well as how art can affect moods or even our beliefs. Both aesthetics and the philosophy of art try to find answers for what exactly is art, 

Or we can get the full text with `.text`.

In [23]:
print(page.text)

Aesthetics, or esthetics, is a branch of philosophy that deals with the nature of beauty and taste, as well as the philosophy of art (its own area of philosophy that comes out of aesthetics). It examines aesthetic values, often expressed through judgments of taste.Aesthetics covers both natural and artificial sources of experiences and how we form a judgment about those sources. It considers what happens in our minds when we engage with objects or environments such as viewing visual art, listening to music, reading poetry, experiencing a play, watching a fashion show, movie, sports or even exploring various aspects of nature. The philosophy of art specifically studies how artists imagine, create, and perform works of art, as well as how people use, enjoy, and criticize art. Aesthetics considers why people like some works of art and not others, as well as how art can affect moods or even our beliefs. Both aesthetics and the philosophy of art try to find answers for what exactly is art, 