# Web Scraping

CS72/LING48 Spring 2022 Final Project

**Author:** Zhiyan Zhong (zhiyan.zhong.gr@dartmouth.edu) <br>
**Date:** 6/6/2022 <br>
**Description:** This program uses BeautifulSoup to extract data from specific parts of a website. I used the program to collect text of folktales from different regions on www.worldoftales.com. 
<br>
**Note:** Because the ways the website stores data are inconsistent for different types of stories, the program has to be modified and tailored for each story type.

**Reference:** Our TA Sara Kay's web scraping session has helped me a lot with this program. Some of the base code was adapted from Sara's tutorial.

## General Method

In [None]:
# Get the html for soup to look at and see what are available

from bs4 import BeautifulSoup
import requests
import pandas as pd

URL = 'https://www.worldoftales.com/#gsc.tab=0'
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")

result = soup.find_all('a', class_ = True, href = True)

URLS = [r['href'] for r in result if 'folktales' in r['href']]
URLS

['African_folktales.html',
 'Nigerian_folktales.html',
 'South_African_folktales.html',
 'Tanzanian_folktales.html',
 'Asian_folktales.html',
 'Arabic_folktales.html',
 'Chinese_folktales.html',
 'Indian_folktales.html',
 'Japanese_folktales.html',
 'Filipino_folktales.html',
 'Australian_folktales.html',
 'Australian_folktales.html',
 'European_folktales.html',
 'Germanic_folktales.html',
 'English_folktales.html',
 'German_folktales.html',
 'Norwegian_folktales.html',
 'Swedish_folktales.html',
 'Romanic_folktales.html',
 'Portuguese_folktales.html',
 'Spanish_folktales.html',
 'Italian_folktales.html',
 'Romanian_folktales.html',
 'Slavic_folktales.html',
 'Czech_folktales.html',
 'Polish_folktales.html',
 'Russian_folktales.html',
 'Slovak_folktales.html',
 'Ukrainian_folktales.html',
 'Celtic_folktales.html',
 'Irish_folktales.html',
 'Scottish_folktales.html',
 'North_American_folktales.html',
 'Native_American_folktales.html',
 'United_States_folktales.html',
 'South_American_fo

In [None]:
# Trying to figure out the story html patterns

currURL = 'https://www.worldoftales.com/' + URLS[0] + '#gsc.tab=0'
page = requests.get(currURL)
soup = BeautifulSoup(page.content, "html.parser")
result = soup.find_all('a', href = True)

storyURLs = [r['href'] for r in result if 'folktales' in r['href']]
storyURLs

['media/african_folktales.jpg',
 'African_folktales/African_Folktale_2.html',
 'African_folktales/African_Folktale_1.html',
 'African_folktales/Nigerian_folktale_1.html',
 'African_folktales/Nigerian_folktale_3.html',
 'African_folktales/Nigerian_folktale_4.html',
 'African_folktales/Nigerian_folktale_7.html',
 'African_folktales/African_Folktale_40.html',
 'African_folktales/Nigerian_folktale_16.html',
 'African_folktales/Nigerian_folktale_19.html',
 'African_folktales/African_Folktale_44.html',
 'South_African_folktales.html',
 'Nigerian_folktales.html',
 'Nigerian_folktales.html',
 'Nigerian_folktales.html',
 'Tanzanian_folktales.html',
 'Tanzanian_folktales.html',
 'Tanzanian_folktales.html',
 'Users_folktales.html',
 'African_folktales/African_Folktale_2.html',
 'African_folktales/African_Folktale_1.html',
 'https://www.worldoftales.com/Ukrainian_folktales.html',
 'folktales.html',
 'African_folktales.html',
 'Nigerian_folktales.html',
 'South_African_folktales.html',
 'Tanzanian_

In [None]:
# Write a while loop to collect 40 folktales of the same type

count = 1
article_numbers = []
article_list = []

while count <= 40:
    storyURL = 'https://www.worldoftales.com/African_folktales/Nigerian_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

# Create a dataframe to store the data
data = {'Region': 'Africa',
        'Country/Area': 'Nigeria',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

https://www.worldoftales.com/African_folktales/Nigerian_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/Nigerian_folktale_12.html#gsc.tab=0
https://www.w

Unnamed: 0,Region,Country/Area,Text
0,Africa,Nigeria,There was once a king who was very powerful. H...
1,Africa,Nigeria,Many years ago there was a Calabar hunter call...
2,Africa,Nigeria,Eyamba I. of Calabar was a very powerful king....
3,Africa,Nigeria,Efriam Duke was an ancient king of Calabar. He...
4,Africa,Nigeria,Ituen was a young man of Calabar. He was the o...
5,Africa,Nigeria,"Mbotu was a very famous king of Old Town, Cala..."
6,Africa,Nigeria,A bush rat called Oyot was a great friend of E...
7,Africa,Nigeria,Effiong Edem was a native of Cobham Town. He h...
8,Africa,Nigeria,"King Effiom of Duke Town, Calabar, was very fo..."
9,Africa,Nigeria,Okun Archibong was one of King Archibong's sla...


In [None]:
# Convert the dataframe to a csv file
df.to_csv('nigeria.csv')

## More examples

Below are some examples of collecting data using the web scraping method mentioned above.

### Africa

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 40:
    storyURL = 'https://www.worldoftales.com/African_folktales/African_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Africa',
        'Country/Area': 'South Africa',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('south_africa.csv')

https://www.worldoftales.com/African_folktales/African_Folktale_1.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_2.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_3.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_4.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_5.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_6.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_7.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_8.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_9.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_10.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_11.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_12.html#gsc.tab=0
https://www.worldoftales.

In [None]:
df = pd.concat(map(pd.read_csv, ['nigeria.csv', 'south_africa.csv']))

In [None]:
df

Unnamed: 0.1,Unnamed: 0,Region,Country/Area,Text
0,0,Africa,Nigeria,There was once a king who was very powerful. H...
1,1,Africa,Nigeria,Many years ago there was a Calabar hunter call...
2,2,Africa,Nigeria,Eyamba I. of Calabar was a very powerful king....
3,3,Africa,Nigeria,Efriam Duke was an ancient king of Calabar. He...
4,4,Africa,Nigeria,Ituen was a young man of Calabar. He was the o...
...,...,...,...,...
35,35,Africa,South Africa,The youngest of the three children had brought...
36,36,Africa,South Africa,"“But,” demanded Annie of the old Hottentot, a ..."
37,37,Africa,South Africa,The children were accompanying Old Hendrik fro...
38,38,Africa,South Africa,"Once upon a time Kee′ma, the monkey, and Pa′pa..."


In [None]:
# df.to_csv('africa.csv')

In [None]:
count = 41
article_numbers = []
article_list = []

while count <= 48:
    storyURL = 'https://www.worldoftales.com/African_folktales/African_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Africa',
        'Country/Area': 'Tanzania',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('tanzania.csv')

https://www.worldoftales.com/African_folktales/African_Folktale_41.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_42.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_43.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_44.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_45.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_46.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_47.html#gsc.tab=0
https://www.worldoftales.com/African_folktales/African_Folktale_48.html#gsc.tab=0


In [None]:
df = pd.concat(map(pd.read_csv, ['nigeria.csv', 'south_africa.csv', 'tanzania.csv']))
df.to_csv('africa.csv')

### Arab

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 40:
    storyURL = 'https://www.worldoftales.com/Asian_folktales/Arab_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Asia',
        'Country/Area': 'Arab',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('arab.csv')

https://www.worldoftales.com/Asian_folktales/Arab_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_12.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Arab_folktale_13.html#gsc.tab=0
https://

### China

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 89:
    storyURL = 'https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Asia',
        'Country/Area': 'China',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('china.csv')

https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_1.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_2.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_3.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_4.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_5.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_6.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_7.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_8.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_9.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_10.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_11.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_12.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chin

### India

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 77:
    storyURL = 'https://www.worldoftales.com/Asian_folktales/Indian_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Asia',
        'Country/Area': 'India',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('india.csv')

https://www.worldoftales.com/Asian_folktales/Indian_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_12.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Indian_folktale_

### Japan

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 67:
    storyURL = 'https://www.worldoftales.com/Asian_folktales/Japanese_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Asia',
        'Country/Area': 'Japan',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('japan.csv')

https://www.worldoftales.com/Asian_folktales/Japanese_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Japanese_folktale_12.html#gsc.tab=0
https://www.worldoftales.com/Asian_fo

### Philippines

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 61:
    storyURL = 'https://www.worldoftales.com/Asian_folktales/Filipino_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)bea
    article_numbers.append(count)

    count += 1

data = {'Region': 'Asia',
        'Country/Area': 'Filipino',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('filipino.csv')

https://www.worldoftales.com/Asian_folktales/Filipino_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Filipino_folktale_12.html#gsc.tab=0
https://www.worldoftales.com/Asian_fo

In [None]:
df = pd.concat(map(pd.read_csv, ['arab.csv', 'china.csv', 'japan.csv', 'filipino.csv' ]))
df.to_csv('asia.csv')

In [None]:
# Australian

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 31:
    storyURL = 'https://www.worldoftales.com/Australian_folktales/Australian_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Australia',
        'Country/Area': 'Australia',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('australia.csv')

https://www.worldoftales.com/Australian_folktales/Australian_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/Australian_folktales/Australian_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/Australian_f

### Europe

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 38:
    storyURL = 'https://www.worldoftales.com/European_folktales/Irish_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Europe',
        'Country/Area': 'Ireland',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('ireland.csv')

https://www.worldoftales.com/European_folktales/Irish_Folktale_1.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_2.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_3.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_4.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_5.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_6.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_7.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_8.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_9.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_10.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_11.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Irish_Folktale_12.html#gsc.tab=0
https://www.worldoftales.com/European

In [None]:
# England

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 42:
    storyURL = 'https://www.worldoftales.com/European_folktales/English_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Europe',
        'Country/Area': 'England',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('england.csv')

https://www.worldoftales.com/European_folktales/English_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/English_folktale_12.html#gsc.tab=0
https://www.w

In [None]:
#Germany

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 30:
    storyURL = 'https://www.worldoftales.com/European_folktales/German_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Europe',
        'Country/Area': 'Germany',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('germany.csv')

https://www.worldoftales.com/European_folktales/German_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/German_folktale_12.html#gsc.tab=0
https://www.worldoftales.

In [None]:
# Swedish

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 28:
    storyURL = 'https://www.worldoftales.com/European_folktales/Swedish_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Europe',
        'Country/Area': 'Sweden',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('sweden.csv')

https://www.worldoftales.com/European_folktales/Swedish_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Swedish_folktale_12.html#gsc.tab=0
https://www.w

In [None]:
#Frenhch

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 40:
    storyURL = 'https://www.worldoftales.com/European_folktales/French_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Europe',
        'Country/Area': 'France',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('france.csv')

https://www.worldoftales.com/European_folktales/French_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/French_folktale_12.html#gsc.tab=0
https://www.worldoftales.

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 32:
    storyURL = 'https://www.worldoftales.com/European_folktales/Italian_folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Europe',
        'Country/Area': 'Italy',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('italy.csv')

https://www.worldoftales.com/European_folktales/Italian_folktale_1.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_2.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_3.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_4.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_5.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_6.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_7.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_8.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_9.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_10.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_11.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Italian_folktale_12.html#gsc.tab=0
https://www.w

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 40:
    storyURL = 'https://www.worldoftales.com/European_folktales/Russian_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Europe',
        'Country/Area': 'Russia',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('russia.csv')

https://www.worldoftales.com/European_folktales/Russian_Folktale_1.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_2.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_3.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_4.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_5.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_6.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_7.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_8.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_9.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_10.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_11.html#gsc.tab=0
https://www.worldoftales.com/European_folktales/Russian_Folktale_12.html#gsc.tab=0
https://www.w

In [None]:
df = pd.concat(map(pd.read_csv, ['england.csv', 'germany.csv', 'ireland.csv', 'sweden.csv', 'france.csv', 'italy.csv', 'russia.csv']))
df.to_csv('europe.csv')

### North America

In [None]:
count = 8
article_numbers = []
article_list = []

while count <= 76:
    storyURL = 'https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'North America',
        'Country/Area': 'Native America',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('native_america_new2.csv')

https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_8.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_9.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_10.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_11.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_12.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_13.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_14.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_15.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_16.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_17.html#gsc.tab=0
https://www.worldoftal

In [None]:
df = pd.concat(map(pd.read_csv, ['native_america_new1.csv', 'native_america_new2.csv']))
df.to_csv('north_america_new.csv')

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 45:
    storyURL = 'https://www.worldoftales.com/United_States_folktales/US_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'North America',
        'Country/Area': 'US',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('us_new.csv')

https://www.worldoftales.com/United_States_folktales/US_Folktale_1.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_2.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_3.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_4.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_5.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_6.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_7.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_8.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_9.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_10.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_11.html#gsc.tab=0
https://www.worldoftales.com/United_States_folktales/US_Folktale_12.html#gsc.tab=0
https://www.w

In [None]:
count = 51
article_numbers = []
article_list = []

while count <= 76:
    storyURL = 'https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'North America',
        'Country/Area': 'Canada',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('canada.csv')

https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_51.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_52.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_53.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_54.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_55.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_56.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_57.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_58.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_59.html#gsc.tab=0
https://www.worldoftales.com/Native_American_folktales/Native_American_Folktale_60.html#gsc.tab=0
https://www.worldoft

In [None]:
df = pd.concat(map(pd.read_csv, ['native_america.csv', 'us.csv', 'canada.csv' ]))
df.to_csv('north_america.csv')

### South America

In [None]:
count = 1
article_numbers = []
article_list = []

while count <= 34:
    storyURL = 'https://www.worldoftales.com/South_American_folktales/South_American_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'South America',
        'Country/Area': 'Brazil',
        'Text':article_list,
        }

df = pd.DataFrame(data)

df.to_csv('south_america.csv')

https://www.worldoftales.com/South_American_folktales/South_American_Folktale_1.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_2.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_3.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_4.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_5.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_6.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_7.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_8.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_9.html#gsc.tab=0
https://www.worldoftales.com/South_American_folktales/South_American_Folktale_10.html#gsc.tab=0
https://www.worldoftales.com/South_American_folkt

In [None]:
df = pd.concat(map(pd.read_csv, ['asia.csv', 'africa.csv', 'australia.csv', 'europe.csv', 'north_america.csv', 'south_america.csv' ]))
df.to_csv('asia.csv')

In [None]:
df = pd.concat(map(pd.read_csv, ['all/asia.csv', 'all/africa.csv', 'all/australia.csv', 'all/europe.csv', 'all/north_america.csv', 'all/south_america.csv' ]))
df.to_csv('all_data.csv')

In [None]:
count = 80
article_numbers = []
article_list = []

while count <= 81:
    storyURL = 'https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_' + str(count) + '.html#gsc.tab=0'
    print(storyURL)
    
    page = requests.get(storyURL)
    soup = BeautifulSoup(page.content, "html.parser")

    article_bits = soup.find("div", {"id":"text"}).findAll('p')
    
    full_article = ''

    for a in article_bits:
      full_article += a.text.strip()

    article_list.append(full_article)
    article_numbers.append(count)

    count += 1

data = {'Region': 'Asia',
        'Country/Area': 'China',
        'Text':article_list,
        }

df = pd.DataFrame(data)
df

df.to_csv('china_SINGLE.csv')

https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_80.html#gsc.tab=0
https://www.worldoftales.com/Asian_folktales/Chinese_Folktale_81.html#gsc.tab=0
