# Extracting Text From Websites

Now that I have a list of websites that may contain information related to Biblical archeology, I need to extract the actual text from those articles in order to analyze each article and decide it it is infact related to Biblical archeology. I will later use this data to create a model that predicts the articles that I would arctually be interested in. I will also use this data to identify any references to specific Bible verses within an article. 

This site describes how to extract only the text data from a website: https://www.quora.com/How-can-I-extract-only-text-data-from-HTML-pages

In [1]:
import re
import urllib.request
from bs4 import BeautifulSoup
from pandas_ods_reader import read_ods
import pandas as pd
import sqlite3
import time

I am going to develop this concept using the following website as my subject: https://en.wikipedia.org/wiki/Cana#Written_references_to_Cana

In [2]:
url = 'https://en.wikipedia.org/wiki/Cana#Written_references_to_Cana'

Next, I will use *urllib* to open this website and *BeautifulSoup* to extract the text.

In [3]:
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html)
data = soup.findAll(text=True)

Let's see what happened.

In [4]:
data

['html',
 '\n',
 '\n',
 '\n',
 'Cana - Wikipedia',
 '\n',
 'document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"fbd0444a-95e8-48f4-9f1f-9d2789646944","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Cana","wgTitle":"Cana","wgCurRevisionId":969504170,"wgRevisionId":969504170,"wgArticleId":432485,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["All articles with dead external links","Articles with dead external links from December 2017","Articles with permanently dead external links","CS1 uses Hebrew-language script (he)","CS1 Hebrew-language sources (he)","CS1 Greek-language sources (el)","CS1 Latin-language s

That was pretty easy, but clearly, I extracted way more information than I need. I will filter to just the visible text.

Below, I define a function that does this.

In [5]:
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element.encode('utf-8'))):
        return False
    return True

Next I filter the data using the function *visible*.

In [6]:
result = filter(visible, data)

After this, I want to conver *result* to a list and join all of the elements together in to a single string that I can further process.

In [7]:
text = ' '.join(list(result))

In [8]:
text

'\n \n \n \n \n \n  CentralNotice  \n \n \n Cana \n \n From Wikipedia, the free encyclopedia \n \n \n \n Jump to navigation \n Jump to search \n This article is about a place mentioned in the New Testament. For other uses, see  Cana (disambiguation) . \n \n \n \n Kafr Kanna Khirbet Qana Qana Reineh \n \n \n   Possible locations of Cana\n \n \n \n   Kafr Kanna  described as "Cana of Galilee".  Holy Land Photographed  by Daniel B. Shepp. 1894 \n Cana of  Galilee  ( Ancient Greek :  Κανὰ τῆς Γαλιλαίας ,  Arabic :  قانا الجليل \u200e,  romanized :\xa0 Qana al-Jalil ,  lit. \xa0 \'Qana of the Galilee\') is the location of the  Marriage at Cana , at which the  miracle  of turning water into wine took place in the  Gospel of John . \n The location is disputed, with the four primary locations being   Kafr Kanna ,  Khirbet Qana  and  Reineh  in  Lower Galilee  and  Qana  in  Upper Galilee . The Arabic name "Qana al-Jalil" has been said to apply to a number of sites, but is of doubtful authentic

The characters *\n* indicate differnet sections of the visible text. Since I'm not really interested in anything but the body of the article, I want to get rid of any short sections of text. This would include things like the title, advertisements, web links, etc. In order to do this, I'll split the data by *\n* and remove any items that are contian than 100 words. This should get rid of most of the text I'm not interested in.

In [9]:
lines = text.split("\n")
result2 = [item for item in lines if len(item)>100]

Finally, I will combine all of these results into a single body of text.

In [10]:
text2 = ' '.join(list(result2))
print(text2)

 This article is about a place mentioned in the New Testament. For other uses, see  Cana (disambiguation) .   Cana of  Galilee  ( Ancient Greek :  Κανὰ τῆς Γαλιλαίας ,  Arabic :  قانا الجليل ‎,  romanized :  Qana al-Jalil ,  lit.   'Qana of the Galilee') is the location of the  Marriage at Cana , at which the  miracle  of turning water into wine took place in the  Gospel of John .   The location is disputed, with the four primary locations being   Kafr Kanna ,  Khirbet Qana  and  Reineh  in  Lower Galilee  and  Qana  in  Upper Galilee . The Arabic name "Qana al-Jalil" has been said to apply to a number of sites, but is of doubtful authenticity. [1]     Cana is very positively located in  Shepherd's Historical Atlas , 1923: modern scholars are less sure.   Among Christians and other students of the  New Testament , Cana is best known as the place where, according to the  Fourth Gospel ,  Jesus  performed "the first of his signs", his first public  miracle , the turning of a large quanti

This looks great! We still have a ton of processing to do before we can use this text to make predictions, but this is exactly the text I want to use.

# Creating a Dataset of Extracting Text
Now that I have extract text from a single website, I want to create a dataset that contains all of the extracted text from the websites I have labeled.

First, I have to import the ODS file that contains the labeled websites.

In [1]:
import re
import urllib.request
from bs4 import BeautifulSoup
from pandas_ods_reader import read_ods
import pandas as pd
import sqlite3
import time

In [2]:
def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element.encode('utf-8'))):
        return False
    return True

In [3]:
path = "C:/Bible Research/data/labeled websites - no gov.ods"

sheet_idx = 1
df = read_ods(path, sheet_idx)

In [4]:
len(df)

2542

In [5]:
df.relevant.sum()

490.0

This shows that I read and labeled 2,605 articles about Biblical archeology. These are the articles that my web crawler pulled from Google in relation to Biblical archeology and the nations mentioned in the Bible. Of these, I felt like 491 were relevant to the topic of Biblical archeology. I felt like the others were tangental, i.e. sermon notes that mention Biblical archeology but added nothing to the topic. 

Note: Labeling this data is a likely source of noise. While I tried very hard to be consistent in my labeling, this process took hundreds of hours in an effort that lasted nearly half a year. Since this effort was stretched out across such a long time period, there is every reason to believe that I may have inconsistently labeled sites. In an attempt to control for this, once I read and labeled the entire list, I reread all of the relevant articles and decided again whether or not they were truly relevant. Unfortunately, I did not have time to reread all of the non-relevant articles to form a second opinion. I'm somewhat interested in seeing which non-relevant articles my predictive model will classify as relevant and how relevant they actually are. The worst possible outcome would be if there is so much noise that my model cannot accurately classify relevant articles. If it inaccurately classifies non-relevant articles, I can always read those and make an informed decision. If it inaccurately classifies relevant articles, these are simply lost.

Before I extract the text data, I need to create an empy dataset in which to place the extacted text. I want to keep the website and whether or not it's relevant. I also want to create a counter variable. Also, if the function *visible* has not been run, than should be done before proceeding.

In [6]:
My_Text = pd.DataFrame(columns=['website', 'relevant', 'text'])
n = 0

Next, I'm going to use a FOR loop to iterate through these websites and extract the text for each. Because this FOR loop is unexpectedly stopping and I'm loosing all of the great text that's been extracted, I'm going to do this in small sections. This will take more time, but I feel like it's the best way to get the information.

In [7]:
for index, row in df[2400:2541].iterrows():
    
    print(row[0])
        
    try:
        html = urllib.request.urlopen(row[0])
    except:
        continue
    
    soup = BeautifulSoup(html)
    data = soup.findAll(text=True)
    
    result = filter(visible, data)

    text = ' '.join(list(result))
    
    lines = text.split("\n")
    
    j = ' '.join([item.strip() for item in lines if len(item)>100])
    
    My_Text.loc[n] = [row.website] + [row.relevant] + [j]
    n+=1
    
    time.sleep(2)
    print(n)

https://www.simchajtv.com/statue-of-biblical-joseph-found-story-covered-up/
1
https://www.slideshare.net/SharingGodsDream/archaeology-in-the-holy-bible-list-of-artifacts-in-biblical-studies-of-archaeology-student-study-work-book
2
https://www.smithsonianmag.com/history/race-save-syrias-archaeological-treasures-180958097/
3
https://www.smithsonianmag.com/history/what-is-beneath-the-temple-mount-920764/
4
https://www.smithsonianmag.com/travel/keepers-of-the-lost-ark-179998820/
5
https://www.smithsonianmag.com/videos/category/history/some-very-compelling-evidence-the-tower-of-b/
6
https://www.studylight.org/dictionaries/hbd/h/hazazon-tamar.html
7
https://www.studylight.org/dictionaries/wtd/m/mysia.html
8
https://www.studylight.org/encyclopedias/mse/g/gabbatha.html
9
https://www.studylight.org/encyclopedias/mse/h/hazezon-tamar.html
10
https://www.teldanexcavations.com/
11
https://www.telegraph.co.uk/news/worldnews/islamic-state/11247310/Archaeological-site-uncovered-by-Lawrence-of-Arabia-t

In [8]:
len(My_Text)

128

Now, I will rename this dataframe and run the next 100. I'll do this until I've iterated through the entire list of websites.

In [9]:
conn = sqlite3.connect(r"C:\Bible Research\SQL database\biblesql.db")

In [10]:
My_Text_Full = pd.read_sql('select * from labeled_text', conn)

In [11]:
c = pd.concat([My_Text_Full,My_Text],ignore_index=True)

In [12]:
len(c)

1991

In [13]:
#c.to_sql('labeled_text', conn, if_exists='replace', index=False)

In [14]:
pd.read_sql('select count(*) as Count from labeled_text', conn)

Unnamed: 0,Count
0,1991


In [16]:
pd.read_sql('select * from labeled_text limit 1', conn)

Unnamed: 0,website,relevant,text
0,http://apologeticspress.org/apcontent.aspx?cat...,1.0,"Almost fifty times in the Old Testament, we ca..."


In [17]:
cursor = conn.cursor()

cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

[('bible_bbe',), ('book_key',), ('books',), ('bible_metrics',), ('labeled_text',)]
