# Creating a Web Crawler

I am going to create a web crawler that pulls articles related to Biblical archeology and a list of nations found in the Bible. I want to be considerate of those hosting these articles, so I am going to pause the webcrawler after each iteration. I will then save all of the articles to a permanent dataset for later analysis.

In [1]:
import pandas as pd
from googlesearch import search 
import time
from datetime import datetime

Prior to writing this code, I visited this Wikipedia page to find a list of all nations mentioned in the Bible: https://en.wikipedia.org/wiki/List_of_nations_mentioned_in_the_Bible. I copied these names into an ODS file. 

In [22]:
path = "C:/Bible Research/data/List of nations in Bible.ods"

# load a sheet based on its index (1 based)
sheet_idx = 1
nations = read_ods(path, sheet_idx)

In [23]:
len(nations)

110

There are 110 nations mentioned in the Bible. We will look for any websites about archeological finds related to those nations.

In [24]:
nations.head()

Unnamed: 0,Nation
0,Adramyttium
1,Ai
2,Almon
3,Antioch
4,Aphek


Now it is time to I write the webcrawler. The first thing I do is create an empty list called *all_nations*. Then, within a FOR loop, I then ask the program to print the name of each nation as it is being queried. This is not necessary, but it allows me to keep track of progress.

I then create a web query that requires the key words "archeology" OR "aftifact," "Bible" and the name of the nation that is being queried. This site was helpful: https://www.geeksforgeeks.org/performing-google-search-using-python-code/. I decided to query the first ten articles related to each nation with the requred key words, but I also told the query to stop at result 25. I did this so that the search did not go too deep, i.e. page 4 searches. I also told the program to pause for 10 seconds between each website it queried. I also build in a pause of 3 seconds. Some of the features of the function I used did not work as I expected. I will need to dig into this at a later date. After each quary, I append the web address to the list *all_nations*. Finally, I make the program wait 30 seconds between each nation in order to be respectful of Google. This site was helpful: https://www.pythoncentral.io/pythons-time-sleep-pause-wait-sleep-stop-your-code/. 

In [25]:
all_nations = []
for index, row in nations.iterrows():
    
    print(row[0])

    query = "archeology OR artifact + Bible +" + row[0]
    
    for j in search(query, num=10, stop=25, pause=10):
        time.sleep(3)
        
        all_nations.append(j)

    time.sleep(30)

Adramyttium
Ai
Almon
Antioch
Aphek
Assos
Attalia
Awen
Baal_shalisha
Babylon
Berea
Arbel
Beth_anath
Beth_tappuah
Bozrah
Bubastis
Cauda
Cenchrea
Chezib
Cuthah
Dedan
Ecbatana
Elim
En_gannim
Erech
Eshtemoa
Etam
Gath
Gath_hepher
Gerar
Gerasa
Gibeah
Giloh
Gozan
Haggoyim
Hamath
Hammath
Hapharaim
Harosheth
Hazezon_tamar
Hukok
Ible_am
Iconium
Iron
Jaffo
Jezreel
Kadesh
Kadesh_barnea
Kedesh_naphtali
Kir
Kiriath_Jearim
Kitron
Laish
Leshem
Lod
Lystra
Medeba
Mitylene
Myra
Neapolis
Nephtoah
Nicopolis
No
No_amon
Noph
Noph
On
Ono
Ophir
Ophrah
Pelusium
Pergamum
Philadelphia
Pi_beseth
Pirathon
Ptolemais
Puteoli
QiryatArba
Rabbah
Rabbah
Rakkath
Ramah
Rhegium
Sarepta
Seba
Sheba
Shechem
Shiloah
Shomron
Shunem
Sin
Smyrna
Socoh
Succoth
Susa
Syene
Tadmor
Tahpanhes
Thebes
Thessalonica
Thyatira
Timnah
Timnath_heres
Tirzah
Ur
Zaphon
Zephathah
Ziddim
Zoan
Zorah


Now, I will unduplicate this list of websites.

In [27]:
all_nations = pd.Series(all_nations)
nations2 = pd.Series(all_nations.unique())

In [31]:
nations2.head()

0    https://biblearchaeologyreport.com/2019/04/12/...
1            https://en.wikipedia.org/wiki/Ai_(Canaan)
2    https://en.wikipedia.org/wiki/Ai_(Canaan)#Bibl...
3    https://en.wikipedia.org/wiki/Ai_(Canaan)#Poss...
4    https://en.wikipedia.org/wiki/Ai_(Canaan)#Et-Tell
dtype: object

Finally, I need to output this list to a CSV file that can be label as relevant or irrelevant. I will then use this labeled list of websites to create a predictive model.

In [32]:
nations2.to_pickle("C:/Bible Research/data/archeology websites about nations.pkl")

In [None]:
nations2.to_csv(r'C:/Users/david/OneDrive/Desktop/Bible Research/00 Python Output/Places/all places.aacsv')