# Data Scraping - GTD
Using the the Global Terrorism Database and Selenium to scrape all data required for research and eventully for the prediction of future terror attacks globally 

### Using the Requests Library
We will use the bellow mentioned librarys as follows:
* Selenium - As our main tool to interact with GTD site and scraping page after page till we collect 50,000 rows of data.
* BeautifulSoup - We will load each and every page into a soup object so we can export the data we need.
* Pandas - All data we extract from the web page into a Soup object will be stored into a Pandas Dataframe, and eventully be saved to a .csv file
* Time - Time will allow us to slow down some of the processes to avoid being blocked from the site thinking we are a bot trying to DDos   


In [None]:
import bs4
import time
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support import expected_conditions as EC

Let's define a function to create a Soup object from the html source page we scrape.

In [None]:
# func to create a soup obj from a html source
def load_html_to_soup(html_source):
    bsobj = soup(html_source, 'html.parser')
    return bsobj

Let's define the source URL and initiate the Selenium webdriver for Chrome browser

In [None]:
URL = "https://www.start.umd.edu/gtd/search/Results.aspx?page=1&chart=injuries&casualties_type=&casualties_max=&count=100"
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(URL)
time.sleep(10)

Creating an empty Dataframe with the columns names of the attributes.

In [None]:
df = pd.DataFrame(columns=['Date', 'Country', 'City', 'Perpetrator', 'Fatalities', 'Injured', 'Target'])

### Scraping
Each page in the website contain 100 rows of data, so we will run 500 times(50,000 rows in total) to get all the data

At the end of each page scraping we use the Selenium WebDriver function to move to next page till we reach page 500.

In [None]:
for i in range(500):
    try:
        table = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, "results-table"))
        )
    except:
        print("Exiting with error")
        driver.quit()

    bsobj = load_html_to_soup(driver.page_source)
    #  Looking for the table with the class 'results'
    tableobj = bsobj.find('table', class_='results')
    # Collecting Ddata
    for row in tableobj.tbody.find_all('tr'):    
        columns = row.find_all('td')
        if(columns != []):
            incident_date = columns[1].text.strip()
            countery = columns[2].text.strip()
            city = columns[3].text.strip()
            perpetrator = columns[4].text.strip()
            fatalities = columns[5].text.strip()
            injured = columns[6].text.strip()
            target = columns[7].text.strip()
            df_new_row = pd.DataFrame({
                'Date': [incident_date],
                'Country': [countery],
                'City': [city], 
                'Perpetrator': [perpetrator], 
                'Fatalities': [fatalities], 
                'Injured': [injured], 
                'Target': [target]
            })
            df = pd.concat([df, df_new_row], ignore_index=True)
    nextResultButton = driver.find_element(by=By.PARTIAL_LINK_TEXT, value="MORE RESULTS")
    nextResultButton.click()



That's it, we save the dataframe as a .csv file and close the Selenium WebDriver session and exit.

In [None]:
df.to_csv('output.csv', mode='a', header=False)
driver.quit()