# Murderpedia Web Scraper

* Author: [Daniel Iova](https://github.com/daniel-iova)
* Github Repo: [murderpedia-web-scraper](https://github.com/daniel-iova/murderpedia-web-scraper)


## Table of contents:
1. [Introduction](#Introduction)
2. [Base components](#Base-components)
3. [Getting started](#Getting-started)
4. [Prerequisite methods](#Prerequisite-methods)
5. [Defining the brains of the operation](#Defining-the-brains-of-the-operation)
6. [Collecting, cleaning and storing the data](#Collecting,-cleaning-and-storing-the-data)
6. [Conclusions](#Conclusions)
7. [References](#References)

## Introduction

The goal of this project, along with introducing me to the world of data science, is to scrape data from *most* entries (murderers) on [Murderpedia](https://murderpedia.org).

Each murderer stored on the website has a table of *interesting data*, which is what I want to collect.
<br>This table contains keys like "Classification", "Location", "Status", etc.

All collected data is stored as a *json serialized object* and can easily be integrated into a non-relational database.
<br>For those that need to store the data in a relational database, a *general table model* is also created.

With that being said, the scraper can be extended to collect all data present on each murderer's page, but that is out of the scope of this project.

Due to :
- the inconsistencies between entries
- the variable way the pages were written
- the way http requests work<br>

the scraper often produces different results with each run (max 10 entries difference between runs).

Even so, the number of returned entries is around 85-95% of the original supposed 6921 entries, which is still a huge amount of data.
<br>However, due to how large this dataset is, the scraper is relatively slow (execution time of ~15-25 minutes on a modern computer).

The scraper produces three files:
* **dataset.json** : contains the data that could be retrieved.
* **model.json** : contains a dictionary with all keys that appear in the dataset and their occurances.
* **count.txt** : contains the number of entries that were scraped.

The *License*, along with the source files for this project, are found on the github repo.

## Base components

The two most important components of this web scraper are:
* The **process_all_murderers** method which is the brain of the scraping operation.
* The **global_dataset** dictionary that stores the results of the above method.

## Getting started

First of all we need to install all required libraries.

In [None]:
!pip install requests
!pip install fuzzywuzzy
!pip install python-Levenshtein
!pip install bs4
!pip install unidecode

We then import the libraries needed to run the scraper...

In [None]:
import concurrent.futures as cf
import re
import string
import requests
from bs4 import BeautifulSoup
from fuzzywuzzy import fuzz
from unidecode import unidecode
from collections import defaultdict
import json

...and instantiate the **global_dataset**.

In [None]:
global_dataset = defaultdict(lambda: defaultdict(dict))

## Prerequisite methods

We need to define methods to tackle all aspects of scraping the data for a murderer.

The **get_page_lang** method returns the page language.
<br>We only want to scrape English pages, so we need a way to break the process if the page is in a foreign language.

In [None]:
def get_page_lang(soup):
    lang = str(soup.find("meta", attrs= {"http-equiv":"Content-Language"}))
    lang = lang[lang.find("content=") + 9 : lang.find("content=") + 11]
    return lang

<br>The **get_image_url** method is created to find the image url of our entry.<br>Fuzzywuzzy's *token_set_ratio* method is used to compute the similarity between the found image's filename and the murderer's link.
<br>If the ratio is above a certain threshold (25 in our case), we return the image url. If not, we return "IMG_NOT_FOUND".
<br>Due to this, the method has an accuracy of ~ 90%.
 

In [None]:
def get_image_url(soup, murderer):
    img_urls = soup.find_all("img")
    for img in img_urls:
        if "../images/" in img["src"]:
            tokens = re.split("/", img["src"], re.UNICODE)
            ratio = fuzz.token_set_ratio(tokens[-2], re.sub(r"\d+", "", tokens[-1]))
            if (ratio > 25):
                img_url = murderer["Base_URL"] + img["src"][3:]
                return img_url
    return "IMG_NOT_FOUND"

<br>The **get_tables** method is defined for the purpose of finding the list of "table" tags that *could* contain the data we want to extract.
<br>Along with the soup object, the function receives a *font_size* parameter, because the table's styling either has a font size of 8 or a font size of 10.<br>First we run with 8, and, if no tables were found, we try with 10.

In [None]:
def get_tables(soup, font_size = 8):
    return soup.find_all(name = "table", 
                           attrs = {"border" : "0",
                                    "style" : f"font-size: {font_size}pt; color: #000000; border-collapse: collapse", 
                                    "cellpadding":"0" }), font_size

<br>We define the **process_text** method to deal with the irregular formatting of the text we extract.

In [None]:
def process_text(text):
    text = text.strip()
    text = re.sub(r"\s+", " ", text)
    text = text.replace('\t', '')
    text = text.replace('\n', '')
    return text

<br>The **get_dict_from_data_list** method converts a list of *"raw_key:raw_value"* strings to a dictionary and returns the result.

In [None]:
def get_dict_from_data_list(data_list):
    data_dict = {}
    for item in data_list:
        item = re.sub(r"\:+", ":", item)
        split = re.split(r"\:", item)
        data_dict[unidecode(process_text(split[0]))] = unidecode(process_text(split[1]))
    return data_dict

<br>All above methods are used by the **get_data** method.
<br>This method executes the following steps:
1. Initialize an empty *data dictionary*.
2. Get the soup object associated to the given murderer.
3. Use **get_page_lang** to break the processing if the page is not in English.
4. Assign the return value of **get_image_url** to the *Image_URL* key in the *data dictionary*.
5. Use **get_tables** to get the particular "table" tags and the *font size*.
6. Build a *data list* by iterating through the tables and finding specific "td" tags in the current table with the help of the *font size*.<br>Stop iteration after we find "Status:" in one of the "td" tag's text.
7. Use **get_data_dict_from_list** to update the *data dictionary* with data associated to the given murderer.
8. Return the *data dictionary*.



In [None]:
def get_data(murderer):
    
    data_dict = {}

    url = murderer["Murderpedia_URL"]
    r = requests.get(url, headers = {"User-Agent": 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
    c = r.content
    soup = BeautifulSoup(c, "html5lib")

    if get_page_lang(soup) != "en":
        return None

    data_dict["Image_URL"] = get_image_url(soup, murderer)

    tables, font_size = get_tables(soup)
    if tables == []:
        tables, font_size = get_tables(soup, 10)
    
    for t in tables:
        tds = t.find_all(name = "td", attrs = {"width":"100%", "style": f"font-size: {font_size}pt; color: #000000"})
        data_list = []
        i = 0
        ok = 0
        classif_index = 0
        for td in tds:
            data_list.append(process_text(td.text))
            if "Classification:" in data_list[i]:
                classif_index = i
            i+= 1
            if "Status:" in td.text:
                ok = 1
                break
        if ok == 1:
            break
    data_list = data_list[classif_index:]

    data_dict.update(get_dict_from_data_list(data_list))
    return data_dict

<br>The **process_murderer** method builds the data dictionary for the murderer using the given parameters and the **get_data** method. 

In [None]:
def process_murderer(gender, letter, murderer, index):
    data = {}
    data["Name"] = murderer["Name"]
    data["Murderpedia_URL"] = murderer["Murderpedia_URL"]
    data.update(get_data(murderer))
    return gender, letter, data, index

<br>The **name_to_key** method is used to generate the key where the data is stored in the **global_dataset**.

In [None]:
def name_to_key(name, index):
    return re.sub(r"[^\w]", "", name) + f"_{index}"

<br>The **add_to_global_tree** method uses the return tuple of the **process_murderer** method to add an entry to the **global_dataset**.

In [None]:
def add_to_global_tree(gender, letter, dic, index):
    global_dataset[gender][letter][name_to_key(dic["Name"], index)] = dic

<br>The **single_entry** method is defined to facilitate multithreading.
<br>It is responsible with updating the **global_dataset** with an entry for the given murderer.
<br>It receives a list named *components*, which has the following elements:

1. Tag that contains the link to the murderer
2. Base murderer dictionary
3. Gender
4. Letter
5. Index (used to differentiate between murderers with the same name)

In [None]:
def single_entry(components):
    x, dic, gender, l, index = components
    base_url = "http://murderpedia.org/" + gender + "." + l+"/"
    dic["Murderpedia_URL"] = base_url + x["href"]
    dic["Base_URL"] = base_url
    try:
        add_to_global_tree(*process_murderer(gender, l, dic, index))
        print("DONE", dic["Name"])
    except:
        print("COULDN'T DO", dic["Name"], dic["Murderpedia_URL"])

## Defining the brains of the operation

<br>The **process_all_murderers** method is defined for the purpose of getting all scrapable data and assigning it to the **global_dataset**.
<br>Because the murderers are stored with respect to the first letter of their surname, we first scrape the pages associated to the letters.
<br>For each "letter page" we define a list where we append *components* used by the **single_entry** method.
<br>After all possible entries were appended, we pass it to a *ThreadPoolExecutor*'s map method.
<br>The number of worker threads is given as a parameter.

In [None]:
def process_all_murderers(workers):
    
    genders = ["male", "female"]
    alphabet = list(string.ascii_uppercase)

    for gender in genders:
        base = "http://murderpedia.org/" + gender
        for letter in alphabet:
            
            # Build url and get html content

            url = f"{base}.{letter}/index.{letter}.htm"
            r = requests.get(url, headers = {"User-Agent": 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'})
            c = r.content
            soup = BeautifulSoup(c, "html5lib")

            # Build a list of all correct <tr> rows

            allrows = soup.findAll("tr")
            rows = [row for row in allrows if row.text.strip() != ""][13:]
            
            # Build the entry_thread_list used to facilitate multithreading

            entry_thread_list = []
            index = 0
            for row in rows:
                data = row.findAll("td")
                links  = row.findAll("a")
                if len(data) == 5:
                    dic = {}
                    dic["Name"] = re.sub("&", "and", re.sub(r"\s+", " ", data[1].text.strip().replace("\n","").replace("\t"," ")))
                    if dic["Name"] != "":
                        for x in links:
                            if re.match(r"%s\d*/.*" % letter, x["href"], re.I):
                                entry_thread_list.append([x, dic, gender, letter, index])
                                index += 1
                                break
                    else:
                        print(f"DIDN'T PASS NAME CHECK" + dic["Name"])

            # Execute the calls to the single_entry method using the entry_thread_list
            
            with cf.ThreadPoolExecutor(max_workers=workers) as executor : 
                executor.map(single_entry, entry_thread_list)

## Collecting, cleaning and storing the data

<br>Now that we defined the base functions, we can go ahead and actually use them.<br>
<br>First, we define where we want to save the data we collect.

In [None]:
    dataset_filename = "dataset.json"
    model_filename = "model.json"
    count_filename = "count.txt"

<br>Then, we define the number of workers we want to use.
<br>I found that 8 workers offers a great balance between speed and number of different lost entries between runs.

In [None]:
max_workers = 8

<br>Finally, to get the raw data, we call the **process_all_murderers** method.
<br>**Disclaimer**: The run time of this method is between **15 and 25 minutes**.

In [None]:
process_all_murderers(max_workers)

<br>The data needs to be cleaned, so we define the **clean_data** method...

In [None]:
def clean_data(dataset):
    for g in list(dataset.keys()):
        for l in list(dataset[g].keys()):
            for m in list(dataset[g][l].keys()):
                if len(dataset[g][l][m]) <= 5:
                    del dataset[g][l][m]
    return dataset

...and use it to clean the dataset. We dump the resulting dictionary to file.

In [None]:
dataset = clean_data(global_dataset)
json.dump(dataset, open(f"{dataset_filename}", "w"))

<br>After this, we compile a *general table model*.<br>
It contains an instance of every key of all scraped murderers, each key having it's no. of occurances in the dataset as value.
<br>We can later use this model if we want to store the data into a relational database.
<br>This is done by defining and using the **compile_general_model** method, which receives our dataset and a filename for the model.

In [None]:
def compile_general_model(dataset, model_filename):
    occurances = {}
    for gender in dataset.keys():
        for letter in dataset[gender].keys():
            for murderer in dataset[gender][letter].keys():
                for key in dataset[gender][letter][murderer].keys():
                    if key not in occurances.keys():
                        occurances[key] = 1
                    else:
                         occurances[key] += 1
    json.dump(dict(sorted(occurances.items(), key= lambda x : x[1], reverse=True)), open(model_filename, "w"))


compile_general_model(dataset, f"{model_filename}")

<br>In the end, we count the number of murderers we scraped.
<br>This is done by defining two methods: 
* **count_children** : counts the number of children on only one branch, "male" or "female".
* **count_murderers** : counts all murderers using the first method.

In [None]:
def count_children(dic):
    count = 0
    for key in dic.keys():
        count += len(dic[key])
    return count

def count_murderers(dataset_filename, count_filename):
    outf = open(count_filename, "w")
    data = json.load(open(dataset_filename))
    male_count, female_count = 0, 0
    if "female" in data.keys():
        female_count = count_children(data["female"])
    if "male" in data.keys():
        male_count = count_children(data["male"])
    print (f"{male_count} Males and {female_count} Females.", file = outf)
    print (f"In total {male_count+female_count} out of 6921 total entries.", file = outf)
    print (f"{6921 - (male_count+female_count)} murderers not scraped.", file = outf)


count_murderers(f"{dataset_filename}", count_filename)

## Conclusions

<br>As my first project, this was a great learning experience.
<br>It taught me all about how http requests work, how data is scraped from websites, multithreading, and so on.
<br>Along with this, it allowed me to collect great deals of data, which will be used in the future.

## References

* [Murderpedia](murderpedia.org) by [Juan Ignacio Blanco](https://es.wikipedia.org/wiki/Juan_Ignacio_Blanco)
* [Murder...and learning how to analyze data](https://mellybess.github.io/2017/01/23/getting-some-data.html) by [mellybess](https://github.com/mellybess)
* [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)