# Pull Case Data From NamUs

The purpose of this project was originally to build a GAN that can take forensic sketches of unidentified persons and attempt to produce more realistic photos of them using a GAN/VAE. From there, the photographs and details of the found person will be matched with the data of known missing persons to see if there is a match.

Unfortunately, NamUs does not provide an API and scraping their website would likely be against the Terms of Use. For this reason, I won't be proceeding with this project, but this code may be useful to others.

In [1]:
# Common imports
import numpy as np
import pandas as pd
import os
import bs4

In [2]:
#Read in files
list_of_files = ['missing.csv', 'unclaimed.csv', 'unidentified.csv']
df_missing = pd.read_csv(list_of_files[0])
df_unclaimed = pd.read_csv(list_of_files[1])
df_unidentified = pd.read_csv(list_of_files[2])

#Here's what the sample case files look like:

#https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/88851
#https://www.namus.gov/api/CaseSets/NamUs/UnclaimedPersons/Cases/88882/
#https://www.namus.gov/api/CaseSets/NamUs/UnidentifiedPersons/Cases/88796/

In [3]:
#By adding the case URL to the end here, we can pull the cases we need.
unidentified_head = 'https://www.namus.gov/api/CaseSets/NamUs/UnidentifiedPersons/Cases/'
unclaimed_head = 'https://www.namus.gov/api/CaseSets/NamUs/UnclaimedPersons/Cases/'
missing_head = 'https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/'

In [4]:
#Remove MP, UP, UCP from the beginning of the case number
x = 0
while x < len(df_missing):
    df_missing['Case Number'][x] = str(df_missing['Case Number'][x][2:])
    x = x + 1

#Reset x for the next loop
x = 0
while x < len(df_unclaimed):
    df_unclaimed = df_unclaimed.copy()
    df_unclaimed['Case Number'][x] = str(df_unclaimed['Case Number'][x][3:])
    x = x + 1

#Reset x for the next loop
x = 0
while x < len(df_unidentified):
    df_unidentified['Case'][x] = str(df_unidentified['Case'][x][2:])
    x = x + 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unidentified['Case'][x] = str(df_unidentified['Case'][x][2:])


In [5]:
df_missing

Unnamed: 0,Case Number,DLC,Last Name,First Name,Missing Age,City,County,State,Sex,Race / Ethnicity,Date Modified
0,88851,02/09/2022,McFarland,Auttum,16 Years,Batesville,Independence,AR,Female,White / Caucasian,02/11/2022
1,88847,02/07/2022,Xi Ical,Fredy,13 Years,Culpeper,Culpeper,VA,Male,"Hispanic / Latino, White / Caucasian",02/11/2022
2,88777,02/07/2022,Bottorff,Desinee,17 Years,Alma,Crawford,AR,Female,White / Caucasian,02/09/2022
3,88733,01/31/2022,Luber,John,28 Years,Poughkeepsie,Dutchess,NY,Male,White / Caucasian,02/10/2022
4,88728,01/28/2022,McDonald,Morris,66 Years,Crowley,Tarrant,TX,Male,White / Caucasian,02/10/2022
...,...,...,...,...,...,...,...,...,...,...,...
9172,14632,10/30/1926,Clark,Marvin,69 Years,Tigard,Multnomah,OR,Male,White / Caucasian,07/29/2020
9173,24452,07/01/1925,Bartlett,Gerald,21 Years,Davil's Lake,Ramsey,ND,Male,White / Caucasian,02/08/2022
9174,24464,04/23/1916,Gates,John,33 Years,Longmont,Boulder,CO,Male,White / Caucasian,02/08/2022
9175,24522,11/01/1915,Davis,Noel,16 Years,Reno,Washoe,NV,Male,White / Caucasian,07/11/2019


This code will only be run once to download the approximately 22,000 case files from NamUs.

In [6]:
#Pull cases - only need to do this once
import mechanize
import requests
import time
#for x in list(range(len(df_missing))):
for x in list(range(2)):
    case_num = df_missing['Case Number'][x]
    url_to_pull = missing_head + case_num
    print(url_to_pull)
    r = requests.get(url_to_pull)
    fileName = case_num + '.txt'
    with open(fileName, "w") as f:
        f.write(r.text)
    print('Downloaded ' + url_to_pull)
    time.sleep(3)

#Pull cases - only need to do this once
#for x in list(range(len(df_unclaimed))):
for x in list(range(2)):
    case_num = df_unclaimed['Case Number'][x]
    url_to_pull = unclaimed_head + case_num
    print(url_to_pull)
    r = requests.get(url_to_pull)
    fileName = case_num + '.txt'
    with open(fileName, "w") as f:
        f.write(r.text)
    print('Downloaded ' + url_to_pull)
    time.sleep(3)

#Pull cases - only need to do this once
#for x in list(range(len(df_unidentified))):
for x in list(range(2)):
    case_num = df_unidentified['Case'][x]
    url_to_pull = unidentified_head + case_num
    print(url_to_pull)
    r = requests.get(url_to_pull)
    fileName = case_num + '.txt'
    with open(fileName, "w") as f:
        f.write(r.text)
    print('Downloaded ' + url_to_pull)
    time.sleep(3)

https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/88851
Downloaded https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/88851
https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/88847
Downloaded https://www.namus.gov/api/CaseSets/NamUs/MissingPersons/Cases/88847
https://www.namus.gov/api/CaseSets/NamUs/UnclaimedPersons/Cases/88882
Downloaded https://www.namus.gov/api/CaseSets/NamUs/UnclaimedPersons/Cases/88882
https://www.namus.gov/api/CaseSets/NamUs/UnclaimedPersons/Cases/88850
Downloaded https://www.namus.gov/api/CaseSets/NamUs/UnclaimedPersons/Cases/88850
https://www.namus.gov/api/CaseSets/NamUs/UnidentifiedPersons/Cases/88796
Downloaded https://www.namus.gov/api/CaseSets/NamUs/UnidentifiedPersons/Cases/88796
https://www.namus.gov/api/CaseSets/NamUs/UnidentifiedPersons/Cases/88337
Downloaded https://www.namus.gov/api/CaseSets/NamUs/UnidentifiedPersons/Cases/88337


In [7]:
df_missing['Case Number'][1]

'88847'

In [8]:
import json
import requests

#First we create the df, then we'll loop through the rest.
#This is an example only for the missing data but it applies equally well to the unidentified or unclaimed.

#Open the file
case_file_for_json = open(df_missing['Case Number'][0] + '.txt', "r")
#Read it into a string
temp_str = case_file_for_json.read()
#Close the file
case_file_for_json.close()
#Convert it into JSON
temp_json = json.loads(temp_str)
#Add it into a df
df_json = pd.json_normalize(temp_json)

for x in list(range(2)):
    #Open the file
    case_file_for_json = open(df_missing['Case Number'][x] + '.txt', "r")
    #Read it into a string
    temp_str = case_file_for_json.read()
    #Close the file
    case_file_for_json.close()
    #Convert it into JSON
    temp_json = json.loads(temp_str)
    #Add it into a df
    df_temp = pd.json_normalize(temp_json)
    df_missing_json = df_json.append(df_temp);

In [9]:
df_missing_json

Unnamed: 0,id,idFormatted,createdDateTime,modifiedDateTime,grantPermissionToPublish,caseIsResolved,hasPendingByNcmec,hasNcmecContributors,notes,clothingAndAccessoriesArticles,...,physicalDescription.leftEyeColor.name,physicalDescription.leftEyeColor.localizedName,physicalDescription.rightEyeColor.id,physicalDescription.rightEyeColor.name,physicalDescription.rightEyeColor.localizedName,defaultImage.id,defaultImage.defaultImageIdentityId,subjectIdentification.middleName,subjectIdentification.nicknames,subjectDescription.weightTo
0,88851,MP88851,2022-02-11T12:40:57.597,2022-02-11T14:43:20.14,True,False,False,False,[],"[{'id': 92657, 'identityId': 85664, 'article':...",...,Brown,Brown,3,Brown,Brown,91550,171275,,,
0,88847,MP88847,2022-02-10T22:57:17.897,2022-02-14T23:20:11.397,True,False,False,False,[],"[{'id': 92656, 'identityId': 85663, 'article':...",...,Brown,Brown,3,Brown,Brown,91545,171264,Gustavo,Henry Alexis Ical Sabano,90.0
