This [notebook](https://crosscompute.com/n/Sfknz2iPxwDdsLTcJvmWQeYETwYkhwIb) was written by Aida Shoydokova in preparation for a workshop on [Computational Approaches to Fight Human Trafficking](https://www.meetup.com/spatiotemporal-analysis-for-community-health-and-safety/events/244179401).


# **Information extraction**
Build an information extraction system and populate a well-organized database 


### Extract below information from unstructured data:

* **Category and Subcategory of Human Trafficking**:
    * Sex Trafficking
        * Adult Sex Trafficking
        * Child Sex Trafficking
    * Labor
        * Bonded Labor or Debt Bondage
        * Domestic Servitude
        * Forced Child Labor
        * Unlawful Recruitment and Use of Child Soldiers
    * Organ Removal
    * Not Human Trafficking Article
    * Something else
* **Date**
    * Publication Date 
    * Conviction Date
    * Incident Start Date
    * Incident End Date 
* **Geo-Political Location**    
    * Country where a trafficker was operating
    * Country of origin of victim
    * Country of origin of trafficker
    * State/Province where a trafficker was operating
    * State/Province of origin of victim
    * State/Province of trafficker
    * City where a trafficker was operating
    * City of origin of victim
    * City of origin of trafficker
* **"ID Information"** - information that might help to dedupe incidents
    * Trafficker name
    * Victim Name
* **Demographic Information** 
    * Victim race
    * Trafficker race
    * Ethnicity of trafficker
    * Ethnicity of victim
    * Victim Age
    * Trafficker Age
    * Victim Gender
    * Trafficker Gender
    * Victim's Level of education
    * Trafficker's Level of education
    * Occupation of trafficker
    * Prior occupation of victim
    * Post occupation of victim
    * Victim's Income level
    * Trafficker's Income level
    * Victim's Marital status
    * Trafficker's Martial status
    * Religion of victim
    * Religion of trafficker
* **Length of Human Trafficking**
    * How long was a victim harbored?
    * How long did a trafficker operate?
* **How was a victim recruited?** 
    * threat
    * coercion
    * abduction
    * fraud/deceit/deception
    * abuse of power
    * something else
* **How was a victim transported/transferred?**
* **How did a victim escape?**
* **Is it a repeat victim?**
* **Is it a repeat trafficker?**

## DOJ Press Releases
Build a system that extracts information from DOJ Press Releases

### Loading all needed libraries

In [None]:
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
import re 
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import operator
from collections import Counter
from bs4 import BeautifulSoup

# enable IPython to display matplotlib graphs
%matplotlib inline

### Reading data from the database to pandas dataframe

In [None]:
import requests
from os.path import basename, join
from pandas import read_csv
from urllib.parse import urlparse

url = (
    'https://www.dropbox.com/s/74mgua40dhg6acq/'
    'human-trafficking-usa-doj-20171111-1730.csv-sample-100.xz?dl=1')
pr = read_csv(url, compression='xz')
len(pr)

In [None]:
url = (
    'https://www.dropbox.com/s/zr1tem2w4w1ocjz/'
    'human-trafficking-usa-doj-20171111-1730.csv.xz?dl=1')
pr = read_csv(url, compression='xz')
len(pr)

In [None]:
pd.options.display.max_colwidth=50
print(pr.head())

### Quick Data Exploration

In [None]:
pr.columns

In [None]:
import datetime
datetime.datetime.fromtimestamp(pr.published_time.min()).date()

In [None]:
def get_date(x):
    return datetime.datetime.fromtimestamp(x).date()

In [None]:
# datetime.datetime.fromtimestamp(1347517370)
import datetime

print(
    '1. The earliest date:', get_date(pr.published_time.min()), 
    '; Last pulled date:', get_date(pr.published_time.max()))

In [None]:
import numpy as np

def f(i):
    try:
        if np.isnan(i):
            return 0
    except TypeError:
        return len(i.split(';'))
    
pr['topic_names'].map(lambda x: f(x))[:5]

In [None]:
import numpy as np

def f(i):
    try:
        if np.isnan(i):
            return None
    except TypeError:
        return len(i.split(';'))
    
pr['# topics'] = pr['topic_count'] = pr['topic_names'].map(
    lambda x: f(x))

In [None]:
# pr['# topics']=pr['topic'].map(lambda x: int(len(x)) if len(x) else None)

print(
    '2. Number of records:',len(pr), 
    '; Percentage of records that have topics:',
    "{0:.0f}%".format(100*len(pr[~pd.isnull(
        pr['topic_count'])])/len(pr)) )

In [None]:
print(
    '3. Number of records with empty body:',
    len(pr[pr['body'].map(
        lambda x: True if pd.isnull(x) else False)]),
    '; Number of records with empty title:',
        len(pr[pr['title'].map(
            lambda x: False if len(x) else True)]))

In [None]:
print('Distribution of Number of Topics per a document')
dist_n_topics = pr['# topics'].value_counts(sort=True)
print(dist_n_topics)

In [None]:
pr['topic_names'].dropna()[:10]

In [None]:
pr['topic_names'][:3]

In [None]:
from collections import defaultdict

dist_topics = count_by_topic = defaultdict(int)
for x in pr['topic_names']:
    if pd.isnull(x):
        continue
    for topic_name in x.split(';'): 
        count_by_topic[topic_name.strip()] += 1
count_by_topic

In [None]:
"""
print('Topic Distribution')
dist_topics = {}
for row in pr['topic_names']:
     for topic in row:
        dist_topics[topic['name']] = dist_topics.get(topic['name'], 0) + 1
"""        

In [None]:
print(pd.DataFrame.from_records(list(dist_topics.items()), columns=['Topic Name', '# Documents']) ) 

In [None]:
plt.figure(1)
X = np.arange(len(dist_topics))
plt.bar(X, dist_topics.values(), align='center', width=0.5)
_xticks = [list(dist_topics.keys())[i] if i in [21,37] else '' for i in X ]
plt.xticks(X, _xticks)
ymax = max(dist_topics.values()) + 1
plt.ylim(0, ymax)     

In [None]:
# Get the number of cases that have no published_time
len(pr[pr['published_time'] == None])

In [None]:
pr['published_time'].isnull().sum()

In [None]:
pr['topic_count'].isnull().sum()

In [None]:
# Get number of cases that lack both published_time and topic

In [None]:
(pr['published_time'].isnull() & pr['topic_count'].isnull()).sum()

In [None]:
print('Empty Topic Distribution by Year')
plt.figure(2)
pr['Year'] = pr['published_time'].map(
    lambda x: None if pd.isnull(x) else get_date(x).year)
empty = pr[pd.isnull(pr['# topics'])]['Year'].value_counts()
plt.bar(empty.index.values, empty.values)
_xticks = ['' if x%2 else int(x) for x in empty.index.values]
plt.xticks(empty.index.values, _xticks)
ymax = max(empty.values) + 1
plt.ylim(0, ymax)

### Subsetting data by only Human Trafficking related articles

In [None]:
pr['topic_names'][0]

In [None]:
def human_trafficking_in_topic(x):
    ht = ''
    if pd.isnull(x):
        return ''        
    for e in x.split(';'):
        if re.search(r'human\s+traffic', e ,re.I):
            ht += ';' + e
    return ht 

pr['ht_in_topic'] = pr['topic_names'].map(human_trafficking_in_topic)


In [None]:
# we decided to trust to the original labeling 
# if there is a 'human trafficking' in the body, but the topic does not indicate that
# then we believe the press release is not about human trafficking
def human_trafficking_in_body_title_empty_topic(x):
    ht = ''
    topic_names = x['topic_names']
    if pd.isnull(topic_names):
        return ''        
    if not len(topic_names):
        if re.search(r'human\s+traffic',str(x['title']),re.I):
            ht += ';' + r'human\s+traffic'
        elif re.search(r'human\s+traffic',str(x['body']),re.I):
            ht += ';' + r'human\s+traffic'
    return ht

pr['ht_in_body_title'] = pr.apply(human_trafficking_in_body_title_empty_topic,axis=1)

In [None]:
# a new Human Trafficking dataframe
pr_ht = pr[(pr['ht_in_topic'] != '') | (pr['ht_in_body_title'] != '')]
# statistics
print('human trafficking in topic',len(pr[pr['ht_in_topic']!='']))
print('human trafficking in body or title with empty topic',len(pr[pr['ht_in_body_title']!='']))

### Cleaning the body and title

80% of time we spend cleaning the data, remaining 20% we spend complaining that data could've been clean better

* Body of press releases have lxml code, let's remove them
<img alt="DOJ Raw Data" src="resources/DOJ_raw_view.png" width="600px" />

* Articles or press releases usually have a location at the beginning of the article

<center>**Press Release Example and Template**</center>
<img alt="example-press-release" src="resources/example_press_release.png" width="400px" />
<center>**New York Times Article**</center>
<img alt="NYT" src="resources/NYT.png" width="600px" />

In [None]:
def remove_location_at_beginning(x):
    if re.search(r'^\s*\w+\s*(\s*\w+\s*){0,1}(,\s*\w+\s*[\.\s]*){0,1}[\-\u2011\u2012\u2013\u2014\u2015]+',x):
        # HYPHEN, NON-BREAKING HYPHEN, FIGURE DASH, EN DASH, EM DASH, HORIZONTAL BAR
        return re.sub(r'^\s*\w+\s*(\s*\w+\s*){0,1}(,\s*\w+\s*[\.\s]*){0,1}[\-\u2011\u2012\u2013\u2014\u2015]+','',x)
    elif re.search(r'^\s*WASHINGTON\s*[,\s]*D\.{0,1}\s*C\.{0,1}\s*[\-\u2011\u2012\u2013\u2014\u2015]+',x,re.I):
        return re.sub(r'^\s*WASHINGTON\s*[,\s]*D\.{0,1}\s*C\.{0,1}\s*[\-\u2011\u2012\u2013\u2014\u2015]+','',x,re.I)    
    else:
        return x
    
# remove lxmls
pr_ht['body'] = pr_ht['body'].map(lambda x: BeautifulSoup(str(x),"lxml").text)
# press releases can have the location of the press release at the beginning of the text
pr_ht['body'] = pr_ht['body'].map(remove_location_at_beginning) 

In [None]:
pr_ht['body'].iloc[0].strip()

## Typical architecture for an information extraction system
[Material is taken from here](http://www.nltk.org/book/ch07.html#ref-chunkex-cp)

#### <center>Information Extraction Architecture</center>
<img alt="Architecture" src="resources/information_extraction_architecture.png" width="600px" />

### Segmentation, Tokenization and Tagging Part of Speech
* Let's perform the first three tasks
* If you want, you can improve the standard models based on your corpora, especially the step # 3 POS tagging
* If your corpora consists of 'new' words, the standard libraries could be inaccurate

In [None]:
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

def ie_preprocess(document):
    sentences = nltk.sent_tokenize(document) 
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    sentences = [nltk.pos_tag(sent) for sent in sentences]
    return sentences

# create new columns
pr_ht['pos body'] = pr_ht['body'].map(ie_preprocess)
pr_ht['pos title'] = pr_ht['title'].map(ie_preprocess)
pd.options.display.max_colwidth=100
print(pr_ht[['pos body','pos title']].head())    

#### List of all possible Part of Speech tags in NLTK

In [None]:
nltk.download('tagsets')

In [None]:
nltk.help.upenn_tagset()

### Pre-entity detection stage: Noun Phrase chunking or NP-chunking
We will search for chunks corresponding to individual noun phrases
<img alt="Chunk Segmentation" src="resources/chunk-segmentation.png" width="600px" />
<center>**Tree representation**</center>
<img alt="Chunk Segmentation" src="resources/chunk-tree.png" width="600px" />

### Rule-based approach: NP-chunking with regular expressions
* Let's define the rule (**chunk grammar**) that will divide the sentences into our NP-chunks based on POS tagging
* Part-of-speech tags are delimited using angle brackets
* This rule will chunk any sequence of tokens beginning with an optional determiner or possessive pronoun, followed by zero or more adjectives of any type (including relative adjectives like earlier/JJR), followed by one or more nouns of any type

In [None]:
# define the chunk grammar
# in general case you want the chunk grammar as below
#grammar = """NP: {<DT|PRP\$>?<JJ.*>*<NN.*>+}"""
# for our case, let's have this one instead
grammar = """NP: {<JJ.*>*<NN.*>+}"""

# Using the grammar, create a chunk parser
cp = nltk.RegexpParser(grammar)

# Chunk the sentences based on your rule
pr_ht['chunks'] = pr_ht['pos body'].map(lambda sentences: [cp.parse(sentence) for sentence in sentences])

# Print one example
pd.options.display.max_colwidth=1000
print(pr_ht['body'].iloc[2])
print(pr_ht['chunks'].iloc[2])

#### Here you can play around with chunk grammar to accurately determine NP-chunks

In [None]:
# define the index of the document you want to investigate
i = 0

# extract the document's raw text
document = pr_ht['body'].iloc[i]
print('Original document')
print(document)
print('------------------------------------------------')

# process your document
processed_document = nltk.sent_tokenize(document) 
processed_document = [nltk.word_tokenize(sent) for sent in processed_document]
processed_document = [nltk.pos_tag(sent) for sent in processed_document]
print('\nProcessed document')
print(processed_document)
print('------------------------------------------------')

# define the chunk rule, let's say everything will be NP if a noun preceded by ) or more adjectives  
grammar = """My Chunk. Yeah!: {<JJ*>*<NN*>+}"""

# create a chunk parser
cp = nltk.RegexpParser(grammar)

# chunk your document 
chunk_trees = [cp.parse(sent) for sent in processed_document]
print('\nYour Chunked Phrases')
for tree in chunk_trees:
    for subtree in tree.subtrees():
        if subtree.label() == 'My Chunk. Yeah!': 
            print(subtree)
print('------------------------------------------------')
print('Visual representation (tree) of the document\'s first sentence')
#chunk_trees[0].draw()

### Machine Learning approach: NP-chunking (I'll leave this as an exercise for you)
* For ML approach, you need annotated data. Unfortunately DOJ isn't.
* You can use existing annotated data like WSJ data in the NLTK package. Then you can assume that DOJ and this annotated data are from the same distribution and the models you built on the annotated data will work fine for DOJ data
* Simple ML model can be built by using n-gram tagger to label sentences with chunk tags [part 3.2](http://www.nltk.org/book/ch07.html#ref-chunkex-cp)
* More comprehensive ML model would be a classifier-based chunkers such as the ConsecutiveNPChunker [part 3.3](http://www.nltk.org/book/ch07.html#ref-chunkex-cp)

## Find the category of Human Trafficking
<img alt="Example HT" src="resources/HT_example1.png" width="600px" />

* Usually titles are very informative stating the gender, location of a trafficker and the category of Human Trafficking
* The first paragraph is usually more relevant to an incident than the next ones. The last paragraph usually has general information on Human Trafficking rather than information related to the incident (**task: divide a document onto paragraphs**)

### Entity Detection - Category

In [None]:
def human_trafficking_category(chunk_trees):
    categories = []
    for tree in chunk_trees:
        for subtree in tree.subtrees():
            # find all NP-chunk
            if subtree.label() == 'NP':
                leaves = [w[0] for w in subtree.leaves()]
                # if an NP-chunk has trafficking in it, then we start extracting category
                if 'trafficking' in leaves:
                    # clean up and creating a string out of a list
                    c = ' '.join([re.sub('[^a-z]','',x.lower()) for x in leaves if not re.search('human',x,re.I)])
                    if re.sub(r'\s','',c) != '':
                        categories.append(re.sub('(^\s+|\s+$)','',c))
                        
    categories2 = Counter(categories)
    keys = []
    for key in categories2.keys():
        if len(keys) == 0:
            keys.append(key)
        else:
            for i,k in enumerate(keys):
                if key in k:
                    break
                elif k in key:
                    keys[i] = key
                    break
            if len(keys) == i+1:
                keys.append(key)
                
    new_categories = dict((k, 0) for k in keys)  
    for key in categories2:
        for key1 in new_categories:
            if key in key1:
                new_categories[key1] += categories2[key]
                break
    
    new_categories = sorted(new_categories, key=new_categories.get) 
    if len(new_categories):
        category = new_categories[-1]
        if category == 'trafficking':
            category = ''
    else:
        category = ''
    return(category)            
 
#print(human_trafficking_category(pr_ht['chunks'].iloc[10]))    
pr_ht['category'] = pr_ht['chunks'].map(human_trafficking_category)  
print(pr_ht['category'].head(50))
print('# of articles that have a category:',len(pr_ht[pr_ht['category']!='']),\
      round(100*len(pr_ht[pr_ht['category']!=''])/len(pr_ht),2))

## Extract Human Trafficking Category Through Spacy
Let's use dependency parsing method in Spacy to detect categories of Human Trafficking
This is a big model and very resource consuming

In [None]:
from os import environ

environment_level = int(environ.get(
    'CROSSCOMPUTE_ENVIRONMENT_LEVEL', 0))
memory_level = int(environ.get(
    'CROSSCOMPUTE_MEMORY_LEVEL'))

if environment_level < 1:
    print(
        'environment_level.error = environment level must be set to '
        'computational in order to use the spacy package because it '
        'takes too long to install')

if memory_level < 3:
    print(
        'memory_level.error = memory level should be set to large '
        'or higher in order to use the en_core_web_lg spacy model')

In [None]:
import spacy

# load spacy's large model (be careful it has size of 812 MB!)
nlp = spacy.load('en_core_web_lg')

def type_ht_spacy (document, token, ptag):    
    types = []
    for chunk in document.noun_chunks:
        if re.search(token,chunk.root.text, re.I):
            doc = nlp(chunk.text)
            for word in doc:
                if word.pos_ == ptag and not re.search(
                    token,word.text,re.I
                ) and not re.search('human',word.lemma_):
                    types.append(word.lemma_)               
    return Counter(types)

pr_ht['type ht spacy'] = pr_ht['body'].map(
    lambda x: type_ht_spacy(nlp(x),'traffic','NOUN'))   
print(pr_ht['type ht spacy'])
print('# of articles with a category:',len(pr_ht[pr_ht[
    'type ht spacy'] != {}]))

In [None]:
print('# of articles with a category:',len(pr_ht[pr_ht['type ht spacy'] != {}]))