# 0. Imports

In [1]:
## helpful packages
import pandas as pd
import numpy as np
import random
import re

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
nltk.download('punkt')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords


## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
! python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 5.2 MB/s eta 0:00:01
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


# 2. Text analysis of DOJ press releases

For background, here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 

Here's the code the dataset owner used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

In [2]:
## run this code to load the unzipped json file and convert to a dataframe
## and convert some of the things from lists to values
doj = pd.read_json("combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 'components_clean']].copy()

doj.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,No topic,National Security Division (NSD)
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,No topic,Environment and Natural Resources Division
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,No topic,Environment and Natural Resources Division
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,No topic,Environment and Natural Resources Division
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,Environment,Environment and Natural Resources Division


## 2.1 NLP on one press release (10 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.


- Part of speech tagging- extract verbs and sort from most occurrences to least occurrences
- Named entity recognition --- what are the different organizations mentioned? how would you like to make more granular?
- Sentence level versus document-level sentiment scoring

- For sentence level scoring, print a few top positive and top negative. Does the automatic classifier seem to work?


### 2.1.1: part of speech tagging (3 points)

A. Preprocess the press release to remove all punctuation / digits (so can subset to one_word.isalpha())

B. Then, use part of speech tagging within nltk to tag all the words in that one press release with their part of speech. 

C. Finally, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print the 5 most frequent adjectives. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for .isalpha(): https://www.w3schools.com/python/ref_string_isalpha.asp
- `processtext` function here has an example of tokenizing and filtering to words where .isalpha() is true: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb 
- Part of speech tagging section of this code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb



In [4]:
doj_17 = str(doj.loc[doj['id'] == '17-1204'].contents)
pd.set_option('display.max_colwidth', None)
print(len(doj_17))
#From stack overflow: https://stackoverflow.com/questions/47361776/how-do-you-remove-punctuation-from-a-string-python
def cleaner(string):
    cleaned = [i for i in string if i.isalpha() or i.isspace()]
    return ''.join(cleaned)

doj_18 = cleaner(doj_17)
#print(doj_17)


## Code from slides
tokens = word_tokenize(doj_18) # Generate list of tokens
tokens_pos = pos_tag(tokens) # generate part of speech tags for those tokens
 
#for one_tok in tokens_pos:
   # print(one_tok)

adjectives = [one_tok[0] for one_tok in tokens_pos 
                if one_tok[1] in ["JJ", "JJS", "JJR"]]

adj_df = pd.DataFrame(adjectives)
adj_df.value_counts().head()
#adj_df.count(ascending = False).head()



9290


former        8
opioid        5
nationwide    4
addictive     3
other         3
dtype: int64

### 2.1.2 named entity recognition (3 points)


A. Using the alpha-only press release you created in the previous step, use spaCy to extract all named entities from the press release

B. Print all the named entities along with their tag

C. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these.

D. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convincted after this indictment.

**Resources**:

- Named entity recognition part of this code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb
- re.search and re.findall examples here for filtering to ones containing year (multiple approaches; some need not involve `re`): https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/04_basicregex_formerging.ipynb 

In [43]:
spacy_contents = nlp(doj_18)
for one_tok in spacy_contents.ents:
    print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)

years_list = []
for one_tok in spacy_contents.ents:

    if (one_tok.label_ == "DATE") & ('year' in one_tok.text):
            print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)
            years_list.append(one_tok.text)
        
print(years_list)



Entity: Insys Therapeutics Inc; NER tag: ORG
Entity: today; NER tag: DATE
Entity: Fentanyl; NER tag: PERSON
Entity: Americans; NER tag: NORP
Entity: last year; NER tag: DATE
Entity: millions; NER tag: CARDINAL
Entity: Jeff Sessions; NER tag: PERSON
Entity: This Justice Department; NER tag: ORG
Entity: Trump; NER tag: PERSON
Entity: American; NER tag: NORP
Entity: Phoenix Ariz; NER tag: ORG
Entity: the Board of Directors of Insys; NER tag: ORG
Entity: this morning; NER tag: TIME
Entity: Arizona; NER tag: GPE
Entity: RICO; NER tag: LAW
Entity: the AntiKickback Law Kapoor; NER tag: LAW
Entity: the Board; NER tag: ORG
Entity: Insys; NER tag: ORG
Entity: Phoenix; NER tag: GPE
Entity: today; NER tag: DATE
Entity: US; NER tag: GPE
Entity: District Court; NER tag: ORG
Entity: Boston; NER tag: GPE
Entity: a later date; NER tag: DATE
Entity: today; NER tag: DATE
Entity: Boston; NER tag: GPE
Entity: Insys; NER tag: ORG
Entity: December; NER tag: DATE
Entity: Kapoor Michael L Babich; NER tag: PERS

In [44]:
sentences = sent_tokenize(doj_17)

for x in years_list:
    for y in sentences:
        if x in y:
            print(x + ":" + y)

## assuming the CEO attempted to violate the anti-kickback law, 
##he is looking at five years in prison and three years on probation

last year:"More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids.
three years:This investigation highlights our commitment to defending our mail system from illegal misuse and ensuring public trust in the mail.”“The U.S. Department of Veterans Affairs, Office of Inspector General will continue to aggressively investigate those that attempt to fraudulently impact programs designed to benefit our veterans and their families,” said Donna L. Neves, Special Agent in Charge of the VA OIG Northeast Field Office.The charges of conspiracy to commit RICO and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than 20 years in prison, three years of supervised release and a fine of $250,000, or twice the amount of pecuniary gain or loss.
three years:The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised release and a $

### 2.1.3 Sentiment analysis (4 points)

A. Use a `SentimentIntensityAnalyzer` and `polarity_scores` to score the entire press release for its sentiment (you can go back to the raw string of the press release without punctuation/digits removed)

B. Remove all named entities from the string and score the sentiment of the press release without named entities. Did the neutral score go up or down relative to the version of the press release containing named entities? Why do you think this occurred?

C. With the version of the string that removes named entities, try to split the press release into discrete sentences (hint: re.split() may be useful since it allows or conditions in the pattern you're looking for). Print the first 5 sentences of the split press release (there will not be deductions if there remain some erroneous splits; just make sure it's generally splitting)

D. Score each sentence in the split press release and print the top 5 sentences in the press release with the most negative sentiment (use the `neg` score- higher values = more negative). **Hint**: you can use pd.DataFrame to rowbind a list of dictionaries; you can then add the press release sentence for each row back as a column in that dataframe and use sort_values()                                                  
                
**Resources**:

- Sentiment analysis section of this script: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb

- Discussion of using `re.split()` to split on multiple delimiters: https://stackoverflow.com/questions/4998629/split-string-with-multiple-delimiters-in-python

In [48]:
# create sentiment object
sent_obj = SentimentIntensityAnalyzer()
sentiment = sent_obj.polarity_scores(doj_17)
#print(sentiment)

ents =[]
spacy_dirty = nlp(doj_17)
for one_tok in spacy_dirty.ents:
    ents.append(one_tok.text)
#print(ents)

joined_ents = "|".join(ents)
#print(joined_ents)
subbed_doj = re.sub(joined_ents, "", doj_17)
##print(subbed_doj)

new_sent_obj = SentimentIntensityAnalyzer()
sent_new = sent_obj.polarity_scores(subbed_doj)
#print(sent_new)

## The nuetral score has decreased since removing all of the named entities.
## This is likely because these named entities have scores of their own and 
## can sway nuetrality in their use

sentence_senti_list = []
 
for sentence in sentences:
    holder = []
    holder.append(sentence)
    senter = SentimentIntensityAnalyzer()
    senti = sent_obj.polarity_scores(sentence)
    holder.append(senti["neg"])
    #print(neg)
    sentence_senti_list.append(holder)
    
#print(sentence_dict)
sentences_df = pd.DataFrame(sentence_senti_list)
sentences_df.head()
sentences_df.columns = ["content", "neg_score"]
sentences_df.sort_values(ascending = False, by = "neg_score").head()


Unnamed: 0,0,1
0,"4909 The founder and majority owner of Insys Therapeutics Inc., was arrested today and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a Fentanyl spray intended for cancer patients experiencing breakthrough pain.",0.382
1,"""More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids.",0.184
2,"And yet some medical professionals would rather take advantage of the addicts than try to help them,"" said Attorney General Jeff Sessions.",0.0
3,"""This Justice Department will not tolerate this.",0.0
4,We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic.,0.0


Unnamed: 0,content,neg_score
13,“Today's arrest and charges reflect our ongoing efforts to attack the opioid crisis from all angles.,0.494
0,"4909 The founder and majority owner of Insys Therapeutics Inc., was arrested today and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a Fentanyl spray intended for cancer patients experiencing breakthrough pain.",0.382
9,"The medication, called “Subsys,” is a powerful narcotic intended to treat cancer patients suffering intense breakthrough pain.",0.378
5,"And under the leadership of President Trump, we are fully committed to defeating this threat to the American people.”John N. Kapoor, 74, of Phoenix, Ariz., a current member of the Board of Directors of Insys, was arrested this morning in Arizona and charged with RICO conspiracy, as well as other felonies, including conspiracy to commit mail and wire fraud and conspiracy to violate the Anti-Kickback Law.",0.352
26,"The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised release and a $25,000 fine.",0.351


## 2.2 sentiment scoring across many press releases (10 points)


A. Subset the press releases to those labeled with one of free topics (can just do if topic_clean == that topic rather than finding where that topic is mentioned in a longer list): Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.

B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string
- Scores the sentiment of the entire press release

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- You may want to use re.escape at some point to avoid errors relating to escape characters like ( in the press release
- I used a function + list comprehension to execute and it takes about 30 seconds on my local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample

C. Add the scores to the `doj_subset` dataframe. Sort from highest neg to lowest neg score and print the top 5 most neg.

D. With that dataframe, find the mean compound score for each of the three topics using group_by and agg. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)

**Resources**:

- Same named entity and sentiment resources as above

In [18]:
top = ['Hate Crimes', 'Civil Rights', 'Project Safe Childhood']
doj_subset = doj[doj['topics_clean'].isin(top)]


def super_senti(presser):
    named_ents = nlp(presser)

    ent_list = [one_tok.text for one_tok in named_ents.ents]
    joined_ents = "|".join(ent_list)

    subbed = re.sub(re.escape(joined_ents), "", presser)
    
    new_sent_obj = SentimentIntensityAnalyzer()
    sent_new = new_sent_obj.polarity_scores(subbed)
    return sent_new
    
#test_string = "Dartmouth College has many majors for the most ambitious students"
#super_senti(test_string)


sentiment_list = [super_senti(press_release) for press_release in doj_subset.contents]
senti_df = pd.DataFrame(sentiment_list)
senti_df.head()

senti_df.reset_index(drop = True, inplace = True)
doj_subset.reset_index(drop = True, inplace = True)


type(senti_df)



Unnamed: 0,neg,neu,pos,compound
0,0.169,0.763,0.068,-0.9893
1,0.118,0.774,0.108,-0.7003
2,0.088,0.811,0.101,0.6124
3,0.114,0.779,0.107,-0.5385
4,0.141,0.803,0.056,-0.9786


pandas.core.frame.DataFrame

In [20]:
doj_subset_with_sentiment = pd.concat([doj_subset, senti_df], axis = 1 )
len(doj_subset)
doj_subset_with_sentiment.head()

717

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,neg,neu,pos,compound
0,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened. James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.” “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.” This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle",0.169,0.763,0.068,-0.9893
1,15-1522,Alabama Man Found Guilty of Aggravated Sexual Abuse of a Child,"A federal jury convicted Rick Lee Evans, 43, of Anniston, Alabama, today of aggravated sexual abuse of a child after a five-day trial, Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Joyce White Vance of the Northern District of Alabama announced. According to evidence introduced at trial, Evans, a former U.S. Army soldier, and his then-wife, a Department of Defense employee, were residing in Germany when they were asked to take temporary custody of a five-year-old child whose parents were deployed to Iraq with the U.S. Army. Evans sexually abused the child on multiple occasions during the 18 months that the child lived with him from May 2007 to December 2008. Trial Attorney Austin M. Berry of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. U.S. Army Criminal Investigations Division and the FBI’s Birmingham, Alabama, Division investigated the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2015-12-11T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern",0.118,0.774,0.108,-0.7003
2,16-213,Alabama Man Indicted on Child Pornography and Sex Tourism Charges,"An Alabama native was indicted today and charged with multiple crimes involving travel with intent to engage in illicit sexual conduct with minors and child pornography, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Kenyen R. Brown of the Southern District of Alabama. Clarence Edward Evers Jr., aka Bud, a technology teacher employed by the Conecuh County, Alabama, Board of Education, was arrested on Feb. 11, 2016, and was charged today with five counts of travel with intent to engage in illicit sexual conduct with a minor, one count of attempted travel with intent to engage in illicit sexual conduct with a minor, one count of production and attempted production of child pornography, one count of transportation of child pornography, one count of receipt of child pornography, one count of access with intent to view child pornography and one count of possession of child pornography. According to the indictment, Evers allegedly traveled to Thailand in the summers of 2010 through 2014 for the purpose of engaging in illicit sexual conduct with a minor and allegedly attempted to make a similar trip in the spring of 2015. During the 2014 trip, Evers also allegedly photographed his victims’ abuse and then transported the images back to the United States. In addition, Evers allegedly had other images of child sexual exploitation on his computers and other electronic devices. The charges contained in the indictment are only allegations. Evers is presumed innocent unless and until he is proven guilty beyond a reasonable doubt in a court of law. ICE-HSI is investigating this case. Trial Attorney James E. Burke IV of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys Sean P. Costello and Maria E. Murphy of the Southern District of Alabama are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-02-24T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Southern",0.088,0.811,0.101,0.6124
3,16-381,Alabama Man Indicted for Producing Child Pornography Involving Multiple Victims,"An Alabama man was indicted today by a federal grand jury in Birmingham, Alabama, on charges related to the production of child pornography involving four minor victims, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Joyce White Vance of the Northern District of Alabama. Gregory Jerome Lee, 53, formerly of Cullman County, Alabama, was indicted on four counts of production of child pornography, one count of conspiracy to advertise child pornography and one count of conspiracy to distribute and receive child pornography. According to the indictment, from September 1996 through December 2004, Lee used, persuaded, coerced and enticed minors to engage in sexually explicit conduct in order to produce images of that conduct. Between September 1996 and August 2007, Lee conspired with other individuals to distribute and receive child pornography through a variety of means, including the Internet. The U.S. Postal Inspection Service (USPIS) is investigating the case. Trial Attorney Amy E. Larson of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. The charges and allegations contained in an indictment are merely accusations. The defendant is presumed innocent unless and until proven guilty. Members of the public who may have information related to this matter should call the USPIS Birmingham Office at (205) 326-2909. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-03-30T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern",0.114,0.779,0.107,-0.5385
4,14-464,Alabama Man Indicted for Threatening African-American Man and Another Person at Restaurant,"Jeremy Heath Higgins was indicted for threatening an African-American man at a Quinton, Alabama, restaurant, and for threatening another person who ordered Higgins to leave the restaurant due to his behavior, Acting Assistant Attorney General Jocelyn Samuels for the Justice Department’s Civil Rights Division and U.S. Attorney Joyce Vance for the Northern District of Alabama announced today. Higgins, 28, was charged in a three count indictment returned yesterday by a federal grand jury in the U.S. District Court for the Northern District of Alabama. The indictment charges him with one felony count and two misdemeanor counts of interference with a federally-protected activity. The indictment alleges that on June 14, 2013, Higgins approached and threatened an African-American man at the Alabama Rose Steakhouse because the man was present at the restaurant with a white woman. According to the indictment, another person ordered Higgins to leave the premises of the restaurant because of Higgins’ behavior toward the African-American man, after which Higgins allegedly shouted a threat to burn down the restaurant. The indictment further alleges that Higgins threatened the person who had ordered him to leave the restaurant by painting graffiti on the restaurant’s exterior and fence. If convicted of the felony count of the indictment, Higgins could face a maximum sentence of 10 years in prison and a $250,000 fine. For each of the misdemeanor charges, Higgins could face a maximum sentence of one year in prison and a $200,000 fine. This case is being investigated by the FBI and is being prosecuted by Assistant U.S. Attorney Robin B. Mark of the Northern District of Alabama and Trial Attorney David Reese of the Justice Department’s Civil Rights Division. An indictment is merely an accusation, and the defendant is presumed innocent unless proven guilty.",2014-05-01T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.141,0.803,0.056,-0.9786


100

100

## 2.3 topic modeling (25 points)

For this question, use the `doj_subset` data that is reestricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


### 2.3.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in each of the raw strings in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem
    
B. Print the preprocessed text for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/
- Here's more condensed code with topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb 
- Here's longer code with more broken-out topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_examplecode.ipynb

In [8]:
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                       "trial", "assistance", "assist"]


In [21]:
pd.set_option('display.max_colwidth', None)
snow_stemmer = SnowballStemmer(language='english')
def preprocess(df, stopword_list, min_token_length = 4):
    
    new_col = []
    for x in df["contents"]:
    
        product = x.lower()
        try:

            remove_stop = [word for word in wordpunct_tokenize(product)
                          if word not in stopword_list]

            processed_string = " ".join([snow_stemmer.stem(i) 
                            for i in remove_stop if 
                            i.isalpha() and len(i) >= min_token_length])

            #print(processed_string)
            new_col.append(processed_string)
        except:
            processed_string = "error" 
            new_col.append(processed_string)
   # print(new_col)

    return(new_col)
    #df.contents.head()

## combine stopwords lists    
list_stopwords = stopwords.words("english")

list_stopwords = list_stopwords + custom_doj_stopwords

stemmed_contents = preprocess(doj_subset_with_sentiment, list_stopwords, min_token_length = 4)
doj_subset_with_sentiment["stemmed"] = stemmed_contents
doj_subset_with_sentiment.head()



Unnamed: 0,id,title,contents,date,topics_clean,components_clean,neg,neu,pos,compound,stemmed
0,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened. James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.” “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.” This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle",0.169,0.763,0.068,-0.9893,former supervisori correct offic louisiana state penitentiari angola louisiana plead guilti yesterday connect beat handcuf shackl inmat addit conspir cover misconduct falsifi offici record lie intern investig happen jame savoy marksvill louisiana admit plea hear wit offic use excess forc inmat fail interven conspir offic cover beat engag varieti obstruct act person falsifi offici prison record cover attack scotti kennedi beeb arkansa john sander marksvill louisiana previous plead guilti novemb septemb role beat cover everi citizen right process protect unreason forc correct offic violat basic constitut must held account egregi action said act general john gore continu vigor prosecut correct offic violat public trust commit crime cover violat feder crimin yesterday anoth exampl unwav commit pursu violat feder crimin law said act unit state middl louisiana corey amundson continu work close ensur investig baton roug resid agenc prosecut frederick menner middl louisiana christoph perra crimin section
1,15-1522,Alabama Man Found Guilty of Aggravated Sexual Abuse of a Child,"A federal jury convicted Rick Lee Evans, 43, of Anniston, Alabama, today of aggravated sexual abuse of a child after a five-day trial, Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Joyce White Vance of the Northern District of Alabama announced. According to evidence introduced at trial, Evans, a former U.S. Army soldier, and his then-wife, a Department of Defense employee, were residing in Germany when they were asked to take temporary custody of a five-year-old child whose parents were deployed to Iraq with the U.S. Army. Evans sexually abused the child on multiple occasions during the 18 months that the child lived with him from May 2007 to December 2008. Trial Attorney Austin M. Berry of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. U.S. Army Criminal Investigations Division and the FBI’s Birmingham, Alabama, Division investigated the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2015-12-11T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern",0.118,0.774,0.108,-0.7003,feder juri convict rick evan anniston alabama today aggrav sexual abus child five general lesli caldwel crimin joyc white vanc northern alabama announc accord evid introduc evan former armi soldier wife defens employe resid germani ask take temporari custodi five year child whose parent deploy iraq armi evan sexual abus child multipl occas month child live decemb austin berri crimin child exploit obscen section ceo jacquelyn hutzel northern alabama prosecut armi crimin investig birmingham alabama investig brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit
2,16-213,Alabama Man Indicted on Child Pornography and Sex Tourism Charges,"An Alabama native was indicted today and charged with multiple crimes involving travel with intent to engage in illicit sexual conduct with minors and child pornography, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Kenyen R. Brown of the Southern District of Alabama. Clarence Edward Evers Jr., aka Bud, a technology teacher employed by the Conecuh County, Alabama, Board of Education, was arrested on Feb. 11, 2016, and was charged today with five counts of travel with intent to engage in illicit sexual conduct with a minor, one count of attempted travel with intent to engage in illicit sexual conduct with a minor, one count of production and attempted production of child pornography, one count of transportation of child pornography, one count of receipt of child pornography, one count of access with intent to view child pornography and one count of possession of child pornography. According to the indictment, Evers allegedly traveled to Thailand in the summers of 2010 through 2014 for the purpose of engaging in illicit sexual conduct with a minor and allegedly attempted to make a similar trip in the spring of 2015. During the 2014 trip, Evers also allegedly photographed his victims’ abuse and then transported the images back to the United States. In addition, Evers allegedly had other images of child sexual exploitation on his computers and other electronic devices. The charges contained in the indictment are only allegations. Evers is presumed innocent unless and until he is proven guilty beyond a reasonable doubt in a court of law. ICE-HSI is investigating this case. Trial Attorney James E. Burke IV of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys Sean P. Costello and Maria E. Murphy of the Southern District of Alabama are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-02-24T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Southern",0.088,0.811,0.101,0.6124,alabama nativ indict today charg multipl crime involv travel intent engag illicit sexual conduct minor child pornographi announc general lesli caldwel crimin kenyen brown southern alabama clarenc edward ever technolog teacher employ conecuh counti alabama board educ arrest charg today five count travel intent engag illicit sexual conduct minor count attempt travel intent engag illicit sexual conduct minor count product attempt product child pornographi count transport child pornographi count receipt child pornographi count access intent view child pornographi count possess child pornographi accord indict ever alleg travel thailand summer purpos engag illicit sexual conduct minor alleg attempt make similar trip spring trip ever also alleg photograph victim abus transport imag back unit state addit ever alleg imag child sexual exploit comput electron devic charg contain indict alleg ever presum innoc unless proven guilti beyond reason doubt court investig jame burk crimin child exploit obscen section ceo attorney sean costello maria murphi southern alabama prosecut brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit
3,16-381,Alabama Man Indicted for Producing Child Pornography Involving Multiple Victims,"An Alabama man was indicted today by a federal grand jury in Birmingham, Alabama, on charges related to the production of child pornography involving four minor victims, announced Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Joyce White Vance of the Northern District of Alabama. Gregory Jerome Lee, 53, formerly of Cullman County, Alabama, was indicted on four counts of production of child pornography, one count of conspiracy to advertise child pornography and one count of conspiracy to distribute and receive child pornography. According to the indictment, from September 1996 through December 2004, Lee used, persuaded, coerced and enticed minors to engage in sexually explicit conduct in order to produce images of that conduct. Between September 1996 and August 2007, Lee conspired with other individuals to distribute and receive child pornography through a variety of means, including the Internet. The U.S. Postal Inspection Service (USPIS) is investigating the case. Trial Attorney Amy E. Larson of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. The charges and allegations contained in an indictment are merely accusations. The defendant is presumed innocent unless and until proven guilty. Members of the public who may have information related to this matter should call the USPIS Birmingham Office at (205) 326-2909. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-03-30T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern",0.114,0.779,0.107,-0.5385,alabama indict today feder grand juri birmingham alabama charg relat product child pornographi involv four minor victim announc general lesli caldwel crimin joyc white vanc northern alabama gregori jerom former cullman counti alabama indict four count product child pornographi count conspiraci advertis child pornographi count conspiraci distribut receiv child pornographi accord indict septemb decemb use persuad coerc entic minor engag sexual explicit conduct order produc imag conduct septemb august conspir individu distribut receiv child pornographi varieti mean includ internet postal inspect servic uspi investig larson crimin child exploit obscen section ceo jacquelyn hutzel northern alabama prosecut charg alleg contain indict mere accus defend presum innoc unless proven guilti member public inform relat matter call uspi birmingham brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit
4,14-464,Alabama Man Indicted for Threatening African-American Man and Another Person at Restaurant,"Jeremy Heath Higgins was indicted for threatening an African-American man at a Quinton, Alabama, restaurant, and for threatening another person who ordered Higgins to leave the restaurant due to his behavior, Acting Assistant Attorney General Jocelyn Samuels for the Justice Department’s Civil Rights Division and U.S. Attorney Joyce Vance for the Northern District of Alabama announced today. Higgins, 28, was charged in a three count indictment returned yesterday by a federal grand jury in the U.S. District Court for the Northern District of Alabama. The indictment charges him with one felony count and two misdemeanor counts of interference with a federally-protected activity. The indictment alleges that on June 14, 2013, Higgins approached and threatened an African-American man at the Alabama Rose Steakhouse because the man was present at the restaurant with a white woman. According to the indictment, another person ordered Higgins to leave the premises of the restaurant because of Higgins’ behavior toward the African-American man, after which Higgins allegedly shouted a threat to burn down the restaurant. The indictment further alleges that Higgins threatened the person who had ordered him to leave the restaurant by painting graffiti on the restaurant’s exterior and fence. If convicted of the felony count of the indictment, Higgins could face a maximum sentence of 10 years in prison and a $250,000 fine. For each of the misdemeanor charges, Higgins could face a maximum sentence of one year in prison and a $200,000 fine. This case is being investigated by the FBI and is being prosecuted by Assistant U.S. Attorney Robin B. Mark of the Northern District of Alabama and Trial Attorney David Reese of the Justice Department’s Civil Rights Division. An indictment is merely an accusation, and the defendant is presumed innocent unless proven guilty.",2014-05-01T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.141,0.803,0.056,-0.9786,jeremi heath higgin indict threaten african american quinton alabama restaur threaten anoth person order higgin leav restaur behavior act general jocelyn samuel joyc vanc northern alabama announc today higgin charg three count indict return yesterday feder grand juri court northern alabama indict charg feloni count misdemeanor count interfer feder protect activ indict alleg june higgin approach threaten african american alabama rose steakhous present restaur white woman accord indict anoth person order higgin leav premis restaur higgin behavior toward african american higgin alleg shout threat burn restaur indict alleg higgin threaten person order leav restaur paint graffiti restaur exterior fenc convict feloni count indict higgin could face maximum sentenc year prison fine misdemeanor charg higgin could face maximum sentenc year prison fine investig prosecut robin mark northern alabama david rees indict mere accus defend presum innoc unless proven guilti


In [22]:
doj_16_718 = (doj_subset_with_sentiment.loc[doj_subset['id'] == '16-718'].stemmed)
doj_16_217 = (doj_subset_with_sentiment.loc[doj_subset['id'] == '16-217'].stemmed)

print(doj_16_718)
print(doj_16_217)
len(doj_subset)

632    nine count indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti investig jackson cooper mississippi correct prosecut robert coleman northern mississippi dana mulhaus crimin section marsher indict
Name: stemmed, dtype: object
313    reach comprehens settlement agr

717

### 2.3.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the `compound` sentiment column you added and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so most positive)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so most negative)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. What are the top 10 words for press releases in each of the three `topics_clean`?

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data

**Resources**:

- Here contains an example of applying the create_dtm function: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_examplecode.ipynb


In [11]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), columns=vectorizer.get_feature_names())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)

In [24]:

#doj_subset.head()
dtm_nopre = create_dtm(list_of_strings= doj_subset_with_sentiment.stemmed,
                     metadata = doj_subset_with_sentiment[['id', 'compound', 'topics_clean']].rename(columns = 
                                                        {'id': 'id_metadata', 'compound': 'compound_score'}))
dtm_nopre.head()


Unnamed: 0,index,id_metadata,compound_score,topics_clean,aaron,abandon,abbat,abbi,abbott,abdomen,...,zane,zealand,zealous,zeeman,zero,zionism,zobel,zone,zunggeemog,zwengel
0,0,17-1235,-0.9893,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,15-1522,-0.7003,Project Safe Childhood,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,16-213,0.6124,Project Safe Childhood,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,16-381,-0.5385,Project Safe Childhood,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,14-464,-0.9786,Hate Crimes,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [25]:

#top_05 = dtm_nopre[dtm_nopre["compound_score"] < dtm_nopre["compound_score"].quantile(.05)]



In [27]:
top_05.shape

(36, 6870)

In [32]:
#top_terms = [col for col in dtm_nopre.columns if col not in ['id_metadata', 'index', 'compound_score', 'topics_clean']]

#top_05[top_terms].sum(axis = 0).sort_values(ascending = False).head(10)


assault     185
crime       163
victim      147
hate        126
conspir     119
offic       117
american    110
defend      106
charg       101
african     100
dtype: int64

In [None]:
top_terms.sort_values(ascending = False).head(10)

In [35]:
def quant_words(df):
    top_terms = [col for col in df.columns if col not in ['id_metadata', 'index', 'compound_score', 'topics_clean']]

    return df[top_terms].sum(axis = 0).sort_values(ascending = False).head(10)

top_05 = dtm_nopre[dtm_nopre["compound_score"] < dtm_nopre["compound_score"].quantile(.05)]
bottom_05 = dtm_nopre[dtm_nopre["compound_score"] > dtm_nopre["compound_score"].quantile(.95)]


quant_words(top_05)
quant_words(bottom_05)




assault     185
crime       163
victim      147
hate        126
conspir     119
offic       117
american    110
defend      106
charg       101
african     100
dtype: int64

agreement     170
disabl        132
enforc        122
ensur         104
settlement    103
state         103
communiti     101
hous           90
polic          88
student        85
dtype: int64

In [37]:
hate_crimes = dtm_nopre.loc[dtm_nopre['topics_clean'] == 'Hate Crimes']
civil_rights = dtm_nopre.loc[dtm_nopre['topics_clean'] == 'Civil Rights']
psc = dtm_nopre.loc[dtm_nopre['topics_clean'] == 'Project Safe Childhood']





quant_words(hate_crimes) 
quant_words(civil_rights) 
quant_words(psc) 


victim      591
crime       557
hate        524
defend      484
prosecut    478
charg       463
sentenc     455
american    451
feder       432
guilti      430
dtype: int64

offic        637
hous         633
discrimin    616
enforc       544
disabl       532
said         497
feder        479
violat       477
state        452
court        414
dtype: int64

child          1022
exploit         701
sexual          572
safe            479
childhood       474
project         472
pornographi     452
children        423
crimin          405
prosecut        374
dtype: int64

### 2.3.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Resources**:

- Same topic modeling resources linked to above

In [39]:
tokenized_text = [wordpunct_tokenize(token) for token in 
                                      doj_subset_with_sentiment.stemmed]

### create dictionary
text_proc_dict = corpora.Dictionary(tokenized_text)
### filter dictionary- using 2% as bounds
text_proc_dict.filter_extremes(no_below = round(doj_subset_with_sentiment.shape[0]*0.02),
                             no_above = round(doj_subset_with_sentiment.shape[0]*0.98))

### create corpus from dictionary
corpus_fromdict_proc = [text_proc_dict.doc2bow(token) 
                   for token in tokenized_text]

### estimate model
n_topics = 3
ldamod_proc = gensim.models.ldamodel.LdaModel(corpus_fromdict_proc, 
                                         num_topics = n_topics, id2word=text_proc_dict, 
                                         passes=6, alpha = 'auto',
                                         per_word_topics = True, random_state = 91988)


In [40]:
topics = ldamod_proc.print_topics(num_words = 15)
for topic in topics:
    print(topic)

(0, '0.012*"victim" + 0.012*"crime" + 0.011*"prosecut" + 0.011*"hate" + 0.010*"sentenc" + 0.010*"defend" + 0.010*"charg" + 0.010*"said" + 0.010*"guilti" + 0.010*"feder" + 0.010*"american" + 0.009*"year" + 0.008*"african" + 0.008*"today" + 0.007*"investig"')
(1, '0.012*"hous" + 0.011*"offic" + 0.010*"disabl" + 0.009*"enforc" + 0.009*"discrimin" + 0.008*"said" + 0.008*"violat" + 0.008*"feder" + 0.008*"alleg" + 0.008*"state" + 0.007*"court" + 0.007*"agreement" + 0.007*"general" + 0.006*"polic" + 0.006*"today"')
(2, '0.032*"child" + 0.021*"exploit" + 0.018*"sexual" + 0.015*"safe" + 0.014*"project" + 0.014*"childhood" + 0.014*"pornographi" + 0.013*"children" + 0.013*"crimin" + 0.012*"prosecut" + 0.011*"sentenc" + 0.011*"victim" + 0.010*"ceo" + 0.010*"minor" + 0.009*"year"')


### 2.3.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset` dataframe

B. Add the topic probabilities to the `doj_subset` dataframe as columns and code each document to its highest-probability topic

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)

**Resources**:

- End of this code contains example of how to use `get_document_topics` and other steps to add topic probabilities back to data: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb

In [54]:
topic_probs_bydoc =[ldamod_proc.get_document_topics(item, minimum_probability = 0) for item in corpus_fromdict_proc]
assert len(topic_probs_bydoc) == len(doj_subset)

topic_probs_bydoc_long = pd.DataFrame([t for lst in topic_probs_bydoc for t in lst],
                                     columns = ['topic', 'probability'])


topic_probs_bydoc_long['doc_id'] = list(np.concatenate([[one_id] * n_topics for one_id in doj_subset_with_sentiment.id]).flat)
topic_probs_bydoc_long.head()

topic_probs_bydoc_wide = pd.pivot_table(topic_probs_bydoc_long, index = ['doc_id'],
                        columns = ['topic']).reset_index().reset_index(drop = True)
topic_probs_bydoc_wide.columns = ['doc_id'] + ["topic_" + str(i) for i in np.arange(0, n_topics)]
topic_probs_bydoc_wide.head()

Unnamed: 0,topic,probability,doc_id
0,0,0.30378,17-1235
1,1,0.69573,17-1235
2,2,0.00049,17-1235
3,0,0.000928,15-1522
4,1,0.000905,15-1522


Unnamed: 0,doc_id,topic_0,topic_1,topic_2
0,09-063,0.996543,0.002072,0.001386
1,09-067,0.999265,0.00044,0.000294
2,09-081,0.999487,0.000307,0.000206
3,09-1011,0.999179,0.000492,0.000329
4,09-1021,0.998045,0.001171,0.000784


In [55]:
topic_wmeta = pd.merge(topic_probs_bydoc_wide,
                      doj_subset_with_sentiment,
                      left_on = 'doc_id',
                      right_on = 'id')

topic_wmeta['toptopic'] = topic_wmeta[[col for col in topic_wmeta.columns if 
                                    "topic_" in col]].idxmax(axis=1)
topic_wmeta.head()




Unnamed: 0,doc_id,topic_0,topic_1,topic_2,id,title,contents,date,topics_clean,components_clean,neg,neu,pos,compound,stemmed,toptopic
0,09-063,0.996543,0.002072,0.001386,09-063,New York Man Pleads Guilty to Federal Hate Crime Conspiracy,"WASHINGTON – Brian Carranza, 21, pleaded guilty today before U.S. District Court Judge Carol B. Amon in Brooklyn, N.Y., to conspiring to assault African-American residents in Staten Island, N.Y., in retaliation for President Barack Obama winning last year’s presidential election, Acting Assistant Attorney General Loretta King for the Civil Rights Division and U.S. Attorney for the Eastern District of New York Benton J. Campbell announced. Carranza, of Staten Island, N.Y., faces a maximum sentence of 10 years in prison and a 250,000 fine. A sentencing date has not been set by the court. The case is being investigated by the FBI and the New York City Police Department. The case is being prosecuted by Special Litigation Counsel Kristy Parker of the Justice Department’s Civil Rights Division and Assistant U.S. Attorney Pamela Chen.",2009-01-26T00:00:00-05:00,Hate Crimes,Civil Rights Division,0.106,0.809,0.084,-0.5106,washington brian carranza plead guilti today court judg carol amon brooklyn conspir assault african american resid staten island retali presid barack obama win last year presidenti elect act general loretta king eastern york benton campbel announc carranza staten island face maximum sentenc year prison fine sentenc date court investig york citi polic prosecut special litig counsel kristi parker pamela chen,topic_0
1,09-067,0.999265,0.00044,0.000294,09-067,"Three Men Indicted for Racially-Motivated Church Arson in Springfield, Mass.","WASHINGTON – Three individuals were indicted today by a federal grand jury in the District of Massachusetts for conspiring to interfere with the civil rights of members of the Macedonia Church of God in Christ, a Springfield, Mass., church with a predominantly African-American congregation. The indictment was announced by Loretta King, Acting Assistant Attorney General for the Civil Rights Division; U.S. Attorney Michael J. Sullivan for the District of Massachusetts; Glenn N. Anderson, Special Agent in Charge of the Bureau of Alcohol, Tobacco, Firearms and Explosives - Boston Field Division; Warren T. Bamford, Special Agent in Charge of the FBI’s Boston Field Office; Colonel Mark Delaney, Superintendent of the Massachusetts State Police; William Bennett, Hampden County District Attorney; and Commissioner William J. Fitchet of the Springfield Police Department. The church’s newly constructed building burned to the ground on Nov. 5, 2008, hours after the election of President Barack Obama. The indictment alleges that Benjamin Haskell, 22, Michael Jacques, 24, and Thomas Gleason, 21, all of Springfield, conspired to burn the church in retaliation for the election of the country’s first African-American president. ""These allegations of racial violence connected with the presidential election are serious and disturbing,"" said Acting Assistant Attorney General Loretta King. ""The Justice Department will aggressively prosecute individuals who conspire to commit such acts of violence and intimidation."" The indictment alleges that several hours after Barack Obama was elected President, Haskell, Jacques and Gleason conspired to burn the Macedonia Church of God in Christ’s new under-construction church building, which was 75 percent complete at the time of the fire. According to the indictment, on Election Night, Haskell, Jacques and Gleason used racial slurs and expressed anger with the election of Barack Obama and discussed burning the Macedonia Church of God in Christ’s new church building because the church members, congregants and bishop were African-American. They then obtained gasoline, poured it on the interior and exterior of the new church building and started a fire that destroyed nearly the entire structure. Some of the responding firefighters suffered injuries as they worked to extinguish the blaze. ""Racism has devastating effects on individuals, and stifles the quality of life in the community,"" said U.S. Attorney Sullivan. ""We will not tolerate those who victimize others and I am angered and saddened that the neighborhood has endured such cruel acts by those allegedly living in the same community."" If convicted, Haskell, Jacques and Gleason face a maximum prison sentence of 10 years to be followed by three years of supervised release. The details contained in the indictment are allegations. The defendants are presumed to be innocent unless and until proven guilty beyond a reasonable doubt in a court of law. The case is being investigated by the FBI; Bureau of Alcohol, Tobacco, Firearms and Explosives; Massachusetts State Police; Hampden County District Attorney’s Office and the Springfield Police Department. It is being prosecuted by Trial Attorney Erin Aslan of the Justice Department’s Civil Rights Division and Assistant U.S. Attorneys Paul Smyth and Kevin O’Regan of the U.S. Attorney’s Office for the District of Massachusetts.",2009-01-27T00:00:00-05:00,Hate Crimes,Civil Rights Division,0.12,0.832,0.048,-0.9921,washington three individu indict today feder grand juri massachusett conspir interfer member macedonia church christ springfield mass church predomin african american congreg indict announc loretta king act general michael sullivan massachusett glenn anderson special agent charg bureau alcohol tobacco firearm explos boston field warren bamford special agent charg boston field colonel mark delaney superintend massachusett state polic william bennett hampden counti commission william fitchet springfield polic church newli construct build burn ground hour elect presid barack obama indict alleg benjamin haskel michael jacqu thoma gleason springfield conspir burn church retali elect countri first african american presid alleg racial violenc connect presidenti elect serious disturb said act general loretta king aggress prosecut individu conspir commit act violenc intimid indict alleg sever hour barack obama elect presid haskel jacqu gleason conspir burn macedonia church christ construct church build percent complet time fire accord indict elect night haskel jacqu gleason use racial slur express anger elect barack obama discuss burn macedonia church christ church build church member congreg bishop african american obtain gasolin pour interior exterior church build start fire destroy near entir structur respond firefight suffer injuri work extinguish blaze racism devast effect individu stifl qualiti life communiti said sullivan toler victim other anger sadden neighborhood endur cruel act alleg live communiti convict haskel jacqu gleason face maximum prison sentenc year follow three year supervis releas detail contain indict alleg defend presum innoc unless proven guilti beyond reason doubt court investig bureau alcohol tobacco firearm explos massachusett state polic hampden counti springfield polic prosecut erin aslan attorney paul smyth kevin regan massachusett,topic_0
2,09-081,0.999487,0.000307,0.000206,09-081,Final Defendant Pleads Guilty to Anti-Obama Assaults,"WASHINGTON - Ralph Nicoletti pleaded guilty in Brooklyn, N.Y., federal court today before U.S. District Judge Carol B. Amon to committing three assaults targeting African-American residents in Staten Island, N.Y., on the night of President Barack Obama’s election victory. Nicoletti was the last of four defendants to plead guilty in the federal prosecution stemming from the attacks. The other three defendants – Bryan Garaventa, Michael Contreras and Brian Carranza – previously pleaded guilty to conspiring to commit the hate crime assaults and each face sentences of up to 10 years in prison. As part of his plea, Nicoletti has agreed to a sentence of 12 years, subject to the court’s approval. The guilty plea was announced by Loretta King, Acting Assistant Attorney General for the Department of Justice’s Civil Rights Division; Benton J. Campbell, U.S. Attorney for the Eastern District of New York; Joseph M. Demarest, Jr., Assistant Director-in-Charge, FBI, New York Field Office; and Raymond W. Kelly, Commissioner, New York City Police Department. At the plea proceeding, Nicoletti admitted that on Nov. 4, 2008, the night of the presidential election, the defendants decided to assault African-Americans in Staten Island after President Obama was declared the winner of the election. The defendants targeted African-Americans believing that they had voted for President Obama. Nicoletti drove the group to the Park Hill section of Staten Island, a predominantly African-American neighborhood, where they came upon an African-American teenager and assaulted him. Nicoletti struck the teenager with a metal pipe and Garaventa hit him with a collapsible police baton. Nicoletti then drove to the Port Richmond section of Staten Island, where the defendants assaulted an unidentified African-American man. During that assault, Garaventa tripped the victim and pushed him to the ground. The third assault was against an individual whom the defendants mistakenly believed was African-American. The plan was for Contreras to hit the victim with the police baton as the defendants drove by him. Instead, Nicoletti deliberately drove his car into the victim’s body. The victim was thrown onto the hood of the car and hit the front windshield, smashing it. The victim was seriously injured and remained in a coma for several weeks after the attack. ""This successful prosecution sends a clear message that racially-motivated acts of violence targeted at those who are exercising their right to vote are intolerable and will be aggressively investigated and prosecuted,"" said Acting Assistant Attorney General King. ""It is a tragedy that these crimes occur at all, but the Department of Justice will remain vigilant in our efforts to combat hate crimes, as they tear at the very fabric of our great nation."" ""The conduct of the defendants is shocking and deplorable,"" stated U.S. Attorney Campbell. ""On a night of historic significance, these four angry men assaulted their victims in an attempt to punish them for exercising a fundamental right of all Americans – the right to vote. Those who commit such crimes will be swiftly apprehended, prosecuted and punished. We are grateful for our partnership with the Department of Justice Civil Rights Division, the FBI and the New York City Police Department, which has been vital to the success of this case, and I particularly wish to thank the Richmond County District Attorney’s Office for its assistance in this matter."" ""The crimes these defendants have now admitted to were violent assaults that in one case nearly killed a man,"" said FBI Assistant Director-in-Charge Demarest of the New York Field Office. ""In attempting to intimidate voters, the defendants also violated the victims’ civil rights in a way that was an attack on the democratic process. These were serious crimes that prompted the serious response the FBI will always bring to bear in civil rights enforcement."" ""It was important to make certain that those who seriously injured individuals, based on their race, did not escape justice,"" said Police Commissioner Raymond W. Kelly. ""NYPD Inspector Michael J. Osgood, Commanding Officer of the NYPD Hate Crime Task Force, had the foresight to assign a special team on Election Night until 4 a.m. the next morning. As a result, his investigators were in position to respond quickly to the bias attacks as reports of them began to emerge. Detectives located an eyewitness to one of the attacks, and their subsequent distribution of flyers in the Rosebank area of Staten Island over three days led to the first major break in the case. I also want to thank the FBI agents who helped, and the federal prosecutors who succeeded in winning the guilty pleas."" The government’s case is being prosecuted by Assistant U.S. Attorneys Pamela K. Chen and Margo K. Brodie, and Department of Justice Special Litigation Counsel Kristy Parker.",2009-02-02T00:00:00-05:00,Hate Crimes,Civil Rights Division,0.171,0.73,0.099,-0.9956,washington ralph nicoletti plead guilti brooklyn feder court today judg carol amon commit three assault target african american resid staten island night presid barack obama elect victori nicoletti last four defend plead guilti feder prosecut stem attack three defend bryan garaventa michael contrera brian carranza previous plead guilti conspir commit hate crime assault face sentenc year prison part plea nicoletti agre sentenc year subject court approv guilti plea announc loretta king act general benton campbel eastern york joseph demarest director charg york field raymond kelli commission york citi polic plea proceed nicoletti admit night presidenti elect defend decid assault african american staten island presid obama declar winner elect defend target african american believ vote presid obama nicoletti drove group park hill section staten island predomin african american neighborhood came upon african american teenag assault nicoletti struck teenag metal pipe garaventa collaps polic baton nicoletti drove port richmond section staten island defend assault unidentifi african american assault garaventa trip victim push ground third assault individu defend mistaken believ african american plan contrera victim polic baton defend drove instead nicoletti deliber drove victim bodi victim thrown onto hood front windshield smash victim serious injur remain coma sever week attack success prosecut send clear messag racial motiv act violenc target exercis right vote intoler aggress investig prosecut said act general king tragedi crime occur remain vigil effort combat hate crime tear fabric great nation conduct defend shock deplor state campbel night histor signific four angri assault victim attempt punish exercis fundament right american right vote commit crime swift apprehend prosecut punish grate partnership york citi polic vital success particular wish thank richmond counti matter crime defend admit violent assault near kill said director charg demarest york field attempt intimid voter defend also violat victim attack democrat process serious crime prompt serious respons alway bring bear enforc import make certain serious injur individu base race escap said polic commission raymond kelli nypd inspector michael osgood command offic nypd hate crime task forc foresight assign special team elect night next morn result investig posit respond quick bias attack report began emerg detect locat eyewit attack subsequ distribut flyer rosebank area staten island three day first major break also want thank agent help feder prosecutor succeed win guilti plea govern prosecut attorney pamela chen margo brodi special litig counsel kristi parker,topic_0
3,09-1011,0.999179,0.000492,0.000329,09-1011,Last Defendant in Tennessee Islamic Center Burning Pleads Guilty,"WASHINGTON – Eric Ian Baker pleaded guilty today in federal court in Nashville, Tenn., for his role in burning and vandalizing the Islamic Center of Columbia, Tenn., on Feb. 9, 2008. Baker was charged with violating civil rights that protect religious property and for using fire in the commission of a felony. Two other defendants, Michael Corey Golden and Jonathan Edward Stone, had previously pleaded guilty in November 2008 for their roles in the arson. During the plea hearing, Baker admitted that he, Golden and Stone assembled Molotov cocktail incendiary devices, broke into the Islamic Center, ignited the devices and used them to completely destroy the mosque. He admitted to painting swastikas and the phrase ""White Power"" on the mosque in the course of the arson and that they acted because of the religious character of the property. ""The law protects the right of all Americans to worship where and how they choose without fear of violence or intimidation,"" said Loretta King, Acting Assistant Attorney General for the Civil Rights Division. ""The Civil Rights Division will vigorously prosecute those who, through acts of terror, attempt to interfere with that right."" ""This type of crime strikes at the heart of our civil rights and religious freedoms in America. I am very pleased that through local, state and federal cooperation all defendants responsible for this vile attack have been brought to justice,"" said U.S. Attorney Edward M. Yarbrough for the Middle District of Tennessee. ""Every Muslim who saw the news photos with the Swastika painted on the burned out Islamic center was victimized by this attack. Today, they can clearly see that American law enforcement stands strongly with them to guarantee their freedoms to worship and assemble,"" said ATF Nashville Field Division Special Agent in Charge James M. Cavanaugh. ""The FBI is committed to protecting the civil rights of all people through the enforcement of federal civil rights statutes,"" said FBI Memphis Division Special Agent in Charge My Harrison. ""The destruction of any place of worship will not be ignored and the FBI will make every effort to bring those who commit such heinous acts to justice.""‬‪ A date for Baker’s sentencing hearing will be scheduled at a later time. Stone and Golden are scheduled to be sentenced on Nov. 23, 2009. All three defendants face prison sentences of up to 30 years for damaging religious property and for using fire and an explosive device to commit a federal felony offense. The case was investigated by the Columbia, Tenn., Police Department and special agents with the Bureau of Alcohol, Tobacco, Firearms and Explosives, Tennessee State Bomb and Arson and the FBI. The case is being prosecuted by Assistant U.S. Attorney Hal McDonough and Civil Rights Division Trial Attorney Jonathan Skrmetti.",2009-09-22T00:00:00-04:00,Hate Crimes,Civil Rights Division,0.143,0.74,0.117,-0.9653,washington eric baker plead guilti today feder court nashvill tenn role burn vandal islam center columbia tenn baker charg violat protect religi properti use fire commiss feloni defend michael corey golden jonathan edward stone previous plead guilti novemb role arson plea hear baker admit golden stone assembl molotov cocktail incendiari devic broke islam center ignit devic use complet destroy mosqu admit paint swastika phrase white power mosqu cours arson act religi charact properti protect right american worship choos without fear violenc intimid said loretta king act general vigor prosecut act terror attempt interfer right type crime strike heart religi freedom america pleas local state feder cooper defend respons vile attack brought said edward yarbrough middl tennesse everi muslim news photo swastika paint burn islam center victim attack today clear american enforc stand strong guarante freedom worship assembl said nashvill field special agent charg jame cavanaugh commit protect peopl enforc feder statut said memphi special agent charg harrison destruct place worship ignor make everi effort bring commit heinous act date baker sentenc hear schedul later time stone golden schedul sentenc three defend face prison sentenc year damag religi properti use fire explos devic commit feder feloni offens investig columbia tenn polic special agent bureau alcohol tobacco firearm explos tennesse state bomb arson prosecut mcdonough jonathan skrmetti,topic_0
4,09-1021,0.998045,0.001171,0.000784,09-1021,Two Indiana Men Plead Guilty to Cross Burning,"Richard LaShure, 41, and Aaron Latham, 20, both of Muncie, Ind., pleaded guilty to conspiring to violate the civil rights of an African American family and to interfering with their housing rights by burning a cross in the family’s yard. According to the charging document, on July 25, 2008, the two men, acting with the assistance of a third participant, built a cross and poured gasoline on it, then set it on fire in the yard of an African-American family who lived in the neighborhood. They will be sentenced on Nov. 5, 2009. This is the second case in two years in which the Civil Rights Division has brought charges for a cross burning that occurred in Muncie, Ind. Two men were convicted in 2008 for burning a cross at the home of a woman who had biracial children. ""These two men used a despicable and unmistakable symbol of hatred, the burning cross, to intimidate a family because they are African American,"" said Loretta King, Acting Assistant Attorney General for the Civil Rights Division. ""The Civil Rights Division will continue to prosecute this type of illegal, hateful behavior to the fullest extent of the law."" The guilty pleas resulted from an investigation by Special Agent Charlie Rownd from the Muncie Field Office of the FBI and Betsy Biffl from the Civil Rights Division of the United States Department of Justice .",2009-09-24T00:00:00-04:00,Hate Crimes,Civil Rights Division,0.117,0.848,0.035,-0.9584,richard lashur aaron latham munci plead guilti conspir violat african american famili interf hous burn cross famili yard accord charg document juli act third particip built cross pour gasolin fire yard african american famili live neighborhood sentenc second year brought charg cross burn occur munci convict burn cross home woman biraci children use despic unmistak symbol hatr burn cross intimid famili african american said loretta king act general continu prosecut type illeg hate behavior fullest extent guilti plea result special agent charli rownd munci field betsi biffl unit state,topic_0


In [64]:
#topic_wmeta.head()
pd.crosstab(topic_wmeta.topics_clean, topic_wmeta.toptopic).apply(lambda r: r/r.sum(), axis=1)

toptopic,topic_0,topic_1,topic_2
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Civil Rights,0.146179,0.847176,0.006645
Hate Crimes,1.0,0.0,0.0
Project Safe Childhood,0.0,0.00625,0.99375


In [None]:
## civil rights has the most challenge matching data. This is likely because civil rights discussions
## oft include mentions of things like hate crimes, but also other triumphs. This is not the same with
## reports of hate crimes, where there is generally very little postive news. 

## 2.5 OPTIONAL extra credit (5 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 2.1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings. Feel free to load additional packages if needed

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?