# Problem set 4: Text analysis of DOJ press releases

**Total points (without extra credit)**: 52 

- For background:

    - DOJ is the federal law enforcement agency responsible for federal prosecutions; this contrasts with the local prosecutions in the Cook County dataset we analyzed earlier. Here's a short explainer on which crimes get prosecuted federally versus locally: https://www.criminaldefenselawyer.com/resources/criminal-defense/federal-crime/state-vs-federal-crimes.htm#:~:text=Federal%20criminal%20prosecutions%20are%20handled,of%20state%20and%20local%20law. 
    - Here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 
    - Here's the code the dataset creator used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

## 0.0 Import packages

In [361]:
## helpful packages
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import random
import re
import string

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
# nltk.download('averaged_perceptron_tagger')
# nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
# ! python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)

## 0.1 Load and clean text data

In [362]:
## first, unzip the file pset4_inputdata.zip 
## then, run this code to load the unzipped json file and convert to a dataframe
## (may need to change the pathname depending on where you store stuff)
## and convert some of the attributes from lists to values
doj = pd.read_json("combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 
           'components_clean']].copy()

doj.head(2)

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23, who was convicted in 2013 of attempting to use a weapon of mass destruction (explosives) in connection with a plot to detonate a vehicle bomb at an annual Christmas tree lighting ceremony in Portland, was sentenced today to serve 30 years in prison, followed by a lifetime term of supervised release. Mohamud, a naturalized U.S. citizen from Somalia and former resident of Corvallis, Oregon, was arrested on Nov. 26, 2010, after he attempted to detonate what he believed to be an explosives-laden van that was parked near the tree lighting ceremony in Portland. The arrest was the culmination of a long-term undercover operation, during which Mohamud was monitored closely for months as his bomb plot developed. The device was in fact inert, and the public was never in danger from the device. At sentencing, United States District Court Judge Garr M. King, who presided over Mohamed’s 14-day trial, said “the intended crime was horrific,” and that the defendant, even though he was presented with options by undercover FBI employees, “never once expressed a change of heart.” King further noted that the Christmas tree ceremony was attended by up to 10,000 people, and that the defendant “wanted everyone to leave either dead or injured.” King said his sentence was necessary in view of the seriousness of the crime and to serve as deterrence to others who might consider similar acts. “With today’s sentencing, Mohamed Osman Mohamud is being held accountable for his attempted use of what he believed to be a massive bomb to attack innocent civilians attending a public Christmas tree lighting ceremony in Portland,” said John P. Carlin, Assistant Attorney General for National Security. “The evidence clearly indicated that Mohamud was intent on killing as many people as possible with his attack. Fortunately, law enforcement was able to identify him as a threat, insert themselves in the place of a terrorist that Mohamud was trying to contact, and thwart Mohamud’s efforts to conduct an attack on our soil. This case highlights how the use of undercover operations against would-be terrorists allows us to engage and disrupt those who wish to commit horrific acts of violence against the innocent public. The many agents, analysts, and prosecutors who have worked on this case deserve great credit for their roles in protecting Portland from the threat posed by this defendant and ensuring that he was brought to justice.” “This trial provided a rare glimpse into the techniques Al Qaeda employs to radicalize home-grown extremists,” said Amanda Marshall, U.S. Attorney for the District of Oregon. “With the sentencing today, the court has held this defendant accountable. I thank the dedicated professionals in the law enforcement and intelligence communities who were responsible for this successful outcome. I look forward to our continued work with Muslim communities in Oregon who are committed to ensuring that all young people are safe from extremists who seek to radicalize others to engage in violence.” According to the trial evidence, in February 2009, Mohamud began communicating via e-mail with Samir Khan, a now-deceased al Qaeda terrorist who published Jihad Recollections, an online magazine that advocated violent jihad, and who also published Inspire, the official magazine of al-Qaeda in the Arabian Peninsula. Between February and August 2009, Mohamed exchanged approximately 150 emails with Khan. Mohamud wrote several articles for Jihad Recollections that were published under assumed names. In August 2009, Mohamud was in email contact with Amro Al-Ali, a Saudi national who was in Yemen at the time and is today in custody in Saudi Arabia for terrorism offenses. Al-Ali sent Mohamud detailed e-mails designed to facilitate Mohamud’s travel to Yemen to train for violent jihad. In December 2009, while Al-Ali was in the northwest frontier province of Pakistan, Mohamud and Al-Ali discussed the possibility of Mohamud traveling to Pakistan to join Al-Ali in terrorist activities. Mohamud responded to Al-Ali in an e-mail: “yes, that would be wonderful, just tell me what I need to do.” Al-Ali referred Mohamud to a second associate overseas and provided Mohamud with a name and email address to facilitate the process. In the following months, Mohamud made several unsuccessful attempts to contact Al-Ali’s associate. Ultimately, an FBI undercover operative contacted Mohamud via email under the guise of being an associate of Al-Ali’s. Mohamud and the FBI undercover operative agreed to meet in Portland in July 2010. At the meeting, Mohamud told the FBI undercover operative he had written articles that were published in Jihad Recollections. Mohamud also said that he wanted to become “operational.” Asked what he meant by “operational,” Mohamud said he wanted to put an explosion together, but needed help. According to evidence presented at trial, at a meeting in August 2010, Mohamud told undercover FBI operatives he had been thinking of committing violent jihad since the age of 15. Mohamud then told the undercover FBI operatives that he had identified a potential target for a bomb: the annual Christmas tree lighting ceremony in Portland’s Pioneer Courthouse Square on Nov. 26, 2010. The undercover FBI operatives cautioned Mohamud several times about the seriousness of this plan, noting there would be many people at the event, including children, and emphasized that Mohamud could abandon his attack plans at any time with no shame. Mohamud indicated the deaths would be justified and that he would not mind carrying out a suicide attack on the crowd. According to evidence presented at trial, in the ensuing months Mohamud continued to express his interest in carrying out the attack and worked on logistics. On Nov. 4, 2010, Mohamud and the undercover FBI operatives traveled to a remote location in Lincoln County, Oregon, where they detonated a bomb concealed in a backpack as a trial run for the upcoming attack. During the drive back to Corvallis, Mohamud was asked if was capable looking at all the bodies of those who would be killed during the explosion. In response, Mohamud noted, “I want whoever is attending that event to be, to leave either dead or injured.” Mohamud later recorded a video of himself, with the assistance of the undercover FBI operatives, in which he read a statement that offered his rationale for his bomb attack. On Nov. 18, 2010, undercover FBI operatives picked up Mohamud to travel to Portland to finalize the details of the attack. On Nov. 26, 2010, just hours before the planned attack, Mohamud examined the 1,800 pound bomb in the van and remarked that it was “beautiful.” Later that day, Mohamud was arrested after he attempted to remotely detonate the inert vehicle bomb rked near the Christmas tree lighting ceremony This case was investigated by the FBI, with assistance from the Oregon State Police, the Corvallis Police Department, the Lincoln County Sheriff’s Office and the Portland Police Bureau. The prosecution was handled by Assistant U.S. Attorneys Ethan D. Knight and Pamala Holsinger from the U.S. Attorney’s Office for the District of Oregon. Trial Attorney Jolie F. Zimmerman, from the Counterterrorism Section of the Justice Department’s National Security Division, assisted. # # # 14-1077",2014-10-01T00:00:00-04:00,No topic,National Security Division (NSD)
1,12-919,$1 Million in Restitution Payments Announced to Preserve North Carolina Wetlands,"WASHINGTON – North Carolina’s Waccamaw River watershed will benefit from a $1 million restitution order from a federal court, funding environmental projects to acquire and preserve wetlands in an area damaged by illegal releases of wastewater from a corporate hog farm, announced Ignacia S. Moreno, Assistant Attorney General of the Justice Department’s Environment and Natural Resources Division; U.S. Attorney for the Eastern District of North Carolina Thomas G. Walker; Director Greg McLeod from the North Carolina State Bureau of Investigation; and Camilla M. Herlevich, Executive Director of the North Carolina Coastal Land Trust. Freedman Farms Inc. was sentenced in February 2012 to five years of probation and ordered to pay $1.5 million in fines, restitution and community service payments for violating the Clean Water Act when it discharged hog waste into a stream that leads to the Waccamaw River. William B. Freedman, president of Freedman Farms, was sentenced to six months in prison to be followed by six months of home confinement. Freedman Farms also is required to implement a comprehensive environmental compliance program and institute an annual training program. In an order issued on April 19, 2012, the court ordered that the defendants would be responsible for restitution of $1 million in the form of five annual payments starting in January 2013, which the court will direct to the North Carolina Coastal Land Trust (NCCLT). The NCCLT plans to use the money to acquire and conserve land along streams in the Waccamaw watershed. The court also directed a $75,000 community service payment to the Southern Environmental Enforcement Network, an organization dedicated to environmental law enforcement training and information sharing in the region. “The resolution of the case against Freedman Farms demonstrates the commitment of the Department of Justice to enforcing the Clean Water Act to ensure the protection of human health and the environment,” said Assistant Attorney General Moreno. “The court-ordered restitution in this case will conserve wetlands for the benefit of the people of North Carolina. By enforcing the nation’s environmental laws, we will continue to ensure that concentrated animal feeding operations (CAFOs) operate without threatening our drinking water, the health of our communities and the environment.” “This office is committed to doing our part to hold accountable those who commit crimes against our environment, which can cause serious health problems to residents and damage the environment that makes North Carolina such a beautiful place to live and visit,” said U.S. Attorney Walker. “This case shows what we can accomplish when our SBI agents work closely with their local, state and federal partners to investigate environmental crimes and hold the polluters accountable,” said Director McLeod. “We’ll continue our efforts to fight illegal pollution that damages our water and puts the public’s health at risk.” “The Waccamaw is unique and wild,” said Director Herlevich of the North Carolina Coastal Land Trust. “Its watershed includes some of the most extensive cypress gum swamps in the state, and its headwaters at Lake Waccamaw contain fish that are found nowhere else on Earth. We appreciate the trust of the court and the U. S. Attorney, and we look forward to using these funds for conservation projects in a river system that is one of our top conservation priorities.” According to evidence presented in court, in December 2007 Freedman Farms discharged hog waste into Browder’s Branch, a tributary to the Waccamaw River that flows through the White Marsh, a large wetlands complex. Freedman Farms, located in Columbus County, N.C., is in the business of raising hogs for market, and this particular farm had some 4,800 hogs. The hog waste was supposed to be directed to two lagoons for treatment and disposal. Instead, hog waste was discharged from Freedman Farms directly into Browder’s Branch. The Clean Water Act is a federal law that makes it illegal to knowingly or negligently discharge a pollutant into a water of the United States. The Freedman case was investigated by the U.S. Environmental Protection Agency (EPA) Criminal Investigation Division, the U.S. Army Corps of Engineers and the North Carolina State Bureau of Investigation, with assistance from the EPA Science and Ecosystem Support Division. The case was prosecuted by Assistant U.S. Attorney J. Gaston B. Williams of the Eastern District of North Carolina and Trial Attorney Mary Dee Carraway of the Environmental Crimes Section of the Justice Department’s Environment and Natural Resources Division. The North Carolina Coastal Land Trust is celebrating its 20th anniversary of saving special lands in eastern North Carolina. The organization has protected nearly 50,000 acres of lands with scenic, recreational, historic and ecological values. North Carolina Coastal Land Trust has saved streams and wetlands that provide clean water, forests that are havens for wildlife, working farms that provide local food and nature parks that everyone can enjoy. More information about the Coastal Land Trust is available at www.coastallandtrust.org.",2012-07-25T00:00:00-04:00,No topic,Environment and Natural Resources Division


## 1. Tagging and sentiment scoring (17 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call the raw string of this press release `pharma`

In [363]:
## your code to subset to one press release and take the string
pharma = doj[doj["id"] == "17-1204"]["contents"].values[0]
pharma

'The founder and majority owner of Insys Therapeutics Inc., was arrested today and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a Fentanyl spray intended for cancer patients experiencing breakthrough pain.\xa0"More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids. And yet some medical professionals would rather take advantage of the addicts than try to help them," said Attorney General Jeff Sessions. "This Justice Department will not tolerate this.\xa0 We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic.\xa0 And under the leadership of President Trump, we are fully committed to defeating this threat to the American people.”John N. Kapoor, 74, of Phoenix, Ariz., a current member of the Board of Directors of Insys, was arrested this morning in Arizona and charged with RICO conspi

### 1.1 part of speech tagging (3 points)

A. Preprocess the `pharma` press release to remove all punctuation / digits (so can use `.isalpha()` to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech. 

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the `pharma` release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for `.isalpha()`: https://www.w3schools.com/python/ref_string_isalpha.asp
- `processtext` function here has an example of tokenizing and filtering to words where `.isalpha()` is true: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb
- Part of speech tagging section of this code: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb



In [364]:
## your code here to restrict to alpha
pharma_alpha = "".join([char for char in pharma if char.isalpha() or char == " "])
pharma_alpha

'The founder and majority owner of Insys Therapeutics Inc was arrested today and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a Fentanyl spray intended for cancer patients experiencing breakthrough painMore than  Americans died of synthetic opioid overdoses last year and millions are addicted to opioids And yet some medical professionals would rather take advantage of the addicts than try to help them said Attorney General Jeff Sessions This Justice Department will not tolerate this We will hold accountable anyone  from street dealers to corporate executives  who illegally contributes to this nationwide epidemic And under the leadership of President Trump we are fully committed to defeating this threat to the American peopleJohn N Kapoor  of Phoenix Ariz a current member of the Board of Directors of Insys was arrested this morning in Arizona and charged with RICO conspiracy as well as other felonies including cons

In [365]:
## your code here for part of speech tagging
pharma_alpha_tokens = word_tokenize(pharma_alpha)
pharma_alpha_tokens_pos = pos_tag(pharma_alpha_tokens)
# pharma_alpha_tokens_pos

In [366]:
adj_count_dict = {}
for token in pharma_alpha_tokens_pos:
    if (token[1] == "JJ") or (token[1] == "JJR") or (token[1] == "JJS"):
        if token[0] in adj_count_dict:
            adj_count_dict[token[0]] += 1
        else:
            adj_count_dict[token[0]] = 1
            
adj_count_dict_df = pd.DataFrame.from_dict(adj_count_dict, orient = "index", columns = ["count"]).reset_index().rename(columns = {"index": "adj"}).sort_values(by = "count", ascending = False)
adj_count_dict_df.head()

Unnamed: 0,adj,count
9,former,8
26,opioid,5
0,nationwide,4
30,addictive,3
8,other,3


## 1.2 named entity recognition (4 points)

A. Using the original `pharma` press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: `LAW`

**Resources**:

- For parts A and B: named entity recognition part of this code: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb

In [367]:
## your code here for part A
spacy_pharma = nlp(pharma)

In [368]:
## your code here for part B
for one_tok in spacy_pharma.ents:
    if one_tok.label_ == "LAW":
        print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)

Entity: RICO; NER tag: LAW
Entity: the Controlled Substances Act; NER tag: LAW
Entity: RICO; NER tag: LAW


C. Use Google to summarize in one sentence what the `RICO` named entity means and why this might apply to a pharmaceutical kickbacks case (and not just a mafia case...) 

In [369]:
## your code here 
"Rico stands for 'Racketeer Influenced and Corrupt Organizations Act', and so thus can be used to prosecute both organized crime such as a mafia case, or corporate crime where corruption is involved like in a pharmaceutical kickbacks case."

"Rico stands for 'Racketeer Influenced and Corrupt Organizations Act', and so thus can be used to prosecute both organized crime such as a mafia case, or corporate crime where corruption is involved like in a pharmaceutical kickbacks case."

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these named entities.

In [370]:
## your code here

date_toks = [one_tok for one_tok in spacy_pharma.ents if one_tok.label_ == "DATE"]

date_toks_contains_year = [one_tok for one_tok in date_toks if "year" in one_tok.text.lower()]
for one_tok in date_toks_contains_year:
    print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)

Entity: last year; NER tag: DATE
Entity: 20 years; NER tag: DATE
Entity: three years; NER tag: DATE
Entity: five years; NER tag: DATE
Entity: three years; NER tag: DATE


E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). 

**Hint**: you may want to use re.search or re.findall 

- For part E, `re.search` and `re.findall` examples here for filtering to ones containing year (multiple approaches; some need not involve `re`): https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_basicregex_solutions.ipynb

In [371]:
## your code here
pharma_sentences = [one_sent for one_sent in spacy_pharma.sents]
pharma_sentences_contains_year = [one_sent for one_sent in pharma_sentences if "year" in one_sent.text.lower()]
for one_sent in pharma_sentences_contains_year:
    print(one_sent.text)
    
"The CEO will receive a maximum of 20 years in prison and 3 years of probation if he's charged with conspiracy to commit RICO, but only a maximum of 5 years in prison and 3 years of probation if he's charged with conspiracy to violate the Anti-Kickback Law."

"More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids.
The charges of conspiracy to commit RICO and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than 20 years in prison, three years of supervised release and a fine of $250,000, or twice the amount of pecuniary gain or loss.  
The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised release and a $25,000 fine.


"The CEO will receive a maximum of 20 years in prison and 3 years of probation if he's charged with conspiracy to commit RICO, but only a maximum of 5 years in prison and 3 years of probation if he's charged with conspiracy to violate the Anti-Kickback Law."

## 1.3 sentiment analysis  (10 points)

- Sentiment analysis section of this script: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb


A. Subset the press releases to those labeled with one of three topics via `topics_clean`: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.



In [372]:
## your code here for subsetting
doj_subset = doj[(doj["topics_clean"] == "Civil Rights") | (doj["topics_clean"] == "Hate Crimes") | (doj["topics_clean"] == "Project Safe Childhood")].copy()
doj_subset.head(2)
doj_subset.shape

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
77,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened. James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.” “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.” This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle"
155,15-1522,Alabama Man Found Guilty of Aggravated Sexual Abuse of a Child,"A federal jury convicted Rick Lee Evans, 43, of Anniston, Alabama, today of aggravated sexual abuse of a child after a five-day trial, Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division and U.S. Attorney Joyce White Vance of the Northern District of Alabama announced. According to evidence introduced at trial, Evans, a former U.S. Army soldier, and his then-wife, a Department of Defense employee, were residing in Germany when they were asked to take temporary custody of a five-year-old child whose parents were deployed to Iraq with the U.S. Army. Evans sexually abused the child on multiple occasions during the 18 months that the child lived with him from May 2007 to December 2008. Trial Attorney Austin M. Berry of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jacquelyn Hutzell of the Northern District of Alabama are prosecuting the case. U.S. Army Criminal Investigations Division and the FBI’s Birmingham, Alabama, Division investigated the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2015-12-11T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Alabama, Northern"


(717, 6)

B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string (**Hint**: you may want to use `re.sub` with an or condition)
- Scores the sentiment of the entire press release using the `SentimentIntensityAnalyzer` and `polarity_scores`
- Returns the length-four (negative, positive, neutral, compound) sentiment dictionary (any order is fine)

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- I used a function + list comprehension to execute and it takes about 30 seconds on my local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717


In [373]:
## your code here to define function
## your code here to define function
def get_sentiment(text):
    """Remove named entities from text and return sentiment score."""
    text = text.lower()
    spacy_text = nlp(str(text))
    named_entities = [one_tok.text for one_tok in spacy_text.ents]
    
    text_no_named_entities = text
    
    for entity in named_entities:
        text_no_named_entities = text_no_named_entities.replace(entity, "")
    
    analyzer = SentimentIntensityAnalyzer()
    return [analyzer.polarity_scores(text_no_named_entities)]

In [374]:
## your code here executing the function
# get_sentiment(pharma)
sentiments = doj_subset["contents"].apply(get_sentiment)
sentiments

77       [{'neg': 0.188, 'neu': 0.765, 'pos': 0.047, 'compound': -0.9931}]
155      [{'neg': 0.141, 'neu': 0.759, 'pos': 0.101, 'compound': -0.9118}]
157      [{'neg': 0.101, 'neu': 0.808, 'pos': 0.091, 'compound': -0.6808}]
162      [{'neg': 0.131, 'neu': 0.767, 'pos': 0.101, 'compound': -0.8827}]
168          [{'neg': 0.16, 'neu': 0.8, 'pos': 0.04, 'compound': -0.9864}]
                                       ...                                
13002    [{'neg': 0.151, 'neu': 0.789, 'pos': 0.059, 'compound': -0.9737}]
13032     [{'neg': 0.089, 'neu': 0.797, 'pos': 0.114, 'compound': 0.7717}]
13034    [{'neg': 0.156, 'neu': 0.735, 'pos': 0.109, 'compound': -0.9578}]
13068     [{'neg': 0.138, 'neu': 0.769, 'pos': 0.093, 'compound': -0.988}]
13081     [{'neg': 0.142, 'neu': 0.828, 'pos': 0.03, 'compound': -0.9913}]
Name: contents, Length: 717, dtype: object

C. Add the four sentiment scores to the `doj_subset` dataframe to create a dataframe: `doj_subset_wscore`. Sort from highest neg to lowest neg score and print the top `id`, `contents`, and `neg` columns of the two most neg press releases. 

Notes:

- Don't worry if your sentiment score differs slightly from our output on GitHub; differences in preprocessing can lead to diff scores

In [375]:
## your code here
doj_subset_wscore = doj_subset.copy()
doj_subset_wscore["sentiment_neg"] = [one_sent[0]["neg"] for one_sent in sentiments]
doj_subset_wscore["sentiment_neu"] = [one_sent[0]["neu"] for one_sent in sentiments]
doj_subset_wscore["sentiment_pos"] = [one_sent[0]["pos"] for one_sent in sentiments]
doj_subset_wscore["sentiment_compound"] = [one_sent[0]["compound"] for one_sent in sentiments]

# Sort by negative sentiment
doj_subset_wscore.sort_values(by = "sentiment_neg", ascending = False).head(2)

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,sentiment_neg,sentiment_neu,sentiment_pos,sentiment_compound
329,14-248,Albuquerque Man Charged with Federal Hate Crime Related to Anti-Semitic Threats Against Businesswoman,"The Department of Justice announced that this morning John W. Ng, 58, of Albuquerque, N.M., made his initial appearance in federal court on a criminal complaint charging him with a hate crime offense. This charge is related to anti-Semitic threats Ng made against a Jewish woman who owns and operates the Nosh Jewish Delicatessen and Bakery in Albuquerque. Ng was arrested by the FBI on March 7, 2014, based on a criminal complaint alleging that he interfered with the victim’s federally protected rights by threatening her and interfering with her business because of her religion. According to the criminal complaint, between Jan. 22, 2014, and Feb. 8, 2014, Ng allegedly posted threatening anti-Semitic notes on and in the vicinity of the victim’s business. A criminal complaint merely establishes probable cause, and Ng is presumed innocent unless proven guilty. If convicted on the offense charged in the criminal complaint, Ng faces a maximum statutory penalty of one year in prison. This matter was investigated by the Albuquerque Division of the FBI and is being prosecuted by Assistant U.S. Attorney Mark T. Baker of the U.S. Attorney’s Office for the District of New Mexico and Trial Attorney AeJean Cha of the U.S. Department of Justice’s Civil Rights Division.",2014-03-10T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.29,0.675,0.035,-0.995
11593,16-718,Three Mississippi Correctional Officers Indicted for Inmate Assault and Cover-Up,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up. The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground. Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H. The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI’s Jackson Division, with the cooperation of the Mississippi Department of Corrections. It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division’s Criminal Section. Marsher Indictment",2016-06-21T00:00:00-04:00,Civil Rights,"Civil Rights Division; Civil Rights - Criminal Section; USAO - Mississippi, Northern",0.282,0.687,0.03,-0.9968


D. With the dataframe from part C, find the mean compound sentiment score for each of the three topics in `topics_clean` using group_by and agg.

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)


In [376]:
## agg and find the mean compound score by topic
doj_subset_wscore.groupby("topics_clean")["sentiment_compound"].mean()
"Hate crimes are more universally accepted to be negative than civil rights or project safe childhood, so that is likely why their compound sentiment scores are lower on average."

topics_clean
Civil Rights             -0.049092
Hate Crimes              -0.931611
Project Safe Childhood   -0.620363
Name: sentiment_compound, dtype: float64

'Hate crimes are more universally accepted to be negative than civil rights or project safe childhood, so that is likely why their compound sentiment scores are lower on average.'

# 2. Topic modeling (25 points)

For this question, use the `doj_subset_wscores` data that is restricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


## 2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in a single raw string in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem

- Returns a joined preprocessed string
    
B. Use `apply` or list comprehension to execute that function and create a new column in the data called `processed_text`
    
C. Print the `id`, `contents`, and `processed_text` columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/
- Here's code with topic modeling steps: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb

In [377]:
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                       "trial", "assistance", "assist"]

stop_words = stopwords.words('english') + custom_doj_stopwords
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [378]:
## your code defining a text processing function
def preprocessing(content):
    
    # lowercase
    content = content.lower()
    
    # only keep letters
    content = ''.join([char for char in content if char.isalpha() or char == " "])
    
    # remove stopwords
    remove_stopwords_list = [word for word in content.split() if word not in stop_words]
    content = ' '.join(remove_stopwords_list)
    
    # remove words that are 4 characters or longer
    remove_longwords_list = [word for word in content.split() if len(word) >= 4]
    content = ' '.join(remove_longwords_list)
    
    # Uses the snowball stemmer from nltk to stem
    snow_stemmer = SnowballStemmer(language='english')
    stemmerwords_list = [snow_stemmer.stem(word) for word in content.split()]
    content = ' '.join(stemmerwords_list)
    
    return content
    
test_string = 'A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened.     James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack.   Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up.   “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division.  “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.”   “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.”     This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.'
print(preprocessing(test_string))

former supervisori correct offic louisiana state penitentiari angola louisiana plead guilti yesterday connect beat handcuf shackl inmat addit conspir cover misconduct falsifi offici record lie intern investig happen jame savoy marksvill louisiana admit plea hear wit offic use excess forc inmat fail interven conspir offic cover beat engag varieti obstruct act person falsifi offici prison record cover attack scotti kennedi beeb arkansa john sander marksvill louisiana previous plead guilti novemb septemb role beat cover everi citizen right process protect unreason forc correct offic violat basic constitut must held account egregi action said act general john gore continu vigor prosecut correct offic violat public trust commit crime cover violat feder crimin yesterday anoth exampl offic unwav commit pursu violat feder crimin law said act unit state middl louisiana corey amundson continu work close depart ensur investig fbis baton roug resid agenc prosecut frederick menner middl louisiana c

In [379]:
## your code executing the function
doj_subset_wscore['processed_contents'] = doj_subset_wscore['contents'].apply(preprocessing)

In [380]:
## your code showing the examples
doj_subset_wscore.loc[(doj_subset_wscore['id'] == '16-718') | (doj_subset_wscore['id'] == '16-217'), ['id', 'contents', 'processed_contents']]

Unnamed: 0,id,contents,processed_contents
6727,16-217,"The Justice Department has reached a comprehensive settlement agreement with the city of Miami and the Miami Police Department (MPD) resolving the Justice Department’s investigation of officer-involved shootings by MPD officers, announced Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division and U.S. Attorney Wifredo A. Ferrer of the Southern District of Florida. The settlement, which was approved by Miami’s city commission today and will go into effect when the agreement is signed by all parties, resolves claims stemming from the Justice Department’s investigation into officer-involved shootings by MPD officers, which was conducted under the Violent Crime Control and Law Enforcement Act of 1994. The investigation’s findings, issued in July 2013, identified a pattern or practice of excessive use of force through officer-involved shootings in violation of the Fourth Amendment of the Constitution. The city’s compliance with the settlement will be monitored by an independent reviewer, former Tampa, Florida, Police Chief Jane Castor. Under the settlement agreement, the city will implement comprehensive reforms to ensure constitutional policing and support public trust. The settlement agreement is designed to minimize officer-involved shootings and to more effectively and quickly investigate officer-involved shootings that do occur, through measures that include: “This settlement represents a renewed commitment by the city of Miami and Chief Rodolfo Llanes to provide constitutional policing for Miami residents and to protect public safety through sustainable reform,” said Principal Deputy Assistant Attorney General Gupta. “The agreement will help to strengthen the relationship between the MPD and the communities they serve by improving accountability for officers who fire their weapons unlawfully, and provides for community participation in the enforcement of this agreement.” “Today's agreement is the result of a joint effort between the Department of Justice and the City of Miami to ensure that the Miami Police Department continues its efforts to make our community safe while protecting the sacred Constitutional rights of all of our citizens,” said U.S. Attorney Ferrer. “Through oversight and communication, the agreement seeks to make permanent the positive changes that former Chief Orosa and Chief Llanes have made, and we applaud the City Commission’s vote.” The settlement agreement builds upon important reforms implemented by the city since the Justice Department issued its findings, including: The investigation was conducted by attorneys and staff from the Civil Rights Division’s Special Litigation Section and the Civil Division of the U. S. Attorney’s Office of the Southern District of Florida.",reach comprehens settlement agreement citi miami miami polic resolv depart officerinvolv shoot offic announc princip deputi general vanita gupta head depart wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem depart officerinvolv shoot offic conduct violent crime control enforc investig find issu juli identifi pattern practic excess forc officerinvolv shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trustth settlement agreement design minim officerinvolv shoot effect quick investig officerinvolv shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc agreement today agreement result joint effort citi miami ensur miami polic continu effort make communiti safe protect sacr constitut citizen said ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss vote settlement agreement build upon import reform implement citi sinc issu find includ conduct attorney staff divis special litig section attorney southern florida
11593,16-718,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up. The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground. Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H. The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI’s Jackson Division, with the cooperation of the Mississippi Department of Corrections. It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division’s Criminal Section. Marsher Indictment",ninecount indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti investig fbis jackson cooper mississippi correct prosecut robert coleman northern mississippi dana mulhaus divis crimin section marsher indict


## 2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: `id`, `compound` sentiment column you added, and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. Print the top 10 words for press releases in each of the three `topics_clean`

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.

**Resources**:

- Here contains an example of applying the `create_dtm` function: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb


In [381]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), columns=vectorizer.get_feature_names_out())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)

In [382]:
# your code here
dtm = create_dtm(list_of_strings = doj_subset_wscore["processed_contents"], metadata = doj_subset_wscore[['id', 'sentiment_compound', 'topics_clean']])

def get_topwords(dtm):
    top_terms = dtm[dtm.columns[4:]].sum(axis = 0)
    
    return top_terms.sort_values(ascending=False).head(10)

# Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)
threshold_1 = dtm['sentiment_compound'].quantile(0.95)
top_5_percent = dtm[dtm['sentiment_compound'] >= threshold_1]
get_topwords(top_5_percent)

# Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)
threshold_2 = dtm['sentiment_compound'].quantile(0.05)
bottom_5_percent = dtm[dtm['sentiment_compound'] <= threshold_2]
get_topwords(bottom_5_percent)

# Print the top 10 words for press releases in each of the three `topics_clean`
civil_rights = dtm[dtm['topics_clean'] == 'Civil Rights']
get_topwords(civil_rights)

hate_crimes = dtm[dtm['topics_clean'] == 'Hate Crimes']
get_topwords(hate_crimes)

safe_childhood = dtm[dtm['topics_clean'] == 'Project Safe Childhood']
get_topwords(safe_childhood)

agreement     168
enforc        131
ensur         110
state         110
disabl        110
depart         99
communiti      94
settlement     87
student        85
general        85
dtype: int64

assault     201
crime       180
victim      160
hate        136
defend      130
offic       121
charg       105
sentenc     102
anderson     94
guilti       92
dtype: int64

offic        636
hous         626
discrimin    547
enforc       533
disabl       519
depart       495
said         495
violat       476
feder        475
state        446
dtype: int64

victim      588
crime       546
hate        485
prosecut    478
defend      457
sentenc     455
charg       448
guilti      429
feder       424
said        418
dtype: int64

child          1018
exploit         699
sexual          570
safe            476
project         472
childhood       472
pornographi     443
children        417
crimin          403
prosecut        373
dtype: int64

## 2.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Hints and Resources**:

- Same topic modeling resources linked to above
- Make sure to use the `random_state` argument within the model so that the numbering of topics does not move around between runs of your code

In [383]:
# A. 
# create tokenized list of words
doj_subset_wscore = doj_subset_wscore[doj_subset_wscore["processed_contents"] != ""].copy()

text_tokens = [wordpunct_tokenize(one_text) for one_text in doj_subset_wscore["processed_contents"]]

# create dictionary
text_dict = corpora.Dictionary(text_tokens)

# filter dictionary- using 2% as bounds
text_dict.filter_extremes(no_below = round(doj_subset_wscore.shape[0]*0.02),
                             no_above = round(doj_subset_wscore.shape[0]*0.98))

# create corpus from dictionary
corpus_fromdict = [text_dict.doc2bow(one_text) 
                       for one_text in text_tokens]

# estimate model
n_topics = 3
ldamod_proc = gensim.models.ldamodel.LdaModel(corpus_fromdict, 
                                              num_topics = n_topics, 
                                              id2word=text_dict, 
                                              passes=6, alpha = 'auto',
                                              per_word_topics = True, 
                                              random_state = 91988)

In [384]:
# B.
# print topics and words
topics = ldamod_proc.print_topics(num_words = 15)
for topic in topics:
    print(topic)

(0, '0.028*"child" + 0.019*"exploit" + 0.019*"sexual" + 0.013*"safe" + 0.013*"children" + 0.013*"project" + 0.013*"childhood" + 0.012*"pornographi" + 0.011*"crimin" + 0.011*"victim" + 0.010*"prosecut" + 0.010*"sentenc" + 0.009*"hous" + 0.008*"minor" + 0.008*"ceo"')
(1, '0.013*"victim" + 0.012*"charg" + 0.012*"sentenc" + 0.012*"prosecut" + 0.011*"crime" + 0.011*"guilti" + 0.011*"defend" + 0.010*"feder" + 0.010*"said" + 0.009*"hate" + 0.009*"indict" + 0.008*"prison" + 0.008*"year" + 0.008*"investig" + 0.008*"assault"')
(2, '0.011*"disabl" + 0.011*"discrimin" + 0.010*"enforc" + 0.010*"depart" + 0.009*"offic" + 0.009*"agreement" + 0.008*"state" + 0.008*"said" + 0.007*"hous" + 0.007*"violat" + 0.007*"court" + 0.007*"feder" + 0.007*"requir" + 0.007*"general" + 0.007*"alleg"')


## 2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset_wscores` dataframe

B. Add the topic probabilities to the `doj_subset_wscores` dataframe as columns and create a column, `top_topic`, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)

**Resources**:

- End of this code (`Additional summaries of topics and documents`) contains example of how to use `get_document_topics` and other steps to add topic probabilities back to data: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb
- If you're getting errors, use `shape`, `len`, and other commands to check the dimensionality of things at different steps 

In [385]:
## your code here to get doc-level topic probabilities 

# extract probabilities
topic_probs_bydoc = [ldamod_proc.get_document_topics(item, minimum_probability=0) for item in corpus_fromdict]
# print(topic_probs_bydoc)

# make sure length of the list is equal to the number of rows in the doj_subset_wscores dataframe
print('It is', len(topic_probs_bydoc) == len(doj_subset_wscore), 'that the length of topic_probs_bydoc equals the length of doj_subset_wscre')

It is True that the length of topic_probs_bydoc equals the length of doj_subset_wscre


In [386]:
## your code here to add those topic probabilities to the dataframe
def highestValue(list):
    return sorted(list, key=lambda x: x[1], reverse=True)[0][0]

doj_subset_wscore['top_topic'] = [highestValue(probability_set) for probability_set in topic_probs_bydoc]
doj_subset_wscore['top_topic'].sample(20)

10410    1
11928    1
174      1
12566    2
6468     2
5820     1
1514     1
7061     2
6685     2
4124     1
11250    1
1045     1
11539    1
9650     0
6722     2
11740    1
4816     0
11340    1
8075     0
1190     0
Name: top_topic, dtype: int64

In [387]:
## your code here to summarize the topic proportions for each of the topics_clean 

# Calculate the crosstab with the numerical distribution
ct_counts = pd.crosstab(doj_subset_wscore['topics_clean'], doj_subset_wscore['top_topic'])
ct_counts

# Calculate the crosstab with the percentage distribution
ct_prop = pd.crosstab(doj_subset_wscore['topics_clean'], doj_subset_wscore['top_topic'], normalize='index')
ct_prop

# Loop over each label in topics_clean and print the percentage breakdown of top topics
for label in doj_subset_wscore['topics_clean'].unique():
    print(f"\nLabel: {label}")
    print(ct_prop.loc[label] * 100) # Multiply by 100 to convert decimal to percentage

top_topic,0,1,2
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Civil Rights,3,102,200
Hate Crimes,0,246,0
Project Safe Childhood,165,0,1


top_topic,0,1,2
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Civil Rights,0.009836,0.334426,0.655738
Hate Crimes,0.0,1.0,0.0
Project Safe Childhood,0.993976,0.0,0.006024



Label: Civil Rights
top_topic
0     0.983607
1    33.442623
2    65.573770
Name: Civil Rights, dtype: float64

Label: Project Safe Childhood
top_topic
0    99.39759
1     0.00000
2     0.60241
Name: Project Safe Childhood, dtype: float64

Label: Hate Crimes
top_topic
0      0.0
1    100.0
2      0.0
Name: Hate Crimes, dtype: float64


In [388]:
"It appears Project Safe Childhood maps almost 100% onto topic 0, and Hate Crimes maps 100% onto topic 1. Civil Rights maps about 1/3 onto topic 1 and 2/3 onto topic 2. This is likely because Civil Rights cases are more broad than the other two topics, and so they are more likely to have a mix of topics."

'It appears Project Safe Childhood maps almost 100% onto topic 0, and Hate Crimes maps 100% onto topic 1. Civil Rights maps about 1/3 onto topic 1 and 2/3 onto topic 2. This is likely because Civil Rights cases are more broad than the other two topics, and so they are more likely to have a mix of topics.'

# 3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the `doj_subset_wscore` data and the `processed_text` column (so the words after stemming/other preprocessing), create a column in the data called `processed_text_bigrams` that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function `create_bigram_onedoc` that takes in a single `processed_text` string and returns a string with its bigrams structured similarly to above example
 
**Hint**: there are many ways to solve but `zip` may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the `id`, `processed_text`, and `processed_text_bigram` columns for press release with id = 16-217

In [389]:
## your code here 
def create_bigram_onedoc(processed_contents):
    words = processed_contents.split()
    bigrams = [f"{w1}_{w2}" for w1, w2 in zip(words, words[1:])]
    return " ".join(bigrams)

doj_subset_wscore["processed_contents_bigrams"] = doj_subset_wscore["processed_contents"].apply(create_bigram_onedoc)

doj_subset_wscore[doj_subset_wscore.id == '16-217'][['id', 'processed_contents', 'processed_contents_bigrams']]

Unnamed: 0,id,processed_contents,processed_contents_bigrams
6727,16-217,reach comprehens settlement agreement citi miami miami polic resolv depart officerinvolv shoot offic announc princip deputi general vanita gupta head depart wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem depart officerinvolv shoot offic conduct violent crime control enforc investig find issu juli identifi pattern practic excess forc officerinvolv shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trustth settlement agreement design minim officerinvolv shoot effect quick investig officerinvolv shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc agreement today agreement result joint effort citi miami ensur miami polic continu effort make communiti safe protect sacr constitut citizen said ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss vote settlement agreement build upon import reform implement citi sinc issu find includ conduct attorney staff divis special litig section attorney southern florida,reach_comprehens comprehens_settlement settlement_agreement agreement_citi citi_miami miami_miami miami_polic polic_resolv resolv_depart depart_officerinvolv officerinvolv_shoot shoot_offic offic_announc announc_princip princip_deputi deputi_general general_vanita vanita_gupta gupta_head head_depart depart_wifredo wifredo_ferrer ferrer_southern southern_florida florida_settlement settlement_approv approv_miami miami_citi citi_commiss commiss_today today_effect effect_agreement agreement_sign sign_parti parti_resolv resolv_claim claim_stem stem_depart depart_officerinvolv officerinvolv_shoot shoot_offic offic_conduct conduct_violent violent_crime crime_control control_enforc enforc_investig investig_find find_issu issu_juli juli_identifi identifi_pattern pattern_practic practic_excess excess_forc forc_officerinvolv officerinvolv_shoot shoot_violat violat_fourth fourth_amend amend_constitut constitut_citi citi_complianc complianc_settlement settlement_monitor monitor_independ independ_review review_former former_tampa tampa_florida florida_polic polic_chief chief_jane jane_castor castor_settlement settlement_agreement agreement_citi citi_implement implement_comprehens comprehens_reform reform_ensur ensur_constitut constitut_polic polic_support support_public public_trustth trustth_settlement settlement_agreement agreement_design design_minim minim_officerinvolv officerinvolv_shoot shoot_effect effect_quick quick_investig investig_officerinvolv officerinvolv_shoot shoot_occur occur_measur measur_includ includ_settlement settlement_repres repres_renew renew_commit commit_citi citi_miami miami_chief chief_rodolfo rodolfo_llane llane_provid provid_constitut constitut_polic polic_miami miami_resid resid_protect protect_public public_safeti safeti_sustain sustain_reform reform_said said_princip princip_deputi deputi_general general_gupta gupta_agreement agreement_help help_strengthen strengthen_relationship relationship_communiti communiti_serv serv_improv improv_account account_offic offic_fire fire_weapon weapon_unlaw unlaw_provid provid_communiti communiti_particip particip_enforc enforc_agreement agreement_today today_agreement agreement_result result_joint joint_effort effort_citi citi_miami miami_ensur ensur_miami miami_polic polic_continu continu_effort effort_make make_communiti communiti_safe safe_protect protect_sacr sacr_constitut constitut_citizen citizen_said said_ferrer ferrer_oversight oversight_communic communic_agreement agreement_seek seek_make make_perman perman_posit posit_chang chang_former former_chief chief_orosa orosa_chief chief_llane llane_made made_applaud applaud_citi citi_commiss commiss_vote vote_settlement settlement_agreement agreement_build build_upon upon_import import_reform reform_implement implement_citi citi_sinc sinc_issu issu_find find_includ includ_conduct conduct_attorney attorney_staff staff_divis divis_special special_litig litig_section section_attorney attorney_southern southern_florida


C. Use the create_dtm function and the `processed_text_bigrams` column to create a document-term matrix (`dtm_bigram`) with these bigrams. Keep the following three columns in the data: `id`, `topics_clean`, and `compound` 

D. Print the (1) dimensions of the `dtm` matrix from question 2.2  and (2) the dimensions of the `dtm_bigram` matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix 

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the `get_topwords` function from 2.2

In [390]:
# your code here
# c. create dtm
dtm_bi = create_dtm(list_of_strings = doj_subset_wscore["processed_contents_bigrams"], metadata = doj_subset_wscore[['id', 'sentiment_compound', 'topics_clean']])

# d. print dimensions
print(dtm.shape)
print(dtm_bi.shape)

"The bigram dtm has more columns than the unigram dtm because there are more unique bigrams than unigrams."

# e. print the top 10 words for press releases in each of the three `topics_clean`
civil_rights = dtm_bi[dtm_bi['topics_clean'] == 'Civil Rights']
get_topwords(civil_rights)

hate_crimes = dtm_bi[dtm_bi['topics_clean'] == 'Hate Crimes']
get_topwords(hate_crimes)

safe_childhood = dtm_bi[dtm_bi['topics_clean'] == 'Project Safe Childhood']
get_topwords(safe_childhood)


(717, 7847)
(717, 74085)


'The bigram dtm has more columns than the unigram dtm because there are more unique bigrams than unigrams.'

fair_hous         228
deputi_general    221
princip_deputi    221
vanita_gupta      202
gupta_head        200
general_vanita    199
said_princip      185
unit_state        154
nation_origin     142
head_depart       134
dtype: int64

hate_crime       368
plead_guilti     274
year_prison      159
special_agent    117
thoma_perez      111
grand_juri       101
perez_general     95
said_thoma        90
unit_state        88
act_general       85
dtype: int64

project_safe         472
safe_childhood       472
child_pornographi    441
child_exploit        278
sexual_exploit       223
exploit_children     199
plead_guilti         194
exploit_obscen       175
obscen_section       174
child_sexual         174
dtype: int64

# 4. Optional extra credit (2 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings.

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?

In [408]:
# your code here 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

doj_subset_wscore_small = doj_subset_wscore.sample(100)

vectorizer = TfidfVectorizer()

def get_most_similar_article(processed_contents):
    id_list = []
    for row in doj_subset_wscore_small.iterrows():
        if (row[1]["processed_contents"] != processed_contents):
            corpus = (processed_contents, row[1]["processed_contents"])
            trsfm = vectorizer.fit_transform(corpus)
            similarity = cosine_similarity(trsfm[0:1], trsfm)
            id_list.append((row[1]["id"], similarity[0][1]))
            
    most_similar = sorted(id_list, key=lambda x: x[1], reverse=True)
    return most_similar[0]
            

In [409]:
# apply the function to the dataframe
most_similar_articles = doj_subset_wscore_small["processed_contents"].apply(get_most_similar_article)
doj_subset_wscore_small["most_similar_article_id"] = [item[0] for item in most_similar_articles]
doj_subset_wscore_small["most_similar_article_score"] = [item[1] for item in most_similar_articles]

In [413]:
# print the top 2 pairs of most similar articles
doj_subset_wscore_small_sorted_by_similarity = doj_subset_wscore_small.sort_values(by="most_similar_article_score", ascending=False)

doj_subset_wscore_small_sorted_by_similarity[["id", "contents", "most_similar_article_id", "most_similar_article_score"]].head(4)

"Both pairs seems to be referencing different stages of the same case."

Unnamed: 0,id,contents,most_similar_article_id,most_similar_article_score
4689,15-1332,"Former Tate County, Mississippi, Lieutenant Randy T. Doss pleaded guilty in federal court today to unlawfully tasing an inmate at the Tate County Jail. The tasing, which occurred in 2012, caused J.W., a pre-trial detainee, to fall to the concrete floor and fracture his skull. On Jan. 27, 2012, a jail-wide search was ordered after an inmate was reportedly assaulted with a razor. When corrections officers entered J.W.’s pod to conduct their search, he and his fellow inmates were ordered to stand facing the wall. At the time of the incident, the victim was standing against a wall with his hands over his head, not posing a physical threat to anyone. Doss tased the victim from 11 feet away. The victim fell backward and hit his head on the concrete floor, necessitating brain surgery. The incident was captured on video. “The defendant was an experienced law-enforcement officer who abused the authority entrusted to him,” said Principal Deputy Assistant Attorney General Vanita Gupta, head of the Civil Rights Division. “The right to be free from excessive force is a Constitutional guarantee for all citizens, including those in custody. The U.S. Department of Justice and the Civil Rights Division will vigorously enforce this right.” “The actions of the defendant are reprehensible and inexcusable,” said U.S. Attorney Felicia C. Adams of the Northern District of Mississippi. “He abused his authority, violated the law and the public trust. While the majority of law enforcement officers are hardworking professionals who risk their lives daily for our safety, the U. S. Attorney's Office is committed to aggressively prosecuting those officers who break the law and violate the public trust.” “Officers who abuse their power further undermine the public's trust in law enforcement,” said Special Agent in Charge Donald Alway of the FBI’s Jackson Division. “These types of cases are some of the FBI's most important work and help ensure and maintain a healthy democracy."" Doss, 63, had more than 20 years of experience in law enforcement. He had been certified to train other officers on the proper use of force, including how to use a taser. The defendant was indicted on March 30, 2015, by a grand jury sitting in Oxford, Mississippi. He was charged with a single count of violating the victim’s right not to be deprived of liberty without due process of law. Doss was charged with both using a dangerous weapon – a taser electronic control device – and causing bodily injury to the victim. The defendant will be sentenced by U.S. District Court Judge Michael P. Mills of the Northern District of Mississippi on Feb. 18, 2016. This case was investigated by the FBI’s Jackson Division, with the cooperation of the Tate County Sheriff’s Office. The case is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorneys Dana Mulhauser and Andres Palacio of the Civil Rights Division. Doss Plea Agreement",16-353,0.782375
4690,16-353,"Former Tate County, Mississippi, Lieutenant Randy T. Doss, 63, was sentenced to two years in prison today for unlawfully tasing a pretrial detainee, J.W., at the Tate County Jail. The tasing, which occurred in 2012, caused the victim to fall to the concrete floor and fracture his skull. At the time of the incident, which was captured on video, the victim was standing against a wall with his hands over his head, not posing a physical threat to anyone. Doss tased the victim from 11 feet away. The victim fell backward and hit his head on the concrete floor, necessitating brain surgery. “The defendant is a veteran law enforcement officer who had been certified to train other officers on appropriate use of force,” said Principal Deputy Assistant Attorney General Gupta. “The Department of Justice will protect the rights of all citizens from excessive force at the hands of law enforcement.” “The defendant abused his authority, violated the law and the public trust,” said U.S. Attorney Felicia C. Adams of the Northern District of Mississippi. “While the majority of law enforcement officers are hardworking professionals who risk their lives daily for our safety, the U. S. Attorney’s Office is committed to aggressively prosecuting those officers who break the law and violate an individual’s constitutional rights.” “In making arrests, maintaining order and defending life, the law allows law enforcement officers to use whatever force is ‘reasonably’ necessary,” said Special Agent in Charge Donald Alway of the FBI’s Jackson Division. “Violations of federal law occur when it can be shown, as in this case, that the force used was willfully ‘unreasonable’ or ‘excessive.’” Doss had more than 20 years of experience in law enforcement, and had been certified to train other officers on the proper use of force, including how to use a taser. Doss was indicted on March 30, 2015, by a grand jury in Oxford, Mississippi. He was charged with a single count of violating the rights of J.W. not to be deprived of liberty without due process of law. Doss was charged with both using a dangerous weapon – a taser – and causing bodily injury to the victim. He pleaded guilty to the single count in October 2015. The case was investigated by the FBI’s Jackson Division, with the cooperation of the Tate County Sheriff’s Office. It was prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorneys Dana Mulhauser and Andres Palacio of the Civil Rights Division’s Criminal Section.",15-1332,0.782375
9661,10-579,"WASHINGTON- Daniel Lee Jones, a Portland, Ore., white supremacist, pleaded guilty to using the Postal Service to send a threatening communication to the president of the Lima, Ohio, chapter of the National Association for the Advancement of Colored People, the Justice Department announced today. In the plea agreement, Jones admits to mailing F.M. Jason Upthegrove a hangman’s noose, which arrived at Mr. Upthegrove’s home on or about Feb. 14, 2008. Jones further states in the plea agreement that he mailed the communication containing the hangman’s noose in order to convey a threat to injure Mr. Upthegrove because he was an African-American who publicly advocated for better police services for African-Americans in Lima, Ohio. The indictment indicates that Mr. Upthegrove also spoke out in the media against Jones’s white supremacist group’s mailing of hate flyers related to the shooting of an African-American woman by a member of the Lima Police Department. Jones faces a maximum prison sentence of five years and a potential fine of up to $250,000 for his conviction. ""A noose is an unmistakable symbol of hate in our nation, and it was used in this case to intimidate an individual for exercising his right to speak out and advocate on behalf of others,"" said Thomas E. Perez, Assistant Attorney General of the Civil Rights Division. ""The Department of Justice will vigorously prosecute those who resort to violent threats to silence such advocates, especially when that threat is motivated by hate."" ""Sending a noose is a threat that harkens back to some of the darkest days of our history. We simply will not tolerate such actions any longer,"" said U.S. Attorney Steven M. Dettelbach. The case was investigated by FBI Special Agent Brian Russ, and the prosecution was handled by Assistant U.S. Attorney David Bauer from the U.S. Attorney’s Office, and Special Legal Counsel Barry Kowalski and Trial Attorney Shan Patel from the Civil Rights Division of the Department of Justice.",10-1265,0.768189
9648,10-1265,"WASHINGTON - Daniel Lee Jones, a Portland, Ore., white supremacist, was sentenced today to 18 months in prison and three years supervised release for threatening the president of the Lima, Ohio, chapter of the NAACP by mailing him a noose. Jones entered a guilty plea on May 17, 2010, to using the U.S. Postal Service to send a threatening communication. In the plea agreement, Jones admitted to mailing F.M. Jason Upthegrove a hangman’s noose, which arrived at Mr. Upthegrove’s home on or about Feb. 14, 2008. Jones stated in the plea agreement that he mailed the hangman’s noose in order to convey a threat to Mr. Upthegrove because he was an African-American who publicly advocated for better police services for African-Americans in Lima, Ohio. The indictment indicated that Mr. Upthegrove also spoke out in the media against Jones’s white supremacist group’s mailing of hate flyers related to the shooting of an African American woman by a member of the Lima Police Department. ""A noose, an unmistakable symbol of hatred in this nation, was used by this defendant as a threat of violence aimed at silencing a civil rights advocate,"" said Thomas E. Perez, Assistant Attorney General of the Civil Rights Division. ""The Department of Justice will vigorously prosecute those who use threats of violence to attempt to silence proponents of racial equality."" ""We will not tolerate those who use threats of violence, such as by mailing a noose, to intimidate individuals who are advocating for racial equality,""said U.S. Attorney for the Northern District of Ohio Steven M. Dettelbach. The case was investigated by Special Agent Brian Russ of the FBI, and the prosecution was handled by Assistant U.S. Attorney David Bauer from the U.S. Attorney’s Office, and Special Legal Counsel Barry Kowalski and Trial Attorney Shan Patel from the Civil Rights Division of the U.S. Department of Justice.",10-579,0.768189


'Both pairs seems to be referencing different stages of the same case.'