# Problem set 3: Text analysis of DOJ press releases

**Total points (without extra credit)**: 52 

- For background:

    - DOJ is the federal law enforcement agency responsible for federal prosecutions; this contrasts with the local prosecutions in the Cook County dataset we analyzed earlier. Here's a short explainer on which crimes get prosecuted federally versus locally: https://www.criminaldefenselawyer.com/resources/criminal-defense/federal-crime/state-vs-federal-crimes.htm#:~:text=Federal%20criminal%20prosecutions%20are%20handled,of%20state%20and%20local%20law. 
    - Here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 
    - Here's the code the dataset creator used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

## 0.0 Import packages

In [1]:
## helpful packages
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import random
import re
import string

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
#nltk.download('averaged_perceptron_tagger')
#nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
#! python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)

## 0.1 Load and clean text data

In [3]:
## first, unzip the file pset3_inputdata.zip 
## then, run this code to load the unzipped json file and convert to a dataframe
## (may need to change the pathname depending on where you store stuff)
## and convert some of the attributes from lists to values
doj = pd.read_json("combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 
           'components_clean']].copy()

doj.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23, who was convicted in 2013 of attempting to use a weapon of mass destruction (explosives) in connection with a plot to detonate a vehicle bomb at an annual Christmas tree lighting ceremony in Portland, was sentenced today to serve 30 years in prison, followed by a lifetime term of supervised release. Mohamud, a naturalized U.S. citizen from Somalia and former resident of Corvallis, Oregon, was arrested on Nov. 26, 2010, after he attempted to detonate what he believed to be an explosives-laden van that was parked near the tree lighting ceremony in Portland. The arrest was the culmination of a long-term undercover operation, during which Mohamud was monitored closely for months as his bomb plot developed. The device was in fact inert, and the public was never in danger from the device. At sentencing, United States District Court Judge Garr M. King, who presided over Mohamed’s 14-day trial, said “the intended crime was horrific,” and that the defendant, even though he was presented with options by undercover FBI employees, “never once expressed a change of heart.” King further noted that the Christmas tree ceremony was attended by up to 10,000 people, and that the defendant “wanted everyone to leave either dead or injured.” King said his sentence was necessary in view of the seriousness of the crime and to serve as deterrence to others who might consider similar acts. “With today’s sentencing, Mohamed Osman Mohamud is being held accountable for his attempted use of what he believed to be a massive bomb to attack innocent civilians attending a public Christmas tree lighting ceremony in Portland,” said John P. Carlin, Assistant Attorney General for National Security. “The evidence clearly indicated that Mohamud was intent on killing as many people as possible with his attack. Fortunately, law enforcement was able to identify him as a threat, insert themselves in the place of a terrorist that Mohamud was trying to contact, and thwart Mohamud’s efforts to conduct an attack on our soil. This case highlights how the use of undercover operations against would-be terrorists allows us to engage and disrupt those who wish to commit horrific acts of violence against the innocent public. The many agents, analysts, and prosecutors who have worked on this case deserve great credit for their roles in protecting Portland from the threat posed by this defendant and ensuring that he was brought to justice.” “This trial provided a rare glimpse into the techniques Al Qaeda employs to radicalize home-grown extremists,” said Amanda Marshall, U.S. Attorney for the District of Oregon. “With the sentencing today, the court has held this defendant accountable. I thank the dedicated professionals in the law enforcement and intelligence communities who were responsible for this successful outcome. I look forward to our continued work with Muslim communities in Oregon who are committed to ensuring that all young people are safe from extremists who seek to radicalize others to engage in violence.” According to the trial evidence, in February 2009, Mohamud began communicating via e-mail with Samir Khan, a now-deceased al Qaeda terrorist who published Jihad Recollections, an online magazine that advocated violent jihad, and who also published Inspire, the official magazine of al-Qaeda in the Arabian Peninsula. Between February and August 2009, Mohamed exchanged approximately 150 emails with Khan. Mohamud wrote several articles for Jihad Recollections that were published under assumed names. In August 2009, Mohamud was in email contact with Amro Al-Ali, a Saudi national who was in Yemen at the time and is today in custody in Saudi Arabia for terrorism offenses. Al-Ali sent Mohamud detailed e-mails designed to facilitate Mohamud’s travel to Yemen to train for violent jihad. In December 2009, while Al-Ali was in the northwest frontier province of Pakistan, Mohamud and Al-Ali discussed the possibility of Mohamud traveling to Pakistan to join Al-Ali in terrorist activities. Mohamud responded to Al-Ali in an e-mail: “yes, that would be wonderful, just tell me what I need to do.” Al-Ali referred Mohamud to a second associate overseas and provided Mohamud with a name and email address to facilitate the process. In the following months, Mohamud made several unsuccessful attempts to contact Al-Ali’s associate. Ultimately, an FBI undercover operative contacted Mohamud via email under the guise of being an associate of Al-Ali’s. Mohamud and the FBI undercover operative agreed to meet in Portland in July 2010. At the meeting, Mohamud told the FBI undercover operative he had written articles that were published in Jihad Recollections. Mohamud also said that he wanted to become “operational.” Asked what he meant by “operational,” Mohamud said he wanted to put an explosion together, but needed help. According to evidence presented at trial, at a meeting in August 2010, Mohamud told undercover FBI operatives he had been thinking of committing violent jihad since the age of 15. Mohamud then told the undercover FBI operatives that he had identified a potential target for a bomb: the annual Christmas tree lighting ceremony in Portland’s Pioneer Courthouse Square on Nov. 26, 2010. The undercover FBI operatives cautioned Mohamud several times about the seriousness of this plan, noting there would be many people at the event, including children, and emphasized that Mohamud could abandon his attack plans at any time with no shame. Mohamud indicated the deaths would be justified and that he would not mind carrying out a suicide attack on the crowd. According to evidence presented at trial, in the ensuing months Mohamud continued to express his interest in carrying out the attack and worked on logistics. On Nov. 4, 2010, Mohamud and the undercover FBI operatives traveled to a remote location in Lincoln County, Oregon, where they detonated a bomb concealed in a backpack as a trial run for the upcoming attack. During the drive back to Corvallis, Mohamud was asked if was capable looking at all the bodies of those who would be killed during the explosion. In response, Mohamud noted, “I want whoever is attending that event to be, to leave either dead or injured.” Mohamud later recorded a video of himself, with the assistance of the undercover FBI operatives, in which he read a statement that offered his rationale for his bomb attack. On Nov. 18, 2010, undercover FBI operatives picked up Mohamud to travel to Portland to finalize the details of the attack. On Nov. 26, 2010, just hours before the planned attack, Mohamud examined the 1,800 pound bomb in the van and remarked that it was “beautiful.” Later that day, Mohamud was arrested after he attempted to remotely detonate the inert vehicle bomb rked near the Christmas tree lighting ceremony This case was investigated by the FBI, with assistance from the Oregon State Police, the Corvallis Police Department, the Lincoln County Sheriff’s Office and the Portland Police Bureau. The prosecution was handled by Assistant U.S. Attorneys Ethan D. Knight and Pamala Holsinger from the U.S. Attorney’s Office for the District of Oregon. Trial Attorney Jolie F. Zimmerman, from the Counterterrorism Section of the Justice Department’s National Security Division, assisted. # # # 14-1077",2014-10-01T00:00:00-04:00,No topic,National Security Division (NSD)
1,12-919,$1 Million in Restitution Payments Announced to Preserve North Carolina Wetlands,"WASHINGTON – North Carolina’s Waccamaw River watershed will benefit from a $1 million restitution order from a federal court, funding environmental projects to acquire and preserve wetlands in an area damaged by illegal releases of wastewater from a corporate hog farm, announced Ignacia S. Moreno, Assistant Attorney General of the Justice Department’s Environment and Natural Resources Division; U.S. Attorney for the Eastern District of North Carolina Thomas G. Walker; Director Greg McLeod from the North Carolina State Bureau of Investigation; and Camilla M. Herlevich, Executive Director of the North Carolina Coastal Land Trust. Freedman Farms Inc. was sentenced in February 2012 to five years of probation and ordered to pay $1.5 million in fines, restitution and community service payments for violating the Clean Water Act when it discharged hog waste into a stream that leads to the Waccamaw River. William B. Freedman, president of Freedman Farms, was sentenced to six months in prison to be followed by six months of home confinement. Freedman Farms also is required to implement a comprehensive environmental compliance program and institute an annual training program. In an order issued on April 19, 2012, the court ordered that the defendants would be responsible for restitution of $1 million in the form of five annual payments starting in January 2013, which the court will direct to the North Carolina Coastal Land Trust (NCCLT). The NCCLT plans to use the money to acquire and conserve land along streams in the Waccamaw watershed. The court also directed a $75,000 community service payment to the Southern Environmental Enforcement Network, an organization dedicated to environmental law enforcement training and information sharing in the region. “The resolution of the case against Freedman Farms demonstrates the commitment of the Department of Justice to enforcing the Clean Water Act to ensure the protection of human health and the environment,” said Assistant Attorney General Moreno. “The court-ordered restitution in this case will conserve wetlands for the benefit of the people of North Carolina. By enforcing the nation’s environmental laws, we will continue to ensure that concentrated animal feeding operations (CAFOs) operate without threatening our drinking water, the health of our communities and the environment.” “This office is committed to doing our part to hold accountable those who commit crimes against our environment, which can cause serious health problems to residents and damage the environment that makes North Carolina such a beautiful place to live and visit,” said U.S. Attorney Walker. “This case shows what we can accomplish when our SBI agents work closely with their local, state and federal partners to investigate environmental crimes and hold the polluters accountable,” said Director McLeod. “We’ll continue our efforts to fight illegal pollution that damages our water and puts the public’s health at risk.” “The Waccamaw is unique and wild,” said Director Herlevich of the North Carolina Coastal Land Trust. “Its watershed includes some of the most extensive cypress gum swamps in the state, and its headwaters at Lake Waccamaw contain fish that are found nowhere else on Earth. We appreciate the trust of the court and the U. S. Attorney, and we look forward to using these funds for conservation projects in a river system that is one of our top conservation priorities.” According to evidence presented in court, in December 2007 Freedman Farms discharged hog waste into Browder’s Branch, a tributary to the Waccamaw River that flows through the White Marsh, a large wetlands complex. Freedman Farms, located in Columbus County, N.C., is in the business of raising hogs for market, and this particular farm had some 4,800 hogs. The hog waste was supposed to be directed to two lagoons for treatment and disposal. Instead, hog waste was discharged from Freedman Farms directly into Browder’s Branch. The Clean Water Act is a federal law that makes it illegal to knowingly or negligently discharge a pollutant into a water of the United States. The Freedman case was investigated by the U.S. Environmental Protection Agency (EPA) Criminal Investigation Division, the U.S. Army Corps of Engineers and the North Carolina State Bureau of Investigation, with assistance from the EPA Science and Ecosystem Support Division. The case was prosecuted by Assistant U.S. Attorney J. Gaston B. Williams of the Eastern District of North Carolina and Trial Attorney Mary Dee Carraway of the Environmental Crimes Section of the Justice Department’s Environment and Natural Resources Division. The North Carolina Coastal Land Trust is celebrating its 20th anniversary of saving special lands in eastern North Carolina. The organization has protected nearly 50,000 acres of lands with scenic, recreational, historic and ecological values. North Carolina Coastal Land Trust has saved streams and wetlands that provide clean water, forests that are havens for wildlife, working farms that provide local food and nature parks that everyone can enjoy. More information about the Coastal Land Trust is available at www.coastallandtrust.org.",2012-07-25T00:00:00-04:00,No topic,Environment and Natural Resources Division
2,11-1002,$1 Million Settlement Reached for Natural Resource Damages at Superfund Site in Massachusetts,"BOSTON– A $1-million settlement has been reached for natural resource damages (NRD) at the Blackburn & Union Privileges Superfund Site in Walpole, Mass., the Departments of Justice and Interior (DOI), and the Office of the Massachusetts Attorney General announced today. The Blackburn & Union Privileges Superfund Site includes 22 acres of contaminated land and water in Walpole. The contamination resulted from the operations of various industrial facilities dating back to the 19th century that exposed the site to asbestos, arsenic, lead and other hazardous substances. The private parties involved in the settlement include two former owners and operators of the site, W.R. Grace & Co.– Conn. and Tyco Healthcare Group LP, as well as the current owners, BIM Investment Corp. and Shaffer Realty Nominee Trust. From about 1915 to 1936, a predecessor of W.R. Grace manufactured asbestos brake linings and clutch linings on a large portion of the property. From 1946 to about 1983, a predecessor of Tyco Healthcare operated a cotton fabric manufacturing business, which used caustic solutions, on a portion of the property. In a 2010 settlement with U.S. Environmental Protection Agency (EPA), the four private parties agreed to perform a remedial action to clean up the site at an estimated cost of $13 million. The consent decree lodged today resolves both state and federal NRD liability claims; it requires the parties to pay $1,094,169.56 to the state and federal natural resource trustees, the Massachusetts Executive Office of Energy and Environmental Affairs (EEA) and DOI, for injuries to ecological resources including groundwater and wetlands, which provide habitat for waterfowl and wading birds, including black ducks and great blue herons. The trustees will use the settlement funds for natural resource restoration projects in the area. “This settlement demonstrates our commitment to recovering damages from the parties responsible for injury to natural resources, in partnership with state trustees,” said Bruce Gelber, Acting Deputy Assistant Attorney General of the Justice Department’s Environment and Natural Resources Division. “The citizens of Walpole have had to live with the environmental impact of this contamination for many years,” Attorney General Martha Coakley said. “We are pleased that today’s agreement will not only require the responsible parties to reimburse taxpayer dollars, but will also provide funding to begin restoring or replacing the wetland and other natural resources.” The consent decree was lodged in the U.S. District Court for Massachusetts. A portion of the funds, $300,000, will be distributed to the EEA-sponsored groundwater restoration projects; $575,000 will be used for ecological restoration projects jointly sponsored by EEA and the U.S. Fish and Wildlife Service (FWS). In addition, $125,000 will go for projects jointly sponsored by EEA and FWS that achieve both ecological and groundwater restoration; $57,491.34 will be allocated for reimbursement for the FWS’s assessment costs; and $36,678.22 will be distributed as reimbursement for the commonwealth’s assessment costs. “This settlement provides the means for a range of projects designed to compensate the public for decades of groundwater and other ecological damage at this site. I encourage local citizens and organizations to become engaged in the public process that will take place as we solicit, take comment on, and choose these projects in the months ahead,” said Energy and Environmental Affairs Secretary Richard K. Sullivan Jr., who serves as the Commonwealth’s Natural Resources Damages trustee. “This settlement will help restore habitat for fish and wildlife in the Neponset River watershed,” said Tom Chapman of the FWS New England Field Office. “We look forward to working with the commonwealth and local stakeholders to implement restoration.” “More than 100 years-worth of industrial activities at this site caused major environmental contamination to the Neponset River, nearby wetlands and to groundwater below the site,” said Commissioner Kenneth Kimmell of the Massachusetts Department of Environmental Protection (MassDEP), which will staff the Trustee Council for the Commonwealth. “We will ensure that the community and the public will be active participants in the process to use these NRD funds to restore the injured natural resources.” Under the federal Comprehensive Environmental Response, Compensation and Liability Act, EEA and DOI, acting through the FWS, are the designated state and federal natural resource Trustees for the site. The site has been listed on the EPA’s National Priorities List since 1994. The consent decree is subject to a public comment period and court approval. A copy of the consent decree and instructions about how to submit comments is available on www.usdoj.gov/enrd/Consent_Decrees.html . After the consent decree is approved, EEA and FWS will develop proposed restoration plans to use the settlement funds for restoration projects. The proposed restoration plans will also be made available to the public for review and comment. Assistant Attorney General Matthew Brock of Massachusetts Attorney General Coakley's Environmental Protection Division handled this matter. Attorney Jennifer Davis of MassDEP, Attorney Anna Blumkin of EEA and MassDEP’s NRD Coordinator Karen Pelto also worked on this settlement.",2011-08-03T00:00:00-04:00,No topic,Environment and Natural Resources Division
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying Vehicle Emissions Tests,"WASHINGTON—A federal grand jury in Las Vegas today returned indictments against 10 Nevada-certified emissions testers for falsifying vehicle emissions test reports, the Justice Department announced. Each defendant faces one felony Clean Air Act count for falsifying reports between November 2007 and May 2009. The number of falsifications varied by defendant, with some defendants having falsified approximately 250 records, while others falsified more than double that figure. One defendant is alleged to have falsified over 700 reports. The individuals indicted include: Escudero resides in Pahrump, Nev. All other individuals are from Clark County, Nev. The 10 defendants are alleged to have engaged in a practice known as ""clean scanning"" vehicles. The scheme involved entering the Vehicle Identification Number (VIN) for a vehicle that would not pass the emissions test into the computerized system, then connecting a different vehicle the testers knew would pass the test. These falsifications were allegedly performed for anywhere from $10 to $100 over and above the usual emissions testing fee. The U.S. Environmental Protection Agency (EPA), under the Clean Air Act, requires the state of Nevada to conduct vehicle emissions testing in certain areas because the areas exceed national standards for carbon monoxide and ozone. Las Vegas is currently required to perform emissions testing. To obtain a registration renewal, vehicle owners bring the vehicles to a licensed inspection station for testing. The emissions inspector logs into a computer to activate the system by using a unique password issued to the emissions inspector. The emissions inspector manually inputs the vehicle’s VIN to identify the tested vehicle, then connects the vehicle for model year 1996 and later to an onboard diagnostics port connected to an analyzer. The analyzer downloads data from the vehicle’s computer, analyzes the data and provides a ""pass"" or ""fail"" result. The pass or fail result and vehicle identification data are reported on the Vehicle Inspection Report. It is a crime to knowingly alter or conceal any record or other document required to be maintained by the Clean Air Act. ""Falsifications of vehicle emissions testing, such as those alleged in the indictments unsealed today, are serious matters and we intend to use all of our enforcement tools to stop this harmful practice. These actions undermine a system that is designed to reduce air pollutants including smog and provide better air quality for the citizens of Nevada,"" said Ignacia S. Moreno, Assistant Attorney General for the Justice Department’s Environment and Natural Resources Division. ""The residents of Nevada deserve to know that the vast majority of licensed vehicle emission inspectors are not corrupt and are not circumventing emission testing procedures,"" said U.S. Attorney Bogden. ""These indictments should serve as a clear warning to offenders that the Department of Justice will prosecute you if you make fraudulent statements and reports concerning compliance with the federal Clean Air Act."" ""Lying about car emissions means dirtier air, which is especially of concern in areas like Las Vegas that are already experiencing air quality problems,"" said Cynthia Giles, Assistant Administrator for Enforcement and Compliance Assurance at EPA. ""We will take aggressive action to ensure communities have clean air."" The maximum penalty for the felony violations contained in the indictments includes up to two years in prison and a fine of up to $250,000. An indictment is merely an accusation, and a defendant is presumed innocent unless and until proven guilty in a court of law. The case was investigated by the EPA, Criminal Investigation Division; and the Nevada Department of Motor Vehicles Compliance Enforcement Division. The case is being prosecuted by the U.S. Attorney’s Office for the District of Nevada and the Justice Department’s Environmental Crimes Section.",2010-01-08T00:00:00-05:00,No topic,Environment and Natural Resources Division
4,18-898,"$100 Million Settlement Will Speed Cleanup Work at Centredale Manor Superfund Site in North Providence, R.I.","The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin contaminated sediment and soil at the Centredale Manor Restoration Project Superfund Site in North Providence and Johnston, Rhode Island. “We are pleased to reach a resolution through collaborative work with the responsible parties, EPA, and other stakeholders,” said Acting Assistant Attorney General Jeffrey H. Wood for the Justice Department's Environment and Natural Resources Division . “Today’s settlement ends protracted litigation and allows for important work to get underway to restore a healthy environment for citizens living in and around the Centredale Manor Site and the Woonasquatucket River.” “This settlement demonstrates the tremendous progress we are achieving working with responsible parties, states, and our federal partners to expedite sites through the entire Superfund remediation process,” said EPA Acting Administrator Andrew Wheeler. “The Centredale Manor Site has been on the National Priorities List for 18 years; we are taking charge and ensuring the Agency makes good on its promise to clean it up for the betterment of the environment and those communities affected.” “Successfully concluding this settlement paves the way for EPA to make good on our commitment to aggressively pursue cleaning up the Centredale Manor Superfund Site,” said EPA New England Regional Administrator Alexandra Dunn. “We are excited to get to work on the cleanup at this site, and get it closer to the goal of being fully utilized by the North Providence and Johnston communities.” “We are pleased that the collective efforts of the State of Rhode Island, EPA, and DOJ in these negotiations have concluded in this major milestone toward the cleanup of the Centredale Manor Restoration Superfund site and are consistent with our long-standing efforts to make the polluter pay,” said RIDEM Director Janet Coit. “The settlement will speed up a remedy that protects public health and the river environment, and moves us closer to the day that we can reclaim recreational uses of this beautiful river resource.” The settlement, which includes cleanup work in the Woonasquatucket River (River) and bordering residential and commercial properties along the River, requires the companies to perform the remedy selected by EPA for the Site in 2012, which is estimated to cost approximately $100 million, and resolves longstanding litigation. The cleanup remedy includes excavation of contaminated sediment and floodplain soil from the Woonasquatucket River, including from adjacent residential properties. Once the cleanup remedy is completed, full access to the Woonasquatucket River should be restored for local citizens. The cleanup will be a step toward the State’s goal of a fishable and swimmable river. The work will also include upgrading caps over contaminated soil in the peninsula area of the Site that currently house two high-rise apartment buildings. The settlement also ensures that the long-term monitoring and maintenance of the site, as directed in the remedy, will be implemented to ensure that public health is protected. Under the settlement, Emhart and Black & Decker will reimburse EPA for approximately $42 million in past costs incurred at the Site. The companies will also reimburse EPA and the State of Rhode Island for future costs incurred by those agencies in overseeing the work required by the settlement. The settlement will also include payments on behalf of two federal agencies to resolve claims against those agencies. These payments, along with prior settlements related to the Site, will result in a 100 percent recovery for the United States of its past and future response costs related to the Site. Litigation related to the Site has been ongoing for nearly eight years. While the Federal District Court found Black & Decker and Emhart to be liable for their hazardous waste and responsible to conduct the cleanup of the Site, it had also ruled that EPA needed to reconsider certain aspects of that cleanup. EPA appealed the decision requiring it to reconsider aspects of the cleanup. This settlement, once entered by the District Court, will resolve the litigation between the United States, Rhode Island, and Emhart and Black and Decker, allowing the cleanup of the Site to begin. The Site spans a one and a half mile stretch of the Woonasquatucket River and encompasses a nine-acre peninsula, two ponds and a significant forested wetland. From the 1940s to the early 1970s, Emhart’s predecessor operated a chemical manufacturing facility on the peninsula and used a raw material that was contaminated with 2,3,7,8-tetrachlorodibenzo-p-dioxin, a toxic form of dioxin. The Site property was also previously used by a barrel refurbisher. Elevated levels of dioxins and other contaminants have been detected in soil, groundwater, sediment, surface water and fish. The Site was added to the National Priorities List (NPL) in 2000, and in December 2017, EPA included the Centredale Manor Restoration Project Superfund Site on a list of Superfund sites targeted for immediate and intense attention. Several short-term actions were previously performed at the Site to address immediate threats to the residents and minimize potential erosion and downstream transport of contaminated soil and sediment. This settlement is the latest agreement EPA has reached since the Site was listed on the NPL. Prior agreements addressed the performance and recovery of costs for the past environmental investigations and interim cleanup actions from Emhart, the barrel reconditioning company, the current owners of the peninsula portion of the Site, and other potentially responsible parties. The Consent Decree, lodged in the U.S. District Court of Rhode Island, will be posted in the Federal Register and available for public comment for a period of 30 days. The Consent Decree can be viewed on the Justice Department website: www.justice.gov/enrd/Consent_Decrees.html. EPA information on the Centredale Manor Superfund Site: www.epa.gov/superfund/centredale.",2018-07-09T00:00:00-04:00,Environment,Environment and Natural Resources Division


## 1. Tagging and sentiment scoring (17 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call the raw string of this press release `pharma`

In [5]:
## your code to subset to one press release and take the string
subset = doj[doj["id"]=="17-1204"]
pharma = subset["contents"].iloc[0]
pharma

## to check that it is a string:
#type(pharma)

'The founder and majority owner of Insys Therapeutics Inc., was arrested today and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a Fentanyl spray intended for cancer patients experiencing breakthrough pain.\xa0"More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids. And yet some medical professionals would rather take advantage of the addicts than try to help them," said Attorney General Jeff Sessions. "This Justice Department will not tolerate this.\xa0 We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic.\xa0 And under the leadership of President Trump, we are fully committed to defeating this threat to the American people.”John N. Kapoor, 74, of Phoenix, Ariz., a current member of the Board of Directors of Insys, was arrested this morning in Arizona and charged with RICO conspi

### 1.1 part of speech tagging (3 points)

A. Preprocess the `pharma` press release to remove all punctuation / digits (you can use `.isalpha()` to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech. 

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the `pharma` release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for `.isalpha()`: https://www.w3schools.com/python/ref_string_isalpha.asp

In [7]:
## Part A:
pharma_processed = [token for token in wordpunct_tokenize(pharma)
                   if token.isalpha()]
#pharma_processed

In [9]:
## Part B:
pharma_pos = pos_tag(pharma_processed)

#pharma_pos

In [11]:
## Part C:
adj_tags = ["JJ", "JJR", "JJS"]
all_adjectives = [tok[0] for tok in pharma_pos
                 if tok[1] in adj_tags]

adj_series = pd.Series(all_adjectives)
adj_counts = adj_series.value_counts()

top_5_adj = adj_counts.head(5).reset_index()
top_5_adj.columns = ["adjective", "count"]
top_5_adj

Unnamed: 0,adjective,count
0,former,8
1,opioid,5
2,nationwide,4
3,addictive,3
4,other,3


## 1.2 named entity recognition (4 points)

A. Using the original `pharma` press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: `LAW`

In [13]:
## Part A:
spacy_pharma = nlp(pharma)
#print(type(spacy_pharma))

In [15]:
## Part B:
law_entities = []
for tok in spacy_pharma.ents:
    if tok.label_ == "LAW" and tok.text not in law_entities: 
        law_entities.append(tok.text)

print(law_entities)

['RICO', 'the Controlled Substances Act']


C. Use Google to summarize in one sentence what the `RICO` named entity means and why this might apply to a pharmaceutical kickbacks case (and not just a mafia case...) 

**Part C:**

According to Herman Law White Collar Criminal Defense, a violation of RICO occurs when a person, in connection with an enterprise, engages in a pattern of racketeering activity that can include the distribution of a controlled substance or wire fruad, meaning that this pharmaceutical kickbacks case can be classified under RICO because it relates to the distribution of opiods (a controlled substance) through a company (Insys Therapeutics) in a way that violates anti-kickback law and involves wire fraud.

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these named entities.

In [27]:
## Part D:
sentence_lengths =[]
for tok in spacy_pharma.ents:
    if tok.label_ == "DATE" and re.search(r"years?", tok.text.lower()) and tok.text not in sentence_lengths:
        sentence_lengths.append(tok.text)
        
print(sentence_lengths)

['last year', '20 years', 'three years', 'five years']


E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). 

**Hint**: you may want to use re.search or re.findall 

- For part E, you can use `re.search` and `re.findall`, or anything that works 😳.

In [29]:
## PART E:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(pharma) #breaks pharma into sentences
year_sentences = [s for s in sentences if re.search(r"years?", s.lower())]

print("Sentences that mention year(s):\n")
for s in year_sentences:
    print("-", s, "\n")

Sentences that mention year(s):

- "More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids. 

- This investigation highlights our commitment to defending our mail system from illegal misuse and ensuring public trust in the mail.”“The U.S. Department of Veterans Affairs, Office of Inspector General will continue to aggressively investigate those that attempt to fraudulently impact programs designed to benefit our veterans and their families,” said Donna L. Neves, Special Agent in Charge of the VA OIG Northeast Field Office.The charges of conspiracy to commit RICO and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than 20 years in prison, three years of supervised release and a fine of $250,000, or twice the amount of pecuniary gain or loss. 

- The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised releas

**Part E:**

If convicted after this indictment, the CEO may be facing up to 20 years in prison and up to 3 years of probation each for his RICO charge and his mail/wire fraud charge; he may face up to 5 years in prison and up to 3 years of probation for his violation of Anti-Kickback Law charge.

## 1.3 sentiment analysis  (10 points)

A. Subset the press releases to those labeled with one of three topics via `topics_clean`: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.



In [31]:
## PART A:
doj_subset = doj[doj["topics_clean"].isin(["Civil Rights", "Hate Crimes", "Project Safe Childhood"])]
doj_subset.shape

(717, 6)

B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string (**Hint**: you may want to use `re.sub` with an or condition)
- Scores the sentiment of the entire press release using the `SentimentIntensityAnalyzer` and `polarity_scores`
- Returns the length-four (negative, positive, neutral, compound) sentiment dictionary (any order is fine)

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- A function + list comprehension to execute will takes about 30 seconds on a respectable local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717


In [33]:
## your code here to define function
def func(string):
    ## Remove named entities from each press release:
    spacy_string = nlp(string)
    named_entities = [tok.text for tok in spacy_string.ents]
    if named_entities:
        pattern = "(" + "|".join([re.escape(ent) for ent in named_entities]) + ")" #this groups the named entities from the string together so that we can get rid of them in the next line
        # we need to use re.escape so that our regex pattern isn't thrown off by the presence of special characters in the strings
        cleaned_text = re.sub(pattern, "", string) #any time one of the named entities (given by the pattern) appears in the string, it will be replaced with ""

## I think it would also be possible to do this step via a for loop that loops over each entity in the named_entities
## list and replaces it with "" in the string, but I'm pretty sure it would take a lot longer since you'd have to loop
## through the string one time for each named entity rather than looping through it once and checking for any named entity
    
    else:
        cleaned_text = string
    #just a check:
    #print(cleaned_text)
    
    ## Score the sentiment of the entire press release using SentimentIntensityAnalyzer and polarity_scores
    sia = SentimentIntensityAnalyzer()
    sentiment_scores = sia.polarity_scores(cleaned_text)

    ## Returns the length-four sentiment dictionary in any order
    return sentiment_scores

In [35]:
## Running it on a few examples:
func("Barack Obama spoke at the United Nations in 2014. He said the U.S. economy had improved.")
func("The U.S. News and Report (a special website) named Dartmouth College as one of the top universities. This made all of its undergraduates extremely happy, so they celebrated. People who don't go to Dartmouth College were very angry about this and called them up.")


{'neg': 0.0, 'neu': 0.744, 'pos': 0.256, 'compound': 0.4767}

{'neg': 0.078, 'neu': 0.651, 'pos': 0.271, 'compound': 0.835}

In [37]:
## your code here executing the function:
doj_sentiment_scores = [func(press_rel) for press_rel in doj_subset["contents"]]

#check:
#doj_sentiment_scores

C. Add the four sentiment scores to the `doj_subset` dataframe to create a dataframe: `doj_subset_wscore`. Sort from highest neg to lowest neg score and print the top `id`, `contents`, and `neg` columns of the two most neg press releases. 

Notes:

- Don't worry if your sentiment score differs slightly from our output on GitHub; differences in preprocessing can lead to diff scores

In [45]:
## PART C:
doj_subset_wscore = doj_subset.copy()
doj_subset_wscore["sentiment_scores"] = doj_sentiment_scores
doj_subset_wscore["neg"] = doj_subset_wscore["sentiment_scores"].apply(lambda x: x["neg"])
doj_subset_wscore["neu"] = doj_subset_wscore["sentiment_scores"].apply(lambda x: x["neu"])
doj_subset_wscore["pos"] = doj_subset_wscore["sentiment_scores"].apply(lambda x: x["pos"])
doj_subset_wscore["compound"] = doj_subset_wscore["sentiment_scores"].apply(lambda x: x["compound"])

highest_neg_scores = doj_subset_wscore.sort_values(by="neg", ascending=False)
print(highest_neg_scores[["id", "contents", "neg"]].head(2))

# also without print, since the formatting is kinda ugly:
highest_neg_scores[["id", "contents", "neg"]].head(2)

         id  \
329  14-248   
572  13-312   

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          

Unnamed: 0,id,contents,neg
329,14-248,"The Department of Justice announced that this morning John W. Ng, 58, of Albuquerque, N.M., made his initial appearance in federal court on a criminal complaint charging him with a hate crime offense. This charge is related to anti-Semitic threats Ng made against a Jewish woman who owns and operates the Nosh Jewish Delicatessen and Bakery in Albuquerque. Ng was arrested by the FBI on March 7, 2014, based on a criminal complaint alleging that he interfered with the victim’s federally protected rights by threatening her and interfering with her business because of her religion. According to the criminal complaint, between Jan. 22, 2014, and Feb. 8, 2014, Ng allegedly posted threatening anti-Semitic notes on and in the vicinity of the victim’s business. A criminal complaint merely establishes probable cause, and Ng is presumed innocent unless proven guilty. If convicted on the offense charged in the criminal complaint, Ng faces a maximum statutory penalty of one year in prison. This matter was investigated by the Albuquerque Division of the FBI and is being prosecuted by Assistant U.S. Attorney Mark T. Baker of the U.S. Attorney’s Office for the District of New Mexico and Trial Attorney AeJean Cha of the U.S. Department of Justice’s Civil Rights Division.",0.323
572,13-312,"John Hall, 27, an Aryan Brotherhood member and inmate at the Federal Correctional Institution (FCI) in Seagoville, Texas, was sentenced today by U.S. District Judge Reed O’Connor after pleading guilty to violating the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act stemming from his assault of a fellow inmate, whom he believed to be gay, the Department of Justice announced. Hall assaulted his fellow inmate with a dangerous weapon, causing bodily injury to the victim on Dec. 20, 2011. Hall was sentenced to serve 71 months in prison to be served consecutively with the sentence he is currently serving. The assault occurred on Dec. 20, 2011, inside the FCI Seagoville when Hall targeted and attacked the victim, a fellow inmate, because he believed the victim was gay or involved in a sexual relationship with another male inmate. Hall repeatedly punched, kicked and stomped on the victim’s face with his shod feet, a dangerous weapon, while yelling a homophobic slur. The victim lost consciousness during the assault and suffered multiple lacerations to his face. The victim also sustained a fractured eye socket, lost a tooth, fractured other teeth and was treated at a hospital for the injuries he sustained during Hall’s unprovoked attack. Hall pleaded guilty to violating the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act on Nov. 8, 2012. “Brutality and violence based on sexual orientation has no place in a civilized society,” said Thomas E. Perez, Assistant Attorney General for the Civil Rights Division. “The Justice Department is committed to using all the tools in our law enforcement arsenal, including the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act, to prosecute acts motivated by hate.” “This prosecution sends a clear message that this office, in partnership with attorneys in the department’s Civil Rights Division, will prioritize and aggressively prosecute hate crimes and others civil rights violations in North Texas,” said U.S. Attorney Sarah R. Saldaña of the Northern District of Texas. This case was investigated by the FBI Dallas Division. The case was prosecuted by Assistant U.S. Attorney Errin Martin and Trial Attorney Adriana Vieco of the Civil Rights Division.",0.303


D. With the dataframe from part C, find the mean compound sentiment score for each of the three topics in `topics_clean` using group_by and agg.

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)


In [47]:
## Part D:
mean_sentiment_by_topic = doj_subset_wscore.groupby("topics_clean")["compound"].agg("mean")
mean_sentiment_by_topic


topics_clean
Civil Rights             -0.093931
Hate Crimes              -0.930943
Project Safe Childhood   -0.681391
Name: compound, dtype: float64

## Part E:
We likely observe this variation in scores because different emotional tones and public perceptions are associated with each issue; for instance, Hate Crimes and Project Safe Childhood may have more negative sentiments due to their often tragic nature while Civil Rights may be viewed more neutrally, considering there are positive contexts in which Civil Rights-related words are discussed.

# 2. Topic modeling (25 points)

For this question, use the `doj_subset_wscores` data that is restricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


## 2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in a single raw string in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem

- Returns a joined preprocessed string
    
B. Use `apply` or list comprehension to execute that function and create a new column in the data called `processed_text`
    
C. Print the `id`, `contents`, and `processed_text` columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/

In [49]:
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                       "trial", "assistance", "assist"]

In [51]:
## your code defining a text processing function
## PART A:
def preprocess(string):
    
    ## Convert word to lowercase:
    str_lower = string.lower()

    ## Remove stopwords, adding custom stopwords to default list:
    stopword_list = stopwords.words("english")
    new_stopwords_list = stopword_list + custom_doj_stopwords
    str_token = [word for word in wordpunct_tokenize(str_lower) if word not in new_stopwords_list]

    ## Only retains alpha words and words with more than 3 characters using the snowball stemmer from nltk to stem:
    stemmer = SnowballStemmer("english")
    str_stem = [stemmer.stem(tok) for tok in str_token if tok.isalpha() and len(tok) > 3]

    ## Return a joined preprocessed string:
    return " ".join(str_stem)


In [53]:
## your code executing the function
## PART B:
doj_subset_wscore["processed_text"] = doj_subset_wscore["contents"].apply(preprocess)

#check:
# doj_subset_wscore[["contents", "processed_text"]]

In [55]:
## your code showing the examples
## PART C:
id_list = ["16-718", "16-217"]
example = doj_subset_wscore[doj_subset_wscore["id"].isin(id_list)][["id", "contents", "processed_text"]]
example

Unnamed: 0,id,contents,processed_text
6727,16-217,"The Justice Department has reached a comprehensive settlement agreement with the city of Miami and the Miami Police Department (MPD) resolving the Justice Department’s investigation of officer-involved shootings by MPD officers, announced Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division and U.S. Attorney Wifredo A. Ferrer of the Southern District of Florida. The settlement, which was approved by Miami’s city commission today and will go into effect when the agreement is signed by all parties, resolves claims stemming from the Justice Department’s investigation into officer-involved shootings by MPD officers, which was conducted under the Violent Crime Control and Law Enforcement Act of 1994. The investigation’s findings, issued in July 2013, identified a pattern or practice of excessive use of force through officer-involved shootings in violation of the Fourth Amendment of the Constitution. The city’s compliance with the settlement will be monitored by an independent reviewer, former Tampa, Florida, Police Chief Jane Castor. Under the settlement agreement, the city will implement comprehensive reforms to ensure constitutional policing and support public trust. The settlement agreement is designed to minimize officer-involved shootings and to more effectively and quickly investigate officer-involved shootings that do occur, through measures that include: “This settlement represents a renewed commitment by the city of Miami and Chief Rodolfo Llanes to provide constitutional policing for Miami residents and to protect public safety through sustainable reform,” said Principal Deputy Assistant Attorney General Gupta. “The agreement will help to strengthen the relationship between the MPD and the communities they serve by improving accountability for officers who fire their weapons unlawfully, and provides for community participation in the enforcement of this agreement.” “Today's agreement is the result of a joint effort between the Department of Justice and the City of Miami to ensure that the Miami Police Department continues its efforts to make our community safe while protecting the sacred Constitutional rights of all of our citizens,” said U.S. Attorney Ferrer. “Through oversight and communication, the agreement seeks to make permanent the positive changes that former Chief Orosa and Chief Llanes have made, and we applaud the City Commission’s vote.” The settlement agreement builds upon important reforms implemented by the city since the Justice Department issued its findings, including: The investigation was conducted by attorneys and staff from the Civil Rights Division’s Special Litigation Section and the Civil Division of the U. S. Attorney’s Office of the Southern District of Florida.",reach comprehens settlement agreement citi miami miami polic resolv offic involv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem offic involv shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc offic involv shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim offic involv shoot effect quick investig offic involv shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc agreement today agreement result joint effort citi miami ensur miami polic continu effort make communiti safe protect sacr constitut citizen said ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss vote settlement agreement build upon import reform implement citi sinc issu find includ conduct attorney staff special litig section southern florida
11593,16-718,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up. The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground. Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H. The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI’s Jackson Division, with the cooperation of the Mississippi Department of Corrections. It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division’s Criminal Section. Marsher Indictment",nine count indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti investig jackson cooper mississippi correct prosecut robert coleman northern mississippi dana mulhaus crimin section marsher indict


## 2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: `id`, `compound` sentiment column you added, and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. Print the top 10 words for press releases in each of the three `topics_clean`

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.


In [57]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), 
        columns=vectorizer.get_feature_names_out())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)

In [59]:
# PART A
strings_list = doj_subset_wscore["processed_text"]
updated_doj = doj_subset_wscore.rename(columns={"compound": "compound_sent"})
#to avoid duplicated "compound" or "id" columns once we create the dtm

data = updated_doj[["id", "compound_sent", "topics_clean"]]

dtm = create_dtm(strings_list, data)
dtm

Unnamed: 0,index,id,compound_sent,topics_clean,aaron,abandon,abbat,abbi,abbott,abdomen,...,zane,zealand,zealous,zeeman,zero,zionism,zobel,zone,zunggeemog,zwengel
0,77,17-1235,-0.9931,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,155,15-1522,-0.9325,Project Safe Childhood,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,157,16-213,-0.7579,Project Safe Childhood,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,162,16-381,-0.9037,Project Safe Childhood,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,168,14-464,-0.9864,Hate Crimes,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
712,13002,09-368,-0.9689,Hate Crimes,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
713,13032,18-775,0.7003,Project Safe Childhood,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
714,13034,12-596,-0.9648,Project Safe Childhood,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
715,13068,18-359,-0.9798,Civil Rights,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [61]:
## Defining the function:

def get_topwords(df_sub):
    word_columns = df_sub.drop(columns=["index", "id", "compound_sent", "topics_clean"])
    word_sums = word_columns.sum().sort_values(ascending=False)
    print(word_sums.head(10))



    

In [63]:
## PART B:
cutoff = dtm["compound_sent"].quantile(0.95)
b_sub = dtm[dtm["compound_sent"] > cutoff]

print("Top 10 words for press releases with compound sentiment in the top 5%:")
get_topwords(b_sub)

## PART C:
cutoff2 = dtm["compound_sent"].quantile(0.05)
c_sub = dtm[dtm["compound_sent"] < cutoff2]

print("\nTop 10 words for press releases with compound sentiment in the bottom 5%:")
get_topwords(c_sub)

## PART D:
d_sub_cr = dtm[dtm["topics_clean"]=="Civil Rights"]
print("\nTop 10 words for Civil Rights press releases:")
get_topwords(d_sub_cr)

d_sub_hc = dtm[dtm["topics_clean"]=="Hate Crimes"]
print("\nTop 10 words for Hate Crimes press releases:")
get_topwords(d_sub_hc)

d_sub_safe = dtm[dtm["topics_clean"]=="Project Safe Childhood"]
print("\nTop 10 words for Project Safe Childhood press releases:")
get_topwords(d_sub_safe)


Top 10 words for press releases with compound sentiment in the top 5%:
agreement     171
enforc        130
state         110
disabl        107
ensur         107
communiti      99
student        87
settlement     86
servic         83
general        82
dtype: int64

Top 10 words for press releases with compound sentiment in the bottom 5%:
assault     191
crime       171
victim      164
hate        129
defend      126
offic       119
conspir     105
sentenc     104
charg        97
american     95
dtype: int64

Top 10 words for Civil Rights press releases:
offic        637
hous         633
discrimin    616
enforc       544
disabl       532
said         497
feder        479
violat       477
state        452
court        414
dtype: int64

Top 10 words for Hate Crimes press releases:
victim      591
crime       557
hate        524
defend      484
prosecut    478
charg       463
sentenc     455
american    451
feder       432
guilti      430
dtype: int64

Top 10 words for Project Safe Childhoo

## 2.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Hints and Resources**:

- Same topic modeling resources linked to above
- Make sure to use the `random_state` argument within the model so that the numbering of topics does not move around between runs of your code

In [65]:
# PART A:
#Step 1: re-tokenize our preprocessed text
text_raw_tokens = [wordpunct_tokenize(one_text) for one_text in 
                  doj_subset_wscore.processed_text]

#Step 2: use genism create dictionary
text_raw_dict = corpora.Dictionary(text_raw_tokens)
raw_len = len(text_raw_dict)

#Step 3: Skip filtering out rare and common words

#Step 4: apply dictionary to tokenized text
corpus_fromdict = [text_raw_dict.doc2bow(one_text) 
                   for one_text in text_raw_tokens]

#Step 5: estimate the model
ldamod = gensim.models.ldamodel.LdaModel(corpus_fromdict, 
                                         num_topics = 3, 
                                         id2word=text_raw_dict, 
                                         passes=15, 
                                         random_state=100,
                                         alpha = 'auto',
                                         per_word_topics = True)

#check
#print(type(ldamod))

In [66]:
# PART B:
topics = ldamod.print_topics(num_words = 15)
## printed in this format so it would be easier to compare the list of words in each topic
## with our earlier exercise using pre-determined topics
for topic_num, topic_words in topics:
    print(f"Topic {topic_num + 1}: ")
    print(" -" + topic_words.replace(" + ", "\n -").replace("*", ": "))
    print("\n")


Topic 1: 
 -0.019: "child"
 -0.013: "exploit"
 -0.011: "sexual"
 -0.010: "sentenc"
 -0.010: "prosecut"
 -0.010: "crimin"
 -0.009: "offic"
 -0.009: "victim"
 -0.009: "safe"
 -0.009: "childhood"
 -0.008: "project"
 -0.008: "pornographi"
 -0.008: "year"
 -0.008: "children"
 -0.008: "investig"


Topic 2: 
 -0.011: "hous"
 -0.011: "discrimin"
 -0.009: "disabl"
 -0.008: "enforc"
 -0.007: "agreement"
 -0.007: "state"
 -0.007: "said"
 -0.006: "court"
 -0.006: "violat"
 -0.006: "alleg"
 -0.006: "feder"
 -0.005: "requir"
 -0.005: "general"
 -0.005: "settlement"
 -0.005: "fair"


Topic 3: 
 -0.011: "victim"
 -0.011: "crime"
 -0.011: "defend"
 -0.010: "charg"
 -0.010: "hate"
 -0.009: "prosecut"
 -0.009: "said"
 -0.009: "sentenc"
 -0.008: "american"
 -0.008: "feder"
 -0.008: "guilti"
 -0.008: "indict"
 -0.008: "assault"
 -0.007: "african"
 -0.007: "year"




## 2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset_wscores` dataframe

B. Add the topic probabilities to the `doj_subset_wscores` dataframe as columns and create a column, `top_topic`, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)


In [69]:
## PART A:
topic_probs = [ldamod.get_document_topics(item, minimum_probability=0) for item in corpus_fromdict]

# We want this to return true, which it does:
len(topic_probs) == len(doj_subset_wscore)


True

In [71]:
## PART B:
# Convert topic_probs into a DataFrame
topic_df = pd.DataFrame([[prob for i, prob in doc] for doc in topic_probs],
                        columns=["Topic_1", "Topic_2", "Topic_3"])

# Join topic probabilities to main DataFrame
doj_subset_wscore = doj_subset_wscore.reset_index(drop=True).join(topic_df)

# Add column for most probable topic
doj_subset_wscore["top_topic"] = topic_df.idxmax(axis=1)
#doj_subset_wscore

In [73]:
## PART C
topic_distribution = pd.crosstab(index = doj_subset_wscore["topics_clean"], columns = doj_subset_wscore["top_topic"], normalize = "index")

# Multiply everything by 100 to get percentages
topic_distribution = topic_distribution.mul(100)

print(topic_distribution)


top_topic                  Topic_1    Topic_2    Topic_3
topics_clean                                            
Civil Rights             10.163934  69.508197  20.327869
Hate Crimes               9.349593   0.000000  90.650407
Project Safe Childhood  100.000000   0.000000   0.000000


In [75]:
## PART D
# To look at examples:
doj_subset_wscore.sample(5)

## We believe that the Project Safe Childhood topic maps more cleanly onto our estimated topics because, comparing the list of top words
## for Project Safe Childhood press releases, we see that they are more distinct from the other two topic categories. Civil Rights and 
## Hate Crimes have more overlapping top words between them than either of them do with Project Safe Childhood words, so the LDA model 
## is not able to determine which topic Hate Crime and Civil Rights releases belong to as easily.

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,sentiment_scores,neg,neu,pos,compound,processed_text,Topic_1,Topic_2,Topic_3,top_topic
644,11-599,Two Arkansas Men Plead Guilty to Firebombing an Interracial Couple’s Home,"WASHINGTON – Two Arkansas men pleaded guilty today in U.S. District Court in Little Rock, Ark., to charges related to their involvement in the firebombing of the house of an interracial couple, the Justice Department announced. During the plea proceedings, Dustin Hammond of Sharp County, Ark., and Jake Murphy of Scott County, Ark., admitted that on the night of Jan.14, 2011, while at a party in Evening Shade, Ark., they and two other men devised a plan to firebomb an interracial couple’s home. Thereafter, all four co-defendants drove from Evening Shade to the victims’ house in Hardy, Ark. Upon arrival, the co-defendants constructed three Molotov cocktails and threw them at the house. The couple was also barraged with racial slurs and threatened with future violence if they did not leave Arkansas. The victims’ house sustained some damage during the incident. The victims were not injured. Hammond and Murphy pleaded guilty to one count of conspiracy against rights and one count of criminal violation of housing rights. “Firebombing a family’s home because of their race is a deplorable act of hate that will not be tolerated in our country,” said Thomas E. Perez, Assistant Attorney General of the Civil Rights Division. “The Justice Department will vigorously prosecute those who resort to violent acts motivated by hate.” Hammond and Murphy face a maximum penalty of 20 years in prison. Sentencing has been set for Aug. 12, 2011. The remaining co-defendants are scheduled to go to trial on May 31, 2011. This case was investigated by the Little Rock, Ark., Division of the FBI and is being prosecuted by Assistant U.S. Attorney John Ray White of the Eastern District of Arkansas and Trial Attorney Henry Leventis of the Civil Rights Division.",2011-05-10T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,"{'neg': 0.193, 'neu': 0.762, 'pos': 0.045, 'compound': -0.9906}",0.193,0.762,0.045,-0.9906,washington arkansa plead guilti today court littl rock charg relat involv firebomb hous interraci coupl announc plea proceed dustin hammond sharp counti jake murphi scott counti admit night parti even shade devis plan firebomb interraci coupl home thereaft four defend drove even shade victim hous hardi upon arriv defend construct three molotov cocktail threw hous coupl also barrag racial slur threaten futur violenc leav arkansa victim hous sustain damag incid victim injur hammond murphi plead guilti count conspiraci count crimin violat hous firebomb famili home race deplor hate toler countri said thoma perez general vigor prosecut resort violent act motiv hate hammond murphi face maximum penalti year prison sentenc remain defend schedul investig littl rock prosecut john white eastern arkansa henri leventi,0.000586,0.000469,0.998945,Topic_3
329,16-588,Justice Department Reaches Extension Agreement to Improve Georgia’s Developmental Disability and Mental Health System,"The Justice Department today announced that it has entered into an extension agreement with the state of Georgia to improve the quality and availability of services for people with developmental disabilities living in the community and to provide supported housing to individuals with significant mental illness who need it. The extension agreement builds upon a 2010 settlement agreement resolving a lawsuit brought by the department under the Americans with Disabilities Act and the Supreme Court’s Olmstead decision. The case involves Georgia’s provision of community services for individuals with mental illness and developmental disabilities. The department found in 2009 that Georgia was forcing people with disabilities into state hospitals instead of providing community-based services, in violation of the ADA’s integration requirements. In January, the department alleged that Georgia was not in compliance with the 2010 agreement, both regarding helping people move from institutions into their communities and regarding quality and oversight of community-based services. In light of the agreement and the significant commitments Georgia has made in it, the department has agreed to withdraw its motion to enforce that earlier agreement. The agreement will resolve the seven areas of alleged deficiency identified by the department in its January court filing. Under the agreement, Georgia will help people with developmental disabilities move from its state hospitals to integrated settings, consistent with their needs and preferences; will identify and address each individual’s needs in the community prior to discharge; and will monitor services and track outcomes for people after their discharge. For individuals who have moved from the state hospitals to the community, Georgia will monitor their health and wellbeing to ensure that emerging needs are met in a timely fashion. The extension agreement also calls for creation of at least 675 new Medicaid home- and community-based waiver slots as alternatives to placement in a facility. Georgia will provide clinical oversight and enhanced support coordination for individuals with developmental disabilities served by the state. The extension agreement enhances quality oversight, requiring specific actions in the event of serious incidents and corrective actions to address deficiencies. The state will collect and review data to identify any trends and develop quality improvement initiatives. In addition, Georgia will require providers to develop risk management and quality improvement programs. Under the agreement, at least 600 additional individuals with mental illness will receive bridge funding and at least 633 will receive housing vouchers under the Georgia housing voucher program. By June 30, 2018, the state is to have capacity to provide supported housing to any of the people with mental illness covered by the settlement agreement that need it. The extension agreement requires a referral procedure to supported housing for people who need it leaving the state hospitals, jails, prisons, emergency rooms or homeless shelters. “By strengthening the services provided by Georgia’s mental health system, this agreement will make a difference in the lives of Georgians with developmental disabilities or mental illness who wish to build lives in the community,” said Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division. “We look forward to working with Georgia to deliver on the promise of community integration enshrined in the ADA.” “During the past five years, the State of Georgia has significantly changed the way it provides services for people with disabilities,” said U.S Attorney John A. Horn of the Northern District of Georgia. “Recognizing that we have more work to do in this area, I am encouraged by Georgia’s willingness to continue to partner with the Department of Justice and stakeholders to improve the quality of services for people with developmental disabilities and significant mental illness in our community.” The Civil Rights Division enforces the ADA, which authorizes the Attorney General to investigate whether a state is serving individuals in the most integrated settings appropriate to their needs. Please visit www.justice.gov/crt to learn more about the Olmstead decision, the ADA and other laws enforced by the Justice Department’s Civil Rights Division. The agreement was secured due to the efforts of Civil Rights Division’s Special Litigation Section and the U.S. Attorney’s Office of the Northern District of Georgia. Georgia ADA Extension Agreement",2016-05-18T00:00:00-04:00,Civil Rights,"Civil Rights Division; Civil Rights - Special Litigation Section; USAO - Georgia, Northern","{'neg': 0.039, 'neu': 0.795, 'pos': 0.166, 'compound': 0.9977}",0.039,0.795,0.166,0.9977,today announc enter extens agreement state georgia improv qualiti avail servic peopl development disabl live communiti provid support hous individu signific mental ill need extens agreement build upon settlement agreement resolv lawsuit brought american disabl suprem court olmstead decis involv georgia provis communiti servic individu mental ill development disabl found georgia forc peopl disabl state hospit instead provid communiti base servic violat integr requir januari alleg georgia complianc agreement regard help peopl move institut communiti regard qualiti oversight communiti base servic light agreement signific commit georgia made agre withdraw motion enforc earlier agreement agreement resolv seven area alleg defici identifi januari court file agreement georgia help peopl development disabl move state hospit integr set consist need prefer identifi address individu need communiti prior discharg monitor servic track outcom peopl discharg individu move state hospit communiti georgia monitor health wellb ensur emerg need time fashion extens agreement also call creation least medicaid home communiti base waiver slot altern placement facil georgia provid clinic oversight enhanc support coordin individu development disabl serv state extens agreement enhanc qualiti oversight requir specif action event serious incid correct action address defici state collect review data identifi trend develop qualiti improv initi addit georgia requir provid develop risk manag qualiti improv program agreement least addit individu mental ill receiv bridg fund least receiv hous voucher georgia hous voucher program june state capac provid support hous peopl mental ill cover settlement agreement need extens agreement requir referr procedur support hous peopl need leav state hospit jail prison emerg room homeless shelter strengthen servic provid georgia mental health system agreement make differ live georgian development disabl mental ill wish build live communiti said princip deputi general vanita gupta head look forward work georgia deliv promis communiti integr enshrin past five year state georgia signific chang provid servic peopl disabl said john horn northern georgia recogn work area encourag georgia willing continu partner stakehold improv qualiti servic peopl development disabl signific mental ill communiti enforc author general investig whether state serv individu integr set appropri need pleas visit learn olmstead decis law enforc agreement secur effort special litig section northern georgia georgia extens agreement,0.000202,0.999625,0.000173,Topic_2
374,17-296,Justice Department Settles Immigration-Related Discrimination Claim Against Florida Pizza Delivery Chain,"The Justice Department reached a settlement agreement today with Pizzerias, LLC (Pizzerias), a pizza restaurant franchisee with 31 locations in Miami, Florida. The agreement resolves the department’s investigation into whether Pizzerias violated the Immigration and Nationality Act (INA) by discriminating against work-authorized immigrants when checking their work authorization documents. The department’s investigation concluded that Pizzerias routinely requested that lawful permanent residents produce a specific document – a Permanent Resident Card – to prove their work authorization, while not requesting a specific document from U.S. citizens. Lawful permanent residents often have the same work authorization documents available to them as U.S. citizens, and may choose acceptable documents other than a Permanent Resident Card to prove they are authorized to work. The antidiscrimination provision of the INA prohibits employers from subjecting employees to unnecessary documentary demands based on citizenship or national origin. Under the settlement, Pizzerias must pay a civil penalty of $140,000 to the United States, post notices informing workers about their rights under the INA’s antidiscrimination provision, train their human resources personnel, and be subject to departmental monitoring and reporting requirements. “The Justice Department is committed to ensuring the rights of lawful U.S. workers to be free from discriminatory barriers based on their citizenship, immigration status, or national origin,” said Acting Assistant Attorney General Tom Wheeler of the Civil Rights Division. “Pizzerias’ responsiveness throughout the course of the investigation assisted in a speedy resolution of this matter.” The division’s Immigrant and Employee Rights Section (IER), formerly known as the Office of Special Counsel for Immigration-Related Unfair Employment Practices, is responsible for enforcing the anti-discrimination provision of the INA. The statute prohibits, among other things, citizenship, immigration status, and national origin discrimination in hiring, firing, or recruitment or referral for a fee; unfair documentary practices; retaliation and intimidation. For more information about protections against employment discrimination under immigration laws, call IER’s worker hotline at 1-800-255-7688 (1-800-237-2515, TTY for hearing impaired); call IER’s employer hotline at 1-800-255-8155 (1-800-237-2515, TTY for hearing impaired); sign up for a free webinar; email IER@usdoj.gov; or visit IER’s English and Spanish websites. Applicants or employees who believe they were subjected to different documentary requirements based on their citizenship, immigration status, or national origin; or discrimination based on their citizenship, immigration status, or national origin in hiring, firing, or recruitment or referral, should contact IER’s worker hotline for assistance.",2017-03-21T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Immigrant and Employee Rights Section,"{'neg': 0.049, 'neu': 0.884, 'pos': 0.067, 'compound': 0.6597}",0.049,0.884,0.067,0.6597,reach settlement agreement today pizzeria pizzeria pizza restaur franchise locat miami florida agreement resolv whether pizzeria violat immigr nation discrimin work author immigr check work author document conclud pizzeria routin request law perman resid produc specif document perman resid card prove work author request specif document citizen law perman resid often work author document avail citizen choos accept document perman resid card prove author work antidiscrimin provis prohibit employ subject employe unnecessari documentari demand base citizenship nation origin settlement pizzeria must penalti unit state post notic inform worker antidiscrimin provis train human resourc personnel subject department monitor report requir commit ensur law worker free discriminatori barrier base citizenship immigr status nation origin said act general wheeler pizzeria respons throughout cours assist speedi resolut matter immigr employe section former known special counsel immigr relat unfair employ practic respons enforc anti discrimin provis statut prohibit among thing citizenship immigr status nation origin discrimin hire fire recruit referr unfair documentari practic retali intimid inform protect employ discrimin immigr law call worker hotlin hear impair call employ hotlin hear impair sign free webinar email usdoj visit english spanish websit applic employe believ subject differ documentari requir base citizenship immigr status nation origin discrimin base citizenship immigr status nation origin hire fire recruit referr contact worker hotlin,0.000337,0.999372,0.00029,Topic_2
111,09-081,Final Defendant Pleads Guilty to Anti-Obama Assaults,"WASHINGTON - Ralph Nicoletti pleaded guilty in Brooklyn, N.Y., federal court today before U.S. District Judge Carol B. Amon to committing three assaults targeting African-American residents in Staten Island, N.Y., on the night of President Barack Obama’s election victory. Nicoletti was the last of four defendants to plead guilty in the federal prosecution stemming from the attacks. The other three defendants – Bryan Garaventa, Michael Contreras and Brian Carranza – previously pleaded guilty to conspiring to commit the hate crime assaults and each face sentences of up to 10 years in prison. As part of his plea, Nicoletti has agreed to a sentence of 12 years, subject to the court’s approval. The guilty plea was announced by Loretta King, Acting Assistant Attorney General for the Department of Justice’s Civil Rights Division; Benton J. Campbell, U.S. Attorney for the Eastern District of New York; Joseph M. Demarest, Jr., Assistant Director-in-Charge, FBI, New York Field Office; and Raymond W. Kelly, Commissioner, New York City Police Department. At the plea proceeding, Nicoletti admitted that on Nov. 4, 2008, the night of the presidential election, the defendants decided to assault African-Americans in Staten Island after President Obama was declared the winner of the election. The defendants targeted African-Americans believing that they had voted for President Obama. Nicoletti drove the group to the Park Hill section of Staten Island, a predominantly African-American neighborhood, where they came upon an African-American teenager and assaulted him. Nicoletti struck the teenager with a metal pipe and Garaventa hit him with a collapsible police baton. Nicoletti then drove to the Port Richmond section of Staten Island, where the defendants assaulted an unidentified African-American man. During that assault, Garaventa tripped the victim and pushed him to the ground. The third assault was against an individual whom the defendants mistakenly believed was African-American. The plan was for Contreras to hit the victim with the police baton as the defendants drove by him. Instead, Nicoletti deliberately drove his car into the victim’s body. The victim was thrown onto the hood of the car and hit the front windshield, smashing it. The victim was seriously injured and remained in a coma for several weeks after the attack. ""This successful prosecution sends a clear message that racially-motivated acts of violence targeted at those who are exercising their right to vote are intolerable and will be aggressively investigated and prosecuted,"" said Acting Assistant Attorney General King. ""It is a tragedy that these crimes occur at all, but the Department of Justice will remain vigilant in our efforts to combat hate crimes, as they tear at the very fabric of our great nation."" ""The conduct of the defendants is shocking and deplorable,"" stated U.S. Attorney Campbell. ""On a night of historic significance, these four angry men assaulted their victims in an attempt to punish them for exercising a fundamental right of all Americans – the right to vote. Those who commit such crimes will be swiftly apprehended, prosecuted and punished. We are grateful for our partnership with the Department of Justice Civil Rights Division, the FBI and the New York City Police Department, which has been vital to the success of this case, and I particularly wish to thank the Richmond County District Attorney’s Office for its assistance in this matter."" ""The crimes these defendants have now admitted to were violent assaults that in one case nearly killed a man,"" said FBI Assistant Director-in-Charge Demarest of the New York Field Office. ""In attempting to intimidate voters, the defendants also violated the victims’ civil rights in a way that was an attack on the democratic process. These were serious crimes that prompted the serious response the FBI will always bring to bear in civil rights enforcement."" ""It was important to make certain that those who seriously injured individuals, based on their race, did not escape justice,"" said Police Commissioner Raymond W. Kelly. ""NYPD Inspector Michael J. Osgood, Commanding Officer of the NYPD Hate Crime Task Force, had the foresight to assign a special team on Election Night until 4 a.m. the next morning. As a result, his investigators were in position to respond quickly to the bias attacks as reports of them began to emerge. Detectives located an eyewitness to one of the attacks, and their subsequent distribution of flyers in the Rosebank area of Staten Island over three days led to the first major break in the case. I also want to thank the FBI agents who helped, and the federal prosecutors who succeeded in winning the guilty pleas."" The government’s case is being prosecuted by Assistant U.S. Attorneys Pamela K. Chen and Margo K. Brodie, and Department of Justice Special Litigation Counsel Kristy Parker.",2009-02-02T00:00:00-05:00,Hate Crimes,Civil Rights Division,"{'neg': 0.19, 'neu': 0.713, 'pos': 0.097, 'compound': -0.9964}",0.19,0.713,0.097,-0.9964,washington ralph nicoletti plead guilti brooklyn feder court today judg carol amon commit three assault target african american resid staten island night presid barack obama elect victori nicoletti last four defend plead guilti feder prosecut stem attack three defend bryan garaventa michael contrera brian carranza previous plead guilti conspir commit hate crime assault face sentenc year prison part plea nicoletti agre sentenc year subject court approv guilti plea announc loretta king act general benton campbel eastern york joseph demarest director charg york field raymond kelli commission york citi polic plea proceed nicoletti admit night presidenti elect defend decid assault african american staten island presid obama declar winner elect defend target african american believ vote presid obama nicoletti drove group park hill section staten island predomin african american neighborhood came upon african american teenag assault nicoletti struck teenag metal pipe garaventa collaps polic baton nicoletti drove port richmond section staten island defend assault unidentifi african american assault garaventa trip victim push ground third assault individu defend mistaken believ african american plan contrera victim polic baton defend drove instead nicoletti deliber drove victim bodi victim thrown onto hood front windshield smash victim serious injur remain coma sever week attack success prosecut send clear messag racial motiv act violenc target exercis right vote intoler aggress investig prosecut said act general king tragedi crime occur remain vigil effort combat hate crime tear fabric great nation conduct defend shock deplor state campbel night histor signific four angri assault victim attempt punish exercis fundament right american right vote commit crime swift apprehend prosecut punish grate partnership york citi polic vital success particular wish thank richmond counti matter crime defend admit violent assault near kill said director charg demarest york field attempt intimid voter defend also violat victim attack democrat process serious crime prompt serious respons alway bring bear enforc import make certain serious injur individu base race escap said polic commission raymond kelli nypd inspector michael osgood command offic nypd hate crime task forc foresight assign special team elect night next morn result investig posit respond quick bias attack report began emerg detect locat eyewit attack subsequ distribut flyer rosebank area staten island three day first major break also want thank agent help feder prosecutor succeed win guilti plea govern prosecut attorney pamela chen margo brodi special litig counsel kristi parker,0.103953,0.060304,0.835744,Topic_3
674,17-931,Two Texas Men Plead Guilty to Federal Hate Crime for Assaults Based on Victim’s Sexual Orientation,"Nigel Garrett, 21, and Cameron Ajiduah, 18, pleaded guilty today to assaulting men because of the victim’s sexual orientation, the Justice Department’s Civil Rights Division, the U.S. Attorney’s Office of the Eastern District of Texas, and the U.S. Bureau of Alcohol, Tobacco, Firearms and Explosives’ Dallas Division announced. According to the plea agreement signed by Garrett on January 19, 2017, defendants Garrett, Anthony Shelton and Chancler Encalade used Grindr, a social media dating platform for gay men, to arrange to meet the victim at the victim’s home. Upon entering the victim’s home, the defendants restrained the victim with tape, physically assaulted the victim, and made derogatory statements to the victim for being gay. The defendants brandished a firearm during the home invasion, and stole the victim’s property, including his motor vehicle. Included in a separate plea agreement signed by Ajiduah on February 7, 2017, defendants Ajiduah, Garrett, and Shelton used the same scheme on a different victim, including restraining the victim and covering his eyes with tape, verbally berating him for his sexual orientaion, and physically assaulting him. A federal grand jury previously returned an eighteen-count indictment against Ajiduah, Shelton, Garrett, and Chancler Encalade including charges of hate crimes, kidnappings, carjackings, and the use of firearms to commit violent crimes. The indictment also charged the defendants with conspiring to cause bodily injury because of the victim’s sexual orientation during four home invasions in Plano, Frisco, and Aubrey, Texas, between January 17 and February 7, 2017. “The Justice Department will not tolerate hate crimes against any individual based on sexual orientation,” said Acting Assistant Attorney General John Gore. “Hate crimes are violent crimes, but also attack the fundamental principles of the United States. The Justice Department will continue to aggressively investigate and prosecute hate crimes.” ""Garrett and Ajiduah invaded homes, robbed and assaulted their victims, and particularly horrendous, targeted their victims based on the victim’s sexual orientation,” said Acting U.S. Attorney Brit Featherston. “In response to such a hate crime, let it be known that law enforcement will leave no stone unturned to catch and prosecute the likes of these criminals to the fullest extent of the law."" Garrett and Ajiduah face a maximum statutory penalty of life in prison and a $250,000 fine for their guilty plea for the hate crime charge. The investigation is being conducted by the U.S. Bureau of Alcohol, Tobacco, Firearms and Explosives, the Plano Police Department, and the Frisco Police Department. The case is being prosecuted by Assistant U.S. Attorney Tracey Batson of the U.S. Attorney’s Office for the Eastern District of Texas and Trial Attorney Saeed Mody of the Civil Rights Division.",2017-08-22T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,"{'neg': 0.248, 'neu': 0.716, 'pos': 0.036, 'compound': -0.9982}",0.248,0.716,0.036,-0.9982,nigel garrett cameron ajiduah plead guilti today assault victim sexual orient eastern texa bureau alcohol tobacco firearm explos dalla announc accord plea agreement sign garrett januari defend garrett anthoni shelton chancler encalad use grindr social media date platform arrang meet victim victim home upon enter victim home defend restrain victim tape physic assault victim made derogatori statement victim defend brandish firearm home invas stole victim properti includ motor vehicl includ separ plea agreement sign ajiduah februari defend ajiduah garrett shelton use scheme differ victim includ restrain victim cover eye tape verbal berat sexual orientaion physic assault feder grand juri previous return eighteen count indict ajiduah shelton garrett chancler encalad includ charg hate crime kidnap carjack firearm commit violent crime indict also charg defend conspir caus bodili injuri victim sexual orient four home invas plano frisco aubrey texa januari februari toler hate crime individu base sexual orient said act general john gore hate crime violent crime also attack fundament principl unit state continu aggress investig prosecut hate crime garrett ajiduah invad home rob assault victim particular horrend target victim base victim sexual orient said act brit featherston respons hate crime known enforc leav stone unturn catch prosecut like crimin fullest extent garrett ajiduah face maximum statutori penalti life prison fine guilti plea hate crime charg conduct bureau alcohol tobacco firearm explos plano polic frisco polic prosecut tracey batson eastern texa saeed modi,0.00031,0.000248,0.999443,Topic_3


# 3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the `doj_subset_wscore` data and the `processed_text` column (so the words after stemming/other preprocessing), create a column in the data called `processed_text_bigrams` that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function `create_bigram_onedoc` that takes in a single `processed_text` string and returns a string with its bigrams structured similarly to above example
 
**Hint**: there are many ways to solve but `zip` may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the `id`, `processed_text`, and `processed_text_bigram` columns for press release with id = 16-217

In [77]:
## PART A:
def create_bigram_onedoc(proc_text_string):
    words = proc_text_string.split()
    bigrams = [f"{words[i]}_{words[i+1]}" for i in range(len(words)-1)]
    return " ".join(bigrams) #to return it as a string that is the paur of words with the birgrams structured with underscores

doj_subset_wscore["processed_text_bigrams"] = doj_subset_wscore["processed_text"].apply(create_bigram_onedoc)

# check:
#doj_subset_wscore.head()

In [79]:
## PART B:
example = doj_subset_wscore[doj_subset_wscore["id"]=="16-217"]
example[["id", "processed_text", "processed_text_bigrams"]]

Unnamed: 0,id,processed_text,processed_text_bigrams
313,16-217,reach comprehens settlement agreement citi miami miami polic resolv offic involv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem offic involv shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc offic involv shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim offic involv shoot effect quick investig offic involv shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc agreement today agreement result joint effort citi miami ensur miami polic continu effort make communiti safe protect sacr constitut citizen said ferrer oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss vote settlement agreement build upon import reform implement citi sinc issu find includ conduct attorney staff special litig section southern florida,reach_comprehens comprehens_settlement settlement_agreement agreement_citi citi_miami miami_miami miami_polic polic_resolv resolv_offic offic_involv involv_shoot shoot_offic offic_announc announc_princip princip_deputi deputi_general general_vanita vanita_gupta gupta_head head_wifredo wifredo_ferrer ferrer_southern southern_florida florida_settlement settlement_approv approv_miami miami_citi citi_commiss commiss_today today_effect effect_agreement agreement_sign sign_parti parti_resolv resolv_claim claim_stem stem_offic offic_involv involv_shoot shoot_offic offic_conduct conduct_violent violent_crime crime_control control_enforc enforc_find find_issu issu_juli juli_identifi identifi_pattern pattern_practic practic_excess excess_forc forc_offic offic_involv involv_shoot shoot_violat violat_fourth fourth_amend amend_constitut constitut_citi citi_complianc complianc_settlement settlement_monitor monitor_independ independ_review review_former former_tampa tampa_florida florida_polic polic_chief chief_jane jane_castor castor_settlement settlement_agreement agreement_citi citi_implement implement_comprehens comprehens_reform reform_ensur ensur_constitut constitut_polic polic_support support_public public_trust trust_settlement settlement_agreement agreement_design design_minim minim_offic offic_involv involv_shoot shoot_effect effect_quick quick_investig investig_offic offic_involv involv_shoot shoot_occur occur_measur measur_includ includ_settlement settlement_repres repres_renew renew_commit commit_citi citi_miami miami_chief chief_rodolfo rodolfo_llane llane_provid provid_constitut constitut_polic polic_miami miami_resid resid_protect protect_public public_safeti safeti_sustain sustain_reform reform_said said_princip princip_deputi deputi_general general_gupta gupta_agreement agreement_help help_strengthen strengthen_relationship relationship_communiti communiti_serv serv_improv improv_account account_offic offic_fire fire_weapon weapon_unlaw unlaw_provid provid_communiti communiti_particip particip_enforc enforc_agreement agreement_today today_agreement agreement_result result_joint joint_effort effort_citi citi_miami miami_ensur ensur_miami miami_polic polic_continu continu_effort effort_make make_communiti communiti_safe safe_protect protect_sacr sacr_constitut constitut_citizen citizen_said said_ferrer ferrer_oversight oversight_communic communic_agreement agreement_seek seek_make make_perman perman_posit posit_chang chang_former former_chief chief_orosa orosa_chief chief_llane llane_made made_applaud applaud_citi citi_commiss commiss_vote vote_settlement settlement_agreement agreement_build build_upon upon_import import_reform reform_implement implement_citi citi_sinc sinc_issu issu_find find_includ includ_conduct conduct_attorney attorney_staff staff_special special_litig litig_section section_southern southern_florida


C. Use the create_dtm function and the `processed_text_bigrams` column to create a document-term matrix (`dtm_bigram`) with these bigrams. Keep the following three columns in the data: `id`, `topics_clean`, and `compound` 

D. Print the (1) dimensions of the `dtm` matrix from question 2.2  and (2) the dimensions of the `dtm_bigram` matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix 

In [81]:
## PART C:
strings_list_bigrams = doj_subset_wscore["processed_text_bigrams"]

updated_doj2 = doj_subset_wscore.rename(columns={"compound": "compound_sent"})
#to avoid duplicated "compound" or "id" columns once we create the dtm
data_bigram = updated_doj2[["id", "topics_clean", "compound_sent"]]

dtm_bigram = create_dtm(strings_list_bigrams, data_bigram)

#check:
#dtm_bigram

In [83]:
dtm.shape
dtm_bigram.shape

(717, 6870)

(717, 72721)

**Part D:** The bigram matrix has more dimensions, specifically columns, than the unigram matrix because when bigrams are made they include each word twice, once with the word before and once with the word after. This creates a large number of pairings. In a DTM, columns represent terms, and using bigrams inherently produces more terms.

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the `get_topwords` function from 2.2

In [85]:
# your code here


print("Civil Rights: ")
get_topwords(dtm_bigram[dtm_bigram["topics_clean"] == "Civil Rights"])
print("\n")

print("Hate Crimes: ")
get_topwords(dtm_bigram[dtm_bigram["topics_clean"] == "Hate Crimes"])
print("\n")

print("Project Safe Childhood: ")
get_topwords(dtm_bigram[dtm_bigram["topics_clean"] == "Project Safe Childhood"])
print("\n")

Civil Rights: 
fair_hous         231
deputi_general    221
princip_deputi    221
vanita_gupta      202
gupta_head        200
general_vanita    199
said_princip      186
unit_state        156
nation_origin     143
consent_decre     128
dtype: int64


Hate Crimes: 
hate_crime          379
african_american    367
plead_guilti        275
year_prison         161
special_agent       124
racial_motiv        114
thoma_perez         111
grand_juri          101
perez_general        95
said_thoma           91
dtype: int64


Project Safe Childhood: 
safe_childhood       474
project_safe         472
child_pornographi    450
child_exploit        281
sexual_exploit       223
exploit_children     200
plead_guilti         197
exploit_obscen       176
obscen_section       175
child_sexual         174
dtype: int64




# 4. Optional extra credit (2 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings.

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?

In [66]:
## checking the og data:
#doj.head()

#checking the press releases that we want to work with (from question 1):
#doj_subset

In [60]:
## indictment-->charging someone wiht a crime
## conviction-->conviction after settlement/trial 
## sentencing-->x years of prison/supervised release post conviction

## PAIRS OF PRESS RELEASES where one is from one stage and another is from a different stage

## One of google's suggestions for finding pairwise similarity is TF-IDF & cosine similarity which measures overalp in word importance between documents
## This also stems from teh sklearn package which we discussed in class

## https://goodboychan.github.io/python/datacamp/natural_language_processing/2020/07/17/04-TF-IDF-and-similarity-scores.html 

In [89]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

## Get cleaned (but not overly processed) text from our data frame
texts = doj_subset["contents"] # google suggests notusing our overly processed text because we want entities and tfidf vectorizer will tokenize for us 

## TF-IDF vectorization 
## (term frequency-inverse document frequency = converts text into # vectors representing the importance of words within a document)
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform(texts)

## Computing cosine similarity (measure similarity between 2 vectors)
cosine_sim = cosine_similarity(tfidf_matrix)

## Remove self comparisons
np.fill_diagonal(cosine_sim, 0)


## need only unique pairs sicne cosine simmilarity is symmetirc (so avoid AB then BA)
pairs = []
num_doc = cosine_sim.shape[0]#gives the number of documents in your data set

## indexing into the cosine similarity matrix
for i in range(num_doc):
    for j in range(i+1, num_doc):
        pairs.append((cosine_sim[i,j], i, j))


top_pairs = sorted(pairs, reverse=True)[:2]

for sim_score, idx1, idx2 in top_pairs:
    
    print(f"Similarity Score: {cosine_sim[idx1, idx2]:.3f}\n")
    print(f"Index 1: {idx1}")
    print(f"Index 2: {idx2}")
    print("\n")

    doc1 = texts.iloc[idx1]
    doc2 = texts.iloc[idx2]

    df = pd.DataFrame({
        f"Document {idx1}": [doc1],
        f"Document {idx2}": [doc2]
    })

    display(df) #so you can read them side by side


## major difference seems to be pleading guilty vs being sentenced-->similar details just missing the year

Similarity Score: 0.986

Index 1: 203
Index 2: 686




Unnamed: 0,Document 203,Document 686
0,"A Church Hill, Maryland, resident was sentenced today to 20 years in prison to be followed by a lifetime term of supervised release for enticement of a minor to engage in sexual activity and attempting to transfer obscene materials to a minor, announced Acting Assistant Attorney General Kenneth A. Blanco of the Justice Department’s Criminal Division and Acting U.S. Attorney Benjamin G. Greenberg of the Southern District of Florida. Lee Robert Moore, 38, pleaded guilty March 1, 2017, before U.S. District Judge Daniel T. K. Hurley of the Southern District of Florida. Moore was employed by the U.S. Secret Service-Uniformed Division and was assigned to the White House at the time of his arrest on Nov. 9, 2015, and has remained in custody since that time. Moore has since been terminated from his Secret Service position. According to admissions made in connection with his plea, Moore maintained a profile on the social media application “Meet24,” which provides a mobile-based platform for exchanging digital images, as well as voice and text messages. Delaware State Police Detectives with the Delaware Child Predator Task Force created a profile on this site, posing as a 14-year-old girl, with whom Moore engaged in a number of online chat sessions, via the “Meet24” and “Kik” mobile apps over a two-month period, including while Moore was at work. A number of the online chats between Moore and the undercover officers posing as a female minor were sexual in nature and, on several occasions, Moore sent pictures of himself, including one sexually explicit image. According to the plea documents, after his arrest, law enforcement discovered that Moore had communicated with a minor in Florida. Moore admitted that in those communications, he sent sexually explicit images of himself and enticed the minor to send sexually explicit photos of herself as well. Moore engaged in the same type of behavior with a 14-year-old girl in Texas and another 17-year-old girl in Missouri. Moore requested that his federal charges in Delaware be transferred to the Southern District of Florida so that he could plead guilty to both charges at one time. U.S. Immigration and Customs Enforcement’s Homeland Security Investigations and the Delaware Child Predator Task Force investigated the case. Trial Attorney Austin M. Berry of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Corey Steinberg of the Southern District of Florida prosecuted the case, with assistance from the U.S. Attorney’s Office for the District of Delaware. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit http://www.justice.gov/psc.","A Church Hill, Maryland resident pleaded guilty today in federal court to one count of enticement of a minor to engage in sexual activity and one count of attempting to transfer obscene materials to a minor, announced Acting Assistant Attorney General Kenneth A. Blanco of the Justice Department’s Criminal Division and U.S. Attorney Wifredo A. Ferrer of the Southern District of Florida. Lee Robert Moore, 38, pleaded guilty today before U.S. District Judge Daniel T. K. Hurley of the Southern District of Florida. Moore was employed by the U.S. Secret Service-Uniformed Division and was assigned to the White House at the time of his arrest on Nov. 9, 2015, and has remained in custody since that time. Moore has since been terminated from his Secret Service position. According to admissions made in connection with his plea, Moore maintained a profile on the social media application “Meet24,” which provides a mobile-based platform for exchanging digital images, as well as voice and text messages. Delaware State Police Detectives with the Delaware Child Predator Task Force created a profile on this site, posing as a 14-year-old girl, with whom Moore engaged in a number of online chat sessions, via the “Meet24” and “Kik” mobile apps over a two-month period, including while Moore was at work. A number of the online chats between Moore and the undercover officers posing as a female minor were sexual in nature and, on several occasions, Moore sent pictures of himself, including one sexually explicit image. According to the plea documents, after his arrest, law enforcement discovered that Moore had communicated with a minor in Florida. Moore admitted that in those communications, he sent sexually explicit images of himself and enticed the minor to send sexually explicit photos of herself as well. Moore engaged in the same type of behavior with a 14-year-old girl in Texas and another 17-year-old girl in Missouri. Moore requested that his federal charges in Delaware be transferred to the Southern District of Florida so that he could plead guilty to both charges at one time. U.S. Immigration and Customs Enforcement’s Homeland Security Investigations and the Delaware Child Predator Task Force investigated the case. Trial Attorney Austin M. Berry of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Corey Steinberg of the Southern District of Florida are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse, launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit http://www.justice.gov/psc."


Similarity Score: 0.958

Index 1: 462
Index 2: 463




Unnamed: 0,Document 462,Document 463
0,"A Weed, California man pleaded guilty today to conspiracy to produce child pornography for his participation in a website that was operated for the purpose of coercing and enticing minors as young as eight years old to engage in sexually explicit conduct on web camera. Acting Assistant Attorney General Kenneth A. Blanco of the Justice Department’s Criminal Division; U.S. Attorney Dana J. Boente of the Eastern District of Virginia; and Assistant Director Stephen E. Richardson of the FBI’s Criminal Investigative Divisionmade the announcement. Jeffery Van Dyke, 46, was charged on April 4, 2016, and pleaded guilty before U.S. District Judge T.S. Ellis III of the Eastern District of Virginia. Sentencing is set for June 9. According to admissions made in connection with the plea agreement, members of the conspiracy created false profiles on social networking sites popular with children posing as young teenagers to lure children to two websites they controlled. Once on the conspirators’ websites, Van Dyke admitted that members of the conspiracy showed the children pre-recorded videos of prior minor victims, often engaging in sexually explicit conduct, to make the new victims think that they were chatting with another minor. Van Dyke further admitted that conspirators used these videos to coerce and entice children to engage in sexually explicit activity on their own web cameras, which could be viewed live by other members without the victim’s knowledge and which the website automatically recorded and made available for download later. Van Dyke admitted that he linked minors to one of the websites and chatted with them there in furtherance of the conspiracy. The defendant also admitted that one of the websites ranked the efforts of the members to successfully coerce and entice children to engage in sexually explicit conduct on live web camera. Law enforcement agencies have disabled both websites. VCACS special agents led the investigation with the assistance of the FBI’s Operation Rescue Me and the FBI’s Digital Analysis and Research Center and the Office of Victim Assistance. The South Africa Police Service, Family Violence, Child Protection and Sexual Offenses, Gauteng; Royal Canadian Mounted Police, National Child Exploitation Coordination Centre; the Dutch Police Service Agency, KLPD; and the Australian Federal Police, Child Protection Operations, Sydney were active partners in Operation Subterfuge, a multinational investigation coordinated by members of the FBI’s Violent Crimes Against Children International Task Force. Trial Attorney Lauren Britsch of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Whitney Russell of the Eastern District of Virginia prosecuted the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.","A Weed, California man was sentenced to 218 months in prison for conspiracy to produce child pornography based on his participation in a website that was operated for the purpose of coercing and enticing minors as young as eight years old to engage in sexually explicit conduct on web camera. Acting Assistant Attorney General Kenneth A. Blanco of the Justice Department’s Criminal Division; U.S. Attorney Dana J. Boente of the Eastern District of Virginia; and Section Chief John J. Brosnan of the FBI’s Violent Crimes Against Children Section (VCACS) made the announcement. Jeffery Van Dyke, 46, was charged on April 4, 2016, and pleaded guilty before U.S. District Judge T.S. Ellis III of the Eastern District of Virginia on March 10, 2017. According to admissions made in connection with the plea agreement, members of the conspiracy created false profiles on social networking sites popular with children posing as young teenagers to lure children to two websites they controlled. Once on the conspirators’ websites, Van Dyke admitted that members of the conspiracy showed the children pre-recorded videos of prior minor victims, often engaging in sexually explicit conduct, to make the new victims think that they were chatting with another minor. Van Dyke further admitted that conspirators used these videos to coerce and entice children to engage in sexually explicit activity on their own web cameras, which could be viewed live by other members without the victim’s knowledge and which the website automatically recorded and made available for download later. Van Dyke admitted that he linked minors to one of the websites and chatted with them there in furtherance of the conspiracy. The defendant also admitted that one of the websites ranked the efforts of the members to successfully coerce and entice children to engage in sexually explicit conduct on live web camera. Law enforcement agencies have disabled both websites. Van Dyke’s sentence will be followed by 15 years of supervised release and he was further ordered to pay $15, 215 in restitution. VCACS special agents led the investigation with the assistance of the FBI’s Operation Rescue Me and the FBI’s Digital Analysis and Research Center and the Office of Victim Assistance. The South Africa Police Service, Family Violence, Child Protection and Sexual Offenses, Gauteng; Royal Canadian Mounted Police, National Child Exploitation Coordination Centre; the Dutch Police Service Agency, KLPD; and the Australian Federal Police, Child Protection Operations, Sydney were active partners in Operation Subterfuge, a multinational investigation coordinated by members of the FBI’s Violent Crimes Against Children International Task Force. Trial Attorney Lauren Britsch of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Whitney Russell of the Eastern District of Virginia prosecuted the case. The Criminal Division’s Office of International Affairs provided substantial assistance in this matter. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc."


**Do they seem like different stages of the same crime or just press releases covering similar crimes?** In both pairs, the same crime seems to be detaailed (1-"Church Hill, Maryland", 2- "Weed, California"). However, they seem to be press releases covering different stages of these respective cases. It seems that one was written to cover the plea, and the other was written to cover the sentence, saying how long the perpetrator would serve in prison.