# Problem set 4: Text analysis of DOJ press releases

**Total points (without extra credit)**: 52 

- For background:

    - DOJ is the federal law enforcement agency responsible for federal prosecutions; this contrasts with the local prosecutions in the Cook County dataset we analyzed earlier. Here's a short explainer on which crimes get prosecuted federally versus locally: https://www.criminaldefenselawyer.com/resources/criminal-defense/federal-crime/state-vs-federal-crimes.htm#:~:text=Federal%20criminal%20prosecutions%20are%20handled,of%20state%20and%20local%20law. 
    - Here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 
    - Here's the code the dataset creator used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

## 0.0 Import packages

In [1]:
## helpful packages
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import random
import re
import string

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
### nltk.download('averaged_perceptron_tagger')
### nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
### ! python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts and wide-format text
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
pd.set_option('display.max_colwidth', None)

## 0.1 Load and clean text data

In [2]:
## first, unzip the file pset4_inputdata.zip 
## then, run this code to load the unzipped json file and convert to a dataframe
## (may need to change the pathname depending on where you store stuff)
## and convert some of the attributes from lists to values
doj = pd.read_json("combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 
           'components_clean']].copy()

doj.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23, who was convicted in 2013 of attempting to use a weapon of mass destruction (explosives) in connection with a plot to detonate a vehicle bomb at an annual Christmas tree lighting ceremony in Portland, was sentenced today to serve 30 years in prison, followed by a lifetime term of supervised release. Mohamud, a naturalized U.S. citizen from Somalia and former resident of Corvallis, Oregon, was arrested on Nov. 26, 2010, after he attempted to detonate what he believed to be an explosives-laden van that was parked near the tree lighting ceremony in Portland. The arrest was the culmination of a long-term undercover operation, during which Mohamud was monitored closely for months as his bomb plot developed. The device was in fact inert, and the public was never in danger from the device. At sentencing, United States District Court Judge Garr M. King, who presided over Mohamed’s 14-day trial, said “the intended crime was horrific,” and that the defendant, even though he was presented with options by undercover FBI employees, “never once expressed a change of heart.” King further noted that the Christmas tree ceremony was attended by up to 10,000 people, and that the defendant “wanted everyone to leave either dead or injured.” King said his sentence was necessary in view of the seriousness of the crime and to serve as deterrence to others who might consider similar acts. “With today’s sentencing, Mohamed Osman Mohamud is being held accountable for his attempted use of what he believed to be a massive bomb to attack innocent civilians attending a public Christmas tree lighting ceremony in Portland,” said John P. Carlin, Assistant Attorney General for National Security. “The evidence clearly indicated that Mohamud was intent on killing as many people as possible with his attack. Fortunately, law enforcement was able to identify him as a threat, insert themselves in the place of a terrorist that Mohamud was trying to contact, and thwart Mohamud’s efforts to conduct an attack on our soil. This case highlights how the use of undercover operations against would-be terrorists allows us to engage and disrupt those who wish to commit horrific acts of violence against the innocent public. The many agents, analysts, and prosecutors who have worked on this case deserve great credit for their roles in protecting Portland from the threat posed by this defendant and ensuring that he was brought to justice.” “This trial provided a rare glimpse into the techniques Al Qaeda employs to radicalize home-grown extremists,” said Amanda Marshall, U.S. Attorney for the District of Oregon. “With the sentencing today, the court has held this defendant accountable. I thank the dedicated professionals in the law enforcement and intelligence communities who were responsible for this successful outcome. I look forward to our continued work with Muslim communities in Oregon who are committed to ensuring that all young people are safe from extremists who seek to radicalize others to engage in violence.” According to the trial evidence, in February 2009, Mohamud began communicating via e-mail with Samir Khan, a now-deceased al Qaeda terrorist who published Jihad Recollections, an online magazine that advocated violent jihad, and who also published Inspire, the official magazine of al-Qaeda in the Arabian Peninsula. Between February and August 2009, Mohamed exchanged approximately 150 emails with Khan. Mohamud wrote several articles for Jihad Recollections that were published under assumed names. In August 2009, Mohamud was in email contact with Amro Al-Ali, a Saudi national who was in Yemen at the time and is today in custody in Saudi Arabia for terrorism offenses. Al-Ali sent Mohamud detailed e-mails designed to facilitate Mohamud’s travel to Yemen to train for violent jihad. In December 2009, while Al-Ali was in the northwest frontier province of Pakistan, Mohamud and Al-Ali discussed the possibility of Mohamud traveling to Pakistan to join Al-Ali in terrorist activities. Mohamud responded to Al-Ali in an e-mail: “yes, that would be wonderful, just tell me what I need to do.” Al-Ali referred Mohamud to a second associate overseas and provided Mohamud with a name and email address to facilitate the process. In the following months, Mohamud made several unsuccessful attempts to contact Al-Ali’s associate. Ultimately, an FBI undercover operative contacted Mohamud via email under the guise of being an associate of Al-Ali’s. Mohamud and the FBI undercover operative agreed to meet in Portland in July 2010. At the meeting, Mohamud told the FBI undercover operative he had written articles that were published in Jihad Recollections. Mohamud also said that he wanted to become “operational.” Asked what he meant by “operational,” Mohamud said he wanted to put an explosion together, but needed help. According to evidence presented at trial, at a meeting in August 2010, Mohamud told undercover FBI operatives he had been thinking of committing violent jihad since the age of 15. Mohamud then told the undercover FBI operatives that he had identified a potential target for a bomb: the annual Christmas tree lighting ceremony in Portland’s Pioneer Courthouse Square on Nov. 26, 2010. The undercover FBI operatives cautioned Mohamud several times about the seriousness of this plan, noting there would be many people at the event, including children, and emphasized that Mohamud could abandon his attack plans at any time with no shame. Mohamud indicated the deaths would be justified and that he would not mind carrying out a suicide attack on the crowd. According to evidence presented at trial, in the ensuing months Mohamud continued to express his interest in carrying out the attack and worked on logistics. On Nov. 4, 2010, Mohamud and the undercover FBI operatives traveled to a remote location in Lincoln County, Oregon, where they detonated a bomb concealed in a backpack as a trial run for the upcoming attack. During the drive back to Corvallis, Mohamud was asked if was capable looking at all the bodies of those who would be killed during the explosion. In response, Mohamud noted, “I want whoever is attending that event to be, to leave either dead or injured.” Mohamud later recorded a video of himself, with the assistance of the undercover FBI operatives, in which he read a statement that offered his rationale for his bomb attack. On Nov. 18, 2010, undercover FBI operatives picked up Mohamud to travel to Portland to finalize the details of the attack. On Nov. 26, 2010, just hours before the planned attack, Mohamud examined the 1,800 pound bomb in the van and remarked that it was “beautiful.” Later that day, Mohamud was arrested after he attempted to remotely detonate the inert vehicle bomb rked near the Christmas tree lighting ceremony This case was investigated by the FBI, with assistance from the Oregon State Police, the Corvallis Police Department, the Lincoln County Sheriff’s Office and the Portland Police Bureau. The prosecution was handled by Assistant U.S. Attorneys Ethan D. Knight and Pamala Holsinger from the U.S. Attorney’s Office for the District of Oregon. Trial Attorney Jolie F. Zimmerman, from the Counterterrorism Section of the Justice Department’s National Security Division, assisted. # # # 14-1077",2014-10-01T00:00:00-04:00,No topic,National Security Division (NSD)
1,12-919,$1 Million in Restitution Payments Announced to Preserve North Carolina Wetlands,"WASHINGTON – North Carolina’s Waccamaw River watershed will benefit from a $1 million restitution order from a federal court, funding environmental projects to acquire and preserve wetlands in an area damaged by illegal releases of wastewater from a corporate hog farm, announced Ignacia S. Moreno, Assistant Attorney General of the Justice Department’s Environment and Natural Resources Division; U.S. Attorney for the Eastern District of North Carolina Thomas G. Walker; Director Greg McLeod from the North Carolina State Bureau of Investigation; and Camilla M. Herlevich, Executive Director of the North Carolina Coastal Land Trust. Freedman Farms Inc. was sentenced in February 2012 to five years of probation and ordered to pay $1.5 million in fines, restitution and community service payments for violating the Clean Water Act when it discharged hog waste into a stream that leads to the Waccamaw River. William B. Freedman, president of Freedman Farms, was sentenced to six months in prison to be followed by six months of home confinement. Freedman Farms also is required to implement a comprehensive environmental compliance program and institute an annual training program. In an order issued on April 19, 2012, the court ordered that the defendants would be responsible for restitution of $1 million in the form of five annual payments starting in January 2013, which the court will direct to the North Carolina Coastal Land Trust (NCCLT). The NCCLT plans to use the money to acquire and conserve land along streams in the Waccamaw watershed. The court also directed a $75,000 community service payment to the Southern Environmental Enforcement Network, an organization dedicated to environmental law enforcement training and information sharing in the region. “The resolution of the case against Freedman Farms demonstrates the commitment of the Department of Justice to enforcing the Clean Water Act to ensure the protection of human health and the environment,” said Assistant Attorney General Moreno. “The court-ordered restitution in this case will conserve wetlands for the benefit of the people of North Carolina. By enforcing the nation’s environmental laws, we will continue to ensure that concentrated animal feeding operations (CAFOs) operate without threatening our drinking water, the health of our communities and the environment.” “This office is committed to doing our part to hold accountable those who commit crimes against our environment, which can cause serious health problems to residents and damage the environment that makes North Carolina such a beautiful place to live and visit,” said U.S. Attorney Walker. “This case shows what we can accomplish when our SBI agents work closely with their local, state and federal partners to investigate environmental crimes and hold the polluters accountable,” said Director McLeod. “We’ll continue our efforts to fight illegal pollution that damages our water and puts the public’s health at risk.” “The Waccamaw is unique and wild,” said Director Herlevich of the North Carolina Coastal Land Trust. “Its watershed includes some of the most extensive cypress gum swamps in the state, and its headwaters at Lake Waccamaw contain fish that are found nowhere else on Earth. We appreciate the trust of the court and the U. S. Attorney, and we look forward to using these funds for conservation projects in a river system that is one of our top conservation priorities.” According to evidence presented in court, in December 2007 Freedman Farms discharged hog waste into Browder’s Branch, a tributary to the Waccamaw River that flows through the White Marsh, a large wetlands complex. Freedman Farms, located in Columbus County, N.C., is in the business of raising hogs for market, and this particular farm had some 4,800 hogs. The hog waste was supposed to be directed to two lagoons for treatment and disposal. Instead, hog waste was discharged from Freedman Farms directly into Browder’s Branch. The Clean Water Act is a federal law that makes it illegal to knowingly or negligently discharge a pollutant into a water of the United States. The Freedman case was investigated by the U.S. Environmental Protection Agency (EPA) Criminal Investigation Division, the U.S. Army Corps of Engineers and the North Carolina State Bureau of Investigation, with assistance from the EPA Science and Ecosystem Support Division. The case was prosecuted by Assistant U.S. Attorney J. Gaston B. Williams of the Eastern District of North Carolina and Trial Attorney Mary Dee Carraway of the Environmental Crimes Section of the Justice Department’s Environment and Natural Resources Division. The North Carolina Coastal Land Trust is celebrating its 20th anniversary of saving special lands in eastern North Carolina. The organization has protected nearly 50,000 acres of lands with scenic, recreational, historic and ecological values. North Carolina Coastal Land Trust has saved streams and wetlands that provide clean water, forests that are havens for wildlife, working farms that provide local food and nature parks that everyone can enjoy. More information about the Coastal Land Trust is available at www.coastallandtrust.org.",2012-07-25T00:00:00-04:00,No topic,Environment and Natural Resources Division
2,11-1002,$1 Million Settlement Reached for Natural Resource Damages at Superfund Site in Massachusetts,"BOSTON– A $1-million settlement has been reached for natural resource damages (NRD) at the Blackburn & Union Privileges Superfund Site in Walpole, Mass., the Departments of Justice and Interior (DOI), and the Office of the Massachusetts Attorney General announced today. The Blackburn & Union Privileges Superfund Site includes 22 acres of contaminated land and water in Walpole. The contamination resulted from the operations of various industrial facilities dating back to the 19th century that exposed the site to asbestos, arsenic, lead and other hazardous substances. The private parties involved in the settlement include two former owners and operators of the site, W.R. Grace & Co.– Conn. and Tyco Healthcare Group LP, as well as the current owners, BIM Investment Corp. and Shaffer Realty Nominee Trust. From about 1915 to 1936, a predecessor of W.R. Grace manufactured asbestos brake linings and clutch linings on a large portion of the property. From 1946 to about 1983, a predecessor of Tyco Healthcare operated a cotton fabric manufacturing business, which used caustic solutions, on a portion of the property. In a 2010 settlement with U.S. Environmental Protection Agency (EPA), the four private parties agreed to perform a remedial action to clean up the site at an estimated cost of $13 million. The consent decree lodged today resolves both state and federal NRD liability claims; it requires the parties to pay $1,094,169.56 to the state and federal natural resource trustees, the Massachusetts Executive Office of Energy and Environmental Affairs (EEA) and DOI, for injuries to ecological resources including groundwater and wetlands, which provide habitat for waterfowl and wading birds, including black ducks and great blue herons. The trustees will use the settlement funds for natural resource restoration projects in the area. “This settlement demonstrates our commitment to recovering damages from the parties responsible for injury to natural resources, in partnership with state trustees,” said Bruce Gelber, Acting Deputy Assistant Attorney General of the Justice Department’s Environment and Natural Resources Division. “The citizens of Walpole have had to live with the environmental impact of this contamination for many years,” Attorney General Martha Coakley said. “We are pleased that today’s agreement will not only require the responsible parties to reimburse taxpayer dollars, but will also provide funding to begin restoring or replacing the wetland and other natural resources.” The consent decree was lodged in the U.S. District Court for Massachusetts. A portion of the funds, $300,000, will be distributed to the EEA-sponsored groundwater restoration projects; $575,000 will be used for ecological restoration projects jointly sponsored by EEA and the U.S. Fish and Wildlife Service (FWS). In addition, $125,000 will go for projects jointly sponsored by EEA and FWS that achieve both ecological and groundwater restoration; $57,491.34 will be allocated for reimbursement for the FWS’s assessment costs; and $36,678.22 will be distributed as reimbursement for the commonwealth’s assessment costs. “This settlement provides the means for a range of projects designed to compensate the public for decades of groundwater and other ecological damage at this site. I encourage local citizens and organizations to become engaged in the public process that will take place as we solicit, take comment on, and choose these projects in the months ahead,” said Energy and Environmental Affairs Secretary Richard K. Sullivan Jr., who serves as the Commonwealth’s Natural Resources Damages trustee. “This settlement will help restore habitat for fish and wildlife in the Neponset River watershed,” said Tom Chapman of the FWS New England Field Office. “We look forward to working with the commonwealth and local stakeholders to implement restoration.” “More than 100 years-worth of industrial activities at this site caused major environmental contamination to the Neponset River, nearby wetlands and to groundwater below the site,” said Commissioner Kenneth Kimmell of the Massachusetts Department of Environmental Protection (MassDEP), which will staff the Trustee Council for the Commonwealth. “We will ensure that the community and the public will be active participants in the process to use these NRD funds to restore the injured natural resources.” Under the federal Comprehensive Environmental Response, Compensation and Liability Act, EEA and DOI, acting through the FWS, are the designated state and federal natural resource Trustees for the site. The site has been listed on the EPA’s National Priorities List since 1994. The consent decree is subject to a public comment period and court approval. A copy of the consent decree and instructions about how to submit comments is available on www.usdoj.gov/enrd/Consent_Decrees.html . After the consent decree is approved, EEA and FWS will develop proposed restoration plans to use the settlement funds for restoration projects. The proposed restoration plans will also be made available to the public for review and comment. Assistant Attorney General Matthew Brock of Massachusetts Attorney General Coakley's Environmental Protection Division handled this matter. Attorney Jennifer Davis of MassDEP, Attorney Anna Blumkin of EEA and MassDEP’s NRD Coordinator Karen Pelto also worked on this settlement.",2011-08-03T00:00:00-04:00,No topic,Environment and Natural Resources Division
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying Vehicle Emissions Tests,"WASHINGTON—A federal grand jury in Las Vegas today returned indictments against 10 Nevada-certified emissions testers for falsifying vehicle emissions test reports, the Justice Department announced. Each defendant faces one felony Clean Air Act count for falsifying reports between November 2007 and May 2009. The number of falsifications varied by defendant, with some defendants having falsified approximately 250 records, while others falsified more than double that figure. One defendant is alleged to have falsified over 700 reports. The individuals indicted include: Escudero resides in Pahrump, Nev. All other individuals are from Clark County, Nev. The 10 defendants are alleged to have engaged in a practice known as ""clean scanning"" vehicles. The scheme involved entering the Vehicle Identification Number (VIN) for a vehicle that would not pass the emissions test into the computerized system, then connecting a different vehicle the testers knew would pass the test. These falsifications were allegedly performed for anywhere from $10 to $100 over and above the usual emissions testing fee. The U.S. Environmental Protection Agency (EPA), under the Clean Air Act, requires the state of Nevada to conduct vehicle emissions testing in certain areas because the areas exceed national standards for carbon monoxide and ozone. Las Vegas is currently required to perform emissions testing. To obtain a registration renewal, vehicle owners bring the vehicles to a licensed inspection station for testing. The emissions inspector logs into a computer to activate the system by using a unique password issued to the emissions inspector. The emissions inspector manually inputs the vehicle’s VIN to identify the tested vehicle, then connects the vehicle for model year 1996 and later to an onboard diagnostics port connected to an analyzer. The analyzer downloads data from the vehicle’s computer, analyzes the data and provides a ""pass"" or ""fail"" result. The pass or fail result and vehicle identification data are reported on the Vehicle Inspection Report. It is a crime to knowingly alter or conceal any record or other document required to be maintained by the Clean Air Act. ""Falsifications of vehicle emissions testing, such as those alleged in the indictments unsealed today, are serious matters and we intend to use all of our enforcement tools to stop this harmful practice. These actions undermine a system that is designed to reduce air pollutants including smog and provide better air quality for the citizens of Nevada,"" said Ignacia S. Moreno, Assistant Attorney General for the Justice Department’s Environment and Natural Resources Division. ""The residents of Nevada deserve to know that the vast majority of licensed vehicle emission inspectors are not corrupt and are not circumventing emission testing procedures,"" said U.S. Attorney Bogden. ""These indictments should serve as a clear warning to offenders that the Department of Justice will prosecute you if you make fraudulent statements and reports concerning compliance with the federal Clean Air Act."" ""Lying about car emissions means dirtier air, which is especially of concern in areas like Las Vegas that are already experiencing air quality problems,"" said Cynthia Giles, Assistant Administrator for Enforcement and Compliance Assurance at EPA. ""We will take aggressive action to ensure communities have clean air."" The maximum penalty for the felony violations contained in the indictments includes up to two years in prison and a fine of up to $250,000. An indictment is merely an accusation, and a defendant is presumed innocent unless and until proven guilty in a court of law. The case was investigated by the EPA, Criminal Investigation Division; and the Nevada Department of Motor Vehicles Compliance Enforcement Division. The case is being prosecuted by the U.S. Attorney’s Office for the District of Nevada and the Justice Department’s Environmental Crimes Section.",2010-01-08T00:00:00-05:00,No topic,Environment and Natural Resources Division
4,18-898,"$100 Million Settlement Will Speed Cleanup Work at Centredale Manor Superfund Site in North Providence, R.I.","The U.S. Department of Justice, the U.S. Environmental Protection Agency (EPA), and the Rhode Island Department of Environmental Management (RIDEM) announced today that two subsidiaries of Stanley Black & Decker Inc.—Emhart Industries Inc. and Black & Decker Inc.—have agreed to clean up dioxin contaminated sediment and soil at the Centredale Manor Restoration Project Superfund Site in North Providence and Johnston, Rhode Island. “We are pleased to reach a resolution through collaborative work with the responsible parties, EPA, and other stakeholders,” said Acting Assistant Attorney General Jeffrey H. Wood for the Justice Department's Environment and Natural Resources Division . “Today’s settlement ends protracted litigation and allows for important work to get underway to restore a healthy environment for citizens living in and around the Centredale Manor Site and the Woonasquatucket River.” “This settlement demonstrates the tremendous progress we are achieving working with responsible parties, states, and our federal partners to expedite sites through the entire Superfund remediation process,” said EPA Acting Administrator Andrew Wheeler. “The Centredale Manor Site has been on the National Priorities List for 18 years; we are taking charge and ensuring the Agency makes good on its promise to clean it up for the betterment of the environment and those communities affected.” “Successfully concluding this settlement paves the way for EPA to make good on our commitment to aggressively pursue cleaning up the Centredale Manor Superfund Site,” said EPA New England Regional Administrator Alexandra Dunn. “We are excited to get to work on the cleanup at this site, and get it closer to the goal of being fully utilized by the North Providence and Johnston communities.” “We are pleased that the collective efforts of the State of Rhode Island, EPA, and DOJ in these negotiations have concluded in this major milestone toward the cleanup of the Centredale Manor Restoration Superfund site and are consistent with our long-standing efforts to make the polluter pay,” said RIDEM Director Janet Coit. “The settlement will speed up a remedy that protects public health and the river environment, and moves us closer to the day that we can reclaim recreational uses of this beautiful river resource.” The settlement, which includes cleanup work in the Woonasquatucket River (River) and bordering residential and commercial properties along the River, requires the companies to perform the remedy selected by EPA for the Site in 2012, which is estimated to cost approximately $100 million, and resolves longstanding litigation. The cleanup remedy includes excavation of contaminated sediment and floodplain soil from the Woonasquatucket River, including from adjacent residential properties. Once the cleanup remedy is completed, full access to the Woonasquatucket River should be restored for local citizens. The cleanup will be a step toward the State’s goal of a fishable and swimmable river. The work will also include upgrading caps over contaminated soil in the peninsula area of the Site that currently house two high-rise apartment buildings. The settlement also ensures that the long-term monitoring and maintenance of the site, as directed in the remedy, will be implemented to ensure that public health is protected. Under the settlement, Emhart and Black & Decker will reimburse EPA for approximately $42 million in past costs incurred at the Site. The companies will also reimburse EPA and the State of Rhode Island for future costs incurred by those agencies in overseeing the work required by the settlement. The settlement will also include payments on behalf of two federal agencies to resolve claims against those agencies. These payments, along with prior settlements related to the Site, will result in a 100 percent recovery for the United States of its past and future response costs related to the Site. Litigation related to the Site has been ongoing for nearly eight years. While the Federal District Court found Black & Decker and Emhart to be liable for their hazardous waste and responsible to conduct the cleanup of the Site, it had also ruled that EPA needed to reconsider certain aspects of that cleanup. EPA appealed the decision requiring it to reconsider aspects of the cleanup. This settlement, once entered by the District Court, will resolve the litigation between the United States, Rhode Island, and Emhart and Black and Decker, allowing the cleanup of the Site to begin. The Site spans a one and a half mile stretch of the Woonasquatucket River and encompasses a nine-acre peninsula, two ponds and a significant forested wetland. From the 1940s to the early 1970s, Emhart’s predecessor operated a chemical manufacturing facility on the peninsula and used a raw material that was contaminated with 2,3,7,8-tetrachlorodibenzo-p-dioxin, a toxic form of dioxin. The Site property was also previously used by a barrel refurbisher. Elevated levels of dioxins and other contaminants have been detected in soil, groundwater, sediment, surface water and fish. The Site was added to the National Priorities List (NPL) in 2000, and in December 2017, EPA included the Centredale Manor Restoration Project Superfund Site on a list of Superfund sites targeted for immediate and intense attention. Several short-term actions were previously performed at the Site to address immediate threats to the residents and minimize potential erosion and downstream transport of contaminated soil and sediment. This settlement is the latest agreement EPA has reached since the Site was listed on the NPL. Prior agreements addressed the performance and recovery of costs for the past environmental investigations and interim cleanup actions from Emhart, the barrel reconditioning company, the current owners of the peninsula portion of the Site, and other potentially responsible parties. The Consent Decree, lodged in the U.S. District Court of Rhode Island, will be posted in the Federal Register and available for public comment for a period of 30 days. The Consent Decree can be viewed on the Justice Department website: www.justice.gov/enrd/Consent_Decrees.html. EPA information on the Centredale Manor Superfund Site: www.epa.gov/superfund/centredale.",2018-07-09T00:00:00-04:00,Environment,Environment and Natural Resources Division


## 1. Tagging and sentiment scoring (17 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.

We'll call the raw string of this press release `pharma`

In [3]:
## your code to subset to one press release and take the string
kickback_row = doj[doj["id"] == "17-1204"]
kickback_row

pharma = str(kickback_row["contents"])
print(type(pharma))

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
4909,17-1204,Founder and Owner of Pharmaceutical Company Insys Arrested and Charged with Racketeering,"The founder and majority owner of Insys Therapeutics Inc., was arrested today and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a Fentanyl spray intended for cancer patients experiencing breakthrough pain. ""More than 20,000 Americans died of synthetic opioid overdoses last year, and millions are addicted to opioids. And yet some medical professionals would rather take advantage of the addicts than try to help them,"" said Attorney General Jeff Sessions. ""This Justice Department will not tolerate this. We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic. And under the leadership of President Trump, we are fully committed to defeating this threat to the American people.”John N. Kapoor, 74, of Phoenix, Ariz., a current member of the Board of Directors of Insys, was arrested this morning in Arizona and charged with RICO conspiracy, as well as other felonies, including conspiracy to commit mail and wire fraud and conspiracy to violate the Anti-Kickback Law. Kapoor, the former Executive Chairman of the Board and CEO of Insys, will appear in federal court in Phoenix today. He will appear in U.S. District Court in Boston at a later date. The superseding indictment, unsealed today in Boston, also includes additional allegations against several former Insys executives and managers who were initially indicted in December 2016.The superseding indictment charges that Kapoor; Michael L. Babich, 40, of Scottsdale, Ariz., former CEO and President of the company; Alec Burlakoff, 42, of Charlotte, N.C., former Vice President of Sales; Richard M. Simon, 46, of Seal Beach, Calif., former National Director of Sales; former Regional Sales Directors Sunrise Lee, 36, of Bryant City, Mich., and Joseph A. Rowan, 43, of Panama City, Fla.; and former Vice President of Managed Markets, Michael J. Gurry, 53, of Scottsdale, Ariz., conspired to bribe practitioners in various states, many of whom operated pain clinics, in order to get them to prescribe a fentanyl-based pain medication. The medication, called “Subsys,” is a powerful narcotic intended to treat cancer patients suffering intense breakthrough pain. In exchange for bribes and kickbacks, the practitioners wrote large numbers of prescriptions for the patients, most of whom were not diagnosed with cancer.The indictment also alleges that Kapoor and the six former executives conspired to mislead and defraud health insurance providers who were reluctant to approve payment for the drug when it was prescribed for non-cancer patients. They achieved this goal by setting up the “reimbursement unit,” which was dedicated to obtaining prior authorization directly from insurers and pharmacy benefit managers. “In the midst of a nationwide opioid epidemic that has reached crisis proportions, Mr. Kapoor and his company stand accused of bribing doctors to overprescribe a potent opioid and committing fraud on insurance companies solely for profit,” said Acting United States Attorney William D. Weinreb. “Today's arrest and charges reflect our ongoing efforts to attack the opioid crisis from all angles. We must hold the industry and its leadership accountable - just as we would the cartels or a street-level drug dealer.”“As alleged, these executives created a corporate culture at Insys that utilized deception and bribery as an acceptable business practice, deceiving patients, and conspiring with doctors and insurers,” said Harold H. Shaw, Special Agent in Charge of the Federal Bureau of Investigation, Boston Field Division. “The allegations of selling a highly addictive opioid cancer pain drug to patients who did not have cancer, make them no better than street-level drug dealers. Today's charges mark an important step in holding pharmaceutical executives responsible for their part in the opioid crisis. The FBI will vigorously investigate corrupt organizations with business practices that promote fraud with a total disregard for patient safety.”“These Insys executives allegedly fueled the opioid epidemic by paying doctors to needlessly prescribe an extremely dangerous and addictive form of fentanyl,” said Phillip Coyne, Special Agent in Charge for the Office of Inspector General of the U.S. Department of Health and Human Services. “Corporate executives intent on illegally driving up profits need to be aware they are now squarely in the sights of law enforcement.”“As alleged, Insys executives improperly influenced health care providers to prescribe a powerful opioid for patients who did not need it, and without complying with FDA requirements, thus putting patients at risk and contributing to the current opioid crisis,” said Mark A. McCormack, Special Agent in Charge, FDA Office of Criminal Investigations’ Metro Washington Field Office. “Our office will continue to work with our law enforcement partners to pursue and bring to justice those who threaten the public health.”“Pharmaceutical companies whose products include controlled medications that can lead to addiction and overdose have a special obligation to operate in a trustworthy, transparent manner, because their customers’ health and safety and, indeed, very lives depend on it,” said DEA Special Agent in Charge Michael J. Ferguson. “DEA pledges to work with our law enforcement and regulatory partners nationwide to ensure that rules and regulations under the Controlled Substances Act are followed.”“Today’s arrest is the result of a joint effort to identify, investigate and prosecute individuals who engage in fraudulent activity and endanger patient health,” stated Special Agent in Charge Leigh-Alistair Barzey, Defense Criminal Investigative Service (DCIS) Northeast Field Office. “DCIS will continue to work with the U.S. Attorney’s Office, District of Massachusetts, and our law enforcement partners, to protect U.S. military members, retirees and their dependents and the integrity of TRICARE, the Defense Department’s healthcare system.”“As alleged, John Kapoor and other top executives committed fraud, placing profit before patient safety, to sell a highly potent and addictive opioid. EBSA will take every opportunity to work collaboratively with our law enforcement partners in these important investigations to protect participants in private sector health plans and contribute in fighting the opioid epidemic,” said Susan A. Hensley, Regional Director of the U.S. Department of Labor, Employee Benefits Security Administration, Boston Regional Office.“Once again, the United States Postal Inspection Service is fully committed to protecting our nation’s mail system from criminal misuse,” said Shelly Binkowski, Inspector in Charge of the U.S. Postal Inspection Service. “We are proud to work alongside our law enforcement partners to dismantle high level prescription drug practices which directly contribute to the opioid abuse epidemic. This investigation highlights our commitment to defending our mail system from illegal misuse and ensuring public trust in the mail.”“The U.S. Department of Veterans Affairs, Office of Inspector General will continue to aggressively investigate those that attempt to fraudulently impact programs designed to benefit our veterans and their families,” said Donna L. Neves, Special Agent in Charge of the VA OIG Northeast Field Office.The charges of conspiracy to commit RICO and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than 20 years in prison, three years of supervised release and a fine of $250,000, or twice the amount of pecuniary gain or loss. The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, three years of supervised release and a $25,000 fine. Sentences are imposed by a federal district court judge based upon the U.S. Sentencing Guidelines and other statutory factors.The investigation was conducted by a team that included the FBI; HHS-OIG; FDA Office of Criminal Investigations; the Defense Criminal Investigative Service; the Drug Enforcement Administration; the Department of Labor, Employee Benefits Security Administration; the Office of Personnel Management; the U.S. Postal Inspection Service; the U.S. Postal Service Office of Inspector General; and the Department of Veterans Affairs. The U.S. Attorney’s Office would like to acknowledge the cooperation and assistance of the U.S. Attorney’s Offices around the country engaged in parallel investigations, including the District of Connecticut, Eastern District of Michigan, Southern District of Alabama, Southern District of New York, District of Rhode Island, and the District of New Hampshire. The efforts of the Central District of California and the Justice Department’s Civil Fraud Section of the Department of Justice are also greatly appreciated. Assistant U.S. Attorneys K. Nathaniel Yeager, Chief of Weinreb’s Health Care Fraud Unit, and Susan M. Poswistilo, of Weinreb’s Civil Division, are prosecuting the case.The details contained in the charging documents are allegations. The defendants are presumed innocent unless and until proven guilty beyond a reasonable doubt.",2017-10-26T00:00:00-04:00,Opioids,Office of the Attorney General


<class 'str'>


### 1.1 part of speech tagging (3 points)

A. Preprocess the `pharma` press release to remove all punctuation / digits (so can use `.isalpha()` to subset)

B. With the preprocessed press release from part A, use the part of speech tagger within nltk to tag all the words in that one press release with their part of speech. 

C. Using the output from B, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print a dataframe with the 5 most frequent adjectives and their counts in the `pharma` release. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for `.isalpha()`: https://www.w3schools.com/python/ref_string_isalpha.asp
- `processtext` function here has an example of tokenizing and filtering to words where `.isalpha()` is true: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb
- Part of speech tagging section of this code: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb



In [4]:
## your code here to restrict to alpha
pharma_alpha = re.sub(r'\W+', ' ', pharma)
pharma_alpha = re.sub(r'\d+', ' ', pharma_alpha)

In [5]:
## your code here for part of speech tagging

pharma_tokens = word_tokenize(pharma_alpha)
pharma_pos = pos_tag(pharma_tokens)

pharma_df = pd.DataFrame.from_records(pharma_pos, columns = ["word", "pos"])
pharma_adj_df = pharma_df[pharma_df["pos"].str.contains("JJ") == True]

pharma_adj_count = pharma_adj_df.value_counts("word").sort_values(ascending = False)

pharma_adj_count.head()

word
former        8
opioid        5
nationwide    4
other         3
addictive     3
dtype: int64

## 1.2 named entity recognition (4 points)

A. Using the original `pharma` press release (so the one before stripping punctuation/digits), use spaCy to extract all named entities from the press release.

B. Print the unique named entities with the tag: `LAW`

**Resources**:

- For parts A and B: named entity recognition part of this code: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb

In [6]:
## your code here for part A
spacy_pharma = nlp(pharma)

In [7]:
## your code here for part B

for one_tok in spacy_pharma.ents:
    print(one_tok)
    print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)
    #if one_tok.label_ == "LAW":
    #    print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)

4909
Entity: 4909; NER tag: DATE
Insys Therapeutics Inc.
Entity: Insys Therapeutics Inc.; NER tag: ORG
today
Entity: today; NER tag: DATE
Fentanyl
Entity: Fentanyl; NER tag: PERSON
More than 20,000
Entity: More than 20,000; NER tag: CARDINAL
Americans
Entity: Americans; NER tag: NORP
last year
Entity: last year; NER tag: DATE
millions
Entity: millions; NER tag: CARDINAL
Jeff Sessions
Entity: Jeff Sessions; NER tag: PERSON
This Justice Department
Entity: This Justice Department; NER tag: ORG
Trump
Entity: Trump; NER tag: PERSON
American
Entity: American; NER tag: NORP
”John N. Kapoor
Entity: ”John N. Kapoor; NER tag: PERSON
74
Entity: 74; NER tag: DATE
Phoenix
Entity: Phoenix; NER tag: GPE
Ariz.
Entity: Ariz.; NER tag: GPE
the Board of Directors
Entity: the Board of Directors; NER tag: ORG
Insys
Entity: Insys; NER tag: ORG
this morning
Entity: this morning; NER tag: TIME
Arizona
Entity: Arizona; NER tag: GPE
RICO
Entity: RICO; NER tag: LAW
Kapoor
Entity: Kapoor; NER tag: PERSON
Executiv

C. Use Google to summarize in one sentence what the `RICO` named entity means and why this might apply to a pharmaceutical kickbacks case (and not just a mafia case...) 

In [8]:
## your code here 

D. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these named entities.

In [9]:
## your code here
maybe_sentence = []
maybe_sentence_years = []
for one_tok in spacy_pharma.ents:
    if one_tok.label_ == "DATE":
        maybe_sentence.append(one_tok.text)

years = [re.findall(r'\w+\syears*', s) for s in maybe_sentence]

for r in years:
    if r != []:
        maybe_sentence_years = maybe_sentence_years + r
        
print(maybe_sentence)
print(maybe_sentence_years)

['4909', 'today', 'last year', '74', 'today', 'a later date', 'today', 'December 2016.The', '40', '42', '46', '36', '43', '53', 'Today', 'Today', '20 years', 'three years', 'five years', 'three years']
['last year', '20 years', 'three years', 'five years', 'three years']


E. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convicted after this indictment (if there are multiple lengths mentioned describe the maximum). 

**Hint**: you may want to use re.search or re.findall 

- For part E, `re.search` and `re.findall` examples here for filtering to ones containing year (multiple approaches; some need not involve `re`): https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_basicregex_solutions.ipynb

In [10]:
## your code here


## 1.3 sentiment analysis  (10 points)

- Sentiment analysis section of this script: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partI_textmining_solutions.ipynb


A. Subset the press releases to those labeled with one of three topics via `topics_clean`: Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.



In [11]:
## your code here for subsetting
doj_subset = doj[doj["topics_clean"] == "Civil Rights"]
doj_subset = doj_subset.append(doj[doj["topics_clean"] == "Hate Crimes"])
doj_subset = doj_subset.append(doj[doj["topics_clean"] == "Project Safe Childhood"]).reset_index(drop = True)
doj_subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717 entries, 0 to 716
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   id                707 non-null    object
 1   title             717 non-null    object
 2   contents          717 non-null    object
 3   date              717 non-null    object
 4   topics_clean      717 non-null    object
 5   components_clean  717 non-null    object
dtypes: object(6)
memory usage: 33.7+ KB


B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string (**Hint**: you may want to use `re.sub` with an or condition)
- Scores the sentiment of the entire press release using the `SentimentIntensityAnalyzer` and `polarity_scores`
- Returns the length-four (negative, positive, neutral, compound) sentiment dictionary (any order is fine)

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- I used a function + list comprehension to execute and it takes about 30 seconds on my local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample of the 717


In [19]:
## your code here to define function
test_string = "A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened.    James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack.   Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up.   “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division.  “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.”   “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.”    This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section."

def sentiment_analysis(string):
    #new_string = string
    #
    #spacy_string = nlp(string)
    #entities = []
    #for one_tok in spacy_string.ents:
    #    entities.append(one_tok.text)
    #
    #for ent in entities:
    #    print(ent)
    #    print(type(ent))
    #    new_string.replace(ent, "")
    
    #string
    #new_string
        
    sent_obj = SentimentIntensityAnalyzer()
    sentiment = sent_obj.polarity_scores(string)
    sentiment
    (neg, neu, pos, comp) = sentiment["neg"], sentiment["neu"], sentiment["pos"], sentiment["compound"]

    return neg, neu, pos, comp

sentiment_analysis(test_string)


(0.169, 0.763, 0.068, -0.9893)

In [20]:
## your code here executing the function
releases_list = doj_subset["contents"].to_list()

for release in releases_list:
    print(sentiment_analysis(release))
    

(0.169, 0.763, 0.068, -0.9893)
(0.146, 0.774, 0.079, -0.9814)
(0.097, 0.751, 0.152, 0.9623)
(0.131, 0.737, 0.132, -0.6771)
(0.215, 0.633, 0.152, -0.7717)
(0.029, 0.857, 0.114, 0.9485)
(0.103, 0.816, 0.082, -0.7783)
(0.109, 0.834, 0.057, -0.9746)
(0.08, 0.849, 0.071, -0.2575)
(0.057, 0.854, 0.089, 0.9493)
(0.105, 0.83, 0.065, -0.985)
(0.118, 0.801, 0.081, -0.9578)
(0.0, 0.834, 0.166, 0.9949)
(0.097, 0.858, 0.044, -0.91)
(0.079, 0.863, 0.058, -0.6085)
(0.044, 0.821, 0.134, 0.9897)
(0.072, 0.816, 0.112, 0.9673)
(0.033, 0.842, 0.126, 0.9893)
(0.069, 0.843, 0.088, 0.8779)
(0.105, 0.734, 0.16, 0.9872)
(0.056, 0.839, 0.105, 0.9833)
(0.008, 0.828, 0.164, 0.9976)
(0.021, 0.864, 0.114, 0.9963)
(0.021, 0.828, 0.151, 0.9959)
(0.069, 0.821, 0.11, 0.9838)
(0.079, 0.812, 0.109, 0.9636)
(0.062, 0.797, 0.141, 0.9962)
(0.027, 0.913, 0.06, 0.9147)
(0.076, 0.789, 0.135, 0.9918)
(0.137, 0.766, 0.097, -0.9816)
(0.118, 0.79, 0.092, -0.9855)
(0.115, 0.828, 0.057, -0.9995)
(0.109, 0.797, 0.093, -0.9371)
(0.172

(0.029, 0.875, 0.096, 0.9028)
(0.125, 0.77, 0.106, -0.9186)
(0.0, 0.879, 0.121, 0.8779)
(0.0, 0.84, 0.16, 0.9442)
(0.026, 0.74, 0.234, 0.9897)
(0.0, 0.916, 0.084, 0.9118)
(0.196, 0.735, 0.069, -0.969)
(0.133, 0.787, 0.079, -0.9745)
(0.137, 0.777, 0.086, -0.9778)
(0.178, 0.75, 0.072, -0.9936)
(0.022, 0.863, 0.115, 0.9861)
(0.195, 0.783, 0.022, -0.9959)
(0.135, 0.797, 0.068, -0.9571)
(0.149, 0.817, 0.034, -0.9875)
(0.259, 0.713, 0.028, -0.9968)
(0.167, 0.748, 0.086, -0.9863)
(0.235, 0.712, 0.052, -0.9963)
(0.211, 0.695, 0.094, -0.9817)
(0.17, 0.739, 0.092, -0.9799)
(0.05, 0.858, 0.092, 0.9545)
(0.047, 0.819, 0.134, 0.994)
(0.163, 0.737, 0.1, -0.9976)
(0.148, 0.762, 0.09, -0.9633)
(0.013, 0.856, 0.131, 0.9986)
(0.031, 0.831, 0.138, 0.9947)
(0.125, 0.746, 0.13, -0.2054)
(0.141, 0.803, 0.056, -0.9786)
(0.122, 0.819, 0.059, -0.9798)
(0.082, 0.844, 0.073, -0.3612)
(0.179, 0.772, 0.049, -0.9965)
(0.257, 0.699, 0.044, -0.9943)
(0.104, 0.818, 0.078, -0.9081)
(0.205, 0.739, 0.056, -0.9983)
(0.117

(0.098, 0.81, 0.092, -0.3182)
(0.081, 0.829, 0.09, 0.5267)
(0.144, 0.756, 0.1, -0.9864)
(0.081, 0.815, 0.104, 0.8225)
(0.064, 0.812, 0.124, 0.9979)
(0.102, 0.777, 0.12, 0.5423)
(0.105, 0.79, 0.105, -0.1717)
(0.098, 0.814, 0.088, -0.8316)
(0.096, 0.768, 0.136, 0.9022)
(0.085, 0.805, 0.11, 0.7845)
(0.108, 0.768, 0.124, 0.7128)
(0.121, 0.782, 0.096, -0.802)
(0.133, 0.772, 0.095, -0.9446)
(0.086, 0.8, 0.114, 0.8481)
(0.152, 0.733, 0.114, -0.9477)
(0.13, 0.763, 0.107, -0.6597)
(0.108, 0.775, 0.117, 0.8173)
(0.102, 0.795, 0.103, -0.6909)
(0.127, 0.788, 0.085, -0.9169)
(0.092, 0.823, 0.085, -0.7411)
(0.101, 0.805, 0.095, -0.34)
(0.068, 0.814, 0.118, 0.9485)
(0.089, 0.868, 0.043, -0.9741)
(0.078, 0.813, 0.109, 0.9062)
(0.117, 0.794, 0.089, -0.875)
(0.132, 0.758, 0.11, -0.6486)
(0.13, 0.769, 0.101, -0.9738)
(0.14, 0.798, 0.062, -0.9952)
(0.135, 0.798, 0.067, -0.9726)
(0.095, 0.783, 0.122, 0.8332)
(0.124, 0.772, 0.104, -0.9423)
(0.104, 0.76, 0.136, 0.743)
(0.096, 0.795, 0.109, 0.4767)
(0.121, 0.

C. Add the four sentiment scores to the `doj_subset` dataframe to create a dataframe: `doj_subset_wscore`. Sort from highest neg to lowest neg score and print the top `id`, `contents`, and `neg` columns of the two most neg press releases. 

Notes:

- Don't worry if your sentiment score differs slightly from our output on GitHub; differences in preprocessing can lead to diff scores

In [46]:
## your code here

#for i, row in df.iterrows():
#    ifor_val = something
#    if <condition>:
#        ifor_val = something_else
#    df.at[i,'ifor'] = ifor_val

doj_subset_wscore = doj_subset

doj_subset_wscore["sent_neg"]  = 0
doj_subset_wscore["sent_neu"]  = 0
doj_subset_wscore["sent_pos"]  = 0
doj_subset_wscore["sent_comp"] = 0

#Used StackOverflow to troubleshoot this chunk of code
for i, row in doj_subset.iterrows():
    (neg, neu, pos, comp) = sentiment_analysis(doj_subset_wscore.at[i, "contents"])
    print([neg, neu, pos, comp])
    doj_subset_wscore.at[i, "sent_neg"]  = neg
    doj_subset_wscore.at[i, "sent_neu"]  = neu
    doj_subset_wscore.at[i, "sent_pos"]  = pos
    doj_subset_wscore.at[i, "sent_comp"] = comp
    
doj_subset_wscore

[0.169, 0.763, 0.068, -0.9893]
[0.146, 0.774, 0.079, -0.9814]
[0.097, 0.751, 0.152, 0.9623]
[0.131, 0.737, 0.132, -0.6771]
[0.215, 0.633, 0.152, -0.7717]
[0.029, 0.857, 0.114, 0.9485]
[0.103, 0.816, 0.082, -0.7783]
[0.109, 0.834, 0.057, -0.9746]
[0.08, 0.849, 0.071, -0.2575]
[0.057, 0.854, 0.089, 0.9493]
[0.105, 0.83, 0.065, -0.985]
[0.118, 0.801, 0.081, -0.9578]
[0.0, 0.834, 0.166, 0.9949]
[0.097, 0.858, 0.044, -0.91]
[0.079, 0.863, 0.058, -0.6085]
[0.044, 0.821, 0.134, 0.9897]
[0.072, 0.816, 0.112, 0.9673]
[0.033, 0.842, 0.126, 0.9893]
[0.069, 0.843, 0.088, 0.8779]
[0.105, 0.734, 0.16, 0.9872]
[0.056, 0.839, 0.105, 0.9833]
[0.008, 0.828, 0.164, 0.9976]
[0.021, 0.864, 0.114, 0.9963]
[0.021, 0.828, 0.151, 0.9959]
[0.069, 0.821, 0.11, 0.9838]
[0.079, 0.812, 0.109, 0.9636]
[0.062, 0.797, 0.141, 0.9962]
[0.027, 0.913, 0.06, 0.9147]
[0.076, 0.789, 0.135, 0.9918]
[0.137, 0.766, 0.097, -0.9816]
[0.118, 0.79, 0.092, -0.9855]
[0.115, 0.828, 0.057, -0.9995]
[0.109, 0.797, 0.093, -0.9371]
[0.172

[0.018, 0.86, 0.122, 0.995]
[0.029, 0.875, 0.096, 0.9028]
[0.125, 0.77, 0.106, -0.9186]
[0.0, 0.879, 0.121, 0.8779]
[0.0, 0.84, 0.16, 0.9442]
[0.026, 0.74, 0.234, 0.9897]
[0.0, 0.916, 0.084, 0.9118]
[0.196, 0.735, 0.069, -0.969]
[0.133, 0.787, 0.079, -0.9745]
[0.137, 0.777, 0.086, -0.9778]
[0.178, 0.75, 0.072, -0.9936]
[0.022, 0.863, 0.115, 0.9861]
[0.195, 0.783, 0.022, -0.9959]
[0.135, 0.797, 0.068, -0.9571]
[0.149, 0.817, 0.034, -0.9875]
[0.259, 0.713, 0.028, -0.9968]
[0.167, 0.748, 0.086, -0.9863]
[0.235, 0.712, 0.052, -0.9963]
[0.211, 0.695, 0.094, -0.9817]
[0.17, 0.739, 0.092, -0.9799]
[0.05, 0.858, 0.092, 0.9545]
[0.047, 0.819, 0.134, 0.994]
[0.163, 0.737, 0.1, -0.9976]
[0.148, 0.762, 0.09, -0.9633]
[0.013, 0.856, 0.131, 0.9986]
[0.031, 0.831, 0.138, 0.9947]
[0.125, 0.746, 0.13, -0.2054]
[0.141, 0.803, 0.056, -0.9786]
[0.122, 0.819, 0.059, -0.9798]
[0.082, 0.844, 0.073, -0.3612]
[0.179, 0.772, 0.049, -0.9965]
[0.257, 0.699, 0.044, -0.9943]
[0.104, 0.818, 0.078, -0.9081]
[0.205, 0

[0.098, 0.81, 0.092, -0.3182]
[0.081, 0.829, 0.09, 0.5267]
[0.144, 0.756, 0.1, -0.9864]
[0.081, 0.815, 0.104, 0.8225]
[0.064, 0.812, 0.124, 0.9979]
[0.102, 0.777, 0.12, 0.5423]
[0.105, 0.79, 0.105, -0.1717]
[0.098, 0.814, 0.088, -0.8316]
[0.096, 0.768, 0.136, 0.9022]
[0.085, 0.805, 0.11, 0.7845]
[0.108, 0.768, 0.124, 0.7128]
[0.121, 0.782, 0.096, -0.802]
[0.133, 0.772, 0.095, -0.9446]
[0.086, 0.8, 0.114, 0.8481]
[0.152, 0.733, 0.114, -0.9477]
[0.13, 0.763, 0.107, -0.6597]
[0.108, 0.775, 0.117, 0.8173]
[0.102, 0.795, 0.103, -0.6909]
[0.127, 0.788, 0.085, -0.9169]
[0.092, 0.823, 0.085, -0.7411]
[0.101, 0.805, 0.095, -0.34]
[0.068, 0.814, 0.118, 0.9485]
[0.089, 0.868, 0.043, -0.9741]
[0.078, 0.813, 0.109, 0.9062]
[0.117, 0.794, 0.089, -0.875]
[0.132, 0.758, 0.11, -0.6486]
[0.13, 0.769, 0.101, -0.9738]
[0.14, 0.798, 0.062, -0.9952]
[0.135, 0.798, 0.067, -0.9726]
[0.095, 0.783, 0.122, 0.8332]
[0.124, 0.772, 0.104, -0.9423]
[0.104, 0.76, 0.136, 0.743]
[0.096, 0.795, 0.109, 0.4767]
[0.121, 0.

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,sent_neg,sent_neu,sent_pos,sent_comp
0,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened. James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.” “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.” This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle",0.169,0.763,0.068,-0.9893
1,17-240,Anoka County Resident Sentenced to Six Months in Prison for Threatening Two Clinics that Provide Reproductive Health Services,"On, Feb. 27, 2017, Michael John Harris, 34, was sentenced to six months imprisonment and one year of supervised release for making telephonic threats to two medical clinics in Minneapolis, Minnesota, that provide reproductive health services. On March 2, 2016, Harris pleaded guilty to two violations of 18 U.S.C. § 248(a)(1). During his plea hearing, Harris admitted that on May 12, 2014, he made telephonic threats to two different health clinics in Minneapolis that provide reproductive health services. In a call to the first clinic, Harris threatened to kill the recipient of the call with his bare hands and to cut the recipient’s head off with a band saw. In a call to the second clinic, Harris told the recipient that he was going to kill the recipient and the recipient’s co-workers, and that he was going to travel to the clinic and shoot everyone present. Harris further admitted that he made these threats because the recipient was and has been, and in order to intimidate the recipient and any other person from, obtaining and providing reproductive health services. “This defendant threatened these clinic workers with death and brutality,” said Acting Assistant Attorney General Tom Wheeler of the Justice Department’s Civil Rights Division. “The department is pleased that the defendant accepted responsibility and will face consequences for his actions. The Department is committed to vigorously enforcing the civil rights of all individuals in this country.” “The violence threatened by this defendant against health care workers is unacceptable,” said United States Attorney Andrew M. Luger of the District of Minnesota. “This sentence should serve as a reminder to individuals who would engage in such threats that the federal government will prosecute these crimes.” This case is being investigated by the Federal Bureau of Investigation, and is being prosecuted by Trial Attorney Risa Berkower of the Civil Rights Division of the United States Department of Justice and Assistant U.S. Attorney Manda M. Sertich of the U.S. Attorney’s Office for the District of Minnesota.",2017-03-02T00:00:00-05:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section; Federal Bureau of Investigation (FBI); USAO - Minnesota,0.146,0.774,0.079,-0.9814
2,17-379,Arson Awareness Week 2017 to Focus on Preventing Arson at Houses of Worship,"The Justice Department today announced that its Civil Rights Division is partnering with the Federal Emergency Management Agency’s U.S. Fire Administration on this year’s Arson Awareness Week, May 7-13, with a focus on Preventing Arson at Houses of Worship. There were an average of 103 arsons of houses of worship per year from 2000 to 2015. Half of all reported fires at houses of worship turn out to involve arson. The Department of Justice enforces a number of federal statutes protecting places of worship from attack, including 18 U.S.C. § 247, known as the Church Arson Prevention Act, which was passed in the 1990s in response to a sharp increase in church arsons. That law makes it a federal crime to target religious property because of the religion or race of the congregation. In February of this year, the Department indicted an Idaho man under § 247 alleging that he set fire to a Catholic Church in Bonner’s Ferry in April 2016. In 2013, an Indiana man was sentenced to 20 years imprisonment for setting a fire at the Islamic Center of Greater Toledo. FEMA and the Department of Justice have produced a number of materials to help congregations, community organizations and local law enforcement and fire safety officials to increase arson awareness and hold events highlighting proactive steps that can be taken to try to reduce house of worship arson. These materials are available at the Arson Awareness Week homepage, www.usfa.fema.gov/aaw. “Arson against houses of worship is a serious crime that the Department of Justice is committed to prosecuting to the fullest extent of the law,” said Acting Assistant Attorney General Tom Wheeler of the Justice Department’s Civil Rights Division. “But our role as prosecutors, while critically important, only comes after the fact when the damage is already done. That is why we encourage communities and local officials to take proactive steps to increase public awareness of the problem and measures that can be taken to reduce the likelihood of being a victim of house of worship arson.” Further information about hate crimes, including arsons against on places of worship, is available at the Civil Rights Division hate crimes page, https://www.justice.gov/crt/hate-crimes-0.",2017-04-10T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section,0.097,0.751,0.152,0.9623
3,18-121,Attorney General Issues National Slavery and Human Trafficking Prevention Month Proclamation,"Attorney General Jeff Sessions issued the following proclamation commemorating January as National Slavery and Human Trafficking Prevention Month: “Human trafficking is a nationwide public health and civil rights crisis. Its victims are everywhere: at truck stops, in cities, in rural areas, and in suburbs, and who now total an unconscionable 25 million victims globally according to some estimates. That means 25 million human beings—parents, siblings, and children—have been coerced into a commercial sex act, forced into labor, or exploited because they desperately seek a better life. It is a priority of the Department of Justice to combat this depraved and predatory behavior through swift and aggressive enforcement of our nation’s laws to bring traffickers to justice and restore the lives of victims and survivors. “The Justice Department’s U.S. Attorneys’ Offices, working closely with the Federal Bureau of Investigation (FBI), other federal agencies, and our state, local, and tribal partners, are on the front lines, leading our shared fight against human trafficking in all its forms. These entities are supported by the Department’s Civil Rights Division which is home to a team of dedicated investigators and prosecutors—the Human Trafficking Prosecution Unit (the HTPU)—tasked with bringing human traffickers to justice and vindicating the rights of their victims. Additionally, the Department’s Criminal Division includes the Child Exploitation and Obscenity Section (CEOS), which is committed to harnessing expertise in attacking the technological and systemic challenges that are involved in the sexual exploitation of minors, as well as other specialized prosecution teams who bring expertise in organized crime and money laundering. “Our efforts have produced high-impact prosecutions to dismantle transnational organized human trafficking enterprises, have launched interagency anti-trafficking initiatives with unprecedented momentum, and have vindicated the rights and freedoms of countless victims and survivors. “These efforts resulted in the conviction of nearly 500 defendants in trafficking cases in fiscal year 2017, and making $47 million available to help trafficking survivors. Last fall, the FBI—along with state and local task forces and international law enforcement partners—recovered 84 minors and arrested 120 traffickers, as part a single week-long operation. However, we are keenly aware that many challenges lie ahead and we are committed to taking our efforts to the next level. “In his Presidential Proclamation, President Trump asked us to ‘recommit ourselves to eradicating the evil of enslavement’ and to ‘pledge to do all in our power to end the horrific practice of human trafficking.’ In the spirit of the President’s request, the Justice Department is hosting a Human Trafficking Summit in Washington, D.C. on February 2, 2018, two days before Super Bowl LII. The Super Bowl provides an opportunity to raise awareness of the surge in commercial sex activity around major sporting events, and of our commitment to finding and protecting sex trafficking victims who are at risk of being compelled, coerced, or exploited as minors in that context. “The Human Trafficking Summit will be led by Associate Attorney General Rachel Brand and will convene law enforcement, victim support organizations, and the business community to focus on enhancing the strong partnerships behind all successful anti-trafficking efforts and identifying opportunities to increase collaboration and coordination as we take on new challenges. “There is no room in a civilized society for those who choose to violate an individual’s rights and freedoms by subjecting them to any form of human trafficking. To those that still make that choice: make no mistake, the Justice Department will use every lawful tool to uncover your illegal activity and bring you to justice.”",2018-01-31T00:00:00-05:00,Civil Rights,Civil Rights Division; Office of the Attorney General,0.131,0.737,0.132,-0.6771
4,16-603,Attorney General Loretta E. Lynch Statement on the Case of Dylann Roof,"Attorney General Loretta E. Lynch today released the following statement regarding the United States v. Dylann Roof: “Following the department’s rigorous review process to thoroughly consider all relevant factual and legal issues, I have determined that the Justice Department will seek the death penalty. The nature of the alleged crime and the resulting harm compelled this decision.”",2016-05-24T00:00:00-04:00,Civil Rights,Office of the Attorney General,0.215,0.633,0.152,-0.7717
...,...,...,...,...,...,...,...,...,...,...
712,18-559,Virginia Man Sentenced to Over 15 Years in Prison for Sex Trafficking a Minor and Producing Child Pornography,"A Virginia man was sentenced today to 186 months in prison and 10 years of supervised release for multiple crimes related to the prostitution and exploitation of a 15-year-old minor. Assistant Attorney General Brian A. Benczkowski of the Justice Department’s Criminal Division, U.S. Attorney G. Zachary Terwilliger of the Eastern District of Virginia, and Assistant Director in Charge Nancy McNamara of the FBI’s Washington Field Office made the announcement after the sentence was handed down by U.S. District Judge Anthony J. Trenga of the Eastern District of Virginia. Abdul Karim Bangura Jr. aka “AJ”, 22, of Triangle, Virginia pleaded guilty in August 2017 to all counts of an indictment charging him with sex trafficking of a minor, conspiracy to engage in sex trafficking of a minor, interstate transportation of a minor for the purposes of prostitution, and production of child pornography. According to admissions made in connection with his plea, Bangura and his co-defendant Christian Hood conspired to recruit a 15-year-old girl to work as a prostitute and to advertise her prostitution services on Backpage.com. Bangura also transported the minor to hotels in Virginia, Maryland, and Washington, D.C. for prostitution dates, and he took a portion of the money she made from commercial sex customers. Bangura also used a phone to record a video of himself having sex with the minor. In August 2017, Hood was convicted at trial of sex trafficking and conspiracy to engage in sex trafficking of this same minor. This matter was investigated by the FBI Washington Field Office’s Child Exploitation and Human Trafficking Task Force with assistance from the Washington, D.C. Metropolitan Police Department and the Prince William County Police Department. Assistant U.S. Attorney Maureen C. Cain of the Eastern District of Virginia and Trial Attorney Kyle P. Reynolds of the Criminal Division’s Child Exploitation and Obscenity Section are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative launched in May 2006 by the Department of Justice to combat the growing epidemic of child sexual exploitation and abuse. Led by U.S. Attorneys’ Offices and the Child Exploitation and Obscenity Section (CEOS), Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2018-07-20T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Virginia, Eastern",0.077,0.839,0.084,-0.0708
713,16-278,Virginia Music Volunteer Convicted of Production of Child Pornography,"A Virginia man who served as a volunteer with the music program at Grace E. Metz Middle School in Manassas, Virginia, was found guilty today by a federal jury of four counts of production of child pornography, one count of attempted coercion/enticement of a minor, one count of distribution of child pornography and two counts of receipt of child pornography. Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division, U.S. Attorney Dana J. Boente of the Eastern District of Virginia, Special Agent in Charge Clark E. Settles of U.S. Immigration and Customs Enforcement’s Homeland Security Investigations (HSI) Washington, D.C., and Chief Douglas Keen of the Manassas City Police Department made the announcement. According to evidence presented at trial, David Alexander Battle II, 24, of Manassas, used his home computer to share images of child sexual exploitation via webcam on a chat website in April 2015. Battle also posed as a minor girl on another chat platform and chatted with minor boys, including two boys he personally knew, coercing and enticing them to send him sexually explicit images of themselves. The trial evidence also showed Battle’s laptop contained gigabytes of child sexual exploitation files. Trial Attorney Lauren Britsch of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jay Prabhu of the Eastern District of Virginia are prosecuting the case. HSI and the Manassas City Police Department investigated the case, with assistance from the Herndon, Virginia, Police Department and the Northern Virginia/Washington, D.C., Internet Crimes Against Children Task Force. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-03-10T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Virginia, Eastern",0.073,0.816,0.111,0.9022
714,16-672,Virginia Music Volunteer Sentenced to 300 Months in Prison for Production of Child Pornography,"A Virginia man who served as a volunteer with the music program at Grace E. Metz Middle School in Manassas, Virginia, was sentenced today to 25 years in prison for production of child pornography, attempted coercion and enticement of a minor, and distribution and receipt of child pornography. Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division, U.S. Attorney Dana J. Boente of the Eastern District of Virginia, Special Agent in Charge Clark E. Settles of U.S. Immigration and Customs Enforcement’s Homeland Security Investigations (HSI) Washington, D.C., and Chief Douglas Keen of the Manassas City Police Department made the announcement. David Alexander Battle II, 24, of Manassas, was sentenced by U.S. District Judge Claude M. Hilton of the Eastern District of Virginia, who also ordered Battle to serve 15 years of supervised release. Battle was convicted by a federal jury on March 10, 2016. According to evidence presented at trial, Battle used his home computer to share images of child sexual exploitation via webcam on a chat website in April 2015. Battle also posed as a minor girl on another chat platform and chatted with minor boys, including two boys he personally knew, coercing and enticing them to send him sexually explicit images of themselves. The trial evidence also showed Battle’s laptop contained gigabytes of child sexual exploitation files. HSI and the Manassas City Police Department investigated the case, with assistance from the Herndon, Virginia, Police Department and the Northern Virginia/Washington, D.C., Internet Crimes Against Children Task Force. Trial Attorney Lauren Britsch of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jay Prabhu of the Eastern District of Virginia are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-06-10T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Virginia, Eastern",0.093,0.804,0.102,0.5574
715,18-775,Wisconsin Man Indicted for Producing Child Pornography Outside of the United States,"A Wisconsin man was charged in an indictment yesterday with the crimes of producing and possessing child pornography and engaging in illicit sexual conduct in a foreign place, announced Acting Assistant Attorney General John P. Cronan of the Justice Department’s Criminal Division and U.S. Attorney Matthew D. Krueger of the Eastern District of Wisconsin. Jeffrey H. Ernisse, 61, is currently incarcerated for state offenses related to child exploitation at the Red Granite Correctional Institution in Wisconsin. A grand jury in the U.S. District Court for the Eastern District of Wisconsin indicted Ernisse on two counts of producing child pornography, two counts of producing child pornography outside of the United States, one count of engaging in illicit sexual conduct with a minor in the Philippines and one count of possessing child pornography. According to the indictment, on or about March 10, 2015 and then again, on or about April 7, 2015, Ernisse used a minor to engage in sexually explicit conduct for the purpose of producing child pornography. Between approximately June 17, 2014, and approximately April 11, 2015, Ernisse engaged in illicit sexual conduct with a minor in the Republic of the Philippines. And on or about Dec. 18, 2015, Ernisse possessed child pornography. The charges contained in the indictment are merely allegations. The defendant is presumed innocent until proven guilty beyond a reasonable doubt in a court of law. U.S. Immigration and Customs Enforcement’s Homeland Security Investigations (HSI) is investigating this case with the cooperation of the Sheboygan, Wisconsin, Police Department. Trial Attorney William M. Grady of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys Megan J. Paulson and Penelope L. Coblentz of the Eastern District of Wisconsin are prosecuting the case. This investigation is a part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state, and local resources to better locate, apprehend, and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2018-06-13T00:00:00-04:00,Project Safe Childhood,Criminal Division; Criminal - Child Exploitation and Obscenity Section,0.076,0.805,0.120,0.9460


D. With the dataframe from part C, find the mean compound sentiment score for each of the three topics in `topics_clean` using group_by and agg.

E. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)


In [28]:
## agg and find the mean compound score by topic
comp_mean = doj_subset.groupby("topics_clean").agg({"sent_comp": "mean"})
comp_mean

Unnamed: 0_level_0,sent_comp
topics_clean,Unnamed: 1_level_1
Civil Rights,0.154595
Hate Crimes,-0.884882
Project Safe Childhood,-0.245364


# 2. Topic modeling (25 points)

For this question, use the `doj_subset_wscores` data that is restricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


## 2.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in a single raw string in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem

- Returns a joined preprocessed string
    
B. Use `apply` or list comprehension to execute that function and create a new column in the data called `processed_text`
    
C. Print the `id`, `contents`, and `processed_text` columns for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/
- Here's code with topic modeling steps: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb

In [29]:
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                       "trial", "assistance", "assist"]

In [35]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/spencer/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [44]:
## your code defining a text processing function
def preprocessing(string):
    default_stopwords = stopwords.words("english")
    total_stopwords = default_stopwords + custom_doj_stopwords
    
    lower = string.lower()
    wordpunct_tokenize(string)
    string_nostop = [word
                     for word in wordpunct_tokenize(string)
                     if word not in total_stopwords]
    
    string_alpha = [word
                    for word in string_nostop
                    if word.isalpha() == True]
    
    string_bigwords = [word
                       for word in string_alpha
                       if len(word) >= 4]
    
    string_stemmed = []
    snow_stemmer = SnowballStemmer(language='english')
    for word in string_bigwords:
        x = snow_stemmer.stem(word)
        string_stemmed.append(x)
    
    string_clean = ""
    for word in string_stemmed:
        string_clean = string_clean + word + " "
    
    return string_clean

In [49]:
## your code executing the function
doj_subset_wscore["processed_text"] = doj_subset_wscore["contents"].apply(preprocessing)
doj_subset_wscore

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,sent_neg,sent_neu,sent_pos,sent_comp,processed_text
0,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened. James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.” “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.” This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle",0.169,0.763,0.068,-0.9893,former supervisori correct offic louisiana state penitentiari angola louisiana plead guilti yesterday connect beat handcuf shackl inmat addit conspir cover misconduct falsifi offici record lie intern investig happen jame savoy marksvill louisiana admit plea hear wit offic use excess forc inmat fail interven conspir offic cover beat engag varieti obstruct act person falsifi offici prison record cover attack scotti kennedi beeb arkansa john sander marksvill louisiana previous plead guilti novemb septemb role beat cover everi citizen right process protect unreason forc correct offic violat basic constitut must held account egregi action said act assist attorney general john gore civil right divis justic depart continu vigor prosecut correct offic violat public trust commit crime cover violat feder crimin yesterday anoth exampl unwav commit pursu violat feder crimin law said act unit state attorney middl district louisiana corey amundson continu work close justic depart civil right divis ensur this investig baton roug resid agenc prosecut assist attorney frederick menner middl district louisiana trial attorney christoph perra civil right divis crimin section
1,17-240,Anoka County Resident Sentenced to Six Months in Prison for Threatening Two Clinics that Provide Reproductive Health Services,"On, Feb. 27, 2017, Michael John Harris, 34, was sentenced to six months imprisonment and one year of supervised release for making telephonic threats to two medical clinics in Minneapolis, Minnesota, that provide reproductive health services. On March 2, 2016, Harris pleaded guilty to two violations of 18 U.S.C. § 248(a)(1). During his plea hearing, Harris admitted that on May 12, 2014, he made telephonic threats to two different health clinics in Minneapolis that provide reproductive health services. In a call to the first clinic, Harris threatened to kill the recipient of the call with his bare hands and to cut the recipient’s head off with a band saw. In a call to the second clinic, Harris told the recipient that he was going to kill the recipient and the recipient’s co-workers, and that he was going to travel to the clinic and shoot everyone present. Harris further admitted that he made these threats because the recipient was and has been, and in order to intimidate the recipient and any other person from, obtaining and providing reproductive health services. “This defendant threatened these clinic workers with death and brutality,” said Acting Assistant Attorney General Tom Wheeler of the Justice Department’s Civil Rights Division. “The department is pleased that the defendant accepted responsibility and will face consequences for his actions. The Department is committed to vigorously enforcing the civil rights of all individuals in this country.” “The violence threatened by this defendant against health care workers is unacceptable,” said United States Attorney Andrew M. Luger of the District of Minnesota. “This sentence should serve as a reminder to individuals who would engage in such threats that the federal government will prosecute these crimes.” This case is being investigated by the Federal Bureau of Investigation, and is being prosecuted by Trial Attorney Risa Berkower of the Civil Rights Division of the United States Department of Justice and Assistant U.S. Attorney Manda M. Sertich of the U.S. Attorney’s Office for the District of Minnesota.",2017-03-02T00:00:00-05:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section; Federal Bureau of Investigation (FBI); USAO - Minnesota,0.146,0.774,0.079,-0.9814,michael john harri sentenc month imprison year supervis releas make telephon threat medic clinic minneapoli minnesota provid reproduct health servic march harri plead guilti violat dure plea hear harri admit made telephon threat differ health clinic minneapoli provid reproduct health servic call first clinic harri threaten kill recipi call bare hand recipi head band call second clinic harri told recipi go kill recipi recipi worker go travel clinic shoot everyon present harri admit made threat recipi order intimid recipi person obtain provid reproduct health servic this defend threaten clinic worker death brutal said act assist attorney general wheeler justic depart civil right divis pleas defend accept respons face consequ action depart commit vigor enforc individu countri violenc threaten defend health care worker unaccept said unit state attorney andrew luger district minnesota this sentenc serv remind individu would engag threat feder govern prosecut crime this investig feder bureau investig prosecut trial attorney risa berkow civil right divis unit state depart justic assist attorney manda sertich attorney offic district minnesota
2,17-379,Arson Awareness Week 2017 to Focus on Preventing Arson at Houses of Worship,"The Justice Department today announced that its Civil Rights Division is partnering with the Federal Emergency Management Agency’s U.S. Fire Administration on this year’s Arson Awareness Week, May 7-13, with a focus on Preventing Arson at Houses of Worship. There were an average of 103 arsons of houses of worship per year from 2000 to 2015. Half of all reported fires at houses of worship turn out to involve arson. The Department of Justice enforces a number of federal statutes protecting places of worship from attack, including 18 U.S.C. § 247, known as the Church Arson Prevention Act, which was passed in the 1990s in response to a sharp increase in church arsons. That law makes it a federal crime to target religious property because of the religion or race of the congregation. In February of this year, the Department indicted an Idaho man under § 247 alleging that he set fire to a Catholic Church in Bonner’s Ferry in April 2016. In 2013, an Indiana man was sentenced to 20 years imprisonment for setting a fire at the Islamic Center of Greater Toledo. FEMA and the Department of Justice have produced a number of materials to help congregations, community organizations and local law enforcement and fire safety officials to increase arson awareness and hold events highlighting proactive steps that can be taken to try to reduce house of worship arson. These materials are available at the Arson Awareness Week homepage, www.usfa.fema.gov/aaw. “Arson against houses of worship is a serious crime that the Department of Justice is committed to prosecuting to the fullest extent of the law,” said Acting Assistant Attorney General Tom Wheeler of the Justice Department’s Civil Rights Division. “But our role as prosecutors, while critically important, only comes after the fact when the damage is already done. That is why we encourage communities and local officials to take proactive steps to increase public awareness of the problem and measures that can be taken to reduce the likelihood of being a victim of house of worship arson.” Further information about hate crimes, including arsons against on places of worship, is available at the Civil Rights Division hate crimes page, https://www.justice.gov/crt/hate-crimes-0.",2017-04-10T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section,0.097,0.751,0.152,0.9623,justic depart today announc civil right divis partner feder emerg manag agenc fire administr year arson awar week focus prevent arson hous worship there averag arson hous worship year half report fire hous worship turn involv arson depart justic enforc number feder statut protect place worship attack includ known church arson prevent pass respons sharp increas church arson that make feder crime target religi properti religion race congreg februari year depart indict idaho alleg fire cathol church bonner ferri april indiana sentenc year imprison set fire islam center greater toledo fema depart justic produc number materi help congreg communiti organ local enforc fire safeti offici increas arson awar hold event highlight proactiv step taken reduc hous worship arson these materi avail arson awar week homepag usfa fema arson hous worship serious crime depart justic commit prosecut fullest extent said act assist attorney general wheeler justic depart civil right divis role prosecutor critic import come fact damag alreadi done that encourag communiti local offici take proactiv step increas public awar problem measur taken reduc likelihood victim hous worship arson further inform hate crime includ arson place worship avail civil right divis hate crime page https hate crime
3,18-121,Attorney General Issues National Slavery and Human Trafficking Prevention Month Proclamation,"Attorney General Jeff Sessions issued the following proclamation commemorating January as National Slavery and Human Trafficking Prevention Month: “Human trafficking is a nationwide public health and civil rights crisis. Its victims are everywhere: at truck stops, in cities, in rural areas, and in suburbs, and who now total an unconscionable 25 million victims globally according to some estimates. That means 25 million human beings—parents, siblings, and children—have been coerced into a commercial sex act, forced into labor, or exploited because they desperately seek a better life. It is a priority of the Department of Justice to combat this depraved and predatory behavior through swift and aggressive enforcement of our nation’s laws to bring traffickers to justice and restore the lives of victims and survivors. “The Justice Department’s U.S. Attorneys’ Offices, working closely with the Federal Bureau of Investigation (FBI), other federal agencies, and our state, local, and tribal partners, are on the front lines, leading our shared fight against human trafficking in all its forms. These entities are supported by the Department’s Civil Rights Division which is home to a team of dedicated investigators and prosecutors—the Human Trafficking Prosecution Unit (the HTPU)—tasked with bringing human traffickers to justice and vindicating the rights of their victims. Additionally, the Department’s Criminal Division includes the Child Exploitation and Obscenity Section (CEOS), which is committed to harnessing expertise in attacking the technological and systemic challenges that are involved in the sexual exploitation of minors, as well as other specialized prosecution teams who bring expertise in organized crime and money laundering. “Our efforts have produced high-impact prosecutions to dismantle transnational organized human trafficking enterprises, have launched interagency anti-trafficking initiatives with unprecedented momentum, and have vindicated the rights and freedoms of countless victims and survivors. “These efforts resulted in the conviction of nearly 500 defendants in trafficking cases in fiscal year 2017, and making $47 million available to help trafficking survivors. Last fall, the FBI—along with state and local task forces and international law enforcement partners—recovered 84 minors and arrested 120 traffickers, as part a single week-long operation. However, we are keenly aware that many challenges lie ahead and we are committed to taking our efforts to the next level. “In his Presidential Proclamation, President Trump asked us to ‘recommit ourselves to eradicating the evil of enslavement’ and to ‘pledge to do all in our power to end the horrific practice of human trafficking.’ In the spirit of the President’s request, the Justice Department is hosting a Human Trafficking Summit in Washington, D.C. on February 2, 2018, two days before Super Bowl LII. The Super Bowl provides an opportunity to raise awareness of the surge in commercial sex activity around major sporting events, and of our commitment to finding and protecting sex trafficking victims who are at risk of being compelled, coerced, or exploited as minors in that context. “The Human Trafficking Summit will be led by Associate Attorney General Rachel Brand and will convene law enforcement, victim support organizations, and the business community to focus on enhancing the strong partnerships behind all successful anti-trafficking efforts and identifying opportunities to increase collaboration and coordination as we take on new challenges. “There is no room in a civilized society for those who choose to violate an individual’s rights and freedoms by subjecting them to any form of human trafficking. To those that still make that choice: make no mistake, the Justice Department will use every lawful tool to uncover your illegal activity and bring you to justice.”",2018-01-31T00:00:00-05:00,Civil Rights,Civil Rights Division; Office of the Attorney General,0.131,0.737,0.132,-0.6771,attorney general jeff session issu follow proclam commemor januari nation slaveri human traffick prevent month human traffick nationwid public health crisi victim everywher truck stop citi rural area suburb total unconscion million victim global accord estim that mean million human be parent sibl children coerc commerci forc labor exploit desper seek better life prioriti depart justic combat deprav predatori behavior swift aggress enforc nation law bring traffick restor live victim survivor justic depart attorney offic work close feder bureau investig feder agenc state local tribal partner front line lead share fight human traffick form these entiti support depart civil right divis home team dedic investig prosecutor human traffick prosecut unit htpu task bring human traffick vindic victim addit depart crimin divis includ child exploit obscen section ceo commit har expertis attack technolog system challeng involv sexual exploit minor well special prosecut team bring expertis organ crime money launder effort produc high impact prosecut dismantl transnat organ human traffick enterpris launch interag anti traffick initi unpreced momentum vindic freedom countless victim survivor these effort result convict near defend traffick case fiscal year make million avail help traffick survivor last fall along state local task forc intern enforc partner recov minor arrest traffick part singl week long oper howev keen awar mani challeng ahead commit take effort next level presidenti proclam presid trump ask recommit erad evil enslav pledg power horrif practic human traffick spirit presid request justic depart host human traffick summit washington februari day super bowl super bowl provid opportun rais awar surg commerci activ around major sport event commit find protect traffick victim risk compel coerc exploit minor context human traffick summit associ attorney general rachel brand conven enforc victim support organ busi communiti focus enhanc strong partnership behind success anti traffick effort identifi opportun increas collabor coordin take challeng there room civil societi choos violat individu freedom subject form human traffick still make choic make mistak justic depart everi law tool uncov illeg activ bring
4,16-603,Attorney General Loretta E. Lynch Statement on the Case of Dylann Roof,"Attorney General Loretta E. Lynch today released the following statement regarding the United States v. Dylann Roof: “Following the department’s rigorous review process to thoroughly consider all relevant factual and legal issues, I have determined that the Justice Department will seek the death penalty. The nature of the alleged crime and the resulting harm compelled this decision.”",2016-05-24T00:00:00-04:00,Civil Rights,Office of the Attorney General,0.215,0.633,0.152,-0.7717,attorney general loretta lynch today releas follow statement regard unit state dylann roof follow rigor review process thorough consid relev factual legal issu determin justic depart seek death penalti natur alleg crime result harm compel decis
...,...,...,...,...,...,...,...,...,...,...,...
712,18-559,Virginia Man Sentenced to Over 15 Years in Prison for Sex Trafficking a Minor and Producing Child Pornography,"A Virginia man was sentenced today to 186 months in prison and 10 years of supervised release for multiple crimes related to the prostitution and exploitation of a 15-year-old minor. Assistant Attorney General Brian A. Benczkowski of the Justice Department’s Criminal Division, U.S. Attorney G. Zachary Terwilliger of the Eastern District of Virginia, and Assistant Director in Charge Nancy McNamara of the FBI’s Washington Field Office made the announcement after the sentence was handed down by U.S. District Judge Anthony J. Trenga of the Eastern District of Virginia. Abdul Karim Bangura Jr. aka “AJ”, 22, of Triangle, Virginia pleaded guilty in August 2017 to all counts of an indictment charging him with sex trafficking of a minor, conspiracy to engage in sex trafficking of a minor, interstate transportation of a minor for the purposes of prostitution, and production of child pornography. According to admissions made in connection with his plea, Bangura and his co-defendant Christian Hood conspired to recruit a 15-year-old girl to work as a prostitute and to advertise her prostitution services on Backpage.com. Bangura also transported the minor to hotels in Virginia, Maryland, and Washington, D.C. for prostitution dates, and he took a portion of the money she made from commercial sex customers. Bangura also used a phone to record a video of himself having sex with the minor. In August 2017, Hood was convicted at trial of sex trafficking and conspiracy to engage in sex trafficking of this same minor. This matter was investigated by the FBI Washington Field Office’s Child Exploitation and Human Trafficking Task Force with assistance from the Washington, D.C. Metropolitan Police Department and the Prince William County Police Department. Assistant U.S. Attorney Maureen C. Cain of the Eastern District of Virginia and Trial Attorney Kyle P. Reynolds of the Criminal Division’s Child Exploitation and Obscenity Section are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative launched in May 2006 by the Department of Justice to combat the growing epidemic of child sexual exploitation and abuse. Led by U.S. Attorneys’ Offices and the Child Exploitation and Obscenity Section (CEOS), Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2018-07-20T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Virginia, Eastern",0.077,0.839,0.084,-0.0708,virginia sentenc today month prison year supervis releas multipl crime relat prostitut exploit year minor assist attorney general brian benczkowski justic depart crimin divis attorney zachari terwillig eastern district virginia assist director charg nanci mcnamara washington field offic made announc sentenc hand district judg anthoni trenga eastern district virginia abdul karim bangura triangl virginia plead guilti august count indict charg traffick minor conspiraci engag traffick minor interst transport minor purpos prostitut product child pornographi accord admiss made connect plea bangura defend christian hood conspir recruit year girl work prostitut advertis prostitut servic backpag bangura also transport minor hotel virginia maryland washington prostitut date took portion money made commerci custom bangura also use phone record video minor august hood convict traffick conspiraci engag traffick minor this matter investig washington field offic child exploit human traffick task forc washington metropolitan polic depart princ william counti polic depart assist attorney maureen cain eastern district virginia trial attorney kyle reynold crimin divis child exploit obscen section prosecut this brought part project safe childhood nationwid initi launch depart justic combat grow epidem child sexual exploit abus attorney offic child exploit obscen section ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit
713,16-278,Virginia Music Volunteer Convicted of Production of Child Pornography,"A Virginia man who served as a volunteer with the music program at Grace E. Metz Middle School in Manassas, Virginia, was found guilty today by a federal jury of four counts of production of child pornography, one count of attempted coercion/enticement of a minor, one count of distribution of child pornography and two counts of receipt of child pornography. Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division, U.S. Attorney Dana J. Boente of the Eastern District of Virginia, Special Agent in Charge Clark E. Settles of U.S. Immigration and Customs Enforcement’s Homeland Security Investigations (HSI) Washington, D.C., and Chief Douglas Keen of the Manassas City Police Department made the announcement. According to evidence presented at trial, David Alexander Battle II, 24, of Manassas, used his home computer to share images of child sexual exploitation via webcam on a chat website in April 2015. Battle also posed as a minor girl on another chat platform and chatted with minor boys, including two boys he personally knew, coercing and enticing them to send him sexually explicit images of themselves. The trial evidence also showed Battle’s laptop contained gigabytes of child sexual exploitation files. Trial Attorney Lauren Britsch of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jay Prabhu of the Eastern District of Virginia are prosecuting the case. HSI and the Manassas City Police Department investigated the case, with assistance from the Herndon, Virginia, Police Department and the Northern Virginia/Washington, D.C., Internet Crimes Against Children Task Force. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-03-10T00:00:00-05:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Virginia, Eastern",0.073,0.816,0.111,0.9022,virginia serv volunt music program grace metz middl school manassa virginia found guilti today feder juri four count product child pornographi count attempt coercion entic minor count distribut child pornographi count receipt child pornographi assist attorney general lesli caldwel justic depart crimin divis attorney dana boent eastern district virginia special agent charg clark settl immigr custom enforc homeland secur investig washington chief dougla keen manassa citi polic depart made announc accord evid present david alexand battl manassa use home comput share imag child sexual exploit webcam chat websit april battl also pose minor girl anoth chat platform chat minor boy includ boy person knew coerc entic send sexual explicit imag evid also show battl laptop contain gigabyt child sexual exploit file trial attorney lauren britsch crimin divis child exploit obscen section ceo assist attorney prabhu eastern district virginia prosecut manassa citi polic depart investig herndon virginia polic depart northern virginia washington internet crime against children task forc this brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch depart justic attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit
714,16-672,Virginia Music Volunteer Sentenced to 300 Months in Prison for Production of Child Pornography,"A Virginia man who served as a volunteer with the music program at Grace E. Metz Middle School in Manassas, Virginia, was sentenced today to 25 years in prison for production of child pornography, attempted coercion and enticement of a minor, and distribution and receipt of child pornography. Assistant Attorney General Leslie R. Caldwell of the Justice Department’s Criminal Division, U.S. Attorney Dana J. Boente of the Eastern District of Virginia, Special Agent in Charge Clark E. Settles of U.S. Immigration and Customs Enforcement’s Homeland Security Investigations (HSI) Washington, D.C., and Chief Douglas Keen of the Manassas City Police Department made the announcement. David Alexander Battle II, 24, of Manassas, was sentenced by U.S. District Judge Claude M. Hilton of the Eastern District of Virginia, who also ordered Battle to serve 15 years of supervised release. Battle was convicted by a federal jury on March 10, 2016. According to evidence presented at trial, Battle used his home computer to share images of child sexual exploitation via webcam on a chat website in April 2015. Battle also posed as a minor girl on another chat platform and chatted with minor boys, including two boys he personally knew, coercing and enticing them to send him sexually explicit images of themselves. The trial evidence also showed Battle’s laptop contained gigabytes of child sexual exploitation files. HSI and the Manassas City Police Department investigated the case, with assistance from the Herndon, Virginia, Police Department and the Northern Virginia/Washington, D.C., Internet Crimes Against Children Task Force. Trial Attorney Lauren Britsch of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorney Jay Prabhu of the Eastern District of Virginia are prosecuting the case. This case was brought as part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state and local resources to better locate, apprehend and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2016-06-10T00:00:00-04:00,Project Safe Childhood,"Criminal Division; Criminal - Child Exploitation and Obscenity Section; USAO - Virginia, Eastern",0.093,0.804,0.102,0.5574,virginia serv volunt music program grace metz middl school manassa virginia sentenc today year prison product child pornographi attempt coercion entic minor distribut receipt child pornographi assist attorney general lesli caldwel justic depart crimin divis attorney dana boent eastern district virginia special agent charg clark settl immigr custom enforc homeland secur investig washington chief dougla keen manassa citi polic depart made announc david alexand battl manassa sentenc district judg claud hilton eastern district virginia also order battl serv year supervis releas battl convict feder juri march accord evid present battl use home comput share imag child sexual exploit webcam chat websit april battl also pose minor girl anoth chat platform chat minor boy includ boy person knew coerc entic send sexual explicit imag evid also show battl laptop contain gigabyt child sexual exploit file manassa citi polic depart investig herndon virginia polic depart northern virginia washington internet crime against children task forc trial attorney lauren britsch crimin divis child exploit obscen section ceo assist attorney prabhu eastern district virginia prosecut this brought part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch depart justic attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit
715,18-775,Wisconsin Man Indicted for Producing Child Pornography Outside of the United States,"A Wisconsin man was charged in an indictment yesterday with the crimes of producing and possessing child pornography and engaging in illicit sexual conduct in a foreign place, announced Acting Assistant Attorney General John P. Cronan of the Justice Department’s Criminal Division and U.S. Attorney Matthew D. Krueger of the Eastern District of Wisconsin. Jeffrey H. Ernisse, 61, is currently incarcerated for state offenses related to child exploitation at the Red Granite Correctional Institution in Wisconsin. A grand jury in the U.S. District Court for the Eastern District of Wisconsin indicted Ernisse on two counts of producing child pornography, two counts of producing child pornography outside of the United States, one count of engaging in illicit sexual conduct with a minor in the Philippines and one count of possessing child pornography. According to the indictment, on or about March 10, 2015 and then again, on or about April 7, 2015, Ernisse used a minor to engage in sexually explicit conduct for the purpose of producing child pornography. Between approximately June 17, 2014, and approximately April 11, 2015, Ernisse engaged in illicit sexual conduct with a minor in the Republic of the Philippines. And on or about Dec. 18, 2015, Ernisse possessed child pornography. The charges contained in the indictment are merely allegations. The defendant is presumed innocent until proven guilty beyond a reasonable doubt in a court of law. U.S. Immigration and Customs Enforcement’s Homeland Security Investigations (HSI) is investigating this case with the cooperation of the Sheboygan, Wisconsin, Police Department. Trial Attorney William M. Grady of the Criminal Division’s Child Exploitation and Obscenity Section (CEOS) and Assistant U.S. Attorneys Megan J. Paulson and Penelope L. Coblentz of the Eastern District of Wisconsin are prosecuting the case. This investigation is a part of Project Safe Childhood, a nationwide initiative to combat the growing epidemic of child sexual exploitation and abuse launched in May 2006 by the Department of Justice. Led by U.S. Attorneys’ Offices and CEOS, Project Safe Childhood marshals federal, state, and local resources to better locate, apprehend, and prosecute individuals who exploit children via the Internet, as well as to identify and rescue victims. For more information about Project Safe Childhood, please visit www.justice.gov/psc.",2018-06-13T00:00:00-04:00,Project Safe Childhood,Criminal Division; Criminal - Child Exploitation and Obscenity Section,0.076,0.805,0.120,0.9460,wisconsin charg indict yesterday crime produc possess child pornographi engag illicit sexual conduct foreign place announc act assist attorney general john cronan justic depart crimin divis attorney matthew krueger eastern district wisconsin jeffrey erniss current incarcer state offens relat child exploit granit correct institut wisconsin grand juri district court eastern district wisconsin indict erniss count produc child pornographi count produc child pornographi outsid unit state count engag illicit sexual conduct minor philippin count possess child pornographi accord indict march april erniss use minor engag sexual explicit conduct purpos produc child pornographi between approxim june approxim april erniss engag illicit sexual conduct minor republ philippin erniss possess child pornographi charg contain indict mere alleg defend presum innoc proven guilti beyond reason doubt court immigr custom enforc homeland secur investig investig cooper sheboygan wisconsin polic depart trial attorney william gradi crimin divis child exploit obscen section ceo assist attorney megan paulson penelop coblentz eastern district wisconsin prosecut this part project safe childhood nationwid initi combat grow epidem child sexual exploit abus launch depart justic attorney offic ceo project safe childhood marshal feder state local resourc better locat apprehend prosecut individu exploit children internet well identifi rescu victim inform project safe childhood pleas visit


In [66]:
## your code showing the examples
doj_subset_wscore_print = doj_subset_wscore[doj_subset_wscore["id"] == "16-718"]
doj_subset_wscore_print = doj_subset_wscore_print.append(doj_subset_wscore[doj_subset_wscore["id"] == "16-217"])
doj_subset_wscore_print = doj_subset_wscore_print[["id", "contents", "processed_text"]]
doj_subset_wscore_print

$$$
$$$
$$$
$$$


Unnamed: 0,id,contents,processed_text
293,16-718,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up. The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground. Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H. The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI’s Jackson Division, with the cooperation of the Mississippi Department of Corrections. It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division’s Criminal Section. Marsher Indictment",nine count indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg each three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti this investig jackson divis cooper mississippi depart correct prosecut assist attorney robert coleman northern district mississippi trial attorney dana mulhaus civil right divis crimin section marsher indict
164,16-217,"The Justice Department has reached a comprehensive settlement agreement with the city of Miami and the Miami Police Department (MPD) resolving the Justice Department’s investigation of officer-involved shootings by MPD officers, announced Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division and U.S. Attorney Wifredo A. Ferrer of the Southern District of Florida. The settlement, which was approved by Miami’s city commission today and will go into effect when the agreement is signed by all parties, resolves claims stemming from the Justice Department’s investigation into officer-involved shootings by MPD officers, which was conducted under the Violent Crime Control and Law Enforcement Act of 1994. The investigation’s findings, issued in July 2013, identified a pattern or practice of excessive use of force through officer-involved shootings in violation of the Fourth Amendment of the Constitution. The city’s compliance with the settlement will be monitored by an independent reviewer, former Tampa, Florida, Police Chief Jane Castor. Under the settlement agreement, the city will implement comprehensive reforms to ensure constitutional policing and support public trust. The settlement agreement is designed to minimize officer-involved shootings and to more effectively and quickly investigate officer-involved shootings that do occur, through measures that include: “This settlement represents a renewed commitment by the city of Miami and Chief Rodolfo Llanes to provide constitutional policing for Miami residents and to protect public safety through sustainable reform,” said Principal Deputy Assistant Attorney General Gupta. “The agreement will help to strengthen the relationship between the MPD and the communities they serve by improving accountability for officers who fire their weapons unlawfully, and provides for community participation in the enforcement of this agreement.” “Today's agreement is the result of a joint effort between the Department of Justice and the City of Miami to ensure that the Miami Police Department continues its efforts to make our community safe while protecting the sacred Constitutional rights of all of our citizens,” said U.S. Attorney Ferrer. “Through oversight and communication, the agreement seeks to make permanent the positive changes that former Chief Orosa and Chief Llanes have made, and we applaud the City Commission’s vote.” The settlement agreement builds upon important reforms implemented by the city since the Justice Department issued its findings, including: The investigation was conducted by attorneys and staff from the Civil Rights Division’s Special Litigation Section and the Civil Division of the U. S. Attorney’s Office of the Southern District of Florida.",justic depart reach comprehens settlement agreement citi miami miami polic depart resolv justic depart offic involv shoot offic announc princip deputi assist attorney general vanita gupta head justic depart civil right divis attorney wifredo ferrer southern district florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem justic depart offic involv shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc offic involv shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor under settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim offic involv shoot effect quick investig offic involv shoot occur measur includ this settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi assist attorney general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc agreement today agreement result joint effort depart justic citi miami ensur miami polic depart continu effort make communiti safe protect sacr constitut citizen said attorney ferrer through oversight communic agreement seek make perman posit chang former chief orosa chief llane made applaud citi commiss vote settlement agreement build upon import reform implement citi sinc justic depart issu find includ conduct attorney staff civil right divis special litig section civil divis attorney offic southern district florida


## 2.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the following columns: `id`, `compound` sentiment column you added, and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so the most positive sentiment)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so the most negative sentiment)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. Print the top 10 words for press releases in each of the three `topics_clean`

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data. There are different ways to structure it but one way is to feed it subsetted data (so data subsetted to one topic etc.) and for it to get the top words for that subset.

**Resources**:

- Here contains an example of applying the `create_dtm` function: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb


In [None]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), 
        columns=vectorizer.get_feature_names())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)

In [None]:
# your code here

## 2.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Hints and Resources**:

- Same topic modeling resources linked to above
- Make sure to use the `random_state` argument within the model so that the numbering of topics does not move around between runs of your code

In [None]:
# your code here

## 2.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset_wscores` dataframe

B. Add the topic probabilities to the `doj_subset_wscores` dataframe as columns and create a column, `top_topic`, that reflects each document to its highest-probability topic (eg topic 1, 2, or 3)

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)

**Resources**:

- End of this code (`Additional summaries of topics and documents`) contains example of how to use `get_document_topics` and other steps to add topic probabilities back to data: https://github.com/jhaber-zz/QSS20_public/blob/main/activities/solutions/05_textasdata_partII_topicmodeling_solutions.ipynb
- If you're getting errors, use `shape`, `len`, and other commands to check the dimensionality of things at different steps 

In [None]:
## your code here to get doc-level topic probabilities 

In [None]:
## your code here to add those topic probabilities to the dataframe

In [None]:
## your code here to summarize the topic proportions for each of the topics_clean 

# 3. Extend the analysis from unigrams to bigrams (10 points)

In the previous question, you found top words via a unigram representation of the text. Now, we want to see how those top words change with bigrams (pairs of words)

A. Using the `doj_subset_wscore` data and the `processed_text` column (so the words after stemming/other preprocessing), create a column in the data called `processed_text_bigrams` that combines each consecutive pairs of word into a bigram separated by an underscore. Eg:

"depart reach settlem" would become "depart_reach reach_settlem"

Do this by writing a function `create_bigram_onedoc` that takes in a single `processed_text` string and returns a string with its bigrams structured similarly to above example
 
**Hint**: there are many ways to solve but `zip` may be helpful: https://stackoverflow.com/questions/21303224/iterate-over-all-pairs-of-consecutive-items-in-a-list

B. Print the `id`, `processed_text`, and `processed_text_bigram` columns for press release with id = 16-217

In [None]:
## your code here 

C. Use the create_dtm function and the `processed_text_bigrams` column to create a document-term matrix (`dtm_bigram`) with these bigrams. Keep the following three columns in the data: `id`, `topics_clean`, and `compound` 

D. Print the (1) dimensions of the `dtm` matrix from question 2.2  and (2) the dimensions of the `dtm_bigram` matrix. Comment on why the bigram matrix has more dimensions than the unigram matrix 

E. Find and print the 10 most prevelant bigrams for each of the three topics_clean using the `get_topwords` function from 2.2

In [None]:
# your code here

# 4. Optional extra credit (2 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings.

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?

In [None]:
# your code here 