# 0. Imports

In [1]:
import nltk
#nltk.download('punkt')

In [2]:
## helpful packages
import pandas as pd
import numpy as np
import random
import re

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
#nltk.download('averaged_perceptron_tagger')
#nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
#python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"


# 2. Text analysis of DOJ press releases

For background, here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 

Here's the code the dataset owner used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

In [3]:
## run this code to load the unzipped json file and convert to a dataframe
## and convert some of the things from lists to values
doj = pd.read_json("../../combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 'components_clean']].copy()

doj.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,No topic,National Security Division (NSD)
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,No topic,Environment and Natural Resources Division
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,No topic,Environment and Natural Resources Division
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,No topic,Environment and Natural Resources Division
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,Environment,Environment and Natural Resources Division


## 2.1 NLP on one press release (10 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.


- Part of speech tagging- extract verbs and sort from most occurrences to least occurrences
- Named entity recognition --- what are the different organizations mentioned? how would you like to make more granular?
- Sentence level versus document-level sentiment scoring

- For sentence level scoring, print a few top positive and top negative. Does the automatic classifier seem to work?


### 2.1.1: part of speech tagging (3 points)

A. Preprocess the press release to remove all punctuation / digits (so can subset to one_word.isalpha())

B. Then, use part of speech tagging within nltk to tag all the words in that one press release with their part of speech. 

C. Finally, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print the 5 most frequent adjectives. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for .isalpha(): https://www.w3schools.com/python/ref_string_isalpha.asp
- `processtext` function here has an example of tokenizing and filtering to words where .isalpha() is true: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb 
- Part of speech tagging section of this code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb



In [4]:
pd.set_option("max_colwidth", 10000)
#prevents string in 'contents' from getting cut off
doj2_1 = doj[doj['id']== "17-1204"]
string2_1 = str(doj2_1["contents"])
list_2_1 = [word for word in wordpunct_tokenize(string2_1) if word.isalpha()]
words_2_1 = " ".join(list_2_1)
#words_2_1
tokens = word_tokenize(words_2_1) 
tokens_pos = pos_tag(tokens)
all_adjectives = [one_tok[0] for one_tok in tokens_pos 
                if one_tok[1] == "JJ" or one_tok[1] == "JJR" or one_tok[1] == "JJS"]
all_adjectives_df = pd.DataFrame(all_adjectives)
df = all_adjectives_df.value_counts()
df.head()

former        8
opioid        5
nationwide    4
other         3
addictive     3
dtype: int64

In [None]:
## rj note - good job!

### 2.1.2 named entity recognition (3 points)


A. Using the alpha-only press release you created in the previous step, use spaCy to extract all named entities from the press release

B. Print all the named entities along with their tag

C. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these.

D. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convincted after this indictment.

**Resources**:

- Named entity recognition part of this code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb
- re.search and re.findall examples here for filtering to ones containing year (multiple approaches; some need not involve `re`): https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/04_basicregex_formerging.ipynb 

In [5]:
#words_2_1
spacy_words21 = nlp(words_2_1)
#for one_tok in spacy_words21.ents:
    #print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)
#^these two lines are for Part B, I commented them out to save space while debugging the rest of this cell
print("Part C")
for one_tok in spacy_words21.ents:
    if one_tok.label_ == "DATE" and "year" in one_tok.text: print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)
[re.findall(r".{,80}" + r"year" + r".{,80}", words_2_1)]
#The CEO if convicted may face up to five years in prison and three years of parole

Part C
Entity: last year; NER tag: DATE
Entity: three years; NER tag: DATE
Entity: five years; NER tag: DATE
Entity: three years; NER tag: DATE


[['g breakthrough pain More than Americans died of synthetic opioid overdoses last year and millions are addicted to opioids And yet some medical professionals would r',
  'cy to commit mail and wire fraud each provide for a sentence of no greater than years in prison three years of supervised release and a fine of or twice the amount ',
  'to violate the Anti Kickback Law provide for a sentence of no greater than five years in prison three years of supervised release and a fine Sentences are imposed b']]

In [8]:
## rj note- close! small deduction since there was
## a discussion of a 20 year sentence; see solutions!

### 2.1.3 Sentiment analysis (4 points)

A. Use a `SentimentIntensityAnalyzer` and `polarity_scores` to score the entire press release for its sentiment (you can go back to the raw string of the press release without punctuation/digits removed)

B. Remove all named entities from the string and score the sentiment of the press release without named entities. Did the neutral score go up or down relative to the version of the press release containing named entities? Why do you think this occurred?

C. With the version of the string that removes named entities, try to split the press release into discrete sentences (hint: re.split() may be useful since it allows or conditions in the pattern you're looking for). Print the first 5 sentences of the split press release (there will not be deductions if there remain some erroneous splits; just make sure it's generally splitting)

D. Score each sentence in the split press release and print the top 5 sentences in the press release with the most negative sentiment (use the `neg` score- higher values = more negative). **Hint**: you can use pd.DataFrame to rowbind a list of dictionaries; you can then add the press release sentence for each row back as a column in that dataframe and use sort_values()                                                  
                
**Resources**:

- Sentiment analysis section of this script: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb

- Discussion of using `re.split()` to split on multiple delimiters: https://stackoverflow.com/questions/4998629/split-string-with-multiple-delimiters-in-python

In [10]:
doj2_1 = doj[doj['id']== "17-1204"]
string2_1 = str(doj2_1["contents"])
sent_obj = SentimentIntensityAnalyzer()
sentiment = sent_obj.polarity_scores(string2_1)
sentiment
spacey_words21 = nlp(string2_1)
text_no_ents = string2_1
for ent in spacey_words21.ents:
    text_no_ents = re.sub(str(ent), '', text_no_ents)
#text_no_ents
split_passage = re.split("\. | \!| \?", text_no_ents)
#split_passage[:5]
list_of_dictionaries = []
for sent in split_passage:
    sent_list = sent_obj.polarity_scores(sent)
    score = sent_list["neg"]
    dictionary = {"sentence":sent, "Score":score}
    list_of_dictionaries.append(dictionary)
dictionary_df = pd.DataFrame(list_of_dictionaries)
dictionary_df.sort_values("Score", ascending = False)
dictionary_df.head()

## rj note: no deduction but want to combine .head()
## with the sort values otherwise it's just printing the 
## unsorted ones- eg
dictionary_df.sort_values("Score", ascending = False).head(5)

{'neg': 0.141, 'neu': 0.746, 'pos': 0.113, 'compound': -0.9962}

Unnamed: 0,sentence,Score
6,“'s arrest and charges reflect our ongoing efforts to attack the opioid crisis from all angles,0.494
0,"The founder and majority owner of , was arrested and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a spray intended for cancer patients experiencing breakthrough pain. "" died of synthetic opioid overdoses , and are addicted to opioids",0.381
2,""" will not tolerate this. We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic. And under the leadership of President , we are fully committed to defeating this threat to the people.”John , , of , , a current member of , was arrested in na and charged with conspiracy, as well as other felonies, including conspiracy to commit mail and wire fraud and conspiracy to violate",0.312
8,"“The allegations of selling a highly addictive opioid cancer pain drug to patients who did not have cancer, make them no better than street-level drug dealers",0.289
12,"“We are proud to work alongside our law enforcement partners to dismantle high level prescription drug practices which directly contribute to the opioid abuse epidemic. This investigation highlights our commitment to defending our mail system from illegal misuse and ensuring public trust in the mail.”“The Department of Veterans Affairs, will continue to aggressively investigate those that attempt to fraudulently impact programs designed to benefit our veterans and their families,” said , of the VA OIG .The charges of conspiracy to commit and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than in prison, of supervised release and a fine of $, or twice the amount of pecuniary gain or loss. The charges of conspiracy to violate provide for a sentence of in prison, of supervised release and a $ fine",0.234
5,"and his company stand accused of bribing doctors to overprescribe a potent opioid and committing fraud on insurance companies solely for profit,” said Attorney",0.212
4,"Gurry, , of , , conspired to bribe practitioners in various states, many of whom operated pain clinics, in order to get them to prescribe a fentanyl-based pain medication. The medication, called “Subsys,” is a powerful narcotic intended to treat cancer patients suffering intense breakthrough pain. In exchange for bribes and kickbacks, the practitioners wrote large numbers of prescriptions for the patients, most of whom were not diagnosed with cancer.The indictment also alleges that and the former executives conspired to mislead and defraud health insurance providers who were reluctant to approve payment for the drug when it was prescribed for non-cancer patients. They achieved this goal by setting up the “reimbursement unit,” which was dedicated to obtaining prior authorization directly from insurers and pharmacy benefit managers. “In the midst of a nationwide opioid epidemic that has reached crisis proportions, Mr",0.153
9,"'s charges mark an important step in holding pharmaceutical executives responsible for their part in the opioid crisis. The will vigorously investigate corrupt organizations with business practices that promote fraud with a total disregard for patient safety.”“These executives allegedly fueled the opioid epidemic by paying doctors to needlessly prescribe an extremely dangerous and addictive form of fentanyl,” said , of the Department of Health and Human Services. “Corporate executives intent on illegally driving up profits need to be aware they are now squarely in the sights of law enforcement.”“As alleged, executives improperly influenced health care providers to prescribe a powerful opioid for patients who did not need it, and without complying with requirements, thus putting patients at risk and contributing to the current opioid crisis,” said , , Office of Criminal Investigations’",0.136
13,"Sentences are imposed by a federal district court judge based upon the Sentencing Guidelines and other statutory factors.The investigation was conducted by a team that included the ; ; Office of Criminal Investigations; the ; ; , ; ; the Postal Inspection Service; the Postal Service ; and . The Attorney’s Office would like to acknowledge the cooperation and assistance of the Attorney’s Offices around the country engaged in parallel investigations, including , of gan, of , of , of , and the of New Hampshire. The efforts of the Central of rnia and Civil Fraud Section of are also greatly appreciated. Assistant Attorneys , Chief of , and , of , are prosecuting the case.The details contained in the charging documents are allegations. The defendants are presumed innocent unless and until proven guilty beyond a reasonable doubt.\nName: contents, dtype: object",0.097
7,"We must hold the industry and its leadership accountable - just as we would the cartels or a street-level drug dealer.”“As alleged, these executives created a corporate culture at that utilized deception and bribery as an acceptable business practice, deceiving patients, and conspiring with doctors and insurers,” said , , Field Division",0.092


Unnamed: 0,sentence,Score
0,"The founder and majority owner of , was arrested and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a spray intended for cancer patients experiencing breakthrough pain. "" died of synthetic opioid overdoses , and are addicted to opioids",0.381
1,"And yet some medical professionals would rather take advantage of the addicts than try to help them,"" said Attorney General",0.0
2,""" will not tolerate this. We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic. And under the leadership of President , we are fully committed to defeating this threat to the people.”John , , of , , a current member of , was arrested in na and charged with conspiracy, as well as other felonies, including conspiracy to commit mail and wire fraud and conspiracy to violate",0.312
3,", the former Executive Chairman of and CEO of , will appear in federal court in . He will appear in in at . The superseding indictment, unsealed in , also includes additional allegations against several former executives and managers who were initially indicted in 2016.The superseding indictment charges that ; , , of , , former CEO and President of the company; , , of , , former Vice President of Sales; , , of , , former Director of Sales; former Regional Sales Directors , , of , , and , , of , ; and former Vice President of , el J",0.02
4,"Gurry, , of , , conspired to bribe practitioners in various states, many of whom operated pain clinics, in order to get them to prescribe a fentanyl-based pain medication. The medication, called “Subsys,” is a powerful narcotic intended to treat cancer patients suffering intense breakthrough pain. In exchange for bribes and kickbacks, the practitioners wrote large numbers of prescriptions for the patients, most of whom were not diagnosed with cancer.The indictment also alleges that and the former executives conspired to mislead and defraud health insurance providers who were reluctant to approve payment for the drug when it was prescribed for non-cancer patients. They achieved this goal by setting up the “reimbursement unit,” which was dedicated to obtaining prior authorization directly from insurers and pharmacy benefit managers. “In the midst of a nationwide opioid epidemic that has reached crisis proportions, Mr",0.153


Unnamed: 0,sentence,Score
6,“'s arrest and charges reflect our ongoing efforts to attack the opioid crisis from all angles,0.494
0,"The founder and majority owner of , was arrested and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a spray intended for cancer patients experiencing breakthrough pain. "" died of synthetic opioid overdoses , and are addicted to opioids",0.381
2,""" will not tolerate this. We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic. And under the leadership of President , we are fully committed to defeating this threat to the people.”John , , of , , a current member of , was arrested in na and charged with conspiracy, as well as other felonies, including conspiracy to commit mail and wire fraud and conspiracy to violate",0.312
8,"“The allegations of selling a highly addictive opioid cancer pain drug to patients who did not have cancer, make them no better than street-level drug dealers",0.289
12,"“We are proud to work alongside our law enforcement partners to dismantle high level prescription drug practices which directly contribute to the opioid abuse epidemic. This investigation highlights our commitment to defending our mail system from illegal misuse and ensuring public trust in the mail.”“The Department of Veterans Affairs, will continue to aggressively investigate those that attempt to fraudulently impact programs designed to benefit our veterans and their families,” said , of the VA OIG .The charges of conspiracy to commit and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than in prison, of supervised release and a fine of $, or twice the amount of pecuniary gain or loss. The charges of conspiracy to violate provide for a sentence of in prison, of supervised release and a $ fine",0.234


## 2.2 sentiment scoring across many press releases (10 points)


A. Subset the press releases to those labeled with one of free topics (can just do if topic_clean == that topic rather than finding where that topic is mentioned in a longer list): Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.

B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string
- Scores the sentiment of the entire press release

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- You may want to use re.escape at some point to avoid errors relating to escape characters like ( in the press release
- I used a function + list comprehension to execute and it takes about 30 seconds on my local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample

C. Add the scores to the `doj_subset` dataframe. Sort from highest neg to lowest neg score and print the top 5 most neg.

D. With that dataframe, find the mean compound score for each of the three topics using group_by and agg. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)

**Resources**:

- Same named entity and sentiment resources as above

In [28]:
doj_subset = doj[doj["topics_clean"] == "Civil Rights"]
doj2 = doj[doj["topics_clean"] == "Hate Crimes"]
doj3 = doj[doj["topics_clean"] == "Project Safe Childhood"]
doj_subset = doj_subset.append(doj2)
doj_subset = doj_subset.append(doj3)

In [12]:
def sent_scoring(string):
    sent_obj = SentimentIntensityAnalyzer()
    spacey_words = nlp(string)
    text_no_ents_funct = string
    for ent in spacey_words.ents:
        text_no_ents_funct = re.sub(re.escape(str(ent)), '', text_no_ents_funct)
    sentiment = sent_obj.polarity_scores(string)
    return sentiment

In [13]:
output_temp = []
pos_sent = [0] * 717
neg_sent = [0] * 717
neu_sent = [0] * 717
comp_sent = [0] * 717
i=0
for content in doj_subset['contents']:
    string_temp = str(content)
    output_temp.append(string_temp)

for entry in output_temp:
    sent = sent_scoring(entry)
    pos_sent[i] = sent["pos"]
    neg_sent[i] = sent["neg"]
    neu_sent[i] = sent["neu"]
    comp_sent[i] = sent["compound"]
    i = i + 1


doj_subset["Positive Score"] = pos_sent
doj_subset["Negative Score"] = neg_sent
doj_subset["Neutral Score"] = neu_sent
doj_subset["Compound Score"] = comp_sent
doj_subset.sort_values("Negative Score", ascending = False)
doj_subset.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,Positive Score,Negative Score,Neutral Score,Compound Score
11593,16-718,Three Mississippi Correctional Officers Indicted for Inmate Assault and Cover-Up,"In a nine-count indictment unsealed today, two Mississippi correctional officers were charged with beating an inmate and a third was charged with helping to cover it up. The indictment charged Lawardrick Marsher, 28, and Robert Sturdivant, 47, officers at Mississippi State Penitentiary, in Parchman, Mississippi, with a beating that included kicking, punching and throwing the victim to the ground. Marsher and Sturdivant were charged with violating the right of K.H., a convicted prisoner, to be free from cruel and unusual punishment. Sturdivant was also charged with failing to intervene while Marsher was punching and beating K.H. The indictment alleges that their actions involved the use of a dangerous weapon and resulted in bodily injury to the victim. A third officer, Deonte Pate, 23, was charged along with Marsher and Sturdivant for conspiring to cover up the beating. The indictment alleges that all three officers submitted false reports and that all three lied to the FBI. If convicted, Marsher and Sturdivant face a maximum sentence of 10 years in prison on the excessive force charges. Each of the three officers faces up to five years in prison on the conspiracy and false statement charges, and up to 20 years in prison on the false report charges. An indictment is merely an accusation, and the defendants are presumed innocent unless and until proven guilty. This case is being investigated by the FBI’s Jackson Division, with the cooperation of the Mississippi Department of Corrections. It is being prosecuted by Assistant U.S. Attorney Robert Coleman of the Northern District of Mississippi and Trial Attorney Dana Mulhauser of the Civil Rights Division’s Criminal Section. Marsher Indictment",2016-06-21T00:00:00-04:00,Civil Rights,"Civil Rights Division; Civil Rights - Criminal Section; USAO - Mississippi, Northern",0.028,0.259,0.713,-0.9968
329,14-248,Albuquerque Man Charged with Federal Hate Crime Related to Anti-Semitic Threats Against Businesswoman,"The Department of Justice announced that this morning John W. Ng, 58, of Albuquerque, N.M., made his initial appearance in federal court on a criminal complaint charging him with a hate crime offense. This charge is related to anti-Semitic threats Ng made against a Jewish woman who owns and operates the Nosh Jewish Delicatessen and Bakery in Albuquerque. Ng was arrested by the FBI on March 7, 2014, based on a criminal complaint alleging that he interfered with the victim’s federally protected rights by threatening her and interfering with her business because of her religion. According to the criminal complaint, between Jan. 22, 2014, and Feb. 8, 2014, Ng allegedly posted threatening anti-Semitic notes on and in the vicinity of the victim’s business. A criminal complaint merely establishes probable cause, and Ng is presumed innocent unless proven guilty. If convicted on the offense charged in the criminal complaint, Ng faces a maximum statutory penalty of one year in prison. This matter was investigated by the Albuquerque Division of the FBI and is being prosecuted by Assistant U.S. Attorney Mark T. Baker of the U.S. Attorney’s Office for the District of New Mexico and Trial Attorney AeJean Cha of the U.S. Department of Justice’s Civil Rights Division.",2014-03-10T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.044,0.257,0.699,-0.9943
572,13-312,Aryan Brother Inmate Sentenced for Federal Hate Crime for Assaulting Fellow Inmate,"John Hall, 27, an Aryan Brotherhood member and inmate at the Federal Correctional Institution (FCI) in Seagoville, Texas, was sentenced today by U.S. District Judge Reed O’Connor after pleading guilty to violating the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act stemming from his assault of a fellow inmate, whom he believed to be gay, the Department of Justice announced. Hall assaulted his fellow inmate with a dangerous weapon, causing bodily injury to the victim on Dec. 20, 2011. Hall was sentenced to serve 71 months in prison to be served consecutively with the sentence he is currently serving. The assault occurred on Dec. 20, 2011, inside the FCI Seagoville when Hall targeted and attacked the victim, a fellow inmate, because he believed the victim was gay or involved in a sexual relationship with another male inmate. Hall repeatedly punched, kicked and stomped on the victim’s face with his shod feet, a dangerous weapon, while yelling a homophobic slur. The victim lost consciousness during the assault and suffered multiple lacerations to his face. The victim also sustained a fractured eye socket, lost a tooth, fractured other teeth and was treated at a hospital for the injuries he sustained during Hall’s unprovoked attack. Hall pleaded guilty to violating the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act on Nov. 8, 2012. “Brutality and violence based on sexual orientation has no place in a civilized society,” said Thomas E. Perez, Assistant Attorney General for the Civil Rights Division. “The Justice Department is committed to using all the tools in our law enforcement arsenal, including the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act, to prosecute acts motivated by hate.” “This prosecution sends a clear message that this office, in partnership with attorneys in the department’s Civil Rights Division, will prioritize and aggressively prosecute hate crimes and others civil rights violations in North Texas,” said U.S. Attorney Sarah R. Saldaña of the Northern District of Texas. This case was investigated by the FBI Dallas Division. The case was prosecuted by Assistant U.S. Attorney Errin Martin and Trial Attorney Adriana Vieco of the Civil Rights Division.",2013-03-14T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.036,0.248,0.716,-0.9980
501,11-626,Arkansas Man Pleads Guilty to Federal Hate Crime Related to the Assault of Five Hispanic Men,"WASHINGTON – The Justice Department announced today that Sean Popejoy, 19, of Green Forest, Ark., pleaded guilty in federal court to one count of committing a federal hate crime and one count of conspiring to commit a federal hate crime. This is the first conviction for a violation of the Matthew Shepard and James Byrd Jr. Hate Crimes Prevention Act, which was enacted in October 2009. Information presented during the plea hearing established that in the early morning hours of June 20, 2010, Popejoy admitted that he was part of a conspiracy to threaten and injure five Hispanic men who had pulled into a gas station parking lot. The co-conspirators pursued the victims in a truck. When the co-conspirators caught up to the victims, Popejoy leaned outside of the front passenger window and waived a tire wrench at the victims and continued to threaten and hurl racial epithets at the victims. The co-conspirator rammed into the victims' car, which caused the victims’ car to cross the opposite lane of traffic, go off the road, crash into a tree and ignite. As a result of the co-conspirators’ actions, the victims suffered bodily injury, including one victim who sustained life-threatening injuries. “James Byrd, Jr. and Matthew Shepard were brutally murdered more than a decade ago, and today the first defendant is convicted for a hate crime under the critical new law enacted in their names,” said Thomas E. Perez, Assistant Attorney General for the Civil Rights Division. “It is unacceptable that violent acts of hate committed because of someone’s race continue to occur in 2011, and the department will continue to use every available tool to identify and prosecute hate crimes whenever and wherever they occur. “It is terrible and disturbing that violence motivated by hatred of another’s race continues to occur,” said Conner Eldridge, U.S. Attorney for the Western District of Arkansas. “We are committed to prosecuting such crimes in the Western District of Arkansas.” If convicted, the defendant faces a maximum punishment of 15 years in prison. This case is being investigated by the FBI’s Fayetteville Division in cooperation with the Arkansas State Police Department and the Carroll County Sheriff’s Office. The case is being prosecuted by Trial Attorney Edward Chung of the Department of Justice’s Civil Rights Division and Assistant U.S. Attorney Kyra Jenner for the Western District of Arkansas.",2011-05-16T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.032,0.242,0.725,-0.9985
10635,11-1531,Seven Ohio Men Arrested for Hate Crime Attacks Against Amish Men,"CLEVELAND – Seven Ohio men were arrested today on charges that they committed and conspired to commit religiously-motivated physical assaults in violation of the Matthew Shepard-James Byrd Hate Crimes Prevention Act. The arrests were announced today by Thomas E. Perez, Assistant Attorney General for the Civil Rights Division and Steven M. Dettelbach, U.S. Attorney for the Northern District of Ohio. The criminal complaint, filed in Cleveland, charges Samuel Mullet Sr., Johnny S. Mullet, Daniel S. Mullet, Levi F. Miller, Eli M. Miller and Emanuel Schrock, all of Bergholz, Ohio; and Lester S. Mullet, of Hammondsville, Ohio, with willfully causing bodily injury to any person, or attempting to do so by use of a dangerous weapon, because of the actual or perceived religion of that person. The maximum potential penalty for these violations is life in prison. According to the affidavit filed in support of the arrest warrants, the defendants conspired to carry out a series of assaults against fellow Amish individuals with whom they were having a religiously-based dispute. In doing so, the defendants forcibly restrained multiple Amish men and cut off their beards and head hair with scissors and battery-powered clippers, causing bodily injury to these men while also injuring others who attempted to stop the attacks. In the Amish religion, a man’s beard and head hair are sacred. This case is being investigated by the Cleveland Division of the FBI and is being prosecuted by Assistant U.S. Attorney Bridget M. Brennan of the U.S. Attorney’s Office for the Northern District of Ohio and Deputy Chief Kristy Parker of the Civil Rights Division’s Criminal Section. A criminal complaint is merely an accusation. All defendants are presumed innocent of the charges until proven guilty beyond a reasonable doubt in court.",2011-11-23T00:00:00-05:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.027,0.240,0.734,-0.9968
...,...,...,...,...,...,...,...,...,...,...
11064,16-471,"Statement from Head of the Civil Rights Division Vanita Gupta Regarding District Court’s Approval of Consent Decree with City of Ferguson, Missouri","Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division, released the following statement regarding the ruling issued by U.S. District Judge Catherine D. Perry of the Eastern District of Missouri approving the department’s consent decree with the city of Ferguson, Missouri: “Now that the consent decree has been approved by the court, the department is looking forward to working with the city of Ferguson as it implements the decree and continues the essential work to create a police department that the Constitution requires and that residents deserve.”",2016-04-19T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Special Litigation Section,0.121,0.000,0.879,0.8779
11085,15-667,"Statement from Vanita Gupta, Head of the Justice Department's Civil Rights Division, U.S. Attorney Steven M. Dettlebach for the Northern District of Ohio and Special Agent in Charge Stephen D. Anthony for the FBI","Statement from Vanita Gupta, head of the Justice Department’s Civil Rights Division, U.S. Attorney Steven M. Dettelbach for the Northern District of Ohio and Special Agent in Charge Stephen D. Anthony for the FBI: “The U.S. Attorney's Office, the Federal Bureau of Investigation and the Civil Rights Division of the Department of Justice have been monitoring the extensive investigation that has been conducted around the events of Nov. 29, 2012. We will now review the testimony and evidence presented in the state trial. We will continue our assessment, review all available legal options and will collaboratively determine what, if any, additional steps are available and appropriate given the requirements and limitations of the applicable laws in the federal judicial system. This review is separate and distinct from the Civil Rights Division and U.S. Attorney's Office's productive efforts to resolve civil pattern and practice allegations under 42 U.S.C. 14141 with the city of Cleveland.”",2015-05-23T00:00:00-04:00,Civil Rights,Civil Rights Division,0.084,0.000,0.916,0.9118
6211,16-1019,Justice Department Ends Agreement with West Virginia School District after Successful Implementation of English Language Programs,"The Justice Department announced today that it has terminated its January 2012 settlement agreement with the Mercer County, West Virginia, School District following the district’s successful implementation of programs and services for its English Learner (EL) students, as required by the Equal Educational Opportunities Act (EEOA) of 1974. After entering into the settlement agreement, the district implemented a process whereby every new student completed a home language survey so that all students with non-English speaking backgrounds were timely identified; had their English proficiency assessed; and if they were not proficient, were provided with individualized English language services and supports. The district also implemented a new curriculum for the instruction of EL students, improved its teacher training, carefully monitored the academic progress of current and former EL students and enhanced its communications with limited-English proficient families. As a result of its efforts, the district has successfully integrated dozens of EL students into its student body, enabling them to access the curriculum and develop strong relationships with their teachers and peers. EL students and their parents have credited the district’s individualized programs and the dedication of EL teachers in furthering the students’ progress. “We commend the Mercer County School District for successfully implementing the settlement agreement and for showing dedication and care to its English Learner students and their families,” said Principal Deputy Assistant Attorney General Vanita Gupta, head of the Justice Department’s Civil Rights Division. “We hope other rural districts with growing EL populations will learn from Mercer County’s positive example and significant progress.” The EEOA requires state and local education agencies to take appropriate action to overcome language barriers that impede students’ equal participation in instructional programs. Enforcement of the EEOA is a top priority of the Civil Rights Division. Additional information about the Civil Rights Division is available on its website at www.justice.gov/crt.",2016-09-07T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Educational Opportunities Section,0.189,0.000,0.811,0.9959
6787,17-132,Justice Department Reaches Agreement with St. James Parish Louisiana School District to Desegregate Schools,"The Department of Justice has reached an agreement with the St. James Parish School District in Louisiana that upon completion will end court supervision of the district’s schools. The consent order, approved yesterday by the U.S. District Court for the Eastern District of Louisiana, addresses all remaining issues in the school desegregation case, and when fully implemented will lead to the closing of that case. The consent order, negotiated with the school district and private plaintiffs, represented by the NAACP Legal Defense and Educational Fund, puts the district on a path to full unitary status within three years provided it: The consent order declares that the district has already met its desegregation obligations in the area of transportation. The court will retain jurisdiction over the consent order during its implementation, and the Justice Department will monitor the district’s compliance. “We are pleased to have worked hand-in-hand with the schools to ensure equal and fair treatment for the students of the St. James Parish School District,” said Acting Assistant Attorney General Tom Wheeler of the Civil Rights Division. “We look forward to working with the district and private plaintiffs to implement the consent order and bring this case to a successful close.” Promoting school desegregation and enforcing Title IV of the Civil Rights Act of 1964 is a top priority of the Justice Department’s Civil Rights Division. Additional information about the Civil Rights Division is available on its website at www.justice.gov/crt. St. James Parish Consent Order",2017-01-31T00:00:00-05:00,Civil Rights,Civil Rights Division; Civil Rights - Educational Opportunities Section,0.176,0.000,0.824,0.9905


Unnamed: 0,id,title,contents,date,topics_clean,components_clean,Positive Score,Negative Score,Neutral Score,Compound Score
77,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,"A former supervisory correctional officer at Louisiana State Penitentiary in Angola, Louisiana, pleaded guilty yesterday in connection with the beating of a handcuffed and shackled inmate, in addition to conspiring to cover up their misconduct by falsifying official records and lying to internal investigators about what happened. James Savoy, 39, of Marksville, Louisiana, admitted during his plea hearing that he witnessed other officers using excessive force against the inmate and failed to intervene; that he conspired with other officers to cover up the beating by engaging in a variety of obstructive acts; and that he personally falsified official prison records to cover up the attack. Scotty Kennedy, 48, of Beebe, Arkansas, and John Sanders, 30, of Marksville, Louisiana previously pleaded guilty in November 2016, and September 2017, for their roles in the beating and cover up. “Every citizen has the right to due process and protection from unreasonable force, and correctional officers who violate these basic Constitutional rights must be held accountable for their egregious actions” said Acting Assistant Attorney General John Gore of the Civil Rights Division. “The Justice Department will continue to vigorously prosecute correctional officers who violate the public’s trust by committing crimes and to covering up violations of federal criminal law.” “Yesterday is another example of our office’s unwavering commitment to pursuing those who violate the federal criminal civil rights laws,” said Acting United States Attorney for the Middle District of Louisiana Corey Amundson. “We will continue to work closely with the Justice Department’s Civil Rights Division and the FBI to ensure that no one is above the law.” This case is being investigated by the FBI’s Baton Rouge Resident Agency and is being prosecuted by Assistant U.S. Attorney Frederick A. Menner, Jr. of the Middle District of Louisiana and Trial Attorney Christopher J. Perras of the Civil Rights Division’s Criminal Section.",2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle",0.068,0.169,0.763,-0.9893
423,17-240,Anoka County Resident Sentenced to Six Months in Prison for Threatening Two Clinics that Provide Reproductive Health Services,"On, Feb. 27, 2017, Michael John Harris, 34, was sentenced to six months imprisonment and one year of supervised release for making telephonic threats to two medical clinics in Minneapolis, Minnesota, that provide reproductive health services. On March 2, 2016, Harris pleaded guilty to two violations of 18 U.S.C. § 248(a)(1). During his plea hearing, Harris admitted that on May 12, 2014, he made telephonic threats to two different health clinics in Minneapolis that provide reproductive health services. In a call to the first clinic, Harris threatened to kill the recipient of the call with his bare hands and to cut the recipient’s head off with a band saw. In a call to the second clinic, Harris told the recipient that he was going to kill the recipient and the recipient’s co-workers, and that he was going to travel to the clinic and shoot everyone present. Harris further admitted that he made these threats because the recipient was and has been, and in order to intimidate the recipient and any other person from, obtaining and providing reproductive health services. “This defendant threatened these clinic workers with death and brutality,” said Acting Assistant Attorney General Tom Wheeler of the Justice Department’s Civil Rights Division. “The department is pleased that the defendant accepted responsibility and will face consequences for his actions. The Department is committed to vigorously enforcing the civil rights of all individuals in this country.” “The violence threatened by this defendant against health care workers is unacceptable,” said United States Attorney Andrew M. Luger of the District of Minnesota. “This sentence should serve as a reminder to individuals who would engage in such threats that the federal government will prosecute these crimes.” This case is being investigated by the Federal Bureau of Investigation, and is being prosecuted by Trial Attorney Risa Berkower of the Civil Rights Division of the United States Department of Justice and Assistant U.S. Attorney Manda M. Sertich of the U.S. Attorney’s Office for the District of Minnesota.",2017-03-02T00:00:00-05:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section; Federal Bureau of Investigation (FBI); USAO - Minnesota,0.079,0.146,0.774,-0.9814
568,17-379,Arson Awareness Week 2017 to Focus on Preventing Arson at Houses of Worship,"The Justice Department today announced that its Civil Rights Division is partnering with the Federal Emergency Management Agency’s U.S. Fire Administration on this year’s Arson Awareness Week, May 7-13, with a focus on Preventing Arson at Houses of Worship. There were an average of 103 arsons of houses of worship per year from 2000 to 2015. Half of all reported fires at houses of worship turn out to involve arson. The Department of Justice enforces a number of federal statutes protecting places of worship from attack, including 18 U.S.C. § 247, known as the Church Arson Prevention Act, which was passed in the 1990s in response to a sharp increase in church arsons. That law makes it a federal crime to target religious property because of the religion or race of the congregation. In February of this year, the Department indicted an Idaho man under § 247 alleging that he set fire to a Catholic Church in Bonner’s Ferry in April 2016. In 2013, an Indiana man was sentenced to 20 years imprisonment for setting a fire at the Islamic Center of Greater Toledo. FEMA and the Department of Justice have produced a number of materials to help congregations, community organizations and local law enforcement and fire safety officials to increase arson awareness and hold events highlighting proactive steps that can be taken to try to reduce house of worship arson. These materials are available at the Arson Awareness Week homepage, www.usfa.fema.gov/aaw. “Arson against houses of worship is a serious crime that the Department of Justice is committed to prosecuting to the fullest extent of the law,” said Acting Assistant Attorney General Tom Wheeler of the Justice Department’s Civil Rights Division. “But our role as prosecutors, while critically important, only comes after the fact when the damage is already done. That is why we encourage communities and local officials to take proactive steps to increase public awareness of the problem and measures that can be taken to reduce the likelihood of being a victim of house of worship arson.” Further information about hate crimes, including arsons against on places of worship, is available at the Civil Rights Division hate crimes page, https://www.justice.gov/crt/hate-crimes-0.",2017-04-10T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section,0.152,0.097,0.751,0.9623
851,18-121,Attorney General Issues National Slavery and Human Trafficking Prevention Month Proclamation,"Attorney General Jeff Sessions issued the following proclamation commemorating January as National Slavery and Human Trafficking Prevention Month: “Human trafficking is a nationwide public health and civil rights crisis. Its victims are everywhere: at truck stops, in cities, in rural areas, and in suburbs, and who now total an unconscionable 25 million victims globally according to some estimates. That means 25 million human beings—parents, siblings, and children—have been coerced into a commercial sex act, forced into labor, or exploited because they desperately seek a better life. It is a priority of the Department of Justice to combat this depraved and predatory behavior through swift and aggressive enforcement of our nation’s laws to bring traffickers to justice and restore the lives of victims and survivors. “The Justice Department’s U.S. Attorneys’ Offices, working closely with the Federal Bureau of Investigation (FBI), other federal agencies, and our state, local, and tribal partners, are on the front lines, leading our shared fight against human trafficking in all its forms. These entities are supported by the Department’s Civil Rights Division which is home to a team of dedicated investigators and prosecutors—the Human Trafficking Prosecution Unit (the HTPU)—tasked with bringing human traffickers to justice and vindicating the rights of their victims. Additionally, the Department’s Criminal Division includes the Child Exploitation and Obscenity Section (CEOS), which is committed to harnessing expertise in attacking the technological and systemic challenges that are involved in the sexual exploitation of minors, as well as other specialized prosecution teams who bring expertise in organized crime and money laundering. “Our efforts have produced high-impact prosecutions to dismantle transnational organized human trafficking enterprises, have launched interagency anti-trafficking initiatives with unprecedented momentum, and have vindicated the rights and freedoms of countless victims and survivors. “These efforts resulted in the conviction of nearly 500 defendants in trafficking cases in fiscal year 2017, and making $47 million available to help trafficking survivors. Last fall, the FBI—along with state and local task forces and international law enforcement partners—recovered 84 minors and arrested 120 traffickers, as part a single week-long operation. However, we are keenly aware that many challenges lie ahead and we are committed to taking our efforts to the next level. “In his Presidential Proclamation, President Trump asked us to ‘recommit ourselves to eradicating the evil of enslavement’ and to ‘pledge to do all in our power to end the horrific practice of human trafficking.’ In the spirit of the President’s request, the Justice Department is hosting a Human Trafficking Summit in Washington, D.C. on February 2, 2018, two days before Super Bowl LII. The Super Bowl provides an opportunity to raise awareness of the surge in commercial sex activity around major sporting events, and of our commitment to finding and protecting sex trafficking victims who are at risk of being compelled, coerced, or exploited as minors in that context. “The Human Trafficking Summit will be led by Associate Attorney General Rachel Brand and will convene law enforcement, victim support organizations, and the business community to focus on enhancing the strong partnerships behind all successful anti-trafficking efforts and identifying opportunities to increase collaboration and coordination as we take on new challenges. “There is no room in a civilized society for those who choose to violate an individual’s rights and freedoms by subjecting them to any form of human trafficking. To those that still make that choice: make no mistake, the Justice Department will use every lawful tool to uncover your illegal activity and bring you to justice.”",2018-01-31T00:00:00-05:00,Civil Rights,Civil Rights Division; Office of the Attorney General,0.132,0.131,0.737,-0.6771
914,16-603,Attorney General Loretta E. Lynch Statement on the Case of Dylann Roof,"Attorney General Loretta E. Lynch today released the following statement regarding the United States v. Dylann Roof: “Following the department’s rigorous review process to thoroughly consider all relevant factual and legal issues, I have determined that the Justice Department will seek the death penalty. The nature of the alleged crime and the resulting harm compelled this decision.”",2016-05-24T00:00:00-04:00,Civil Rights,Office of the Attorney General,0.152,0.215,0.633,-0.7717


In [14]:
doj_subset.groupby(["topics_clean"]).agg('mean')
#it's logical that Hate Crimes would be the most negative category, as the only positive possible story discussing hate crimes would be a release about the lack of them - compared to Project Safe Childhood, which takes concrete steps to deal with a negative issue, making it somewhat intuitive why releases discussing said steps were positive in nature

Unnamed: 0_level_0,Positive Score,Negative Score,Neutral Score,Compound Score
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Civil Rights,0.103239,0.08697,0.809757,0.154595
Hate Crimes,0.071711,0.149224,0.779045,-0.884882
Project Safe Childhood,0.10109,0.109705,0.789145,-0.245364


In [None]:
## rj note; good job! see solutions
## for way of executing the function that has much lower runtime than using .append
## small deduction for hardcoding length 717 rather than obtaining from shape

## 2.3 topic modeling (25 points)

For this question, use the `doj_subset` data that is reestricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


### 2.3.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in each of the raw strings in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem
    
B. Print the preprocessed text for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/
- Here's more condensed code with topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb 
- Here's longer code with more broken-out topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_examplecode.ipynb

In [16]:
stemmer = SnowballStemmer(language="english")
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                        "trial", "assistance", "assist"]
orig_stopwords = stopwords.words('english')
new_stopwords = orig_stopwords + custom_doj_stopwords
def process2_3(df):
    #print("shape: ", df.shape[0])
    content_list = []
    content = []
    for i in range(0, df.shape[0]):
        content = df.iloc[i].contents.lower()
        tokens = word_tokenize(content)
        tokens = [stemmer.stem(word) for word in tokens if word not in new_stopwords and word.isalpha() and len(word) > 3]
        content = " ".join(tokens)
        content_list.append(content)
    #print("Content: ",len(content_list))
    df['contents'] = content_list
    return df



In [17]:
pd.set_option("max_colwidth", 1000)
#prevents string in 'contents' from getting cut off
doj_2_3 = process2_3(doj_subset)
doj2_3_pt1 = doj_2_3[doj_2_3['id']== "16-718"]
doj_temp = doj_2_3[doj_2_3['id']== "16-217"]
doj2_3_pt1.append(doj_temp)
doj2_3_pt1['contents']
doj_temp['contents']

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,Positive Score,Negative Score,Neutral Score,Compound Score
11593,16-718,Three Mississippi Correctional Officers Indicted for Inmate Assault and Cover-Up,indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti investig jackson cooper mississippi correct prosecut robert coleman northern mississippi dana mulhaus crimin section marsher indict,2016-06-21T00:00:00-04:00,Civil Rights,"Civil Rights Division; Civil Rights - Criminal Section; USAO - Mississippi, Northern",0.028,0.259,0.713,-0.9968
6727,16-217,Justice Department Reaches Agreement with the City of Miami and the Miami Police Department to Implement Reforms on Officer-Involved Shootings,reach comprehens settlement agreement citi miami miami polic resolv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim shoot effect quick investig shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc today agreement result joint effort...,2016-02-25T00:00:00-05:00,Civil Rights,"Civil Rights Division; Civil Rights - Special Litigation Section; USAO - Florida, Southern",0.217,0.036,0.746,0.9977


11593    indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit fals report three lie convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci fals statement charg year prison fals report charg indict mere accus defend presum innoc unless proven guilti investig jackson cooper mississippi correct prosecut robert coleman northern mississippi dana mulhaus crimin section marsher indict
Name: contents, dtype: object

6727    reach comprehens settlement agreement citi miami miami polic resolv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehens reform ensur constitut polic support public trust settlement agreement design minim shoot effect quick investig shoot occur measur includ settlement repres renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc today agreement result joint ef

In [18]:
## rj note- good job! small deduction for this part
## Takes in each of the raw strings in the `contents` column from that dataframe
## since iterating over all rows within the function is less modular
## than feeding the function one press release at a time

### 2.3.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the `compound` sentiment column you added and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so most positive)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so most negative)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. What are the top 10 words for press releases in each of the three `topics_clean`?

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data

**Resources**:

- Here contains an example of applying the create_dtm function: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_examplecode.ipynb


In [19]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), columns=vectorizer.get_feature_names())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)

In [20]:
output_temp23 = []
i = 0
for content in doj_2_3['contents']:
    string_temp = str(content)
    output_temp.append(string_temp)
for entry in output_temp23:
    sent = sent_scoring(entry)
    pos_sent[i] = sent["pos"]
    neg_sent[i] = sent["neg"]
    neu_sent[i] = sent["neu"]
    comp_sent[i] = sent["compound"]
    i = i + 1


doj_2_3["Positive Score"] = pos_sent
doj_2_3["Negative Score"] = neg_sent
doj_2_3["Neutral Score"] = neu_sent
doj_2_3["Compound Score"] = comp_sent
doj_dtm = create_dtm(doj_2_3.contents, doj_2_3[['topics_clean', 'Compound Score']])

In [21]:
def get_topwords(df):
    return(df[df.columns[2:]].sum(axis=0).sort_values(ascending = False).head(10))

In [24]:
## rj added
doj_dtm.head()

Unnamed: 0,index,topics_clean,Compound Score,aaron,abandon,abbat,abbi,abbott,abdomen,abduct,...,zamora,zane,zealand,zealous,zeeman,zero,zobel,zone,zunggeemog,zwengel
0,77,Civil Rights,-0.9893,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,423,Civil Rights,-0.9814,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,568,Civil Rights,0.9623,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,851,Civil Rights,-0.6771,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,914,Civil Rights,-0.7717,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
print("Positive: \n", get_topwords(doj_dtm[doj_dtm.iloc[:,2] >= doj_dtm.iloc[:,2].quantile(.95)]))
print("Negative: \n", get_topwords(doj_dtm[doj_dtm.iloc[:,2] <= doj_dtm.iloc[:,2].quantile(.05)]))

## rj note- hardcoding to the 3rd column is dangerous since
## if you feed other metadata cols to create_dtm
## it would no longer reference the sentiment col

Positive: 
 agreement     168.0
disabl        126.0
enforc        118.0
ensur         104.0
settlement    103.0
state         101.0
communiti      89.0
hous           87.0
polic          85.0
student        85.0
dtype: float64
Negative: 
 assault     183.0
crime       155.0
victim      147.0
hate        121.0
offic       117.0
defend      102.0
charg       100.0
sentenc      95.0
anderson     93.0
guilti       92.0
dtype: float64


In [23]:
print("Civil Rights: \n", get_topwords(doj_dtm[doj_dtm.topics_clean == "Civil Rights"]))
print("Hate Crimes: \n", get_topwords(doj_dtm[doj_dtm.topics_clean == "Hate Crimes"]))
print("Project Safe Childhood: \n", get_topwords(doj_dtm[doj_dtm.topics_clean == "Project Safe Childhood"]))

Civil Rights: 
 offic        627.0
hous         620.0
discrimin    541.0
enforc       531.0
disabl       509.0
said         497.0
feder        475.0
violat       470.0
state        443.0
general      408.0
dtype: float64
Hate Crimes: 
 victim      590.0
crime       533.0
prosecut    476.0
hate        472.0
defend      459.0
sentenc     455.0
charg       452.0
guilti      430.0
feder       426.0
said        424.0
dtype: float64
Project Safe Childhood: 
 child          1018.0
exploit         698.0
sexual          570.0
safe            476.0
childhood       472.0
project         472.0
pornographi     447.0
children        416.0
crimin          404.0
prosecut        374.0
dtype: float64


### 2.3.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Resources**:

- Same topic modeling resources linked to above

In [25]:
tokens_2_33 = [wordpunct_tokenize(content) for content in doj_2_3['contents']]
dictionary_2_33 = corpora.Dictionary(tokens_2_33)
corpus_2_33 = [dictionary_2_33.doc2bow(one_tok) for one_tok in tokens_2_33]
model_topics = gensim.models.ldamodel.LdaModel(corpus_2_33, num_topics = 3, id2word = dictionary_2_33, alpha = "auto", passes = 8, per_word_topics = True)
model_topics.print_topics(num_words = 15)

[(0,
  '0.007*"said" + 0.007*"state" + 0.007*"feder" + 0.006*"religi" + 0.006*"today" + 0.005*"charg" + 0.005*"general" + 0.005*"sentenc" + 0.005*"guilti" + 0.005*"prosecut" + 0.005*"assault" + 0.005*"defend" + 0.005*"crime" + 0.004*"nation" + 0.004*"violat"'),
 (1,
  '0.013*"hous" + 0.011*"disabl" + 0.009*"discrimin" + 0.008*"enforc" + 0.007*"agreement" + 0.007*"said" + 0.007*"fair" + 0.006*"court" + 0.006*"alleg" + 0.006*"individu" + 0.006*"feder" + 0.006*"state" + 0.005*"general" + 0.005*"access" + 0.005*"ensur"'),
 (2,
  '0.012*"child" + 0.012*"victim" + 0.010*"prosecut" + 0.010*"sentenc" + 0.008*"charg" + 0.008*"exploit" + 0.008*"guilti" + 0.008*"feder" + 0.008*"offic" + 0.008*"sexual" + 0.008*"investig" + 0.007*"crimin" + 0.007*"prison" + 0.007*"year" + 0.006*"crime"')]

### 2.3.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset` dataframe

B. Add the topic probabilities to the `doj_subset` dataframe as columns and code each document to its highest-probability topic

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)

**Resources**:

- End of this code contains example of how to use `get_document_topics` and other steps to add topic probabilities back to data: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb

In [29]:
probs_34 = [model_topics.get_document_topics(item, minimum_probability = 0) for item in corpus_2_33]
assert len(probs_34) == doj_subset.shape[0]
df_34 = pd.DataFrame(probs_34)
for i in df_34.columns:
    df_34[i] = [df_34[i][x][1] for x in range(0, df_34.shape[0])]
    
## original code:    
# doj_subset = doj_subset.merge(right = df_34, how = 'left', left_index = True, right_index = True)
# #doj_subset
# doj_subset['topic_model']= doj_subset[[0,1,2]].idxmax(axis=1)
# #for some reason the above line only seems to work every other time i run this cell
# pd.crosstab(doj_subset.topics_clean, doj_subset.topic_model, normalize = "index")

## rj code to prevent issue
doj_subset_wtop = doj_subset.merge(right = df_34, how = 'left', left_index = True, right_index = True)
#doj_subset
doj_subset_wtop['topic_model']= doj_subset_wtop[[0,1,2]].idxmax(axis=1)
#for some reason the above line only seems to work every other time i run this cell
pd.crosstab(doj_subset_wtop.topics_clean, doj_subset_wtop.topic_model, normalize = "index")

## rj note- i think it's working! main thing is you can use random_state = something
## within the model estimation step to fix which topic is labeled 0, 1, 2
## and the subsetting to the 0, 1, 2 columns while writing over doj_subset causes
## issues- added code to make more stable
## topic 0, 1, 2 etc
## small deduction for this interpretation part:
## Using a couple press releases as examples, write a 1-2 
## sentence interpretation of why some of the manual topics map
## on more cleanly to an estimated topic than other manual topic(s)

topic_model,0.0,1.0,2.0
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Civil Rights,0.0,0.0,1.0
Hate Crimes,0.227273,0.181818,0.590909
Project Safe Childhood,0.0,0.545455,0.454545


## 2.5 OPTIONAL extra credit (5 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 2.1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings. Feel free to load additional packages if needed

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?