# 0. Imports

In [44]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Cameron\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [5]:
## helpful packages
import pandas as pd
import numpy as np
import random
import re

## nltk imports
import nltk
### uncomment and run these lines if you haven't downloaded relevant nltk add-ons yet
#nltk.download('averaged_perceptron_tagger')
#nltk.download('stopwords')
from nltk import pos_tag
from nltk.tokenize import word_tokenize, wordpunct_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

## spacy imports
import spacy
### uncomment and run the below line if you haven't loaded the en_core_web_sm library yet
#python -m spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load()

## vectorizer
from sklearn.feature_extraction.text import CountVectorizer

## sentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## lda
from gensim import corpora
import gensim

## repeated printouts
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"




# 2. Text analysis of DOJ press releases

For background, here's the Kaggle that contains the data: https://www.kaggle.com/jbencina/department-of-justice-20092018-press-releases 

Here's the code the dataset owner used to scrape those press releases here if you're interested: https://github.com/jbencina/dojreleases

In [6]:
## run this code to load the unzipped json file and convert to a dataframe
## and convert some of the things from lists to values
doj = pd.read_json("combined.json", lines = True)

## due to json, topics are in a list so remove them and concatenate with ;
doj['topics_clean'] = ["; ".join(topic) 
                      if len(topic) > 0 else "No topic" 
                      for topic in doj.topics]

## similarly with components
doj['components_clean'] = ["; ".join(comp) 
                           if len(comp) > 0 else "No component" 
                           for comp in doj.components]

## drop older columns from data
doj = doj[['id', 'title', 'contents', 'date', 'topics_clean', 'components_clean']].copy()

doj.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
0,,Convicted Bomb Plotter Sentenced to 30 Years,"PORTLAND, Oregon. – Mohamed Osman Mohamud, 23,...",2014-10-01T00:00:00-04:00,No topic,National Security Division (NSD)
1,12-919,$1 Million in Restitution Payments Announced t...,WASHINGTON – North Carolina’s Waccamaw River...,2012-07-25T00:00:00-04:00,No topic,Environment and Natural Resources Division
2,11-1002,$1 Million Settlement Reached for Natural Reso...,BOSTON– A $1-million settlement has been...,2011-08-03T00:00:00-04:00,No topic,Environment and Natural Resources Division
3,10-015,10 Las Vegas Men Indicted \r\nfor Falsifying V...,WASHINGTON—A federal grand jury in Las Vegas...,2010-01-08T00:00:00-05:00,No topic,Environment and Natural Resources Division
4,18-898,$100 Million Settlement Will Speed Cleanup Wor...,"The U.S. Department of Justice, the U.S. Envir...",2018-07-09T00:00:00-04:00,Environment,Environment and Natural Resources Division


## 2.1 NLP on one press release (10 points)

Focus on the following press release: `id` == "17-1204" about this pharmaceutical kickback prosecution: https://www.forbes.com/sites/michelatindera/2017/11/16/fentanyl-billionaire-john-kapoor-to-plead-not-guilty-in-opioid-kickback-case/?sh=21b8574d6c6c 

The `contents` column is the one we're treating as a document. You may need to to convert it from a pandas series to a single string.


- Part of speech tagging- extract verbs and sort from most occurrences to least occurrences
- Named entity recognition --- what are the different organizations mentioned? how would you like to make more granular?
- Sentence level versus document-level sentiment scoring

- For sentence level scoring, print a few top positive and top negative. Does the automatic classifier seem to work?


### 2.1.1: part of speech tagging (3 points)

A. Preprocess the press release to remove all punctuation / digits (so can subset to one_word.isalpha())

B. Then, use part of speech tagging within nltk to tag all the words in that one press release with their part of speech. 

C. Finally, extract the adjectives and sort those adjectives from most occurrences to fewest occurrences. Print the 5 most frequent adjectives. See here for a list of the names of adjectives within nltk: https://pythonprogramming.net/natural-language-toolkit-nltk-part-speech-tagging/

**Resources**:

- Documentation for .isalpha(): https://www.w3schools.com/python/ref_string_isalpha.asp
- `processtext` function here has an example of tokenizing and filtering to words where .isalpha() is true: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb 
- Part of speech tagging section of this code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb



In [110]:
pd.set_option("max_colwidth", 10000)
#prevents string in 'contents' from getting cut off
doj2_1 = doj[doj['id']== "17-1204"]
string2_1 = str(doj2_1["contents"])
list_2_1 = [word for word in wordpunct_tokenize(string2_1) if word.isalpha()]
words_2_1 = " ".join(list_2_1)
#words_2_1
tokens = word_tokenize(words_2_1) 
tokens_pos = pos_tag(tokens)
all_adjectives = [one_tok[0] for one_tok in tokens_pos 
                if one_tok[1] == "JJ" or one_tok[1] == "JJR" or one_tok[1] == "JJS"]
all_adjectives_df = pd.DataFrame(all_adjectives)
df = all_adjectives_df.value_counts()
df.head()

former        8
opioid        5
nationwide    4
addictive     3
other         3
dtype: int64

### 2.1.2 named entity recognition (3 points)


A. Using the alpha-only press release you created in the previous step, use spaCy to extract all named entities from the press release

B. Print all the named entities along with their tag

C. You want to extract the possible sentence lengths the CEO is facing; pull out the named entities with (1) the label `DATE` and (2) that contain the word year or years (hint: you may want to use the `re` module for that second part). Print these.

D. Pull and print the original parts of the press releases where those year lengths are mentioned (e.g., the sentences or rough region of the press release). Describe in your own words (1 sentence) what length of sentence (prison) and probation (supervised release) the CEO may be facing if convincted after this indictment.

**Resources**:

- Named entity recognition part of this code: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb
- re.search and re.findall examples here for filtering to ones containing year (multiple approaches; some need not involve `re`): https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/04_basicregex_formerging.ipynb 

In [112]:
#words_2_1
spacy_words21 = nlp(words_2_1)
#for one_tok in spacy_words21.ents:
    #print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)
#^these two lines are for Part B, I commented them out to save space while debugging the rest of this cell
print("Part C")
for one_tok in spacy_words21.ents:
    if one_tok.label_ == "DATE" and "year" in one_tok.text: print("Entity: " + one_tok.text + "; NER tag: " + one_tok.label_)
[re.findall(r".{,80}" + r"year" + r".{,80}", words_2_1)]
#The CEO if convicted may face up to five years in prison and three years of parole

Part C
Entity: last year; NER tag: DATE
Entity: no greater than years; NER tag: DATE
Entity: three years; NER tag: DATE
Entity: three years; NER tag: DATE


[['g breakthrough pain More than Americans died of synthetic opioid overdoses last year and millions are addicted to opioids And yet some medical professionals would r',
  'cy to commit mail and wire fraud each provide for a sentence of no greater than years in prison three years of supervised release and a fine of or twice the amount ',
  'to violate the Anti Kickback Law provide for a sentence of no greater than five years in prison three years of supervised release and a fine Sentences are imposed b']]

### 2.1.3 Sentiment analysis (4 points)

A. Use a `SentimentIntensityAnalyzer` and `polarity_scores` to score the entire press release for its sentiment (you can go back to the raw string of the press release without punctuation/digits removed)

B. Remove all named entities from the string and score the sentiment of the press release without named entities. Did the neutral score go up or down relative to the version of the press release containing named entities? Why do you think this occurred?

C. With the version of the string that removes named entities, try to split the press release into discrete sentences (hint: re.split() may be useful since it allows or conditions in the pattern you're looking for). Print the first 5 sentences of the split press release (there will not be deductions if there remain some erroneous splits; just make sure it's generally splitting)

D. Score each sentence in the split press release and print the top 5 sentences in the press release with the most negative sentiment (use the `neg` score- higher values = more negative). **Hint**: you can use pd.DataFrame to rowbind a list of dictionaries; you can then add the press release sentence for each row back as a column in that dataframe and use sort_values()                                                  
                
**Resources**:

- Sentiment analysis section of this script: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partI_textmining_examplecode.ipynb

- Discussion of using `re.split()` to split on multiple delimiters: https://stackoverflow.com/questions/4998629/split-string-with-multiple-delimiters-in-python

In [152]:
doj2_1 = doj[doj['id']== "17-1204"]
string2_1 = str(doj2_1["contents"])
sent_obj = SentimentIntensityAnalyzer()
sentiment = sent_obj.polarity_scores(string2_1)
sentiment
spacey_words21 = nlp(string2_1)
text_no_ents = string2_1
for ent in spacey_words21.ents:
    text_no_ents = re.sub(str(ent), '', text_no_ents)
#text_no_ents
split_passage = re.split("\. | \!| \?", text_no_ents)
#split_passage[:5]
list_of_dictionaries = []
for sent in split_passage:
    sent_list = sent_obj.polarity_scores(sent)
    score = sent_list["neg"]
    dictionary = {"sentence":sent, "Score":score}
    list_of_dictionaries.append(dictionary)
dictionary_df = pd.DataFrame(list_of_dictionaries)
dictionary_df.sort_values("Score", ascending = False)
dictionary_df.head()

{'neg': 0.141, 'neu': 0.746, 'pos': 0.113, 'compound': -0.9962}

Unnamed: 0,sentence,Score
6,“'s arrest and charges reflect our ongoing efforts to attack the opioid crisis from all angles,0.494
0,"The founder and majority owner of , was arrested and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a spray intended for cancer patients experiencing breakthrough pain. "" died of synthetic opioid overdoses , and are addicted to opioids",0.381
2,""" will not tolerate this. We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic. And under the leadership of President , we are fully committed to defeating this threat to the people.”John , , of , , a current member of of , was arrested in na and charged with conspiracy, as well as other felonies, including conspiracy to commit mail and wire fraud and conspiracy to violate the Anti-Kickback Law",0.3
8,"“The allegations of selling a highly addictive opioid cancer pain drug to patients who did not have cancer, make them no better than street-level drug dealers",0.289
12,"“We are proud to work alongside our law enforcement partners to dismantle high level prescription drug practices which directly contribute to the opioid abuse epidemic. This investigation highlights our commitment to defending our mail system from illegal misuse and ensuring public trust in the mail.”“The Department of , will continue to aggressively investigate those that attempt to fraudulently impact programs designed to benefit our veterans and their families,” said , Special Agent in of the VA .The charges of conspiracy to commit and conspiracy to commit mail and wire fraud each provide for a sentence of no greater than 20 years in prison, of supervised release and a fine of $, or twice the amount of pecuniary gain or loss. The charges of conspiracy to violate the Anti-Kickback Law provide for a sentence of no greater than five years in prison, of supervised release and a $ fine",0.228
5,"and his company stand accused of bribing doctors to overprescribe a potent opioid and committing fraud on insurance companies solely for profit,” said Acting Attorney",0.205
4,"Gurry, , of , , conspired to bribe practitioners in various states, many of whom operated pain clinics, in order to get them to prescribe a fentanyl-based pain medication. The medication, called “,” is a powerful narcotic intended to treat cancer patients suffering intense breakthrough pain. In exchange for bribes and kickbacks, the practitioners wrote large numbers of prescriptions for the patients, most of whom were not diagnosed with cancer.The indictment also alleges that and the former executives conspired to mislead and defraud health insurance providers who were reluctant to approve payment for the drug when it was prescribed for non-cancer patients. They achieved this goal by setting up the “reimbursement unit,” which was dedicated to obtaining prior authorization directly from insurers and pharmacy benefit managers. “In the midst of a nationwide opioid epidemic that has reached crisis proportions, Mr",0.153
9,"'s charges mark an important step in holding pharmaceutical executives responsible for their part in the opioid crisis. The will vigorously investigate corrupt organizations with business practices that promote fraud with a total disregard for patient safety.”“These executives allegedly fueled the opioid epidemic by paying doctors to needlessly prescribe an extremely dangerous and addictive form of fentanyl,” said , Special Agent in of the Department of Health and Human Services. “Corporate executives intent on illegally driving up profits need to be aware they are now squarely in the sights of law enforcement.”“As alleged, executives improperly influenced health care providers to prescribe a powerful opioid for patients who did not need it, and without complying with requirements, thus putting patients at risk and contributing to the current opioid crisis,” said , Special Agent in , Office of Criminal Investigations’ Metro Washington Field Office",0.125
11,"Ferguson. “ pledges to work with our law enforcement and regulatory partners nationwide to ensure that rules and regulations under are followed.”“’s arrest is the result of a joint effort to identify, investigate and prosecute individuals who engage in fraudulent activity and endanger patient health,” stated Special Agent in Leigh-Alistair Barzey, (DCIS) . “DCIS will continue to work with the Attorney’s Office, of , and our law enforcement partners, to protect military members, retirees and their dependents and the integrity of , healthcare system.”“As alleged, John and other top executives committed fraud, placing profit before patient safety, to sell a highly potent and addictive opioid. EBSA will take every opportunity to work collaboratively with our law enforcement partners in these important investigations to protect participants in private sector health plans and contribute in fighting the opioid epidemic,” said , Regional Director of the Department of Labor, Employee Benefits Security Administration, Regional Office.“Once again, the Postal Inspection Service is fully committed to protecting our nation’s mail system from criminal misuse,” said , Inspector in of the Postal Inspection Service",0.085
7,"We must hold the industry and its leadership accountable - just as we would the cartels or a street-level drug dealer.”“As alleged, these executives created a corporate culture at that utilized deception and bribery as an acceptable business practice, deceiving patients, and conspiring with doctors and insurers,” said , Special Agent in of the Federal Bureau of Investigation, Field Division",0.079


Unnamed: 0,sentence,Score
0,"The founder and majority owner of , was arrested and charged with leading a nationwide conspiracy to profit by using bribes and fraud to cause the illegal distribution of a spray intended for cancer patients experiencing breakthrough pain. "" died of synthetic opioid overdoses , and are addicted to opioids",0.381
1,"And yet some medical professionals would rather take advantage of the addicts than try to help them,"" said Attorney General",0.0
2,""" will not tolerate this. We will hold accountable anyone – from street dealers to corporate executives -- who illegally contributes to this nationwide epidemic. And under the leadership of President , we are fully committed to defeating this threat to the people.”John , , of , , a current member of of , was arrested in na and charged with conspiracy, as well as other felonies, including conspiracy to commit mail and wire fraud and conspiracy to violate the Anti-Kickback Law",0.3
3,", the former Executive Chairman of the and CEO of , will appear in federal court in . He will appear in in at . The superseding indictment, unsealed in , also includes additional allegations against several former executives and managers who were initially indicted in 2016.The superseding indictment charges that ; , , of , , former CEO and President of the company; , , of , , former Vice President of Sales; , , of , , former National Director of Sales; former Regional Sales Directors , , of , , and , , of , ; and former Vice President of , el J",0.019
4,"Gurry, , of , , conspired to bribe practitioners in various states, many of whom operated pain clinics, in order to get them to prescribe a fentanyl-based pain medication. The medication, called “,” is a powerful narcotic intended to treat cancer patients suffering intense breakthrough pain. In exchange for bribes and kickbacks, the practitioners wrote large numbers of prescriptions for the patients, most of whom were not diagnosed with cancer.The indictment also alleges that and the former executives conspired to mislead and defraud health insurance providers who were reluctant to approve payment for the drug when it was prescribed for non-cancer patients. They achieved this goal by setting up the “reimbursement unit,” which was dedicated to obtaining prior authorization directly from insurers and pharmacy benefit managers. “In the midst of a nationwide opioid epidemic that has reached crisis proportions, Mr",0.153


## 2.2 sentiment scoring across many press releases (10 points)


A. Subset the press releases to those labeled with one of free topics (can just do if topic_clean == that topic rather than finding where that topic is mentioned in a longer list): Civil Rights, Hate Crimes, and Project Safe Childhood. We'll call this `doj_subset` going forward and it should have 717 rows.

B. Write a function that takes one press release string as an input and:

- Removes named entities from each press release string
- Scores the sentiment of the entire press release

Apply that function to each of the press releases in `doj_subset`. 

**Hints**: 

- You may want to use re.escape at some point to avoid errors relating to escape characters like ( in the press release
- I used a function + list comprehension to execute and it takes about 30 seconds on my local machine and about 2 mins on jhub; if it's taking a very long time, you may want to check your code for inefficiencies. If you can't fix those, for partial credit on this part/full credit on remainder, you can take a small random sample

C. Add the scores to the `doj_subset` dataframe. Sort from highest neg to lowest neg score and print the top 5 most neg.

D. With that dataframe, find the mean compound score for each of the three topics using group_by and agg. Add a 1 sentence interpretation of why we might see the variation in scores (remember that compound is a standardized summary where -1 is most negative; +1 is most positive)

**Resources**:

- Same named entity and sentiment resources as above

In [22]:
doj_subset = doj[doj["topics_clean"] == "Civil Rights"]
doj2 = doj[doj["topics_clean"] == "Hate Crimes"]
doj3 = doj[doj["topics_clean"] == "Project Safe Childhood"]
doj_subset = doj_subset.append(doj2)
doj_subset = doj_subset.append(doj3)

In [14]:
def sent_scoring(string):
    sent_obj = SentimentIntensityAnalyzer()
    spacey_words = nlp(string)
    text_no_ents_funct = string
    for ent in spacey_words.ents:
        text_no_ents_funct = re.sub(re.escape(str(ent)), '', text_no_ents_funct)
    sentiment = sent_obj.polarity_scores(string)
    return sentiment

In [43]:
output_temp = []
pos_sent = [0] * 717
neg_sent = [0] * 717
neu_sent = [0] * 717
comp_sent = [0] * 717
i=0
for content in doj_subset['contents']:
    string_temp = str(content)
    output_temp.append(string_temp)

for entry in output_temp:
    sent = sent_scoring(entry)
    pos_sent[i] = sent["pos"]
    neg_sent[i] = sent["neg"]
    neu_sent[i] = sent["neu"]
    comp_sent[i] = sent["compound"]
    i = i + 1


doj_subset["Positive Score"] = pos_sent
doj_subset["Negative Score"] = neg_sent
doj_subset["Neutral Score"] = neu_sent
doj_subset["Compound Score"] = comp_sent
doj_subset.sort_values("Negative Score", ascending = False)
doj_subset.head()

Unnamed: 0,id,title,contents,date,topics_clean,components_clean,Positive Score,Negative Score,Neutral Score,Compound Score
7730,18-876,Justice Department Update on Hate Crimes Prosecutions,today year anniversari hate crime summit announc updat hate crime prosecut crimin section commit enforc feder hate crime statut allow prosecut certain crime commit actual perceiv race color religion nation origin gender sexual orient gender ident disabl person recent year ramp prosecut hate crime increa train feder state local enforc offic ensur hate crime identifi prosecut fullest extent possibl past year charg defend hate crime offen matthew shepard jame byrd hate crime prevent hcpa provid valuabl tool effort hcpa indict defend hate crime convict date charg defend obtain convict sinc januari indict defend involv commit hate crime secur convict defend hate crime incid individu live live free threat violenc discrimin matter believ worship said general john gore proud work alreadi accomplish continu work dilig bring perpetr hate crime across hate crime prosecut januari present base latest uniform crime statist report issu novemb calendar year incid report involv offen victim known o...,2018-06-29T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.090,0.416,0.494,-0.9987
572,13-312,Aryan Brother Inmate Sentenced for Federal Hate Crime for Assaulting Fellow Inmate,john hall aryan brotherhood member inmat feder correct institut seagovil texa sentenc today judg reed connor plead guilti violat matthew shepard jame byrd hate crime prevent stem assault fellow inmat believ announc hall assault fellow inmat danger weapon caus bodili injuri victim hall sentenc serv month prison serv consecut sentenc current serv assault occur insid seagovil hall target attack victim fellow inmat believ victim involv sexual relationship anoth male inmat hall repeat punch kick stomp victim face shod feet danger weapon yell homophob slur victim lost conscious assault suffer multipl lacer face victim also sustain fractur socket lost tooth fractur teeth treat hospit injuri sustain hall unprovok attack hall plead guilti violat matthew shepard jame byrd hate crime prevent brutal violenc base sexual orient place societi said thoma perez general commit tool enforc arsenal includ matthew shepard jame byrd hate crime prevent prosecut motiv prosecut send clear messag partnershi...,2013-03-14T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.046,0.390,0.565,-0.9977
501,11-626,Arkansas Man Pleads Guilty to Federal Hate Crime Related to the Assault of Five Hispanic Men,washington announc today sean popejoy green forest plead guilti feder court count commit feder hate crime count conspir commit feder hate crime first convict violat matthew shepard jame byrd hate crime prevent enact octob inform present plea hear establish morn hour june popejoy admit part conspiraci threaten injur five hispan pull station park pursu victim truck caught victim popejoy lean outsid front passeng window waiv tire wrench victim continu threaten hurl racial epithet victim victim caus victim cross opposit lane traffic road crash tree ignit result action victim suffer bodili injuri includ victim sustain injuri jame byrd matthew shepard brutal murder decad today first defend convict hate crime critic enact name said thoma perez general unaccept violent hate commit someon race continu occur continu everi avail tool identifi prosecut hate crime whenev wherev occur terribl disturb violenc motiv hatr anoth race continu occur said conner eldridg western arkansa commit prosecut ...,2011-05-16T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.048,0.390,0.563,-0.9978
8655,12-1060,Michigan Man Pleads Guilty to Federal Hate Crimes Charge,everett dwayn averi detroit plead guilti feder court today feder hate crime admit assault victim believ victim eastern michigan barbara mcquad special agent charg robert foley announc today time plea averi admit march struck victim face custom conveni store detroit believ victim victim suffer fractur socket facial injuri result incid place societi said thoma perez general commit tool enforc arsenal includ matthew shepard jame byrd hate crime prevent prosecut violenc motiv averi face maximum year prison sentenc schedul judg john corbett meara hate crime differ simpl assault attack individu victim attack everyon share particular characterist said mcquad pass statut congress made clear attack base victim sexual orient toler commit protect communiti motiv hate victim anyon result sexual orient said special agent charg foley investig detroit prosecut thompson sanjay patel,2012-08-29T00:00:00-04:00,Hate Crimes,Civil Rights Division,0.118,0.365,0.517,-0.9931
12166,17-931,Two Texas Men Plead Guilty to Federal Hate Crime for Assaults Based on Victim’s Sexual Orientation,nigel garrett cameron ajiduah plead guilti today assault victim sexual orient eastern texa bureau alcohol tobacco firearm explo dalla announc accord plea agreement sign garrett januari defend garrett anthoni shelton chancler encalad grindr social media date platform arrang meet victim victim home upon enter victim home defend restrain victim tape physic assault victim made derogatori statement victim defend brandish firearm home inva stole victim properti includ motor vehicl includ separ plea agreement sign ajiduah februari defend ajiduah garrett shelton scheme differ victim includ restrain victim cover tape verbal berat sexual orientaion physic assault feder grand juri previous return indict ajiduah shelton garrett chancler encalad includ charg hate crime kidnap carjack firearm commit violent crime indict also charg defend conspir caus bodili injuri victim sexual orient four home inva plano frisco aubrey texa januari februari toler hate crime individu base sexual orient said gener...,2017-08-22T00:00:00-04:00,Hate Crimes,Civil Rights Division; Civil Rights - Criminal Section,0.052,0.355,0.593,-0.9980
...,...,...,...,...,...,...,...,...,...,...
6089,18-119,Justice Department Announces Religious Liberty Update to U.S. Attorneys’ Manual and Directs the Designation of Religious Liberty Point of Contact for All U.S. Attorney's Offices,today announc updat unit state manual usam section titl associ general approv notic requir issu implic religi general issu memorandum execut depart agenc entitl feder protect religi liberti memo direct compon unit state offic guidanc litig advic execut branch oper grant aspect work order ensur complianc general memo usam updat languag direct relev compon updat usam also instruct relev compon consult religi liberti principl laid general octob memo consid whether notic approv requir initi order fulli effectu approv notic requir updat usam instruct design point contact lead effort religi liberti inalien right protect constitut defend import thing said associ general rachel brand presid trump direct general session issu robust clear guidanc document octob clear explain feder govern appli religi liberti protect current book requir offic design religi liberti point contact ensur general memorandum effect implement design respon work direct leadership offic relat religi liberti ensur rece...,2018-01-31T00:00:00-05:00,Civil Rights,Civil Rights Division; Office of the Associate Attorney General,0.112,0.000,0.888,0.9423
11065,16-294,"Statement from Head of the Civil Rights Division Vanita Gupta Regarding Ferguson, Missouri, City Council Vote to Approve Consent Decree",princip deputi general vanita gupta head relea follow statement regard ferguson missouri citi council vote approv propo consent decr tonight citi ferguson missouri took import step toward guarant citizen protect constitut plea approv consent decr document design provid framework need institut constitut polic ferguson look forward file court come begin work toward implement,2016-03-15T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Special Litigation Section,0.113,0.000,0.887,0.6597
11064,16-471,"Statement from Head of the Civil Rights Division Vanita Gupta Regarding District Court’s Approval of Consent Decree with City of Ferguson, Missouri",princip deputi general vanita gupta head relea follow statement regard rule issu judg catherin perri eastern missouri approv consent decr citi ferguson missouri consent decr approv court look forward work citi ferguson implement decr continu essenti work creat polic constitut requir resid deserv,2016-04-19T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Special Litigation Section,0.085,0.000,0.915,0.4215
10286,15-1559,Readout of Department of Justice’s First Meetings in Chicago Following Announcement of Pattern or Practice Investigation of the Chicago Police Department,includ lawyer senior leader northern illinoi complet introductori meet chicago today follow last week announc pattern practic chicago polic team compri primarili lawyer join head vanita gupta well zachari fardon northern illinoi forc dispar forc account system northern illinoi group superintend john escal brief command staff investig process also initi meet communiti member organ order solicit inform explain pattern practic scope process today addit communiti group citi offici union repr meet citi chicago includ mayor rahm emanuel staff separ meet independ polic review author administr sharon fairley throughout investig process continu meet repr communiti citi union cour communiti member opportun provid inform public meet privat public meet announc later date anyon wish share inform relev encourag contact phone email,2015-12-17T00:00:00-05:00,Civil Rights,"Civil Rights Division; Civil Rights - Special Litigation Section; USAO - Illinois, Northern",0.074,0.000,0.926,0.8020


Unnamed: 0,id,title,contents,date,topics_clean,components_clean,Positive Score,Negative Score,Neutral Score,Compound Score
77,17-1235,Additional Former Correctional Officer Pleads Guilty to Beating of Handcuffed and Shackled Inmate at Angola State Prison,former supervisori correct offic louisiana state penitentiari angola louisiana plead guilti yesterday connect beat handcuf shackl inmat addit conspir cover misconduct falsifi offici record intern investig happen jame savoy marksvil louisiana admit plea hear offic excess forc inmat fail interven conspir offic cover beat engag varieti obstruct person falsifi offici prison record cover attack scotti kennedi beeb arkansa john sander marksvil louisiana previous plead guilti novemb septemb role beat cover everi citizen right process protect unreason forc correct offic violat basic constitut must held account egregi action said general john gore continu vigor prosecut correct offic violat public trust commit crime cover violat feder crimin yesterday anoth exampl unwav commit pursu violat feder crimin said unit state middl louisiana corey amundson continu work close ensur investig baton roug resid agenc prosecut frederick menner middl louisiana christoph perra crimin section,2017-11-02T00:00:00-04:00,Civil Rights,"Civil Rights Division; USAO - Louisiana, Middle",0.091,0.086,0.824,-0.296
423,17-240,Anoka County Resident Sentenced to Six Months in Prison for Threatening Two Clinics that Provide Reproductive Health Services,michael john harri sentenc month imprison year supervi relea make telephon threat medic clinic minneapoli minnesota provid reproduct health servic march harri plead guilti violat plea hear harri admit made telephon threat differ health clinic minneapoli provid reproduct health servic call first clinic harri threaten kill recipi call bare hand recipi head band call second clinic harri told recipi kill recipi recipi travel clinic shoot everyon present harri admit made threat recipi order intimid recipi person obtain provid reproduct health servic defend threaten clinic worker death brutal said general wheeler plea defend accept respon face consequ action commit vigor enforc individu violenc threaten defend health care worker unaccept said unit state andrew luger minnesota sentenc serv remind individu would engag threat feder govern prosecut investig feder bureau prosecut risa berkow unit state manda sertich minnesota,2017-03-02T00:00:00-05:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section; Federal Bureau of Investigation (FBI); USAO - Minnesota,0.097,0.237,0.666,-0.9805
568,17-379,Arson Awareness Week 2017 to Focus on Preventing Arson at Houses of Worship,today announc partner feder emerg manag agenc fire administr year arson awar week focus prevent arson hous worship averag arson hous worship year half report fire hous worship turn involv arson enforc number feder statut protect place worship attack includ known church arson prevent pass respon sharp increa church arson make feder crime target religi properti religion race congreg februari year indict idaho alleg fire cathol church bonner ferri april indiana sentenc year imprison fire islam center greater toledo fema produc number materi help congreg communiti organ local enforc fire safeti offici increa arson awar hold event highlight proactiv step taken reduc hous worship arson materi avail arson awar week homepag arson hous worship serious crime commit prosecut fullest extent said general wheeler role prosecutor critic import come fact damag alreadi done encourag communiti local offici take proactiv step increa public awar problem measur taken reduc likelihood victim hous worshi...,2017-04-10T00:00:00-04:00,Civil Rights,Civil Rights Division; Civil Rights - Criminal Section,0.167,0.214,0.619,-0.9423
851,18-121,Attorney General Issues National Slavery and Human Trafficking Prevention Month Proclamation,general jeff session issu follow proclam commemor januari nation slaveri human traffick prevent month human traffick nationwid public health crisi victim everywh truck stop citi rural area suburb total unconscion million victim global accord estim mean million human sibl coerc commerci forc labor exploit desper seek better life prioriti combat deprav predatori behavior swift aggress enforc nation bring traffick restor live victim survivor offic work close feder bureau feder agenc state local tribal partner front line lead share fight human traffick form entiti support home team dedic investig human traffick prosecut unit htpu bring human traffick vindic victim addit crimin includ child exploit obscen section commit experti attack technolog system challeng involv sexual exploit minor well special prosecut team bring experti organ crime money launder effort produc prosecut dismantl transnat organ human traffick enterpri launch interag initi unprec momentum vindic freedom countless vi...,2018-01-31T00:00:00-05:00,Civil Rights,Civil Rights Division; Office of the Attorney General,0.189,0.132,0.68,0.9709
914,16-603,Attorney General Loretta E. Lynch Statement on the Case of Dylann Roof,general loretta lynch today relea follow statement regard unit state dylann roof follow rigor review process thorough consid relev factual legal issu determin seek death penalti natur alleg crime result harm compel deci,2016-05-24T00:00:00-04:00,Civil Rights,Office of the Attorney General,0.036,0.263,0.7,-0.886


In [45]:
doj_subset.groupby(["topics_clean"]).agg('mean')
#it's logical that Hate Crimes would be the most negative category, as the only positive possible story discussing hate crimes would be a release about the lack of them - compared to Project Safe Childhood, which takes concrete steps to deal with a negative issue, making it somewhat intuitive why releases discussing said steps were positive in nature

Unnamed: 0_level_0,Positive Score,Negative Score,Neutral Score,Compound Score
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Civil Rights,0.113849,0.10063,0.785534,0.154301
Hate Crimes,0.091411,0.200687,0.707959,-0.698521
Project Safe Childhood,0.102542,0.102934,0.79459,0.202216


## 2.3 topic modeling (25 points)

For this question, use the `doj_subset` data that is reestricted to civil rights, hate crimes, and project safe childhood and with the sentiment scores added


### 2.3.1 Preprocess the data by removing stopwords, punctuation, and non-alpha words (5 points)

A. Write a function that:

- Takes in each of the raw strings in the `contents` column from that dataframe
- Does the following preprocessing steps:

    - Converts the words to lowercase
    - Removes stopwords, adding the custom stopwords in your code cell below to the default stopwords list
    - Only retains alpha words (so removes digits and punctuation)
    - Only retains words 4 characters or longer
    - Uses the snowball stemmer from nltk to stem
    
B. Print the preprocessed text for the following press releases:

id = 16-718 (this case: https://www.seattletimes.com/nation-world/doj-miami-police-reach-settlement-in-civil-rights-case/)

id = 16-217 (this case: https://www.wlbt.com/story/32275512/three-mississippi-correctional-officers-indicted-for-inmate-assault-and-cover-up/)
    
**Resources**:

- Here's code examples for the snowball stemmer: https://www.geeksforgeeks.org/snowball-stemmer-nlp/
- Here's more condensed code with topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb 
- Here's longer code with more broken-out topic modeling steps: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_examplecode.ipynb

In [32]:
stemmer = SnowballStemmer(language="english")
custom_doj_stopwords = ["civil", "rights", "division", "department", "justice",
                        "office", "attorney", "district", "case", "investigation", "assistant",
                        "trial", "assistance", "assist"]
orig_stopwords = stopwords.words('english')
new_stopwords = orig_stopwords + custom_doj_stopwords
def process2_3(df):
    #print("shape: ", df.shape[0])
    content_list = []
    content = []
    for i in range(0, df.shape[0]):
        content = df.iloc[i].contents.lower()
        tokens = word_tokenize(content)
        tokens = [stemmer.stem(word) for word in tokens if word not in new_stopwords and word.isalpha() and len(word) > 3]
        content = " ".join(tokens)
        content_list.append(content)
    #print("Content: ",len(content_list))
    df['contents'] = content_list
    return df



In [38]:
pd.set_option("max_colwidth", 1000)
#prevents string in 'contents' from getting cut off
doj_2_3 = process2_3(doj_subset)
doj2_3_pt1 = doj_2_3[doj_2_3['id']== "16-718"]
doj_temp = doj_2_3[doj_2_3['id']== "16-217"]
doj2_3_pt1.append(doj_temp)
doj2_3_pt1['contents']
doj_temp['contents']

Unnamed: 0,id,title,contents,date,topics_clean,components_clean
11593,16-718,Three Mississippi Correctional Officers Indicted for Inmate Assault and Cover-Up,indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit report three convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci statement charg year prison report charg indict mere accus defend presum innoc unless proven guilti investig jackson cooper mississippi correct prosecut robert coleman northern mississippi dana mulhaus crimin section marsher indict,2016-06-21T00:00:00-04:00,Civil Rights,"Civil Rights Division; Civil Rights - Criminal Section; USAO - Mississippi, Northern"
6727,16-217,Justice Department Reaches Agreement with the City of Miami and the Miami Police Department to Implement Reforms on Officer-Involved Shootings,reach comprehen settlement agreement citi miami miami polic resolv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehen reform ensur constitut polic support public trust settlement agreement design minim shoot effect quick investig shoot occur measur includ settlement repr renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc today agreement result joint effort cit...,2016-02-25T00:00:00-05:00,Civil Rights,"Civil Rights Division; Civil Rights - Special Litigation Section; USAO - Florida, Southern"


11593    indict unseal today mississippi correct offic charg beat inmat third charg help cover indict charg lawardrick marsher robert sturdiv offic mississippi state penitentiari parchman mississippi beat includ kick punch throw victim ground marsher sturdiv charg violat right convict prison free cruel unusu punish sturdiv also charg fail interven marsher punch beat indict alleg action involv danger weapon result bodili injuri victim third offic deont pate charg along marsher sturdiv conspir cover beat indict alleg three offic submit report three convict marsher sturdiv face maximum sentenc year prison excess forc charg three offic face five year prison conspiraci statement charg year prison report charg indict mere accus defend presum innoc unless proven guilti investig jackson cooper mississippi correct prosecut robert coleman northern mississippi dana mulhaus crimin section marsher indict
Name: contents, dtype: object

6727    reach comprehen settlement agreement citi miami miami polic resolv shoot offic announc princip deputi general vanita gupta head wifredo ferrer southern florida settlement approv miami citi commiss today effect agreement sign parti resolv claim stem shoot offic conduct violent crime control enforc find issu juli identifi pattern practic excess forc shoot violat fourth amend constitut citi complianc settlement monitor independ review former tampa florida polic chief jane castor settlement agreement citi implement comprehen reform ensur constitut polic support public trust settlement agreement design minim shoot effect quick investig shoot occur measur includ settlement repr renew commit citi miami chief rodolfo llane provid constitut polic miami resid protect public safeti sustain reform said princip deputi general gupta agreement help strengthen relationship communiti serv improv account offic fire weapon unlaw provid communiti particip enforc today agreement result joint effort

### 2.3.2 Create a document-term matrix from the preprocessed press releases and to explore top words (5 points)

A. Use the `create_dtm` function I provide (alternately, feel free to write your own!) and create a document-term matrix using the preprocessed press releases; make sure metadata contains the `compound` sentiment column you added and the `topics_clean` column

B. Print the top 10 words for press releases with compound sentiment in the top 5% (so most positive)

C. Print the top 10 words for press releases with compound sentiment in the bottom 5% (so most negative)

**Hint**: for these, remember the pandas quantile function from pset one.  

D. What are the top 10 words for press releases in each of the three `topics_clean`?

For steps B - D, to receive full credit, write a function `get_topwords` that helps you avoid duplicated code when you find top words for the different subsets of the data

**Resources**:

- Here contains an example of applying the create_dtm function: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_examplecode.ipynb


In [46]:
def create_dtm(list_of_strings, metadata):
    vectorizer = CountVectorizer(lowercase = True)
    dtm_sparse = vectorizer.fit_transform(list_of_strings)
    dtm_dense_named = pd.DataFrame(dtm_sparse.todense(), columns=vectorizer.get_feature_names())
    dtm_dense_named_withid = pd.concat([metadata.reset_index(), dtm_dense_named], axis = 1)
    return(dtm_dense_named_withid)

In [52]:
output_temp23 = []
i = 0
for content in doj_2_3['contents']:
    string_temp = str(content)
    output_temp.append(string_temp)
for entry in output_temp23:
    sent = sent_scoring(entry)
    pos_sent[i] = sent["pos"]
    neg_sent[i] = sent["neg"]
    neu_sent[i] = sent["neu"]
    comp_sent[i] = sent["compound"]
    i = i + 1


doj_2_3["Positive Score"] = pos_sent
doj_2_3["Negative Score"] = neg_sent
doj_2_3["Neutral Score"] = neu_sent
doj_2_3["Compound Score"] = comp_sent
doj_dtm = create_dtm(doj_2_3.contents, doj_2_3[['topics_clean', 'Compound Score']])

In [62]:
def get_topwords(df):
    return(df[df.columns[2:]].sum(axis=0).sort_values(ascending = False).head(10))

In [70]:
print("Positive: \n", get_topwords(doj_dtm[doj_dtm.iloc[:,2] >= doj_dtm.iloc[:,2].quantile(.95)]))
print("Negative: \n", get_topwords(doj_dtm[doj_dtm.iloc[:,2] <= doj_dtm.iloc[:,2].quantile(.05)]))

Positive: 
 agreement     160.0
state         110.0
enforc        102.0
hous          100.0
disabl         96.0
ensur          84.0
discrimin      83.0
communiti      79.0
settlement     78.0
said           77.0
dtype: float64
Negative: 
 crime       211.0
assault     193.0
hate        184.0
victim      184.0
defend      147.0
sentenc     103.0
charg       102.0
prosecut     99.0
said         96.0
anderson     93.0
dtype: float64


In [68]:
print("Civil Rights: \n", get_topwords(doj_dtm[doj_dtm.topics_clean == "Civil Rights"]))
print("Hate Crimes: \n", get_topwords(doj_dtm[doj_dtm.topics_clean == "Hate Crimes"]))
print("Project Safe Childhood: \n", get_topwords(doj_dtm[doj_dtm.topics_clean == "Project Safe Childhood"]))

Civil Rights: 
 offic        627.0
hous         620.0
discrimin    541.0
enforc       531.0
disabl       509.0
said         497.0
feder        475.0
violat       470.0
state        443.0
general      408.0
dtype: float64
Hate Crimes: 
 victim      590.0
crime       533.0
prosecut    476.0
hate        472.0
defend      459.0
sentenc     455.0
charg       452.0
guilti      430.0
feder       426.0
said        424.0
dtype: float64
Project Safe Childhood: 
 child          1018.0
exploit         698.0
sexual          570.0
safe            476.0
project         472.0
childhood       472.0
pornographi     447.0
children        416.0
crimin          404.0
prosecut        374.0
dtype: float64


### 2.3.3 Estimate a topic model using those preprocessed words (5 points)

A. Going back to the preprocessed words from part 2.3.1, estimate a topic model with 3 topics, since you want to see if the unsupervised topic models recover different themes for each of the three manually-labeled areas (civil rights; hate crimes; project safe childhood). You have free rein over the other topic model parameters beyond the number of topics.

B. After estimating the topic model, print the top 15 words in each topic.

**Resources**:

- Same topic modeling resources linked to above

In [76]:
tokens_2_33 = [wordpunct_tokenize(content) for content in doj_2_3['contents']]
dictionary_2_33 = corpora.Dictionary(tokens_2_33)
corpus_2_33 = [dictionary_2_33.doc2bow(one_tok) for one_tok in tokens_2_33]
model_topics = gensim.models.ldamodel.LdaModel(corpus_2_33, num_topics = 3, id2word = dictionary_2_33, alpha = "auto", passes = 8, per_word_topics = True)
model_topics.print_topics(num_words = 15)

[(0,
  '0.012*"offic" + 0.009*"feder" + 0.009*"charg" + 0.008*"sentenc" + 0.008*"said" + 0.007*"victim" + 0.007*"defend" + 0.007*"prosecut" + 0.007*"investig" + 0.006*"violat" + 0.006*"general" + 0.006*"today" + 0.006*"prison" + 0.006*"enforc" + 0.006*"assault"'),
 (1,
  '0.014*"child" + 0.009*"sexual" + 0.009*"exploit" + 0.008*"hous" + 0.007*"state" + 0.007*"discrimin" + 0.007*"children" + 0.007*"disabl" + 0.007*"enforc" + 0.007*"safe" + 0.006*"project" + 0.006*"childhood" + 0.006*"individu" + 0.006*"feder" + 0.006*"pornographi"'),
 (2,
  '0.012*"victim" + 0.010*"guilti" + 0.009*"prosecut" + 0.009*"crime" + 0.009*"sentenc" + 0.008*"charg" + 0.008*"feder" + 0.008*"defend" + 0.008*"said" + 0.007*"hate" + 0.007*"indict" + 0.007*"prison" + 0.006*"plead" + 0.006*"today" + 0.006*"year"')]

### 2.3.4 Add topics back to main data and explore correlation between manual labels and our estimated topics (10 points)

A. Extract the document-level topic probabilities. Within `get_document_topics`, use the argument `minimum_probability` = 0 to make sure all 3 topic probabilities are returned. Write an assert statement to make sure the length of the list is equal to the number of rows in the `doj_subset` dataframe

B. Add the topic probabilities to the `doj_subset` dataframe as columns and code each document to its highest-probability topic

C. For each of the manual labels in `topics_clean` (Hate Crime, Civil Rights, Project Safe Childhood), print the breakdown of the % of documents with each top topic (so, for instance, Hate Crime has 246 documents-- if 123 of those documents are coded to topic_1, that would be 50%; and so on). **Hint**: pd.crosstab and normalize may be helpful: https://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.crosstab.html

D. Using a couple press releases as examples, write a 1-2 sentence interpretation of why some of the manual topics map on more cleanly to an estimated topic than other manual topic(s)

**Resources**:

- End of this code contains example of how to use `get_document_topics` and other steps to add topic probabilities back to data: https://github.com/rebeccajohnson88/qss20_slides_activities/blob/main/activities/06_textasdata_partII_topicmodeling_solution.ipynb

In [90]:
probs_34 = [model_topics.get_document_topics(item, minimum_probability = 0) for item in corpus_2_33]
assert len(probs_34) == doj_subset.shape[0]
df_34 = pd.DataFrame(probs_34)
for i in df_34.columns:
    df_34[i] = [df_34[i][x][1] for x in range(0, df_34.shape[0])]
    
doj_subset = doj_subset.merge(right = df_34, how = 'left', left_index = True, right_index = True)
#doj_subset
doj_subset['topic_model']= doj_subset[[0,1,2]].idxmax(axis=1)
#for some reason the above line only seems to work every other time i run this cell
pd.crosstab(doj_subset.topics_clean, doj_subset.topic_model, normalize = "index")

topic_model,0.0,1.0,2.0
topics_clean,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Civil Rights,0.0,0.333333,0.666667
Hate Crimes,0.363636,0.227273,0.409091
Project Safe Childhood,0.272727,0.636364,0.090909


## 2.5 OPTIONAL extra credit (5 points)

You notice that the pharmaceutical kickbacks press release we analyzed in question 2.1 was for an indictment, and that in the original data, there's not a clear label for whether a press release outlines an indictment (charging someone with a crime), a conviction (convicting them after that charge either via a settlement or trial), or a sentencing (how many years of prison or supervised release a defendant is sentenced to after their conviction).

You want to see if you can identify pairs of press releases where one press release is from one stage (e.g., indictment) and another is from a different stage (e.g., a sentencing).

You decide that one way to approach is to find the pairwise string similarity between each of the processed press releases in `doj_subset`. There are many ways to do this, so Google for some approaches, focusing on ones that work well for entire documents rather than small strings. Feel free to load additional packages if needed

Find the top two pairs (so four press releases total)-- do they seem like different stages of the same crime or just press releases covering similar crimes?