In [1]:
setwd('/Users/chansoosong/Desktop/Research/edsp2019project-chansooligans/Archive/Progress_Reports/')
library(rjson)
library(stringr)

# Measuring Judge Ideology

![Martin_Quinn](https://mqscores.lsa.umich.edu/images/ipAnim1937_2006.gif)

Standard IRT model with discrimination parameter (gamma):

So this is a simpler, non-dynamic version of Martin-Quinn's estimation. Only difference is that ideology is assumed constant over time.

For this simplified example, suppose outcome, y, has been coded so that 1 = conservative vote and 0 = liberal vote.

![Image1](images/img1.png)

## Plan:

1. Get the data
2. Collect metadata
3. Clean the data
4. Get vector representations for documents
5. Create vector representations for judges
6. Using the metadata, plot the principal components of the vector representations 



### Problem 1: "plain_text", "html", "html_lawbox", "html_columbia", "html_with_citations" each contain opinions with different styles

- Opinions are stored in JSON under several different keys
- Each key stores text in different styles (e.g. html tags may be different)

In [2]:
file_names = list.files('data/')
opinions = list()
example1 = fromJSON(file='data/2448.json')
for(i in 1:length(file_names)) opinions[[i]] = fromJSON(file=paste('data/',file_names[i],sep=''))
str(opinions[[1]])

List of 21
 $ resource_uri       : chr "http://www.courtlistener.com/api/rest/v3/opinions/1036108/"
 $ absolute_url       : chr "/opinion/1036108/manning-v-boston-medical-center/"
 $ cluster            : chr "http://www.courtlistener.com/api/rest/v3/clusters/1036108/"
 $ author             : NULL
 $ joined_by          : list()
 $ author_str         : chr ""
 $ per_curiam         : logi FALSE
 $ date_created       : chr "2013-08-01T21:05:37.076430Z"
 $ date_modified      : chr "2017-03-28T13:06:54.959623Z"
 $ type               : chr "010combined"
 $ sha1               : chr "2ab04ba7e11cba1db185ccda5e4c4f447b569732"
 $ page_count         : num 65
 $ download_url       : chr "http://media.ca1.uscourts.gov/pdf.opinions/12-1573P-01A.pdf"
 $ local_path         : chr "pdf/2013/08/01/manning_v._boston_medical_center.pdf"
 $ plain_text         : chr "          United States Court of Appeals\n                        For the First Circuit\n\n\nNos. 12-1573, 12-1"| __truncated__
 $ html         

In [3]:
print_some_lines = function(x,n_lines) {
    print(paste('plain_text:',substr(x$plain_text,0,n_lines)))
    print(paste('html:',substr(x$html,0,n_lines)))
    print(paste('html_columbia:',substr(x$html_columbia,0,n_lines)))
    print(paste('html_with_citations:',substr(x$html_with_citations,0,n_lines)))
}

In [4]:
print_some_lines(opinions[[1]],2000)

[1] "plain_text:           United States Court of Appeals\n                        For the First Circuit\n\n\nNos. 12-1573, 12-1653\n\n                    ELIZABETH MANNING ET AL.,\n\n                        Plaintiffs, Appellants,\n\n                                  v.\n\n               BOSTON MEDICAL CENTER CORPORATION;\n                  ELAINE ULLIAN; JAMES CANAVAN,\n\n                        Defendants, Appellees,\n\n  BOSTON REGIONAL MEDICAL CENTER, INC.; BOSTON REGIONAL MEDICAL\n     CENTER, LLC; BOSTON MEDICAL CENTER 403B RETIREMENT PLAN,\n\n                              Defendants.\n\n\n          APPEAL FROM THE UNITED STATES DISTRICT COURT\n                FOR THE DISTRICT OF MASSACHUSETTS\n\n            [Hon. Rya W. Zobel, U.S. District Judge]\n\n\n                                Before\n\n                   Thompson, Stahl, and Lipez,\n                         Circuit Judges.\n\n\n     Guy A. Talia, with whom Patrick J. Solomon and Thomas &\nSolomon LLP were on brief, for 

### Problem 2: Get Author Name

In [5]:
opinions[[1067]]$author
opinions[[1067]]$author_str

NULL

https://www.courtlistener.com/opinion/237855/doris-sylvia-grey-infant-and-howard-martin-grey-infant-children-of/

In [6]:
substr(opinions[[1067]]$html_with_citations,1025,1200)

In [7]:
str_match(opinions[[1067]]$html_with_citations, "<p class=\"indent\"([ \t>]*)([a-zA-Z,]*)[ \t]Circuit[ \t]+Judge[.]")[,3]

### Problem 3: "Year" is missing  

- Year is import to remove time/era effects after obtaining vector representations of the texts.
- Easy to get from "local path" if available, otherwise need to extract from opinion.

In [8]:
regex_date = '(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?
|Nov(ember)?|Dec(ember)?)\\s+\\d{1,2},\\s+\\d{4}'
date = str_match(opinions[[4]]$html_with_citations, regex_date)[1]
date



### Problem 4: Dissents, Concurring, Per Curiam, Errata Sheets

Example Dissent: 

https://www.courtlistener.com/opinion/1036108/manning-v-boston-medical-center/

Example Errata Sheet:

https://www.courtlistener.com/opinion/1034770/in-reauerhahn-v/

Excluding these for now  

Total First Circuit Files: 34834  
Dissents: 275 (< 1%)  
Errata: 1527 (4%)  
Per Curiam: 3713 (10%)  

![image2](images/federal_dissents.png)

(From "Why (and when) judges dissents: A Theoretical Empiriccal Analysis" - Epstein, Landes, Posner (2011))

## Metadata

In [9]:
load('data_inventory.ca1.RDATA')

In [10]:
head(df)

file_name,year,case_name,alt_case_name,circuit,local_path,absolute_url,type,author,joined_by,download_url,plain_text,judge,dissent,concurring,per_curiam,errata
1.json,2010,US_v._Davila-Gonzalez,united-states-v-davila-gonzalez,ca1,pdf/2010/02/10/US_v._Davila-Gonzalez.pdf,/opinion/1/united-states-v-davila-gonzalez/,010combined,,,http://www.ca1.uscourts.gov/pdf.opinions/08-2575P-01A.pdf,plain,"SELYA,",0,0,0,0
10.json,2010,US_v._Mitchell,united-states-v-mitchell,ca1,pdf/2010/02/22/US_v._Mitchell.pdf,/opinion/10/united-states-v-mitchell/,010combined,,,http://www.ca1.uscourts.gov/pdf.opinions/09-1260P-01A.pdf,plain,"TORRUELLA,",0,0,0,0
1000.json,2010,Adams_v._Adams,adams-v-adams,ca1,pdf/2010/03/31/Adams_v._Adams.pdf,/opinion/1000/adams-v-adams/,010combined,,,http://www.ca1.uscourts.gov/pdf.opinions/09-1443P-01A.pdf,plain,"STAHL,",0,0,0,0
1001.json,2010,Airframe_Systems_Inc._v._Raytheon_Company,airframe-systems-inc-v-raytheon-co,ca1,pdf/2010/03/31/Airframe_Systems_Inc._v._Raytheon_Company.pdf,/opinion/1001/airframe-systems-inc-v-raytheon-co/,010combined,,,http://www.ca1.uscourts.gov/pdf.opinions/09-1624P-01A.pdf,plain,"LYNCH,",0,0,0,0
1032057.json,2013,united_states_v._hogan,united-states-v-hogan,ca1,pdf/2013/07/05/united_states_v._hogan.pdf,/opinion/1032057/united-states-v-hogan/,010combined,,,http://media.ca1.uscourts.gov/pdf.opinions/12-1039P-01A.pdf,plain,"THOMPSON,",0,0,0,0
1032437.json,2013,mitchell_v._us_airways_inc.,mitchell-v-us-airways-inc,ca1,pdf/2013/07/09/mitchell_v._us_airways_inc..pdf,/opinion/1032437/mitchell-v-us-airways-inc/,010combined,,,http://media.ca1.uscourts.gov/pdf.opinions/12-1543P-01A.pdf,plain,"SELYA,",0,0,0,0


***
## Text Pre-Processing



Tokenize Terms:

1. convert to lowercase
2. split strings using delimiter: " \r\n\t.,;:()?!//"
3. remove punctuation
4. remove numbers
5. trim white space
6. remove stop words
7. minimum number character = 3
8. maximum number character = 100
9. stem tokens

## Text Model

![doc2vec](images/doc2vec.png)

## Analysis

Average document vectors for each judge then plot.
Only keep judge if more than 100 opinions are available. (25 judges)

In [None]:
# Load Data
judge_meta = read.csv(file='ca1_Judge_Metadata.csv')
head(judge_meta,30)

## First Pass:

![pca](images/NoDeMeaning.png)

## De-mean by Circuit Court:

![pca](images/DeMeanCourts.png)

## Color-Code Birthyear:

![pca](images/ColorCodeBirth.png)

## De-mean by Opinion Year too:

![pca](images/finalPCA1.png)

![pca](images/finalPCA2.png)