# Homework 10: Cleaning data with Regular Expressions

Time to use regular expressions!

# Hints and notes

### Opening files in subdirectories

Notice that this notebook might be **homework/**, but!!! the csvs and text files might be in **homework/scraped/** or **/homework/scraped/minutes_pdfs** or **/homework/pdfs/**. To open a file in a subdirectory, instead of having the filename be `"file.csv"` you'll just use `"some/subfolder/file.csv"`

### Opening text files

This will open up a file, read it in and show you the first 500 characters.

```python
contents = open("your-filename.txt").read()
contents[0:500]
```

> You might need `open("your-filename.txt", encoding="utf8").read()`

### Using regex

For some dumb reason you need to put `r` in front of the string you use when you're talking about regex. Just plain `"(\d\d\d)"` will usually work, but *sometimes* it won't and you'll need `r"(\d\d\d)`. It's best to just use the `r` all of the time, if you can remember!

### Using `.str.extract`

When you use `.str.extract`, you're always going to **capture one thing** and save it to a new column. You need to wrap the things you're interested in with parenthesis `(` `)`.

```python
df['phone_number'] = df['old_column'].str.extract(r"My phone number is (\d\d\d-\d\d\d-\d\d\d\d)")
```

### Setting pandas options

Pandas has a lot of options, like how many columns or rows it will show you, or how many characters it will show in a column before it stops showing you anything. Here are a few useful ones:

* `display.max_cols`: Number of columns to show at once
* `display.max_rows`: Number of rows to show at once
* `display.max_colwidth`: Maximum number of characters displayed from a string

You can set them using `pd.set_option("display.max_rows", 1000)`, for example, to show 1000 rows at a time. You can find a lot more at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.set_option.html

### Regular expressions reference

I personally think http://www.regular-expressions.info/ is a wonderful wonderful reference (and tutorial), even if it's ugly! But here's a quick reference for you:

* `\d` is a digit
* `\d*` is zero or more digits 
* `\d+` is one or more digits
* `.` matches anything for ONE character
* `.*` is "give me anything forever"
* `\s` is whitespace, a.k.a. spaces and tabs
* `\w` is a word character, which includes capital and lowercase letters, numbers and hyphens.
* You can put `*` after anything, so `\w*` would mean "as many word characters as you can find"
* `\b` is a word boundary (you'll need the `r""` thing for this one)
* `( )` is a "capture group" for saving something
* `\1` is used when doing find/replace to say "put the first captured group here" (note, it's a dollar sign instead of a backslash in some editors)
* `[ABCDE]` is a character class, which means "match one of these, I don't care which"
* dollar sign means "end of the line"
* caret ^ means "beginning of the line"
* `\.` means "no really seriously I mean a period not just anything"
* You can use `\` with anything else that would normally be a special character, too, not just periods. `(` or `[` or whatever.

### Cleaning up extracted columns

Sometimes you get `\n` (newlines) or spaces or `\t` (tabs) or stuff at the beginning or the end of your column. `.str.strip()` will usually take care of that, just attach it after your `.str.extract()`

After you extract something, it's still a string even though you look at it and know it's a number. Use `.astype(int)` to turn it into an integer (no decimal) or `.astype(float)` to turn it into a float (yes decimal)

### Writing regular expressions in general

Even if I'm using regex in pandas or Python, I like to test them in my text editor with "Find." The highlighting really helps me see if I'm matching things! I also like to think "what stays the same?" when designing patterns, write those parts first, then fill in the blanks with what I want to capture.

## Importing

There might be more, I just wanted to put this up here for the `pd.set_option` part. It allows you to see a lot of content in a single column of pandas, which will be important for some parts below.

In [1]:
import re
import pandas as pd
pd.set_option('display.max_colwidth', 500)

# Part 1: Using `.str.extract` to pull data from columns in pandas

## 1.1 H&M

Open up `hm.csv` from the `scraped` directory. I want **four new columns**:

1. `price_original`, the original price, one of the new price
2. `price_discounted`, the discounted price
3. `pct_discount`, the percent discount
4. `article_id`, the article id (from the url)

Save as **hm_cleaned.csv**.

**Note:** When you look at it, it... won't look right. I don't know why, pandas is weird. Look at the `price` column by itself using `df['price']` before you write your regex.

**Tip:** Remember that `$` is a special regex symbol! You might need to escape it.

**Tip:** When doing `.str.extract`, the whole match doesn't get captured, only what you put `()` around! Think about anchoring to different points of the string, or things in the string.

**Tip:** Not all prices have cents!

**Tip:** Your first instinct about how to compute the percent discount is probably wrong

In [2]:
df= pd.read_csv('scraped\hm.csv',encoding='utf-8')
df

Unnamed: 0,name,price,url
0,Washed Linen Duvet Cover Set,$59.99 $129,http://www.hm.com/us/product/13472?article=13472-N
1,Candle in Glass Jar,$6.99 $17.99,http://www.hm.com/us/product/35079?article=35079-D
2,Glittery Cushion Cover,$7.99 $17.99,http://www.hm.com/us/product/72462?article=72462-A
3,Textured-weave Cushion Cover,$6.99 $12.99,http://www.hm.com/us/product/58926?article=58926-C
4,Stoneware Bowl,$17.99 $24.99,http://www.hm.com/us/product/74242?article=74242-A
5,Slub-weave Cushion Cover,$3.99 $9.99,http://www.hm.com/us/product/70965?article=70965-D
6,Braided Cushion Cover,$7.99 $17.99,http://www.hm.com/us/product/62818?article=62818-B
7,Jacquard-weave Cushion Cover,$7.99 $17.99,http://www.hm.com/us/product/69163?article=69163-B
8,Scented Candle in Glass Holder,$9.99 $17.99,http://www.hm.com/us/product/40910?article=40910-C
9,2-pack Curtain Panels,$27.99 $34.99,http://www.hm.com/us/product/69699?article=69699-B


In [3]:
df['price_original']= df['price'].str.extract(r'.\$(\d\d?.\d?\d?)').astype(float)
df['price_original']

0     129.00
1      17.99
2      17.99
3      12.99
4      24.99
5       9.99
6      17.99
7      17.99
8      17.99
9      34.99
10      5.99
11      5.99
12     12.99
13     49.99
14     49.99
15     17.99
16     29.99
17      5.99
18     17.99
19     24.99
20     24.99
21     12.99
22     17.99
23      9.99
24     12.99
25     17.99
26      6.99
27     17.99
28      6.99
29     12.99
30     34.99
31     17.99
32     49.99
33     79.99
34     12.99
35     24.99
36      5.99
37     34.99
38     17.99
39     17.99
40     17.99
41    199.00
42     99.00
43     17.99
44     12.99
45     17.99
46      9.99
47     17.99
48     12.99
49      5.99
50      9.99
51     12.99
52      5.99
53     79.99
54     17.99
55     79.99
56      5.99
57     24.99
58     12.99
59     17.99
Name: price_original, dtype: float64

In [4]:
df['price_discounted']= df['price'].str.extract(r'\$(.* )').astype(float) 
df['price_discounted']

0     59.99
1      6.99
2      7.99
3      6.99
4     17.99
5      3.99
6      7.99
7      7.99
8      9.99
9     27.99
10     2.99
11     2.99
12     9.99
13    39.99
14    39.99
15     7.99
16    19.99
17     2.99
18     7.99
19    14.99
20     9.99
21     7.99
22     7.99
23     4.99
24     9.99
25     7.99
26     3.99
27     7.99
28     2.99
29     9.99
30    14.99
31     7.99
32    24.99
33    29.99
34     4.99
35    19.99
36     2.99
37    14.99
38     6.99
39     6.99
40     6.99
41    99.00
42    54.99
43     6.99
44     4.99
45     7.99
46     3.99
47     6.99
48     4.99
49     2.99
50     4.99
51     6.99
52     2.99
53    34.99
54     7.99
55    34.99
56     2.99
57    14.99
58     5.99
59    12.99
Name: price_discounted, dtype: float64

In [5]:
#percent discount pct.change()
df['pct_discount']=(df['price_original']+df['price_discounted'])/df['price_original']*100
df['pct_discount']

0     146.503876
1     138.854919
2     144.413563
3     153.810624
4     171.988796
5     139.939940
6     144.413563
7     144.413563
8     155.530850
9     179.994284
10    149.916528
11    149.916528
12    176.905312
13    179.995999
14    179.995999
15    144.413563
16    166.655552
17    149.916528
18    144.413563
19    159.983994
20    139.975990
21    161.508853
22    144.413563
23    149.949950
24    176.905312
25    144.413563
26    157.081545
27    144.413563
28    142.775393
29    176.905312
30    142.840812
31    144.413563
32    149.989998
33    137.492187
34    138.414165
35    179.991997
36    149.916528
37    142.840812
38    138.854919
39    138.854919
40    138.854919
41    149.748744
42    155.545455
43    138.854919
44    138.414165
45    144.413563
46    139.939940
47    138.854919
48    138.414165
49    149.916528
50    149.949950
51    153.810624
52    149.916528
53    143.742968
54    144.413563
55    143.742968
56    149.916528
57    159.983994
58    146.1123

In [6]:
df['article_id']=df['url'].str.extract(r'(\d\d\d\d\d)')
df['article_id']

0     13472
1     35079
2     72462
3     58926
4     74242
5     70965
6     62818
7     69163
8     40910
9     69699
10    76916
11    68473
12    74498
13    41512
14    76727
15    76977
16    73369
17    59417
18    76797
19    76794
20    80432
21    78352
22    70658
23    72111
24    77665
25    70603
26    72262
27    70427
28    72262
29    77665
30    73380
31    35079
32    74773
33    75033
34    60637
35    70306
36    60621
37    69643
38    72455
39    72460
40    72462
41    71871
42    73734
43    74267
44    73344
45    73325
46    74264
47    72455
48    70799
49    68473
50    72111
51    58926
52    58211
53    75469
54    76982
55    74825
56    74263
57    78047
58    60637
59    78381
Name: article_id, dtype: object

In [7]:
#just checking 
df

Unnamed: 0,name,price,url,price_original,price_discounted,pct_discount,article_id
0,Washed Linen Duvet Cover Set,$59.99 $129,http://www.hm.com/us/product/13472?article=13472-N,129.0,59.99,146.503876,13472
1,Candle in Glass Jar,$6.99 $17.99,http://www.hm.com/us/product/35079?article=35079-D,17.99,6.99,138.854919,35079
2,Glittery Cushion Cover,$7.99 $17.99,http://www.hm.com/us/product/72462?article=72462-A,17.99,7.99,144.413563,72462
3,Textured-weave Cushion Cover,$6.99 $12.99,http://www.hm.com/us/product/58926?article=58926-C,12.99,6.99,153.810624,58926
4,Stoneware Bowl,$17.99 $24.99,http://www.hm.com/us/product/74242?article=74242-A,24.99,17.99,171.988796,74242
5,Slub-weave Cushion Cover,$3.99 $9.99,http://www.hm.com/us/product/70965?article=70965-D,9.99,3.99,139.93994,70965
6,Braided Cushion Cover,$7.99 $17.99,http://www.hm.com/us/product/62818?article=62818-B,17.99,7.99,144.413563,62818
7,Jacquard-weave Cushion Cover,$7.99 $17.99,http://www.hm.com/us/product/69163?article=69163-B,17.99,7.99,144.413563,69163
8,Scented Candle in Glass Holder,$9.99 $17.99,http://www.hm.com/us/product/40910?article=40910-C,17.99,9.99,155.53085,40910
9,2-pack Curtain Panels,$27.99 $34.99,http://www.hm.com/us/product/69699?article=69699-B,34.99,27.99,179.994284,69699


## 1.2 Sci-Fi Authors

Open up `sci-fi.csv` to clean. Get rid of the `\n` on the title and and give me six new columns:

* `avg_rating`
* `rating_count`
* `total_score`
* `score_votes`
* `series` the series the book belongs to
* `series_no` the book in the series that it is

For series, I'm talking about e.g. `(The Hunger Games, #1)` is `series` "The Hunter Games" and `series_no` 1.

Save as **sci-fi_cleaned.csv**.

**Tip:** You don't need regex to clean the title - there's a special thing that removes whitespace from the beginning/end of strings

**Tip:** Remember that `(` and `)` are special characters

**BONUS:** When you make the `total_score` column, pay close attention to it. If you notice the problem, fix it.

**BONUS:** You don't need these columns to be numbers, but life would be better if they were. 

In [8]:
df_scifi= pd.read_csv('scraped\sci-fi.csv',encoding='utf-8')
df_scifi
#okay, this is a mess...

Unnamed: 0,full_rating,full_score,rank,title,url
0,"4.07 avg rating — 785,502 ratings","\nscore: 28,539,\n and\n292 people voted\n \n \n",1,\nThe Handmaid's Tale\n,/book/show/38447.The_Handmaid_s_Tale
1,"4.34 avg rating — 5,212,935 ratings","\nscore: 27,566,\n and\n282 people voted\n \n \n",2,"\nThe Hunger Games (The Hunger Games, #1)\n",/book/show/2767052-the-hunger-games
2,"3.76 avg rating — 922,308 ratings","\nscore: 20,049,\n and\n205 people voted\n \n \n",3,"\nFrankenstein, or The Modern Prometheus\n",/book/show/18490.Frankenstein_or_The_Modern_Prometheus
3,"4.04 avg rating — 702,272 ratings","\nscore: 17,684,\n and\n185 people voted\n \n \n",4,"\nA Wrinkle in Time (A Wrinkle in Time Quintet, #1)\n",/book/show/18131.A_Wrinkle_in_Time
4,"4.06 avg rating — 77,664 ratings","\nscore: 16,070,\n and\n165 people voted\n \n \n",5,\nThe Left Hand of Darkness\n,/book/show/18423.The_Left_Hand_of_Darkness
5,"4.23 avg rating — 2,345,974 ratings","\nscore: 12,935,\n and\n134 people voted\n \n \n",6,"\nDivergent (Divergent, #1)\n",/book/show/13335037-divergent
6,"4.30 avg rating — 2,049,239 ratings","\nscore: 12,261,\n and\n128 people voted\n \n \n",7,"\nCatching Fire (The Hunger Games, #2)\n",/book/show/6148028-catching-fire
7,"4.12 avg rating — 1,379,452 ratings","\nscore: 11,238,\n and\n117 people voted\n \n \n",8,"\nThe Giver (The Giver, #1)\n",/book/show/3636.The_Giver
8,"4.19 avg rating — 57,605 ratings","\nscore: 10,246,\n and\n107 people voted\n \n \n",9,\nThe Dispossessed\n,/book/show/13651.The_Dispossessed
9,"4.20 avg rating — 53,473 ratings","\nscore: 9,907,\n and\n104 people voted\n \n \n",10,\nKindred\n,/book/show/60931.Kindred


In [9]:
#Get rid of the \n on the title
df_scifi['title']=df_scifi['title'].str.strip()

In [10]:
df_scifi['avg_rating']= df_scifi['full_rating'].str.extract(r'(\d\.\d\d)\s').astype(float) 
df_scifi['avg_rating']

0     4.07
1     4.34
2     3.76
3     4.04
4     4.06
5     4.23
6     4.30
7     4.12
8     4.19
9     4.20
10    4.00
11    4.03
12    3.95
13    4.03
14    4.10
15    4.17
16    4.12
17    3.97
18    4.10
19    4.14
20    4.13
21    4.29
22    4.31
23    3.84
24    4.06
25    3.69
26    4.44
27    3.88
28    4.08
29    4.06
      ... 
70    4.41
71    4.02
72    3.83
73    3.55
74    3.78
75    4.01
76    4.46
77    4.18
78    4.05
79    4.32
80    4.48
81    3.88
82    4.19
83    4.28
84    3.99
85    4.91
86    3.86
87    4.04
88    4.01
89    3.91
90    3.74
91    4.03
92    4.05
93    3.82
94    4.19
95    4.17
96    3.78
97    4.16
98    3.85
99    3.83
Name: avg_rating, Length: 100, dtype: float64

In [11]:
df_scifi['rating_count']= df_scifi['full_rating'].str.extract(r'\W( \d.* )')
df_scifi['rating_count']=df_scifi['rating_count'].str.replace(',','')
df_scifi['rating_count'].astype(int)

0      785502
1     5212935
2      922308
3      702272
4       77664
5     2345974
6     2049239
7     1379452
8       57605
9       53473
10     176024
11    1932930
12    1355662
13      38534
14      36316
15      46116
16      22554
17      53063
18     106206
19      31668
20      26705
21      18831
22      20045
23     793665
24     979341
25      33313
26      12743
27       1466
28      16555
29      81801
       ...   
70        336
71     206616
72        799
73       4568
74        215
75       8590
76        256
77        748
78      10873
79       3203
80        244
81       5588
82        367
83         93
84     366089
85         56
86        134
87       3160
88       4504
89       9782
90      11925
91       3338
92      57864
93        116
94       3328
95       3717
96     103428
97         97
98        111
99       3244
Name: rating_count, Length: 100, dtype: int32

In [12]:
df_scifi['total_score']=df_scifi['full_score'].str.extract(r'(\d.*\d),\n')
df_scifi['total_score']=df_scifi['total_score'].str.replace(',','')
df_scifi['total_score'].astype(int)                                     
                                         

0     28539
1     27566
2     20049
3     17684
4     16070
5     12935
6     12261
7     11238
8     10246
9      9907
10     9628
11     9423
12     8250
13     8175
14     7523
15     6571
16     6418
17     6257
18     6238
19     5999
20     5933
21     5698
22     5046
23     5012
24     4899
25     4413
26     4398
27     4394
28     4249
29     4245
      ...  
70     1817
71     1775
72     1772
73     1740
74     1698
75     1692
76     1687
77     1684
78     1683
79     1575
80     1565
81     1564
82     1559
83     1495
84     1461
85     1458
86     1455
87     1420
88     1380
89     1323
90     1316
91     1301
92     1285
93     1278
94     1240
95     1197
96     1174
97     1170
98     1169
99     1155
Name: total_score, Length: 100, dtype: int32

In [13]:
df_scifi['score_votes']=df_scifi['full_score'].str.extract(r'(\d{2,3} )').astype(int)
df_scifi['score_votes']

0     292
1     282
2     205
3     185
4     165
5     134
6     128
7     117
8     107
9     104
10    102
11    100
12     88
13     88
14     81
15     69
16     68
17     70
18     68
19     64
20     64
21     61
22     54
23     53
24     54
25     48
26     48
27     45
28     47
29     46
     ... 
70     19
71     20
72     19
73     22
74     17
75     22
76     18
77     18
78     24
79     19
80     17
81     19
82     17
83     16
84     16
85     16
86     16
87     18
88     15
89     18
90     15
91     16
92     16
93     14
94     16
95     16
96     13
97     13
98     13
99     16
Name: score_votes, Length: 100, dtype: int32

In [14]:
df_scifi['series']=df_scifi['title'].str.extract(r'[(](\b.* )')
df_scifi['series']=df_scifi['series'].str.replace(',','')
df_scifi['series']

0                               NaN
1                 The Hunger Games 
2                               NaN
3        A Wrinkle in Time Quintet 
4                               NaN
5                        Divergent 
6                 The Hunger Games 
7                        The Giver 
8                               NaN
9                               NaN
10                       MaddAddam 
11                The Hunger Games 
12                              NaN
13              Oxford Time Travel 
14                              NaN
15                     The Sparrow 
16                 Vorkosigan Saga 
17                  Imperial Radch 
18            Dragonriders of Pern 
19                       Earthseed 
20              Oxford Time Travel 
21                 Vorkosigan Saga 
22                 Vorkosigan Saga 
23                        The Host 
24                       Divergent 
25                              NaN
26                 Vorkosigan Saga 
27                 Aurora Rh

In [15]:
df_scifi['series_no']= df_scifi['title'].str.extract(r', #(.*)[)]')
df_scifi['series_no']

0     NaN
1       1
2     NaN
3       1
4     NaN
5       1
6       2
7       1
8     NaN
9     NaN
10      1
11      3
12    NaN
13      1
14    NaN
15      1
16      1
17      1
18      1
19      1
20      2
21      2
22      7
23      1
24      2
25    NaN
26     10
27      1
28      1
29      2
     ... 
70      4
71    NaN
72      1
73    NaN
74      1
75    NaN
76      5
77      2
78     14
79    1-4
80      6
81      1
82      2
83     10
84      1
85      2
86      2
87    NaN
88    NaN
89    NaN
90      1
91    NaN
92      1
93      5
94      3
95      2
96      1
97      1
98      4
99    NaN
Name: series_no, Length: 100, dtype: object

In [16]:
#just checking
df_scifi

Unnamed: 0,full_rating,full_score,rank,title,url,avg_rating,rating_count,total_score,score_votes,series,series_no
0,"4.07 avg rating — 785,502 ratings","\nscore: 28,539,\n and\n292 people voted\n \n \n",1,The Handmaid's Tale,/book/show/38447.The_Handmaid_s_Tale,4.07,785502,28539,292,,
1,"4.34 avg rating — 5,212,935 ratings","\nscore: 27,566,\n and\n282 people voted\n \n \n",2,"The Hunger Games (The Hunger Games, #1)",/book/show/2767052-the-hunger-games,4.34,5212935,27566,282,The Hunger Games,1
2,"3.76 avg rating — 922,308 ratings","\nscore: 20,049,\n and\n205 people voted\n \n \n",3,"Frankenstein, or The Modern Prometheus",/book/show/18490.Frankenstein_or_The_Modern_Prometheus,3.76,922308,20049,205,,
3,"4.04 avg rating — 702,272 ratings","\nscore: 17,684,\n and\n185 people voted\n \n \n",4,"A Wrinkle in Time (A Wrinkle in Time Quintet, #1)",/book/show/18131.A_Wrinkle_in_Time,4.04,702272,17684,185,A Wrinkle in Time Quintet,1
4,"4.06 avg rating — 77,664 ratings","\nscore: 16,070,\n and\n165 people voted\n \n \n",5,The Left Hand of Darkness,/book/show/18423.The_Left_Hand_of_Darkness,4.06,77664,16070,165,,
5,"4.23 avg rating — 2,345,974 ratings","\nscore: 12,935,\n and\n134 people voted\n \n \n",6,"Divergent (Divergent, #1)",/book/show/13335037-divergent,4.23,2345974,12935,134,Divergent,1
6,"4.30 avg rating — 2,049,239 ratings","\nscore: 12,261,\n and\n128 people voted\n \n \n",7,"Catching Fire (The Hunger Games, #2)",/book/show/6148028-catching-fire,4.30,2049239,12261,128,The Hunger Games,2
7,"4.12 avg rating — 1,379,452 ratings","\nscore: 11,238,\n and\n117 people voted\n \n \n",8,"The Giver (The Giver, #1)",/book/show/3636.The_Giver,4.12,1379452,11238,117,The Giver,1
8,"4.19 avg rating — 57,605 ratings","\nscore: 10,246,\n and\n107 people voted\n \n \n",9,The Dispossessed,/book/show/13651.The_Dispossessed,4.19,57605,10246,107,,
9,"4.20 avg rating — 53,473 ratings","\nscore: 9,907,\n and\n104 people voted\n \n \n",10,Kindred,/book/show/60931.Kindred,4.20,53473,9907,104,,


## 1.3 Where you're just doing one of my former students' projects

Once upon a time my student Stefan did a project that involved some lawyer stuff. Most of the content was in PDFs, though! I converted them to text files and put them into the `pdfs` folder, and gave you code below to open up each of them and save their contents into a dataframe.

What a nice dataframe! I want you to add the following columns to it:

* `lawyer_app`, the applicant's lawyer (pro se means that they did it themselves, that's fine)
* `lawyer_gov`, the government's lawyer
* `judge`, the name of the judge
* `access`, whether the clearance is granted or denied (although you might miss a few)

Save as **court_cleaned.csv**.

**Note:** You can look at the original PDFs, they're also included.

**Note:** This uses a fun utility called `glob`, which is mostly fun because you use it as `glob.glob`. It's used to find files that match a certain filename pattern.

**BONUS:** You'll be happy once you get the judge, but make sure it doesn't have any extra punctuation on it.

**BONUS:** You can for some words using `.str.contains("blah")` and save it into new columns. Maybe `has_debt`, `has_bankruptcy`, etc.

> It's okay if it isn't perfect. Converting PDF into data rarely is! Usually you get 90% of it done with computers, then send people to enter the other 10% by hand.

In [17]:
pd.set_option('display.max_colwidth', 1000)

In [18]:
import glob
filenames = glob.glob("pdfs/*.txt")
contents = [open(filename, encoding="utf8").read() for filename in filenames]
df_law= pd.DataFrame({'filename': filenames, 'content': contents})
df_law

Unnamed: 0,filename,content
0,pdfs\11-02438.h1.pdf.txt,"\n\n DEPARTMENT OF DEFENSE \n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \n\n \n \nIn the matter of: \n \n \n \n \nApplicant for Security Clearance \n\n \n \n\n \n\nISCR Case No. 11-02438 \n\nFor Government: Stephanie C. Hess, Esq., Department Counsel \n\nFor Applicant: Pro se \n\nAppearances \n\n______________ \n\n \nDecision \n\n______________ \n\n \n\n \n\n \n\n \n \n\n \n\nCOACHER, Robert E., Administrative Judge: \n\n \nApplicant has not mitigated the alcohol consumption security concerns. Eligibility \n\nfor access to classified information is denied. \n\nStatement of the Case \n\nOn June 16, 2015, the Department of Defense Consolidated Adjudications \nFacility (DOD CAF) issued Applicant a Statement of Reasons (SOR) detailing security \nconcerns under Guideline G, alcohol consumption. DOD CAF acted under Executive \nOrder (EO) 10865, ..."
1,pdfs\11-03073.h1.pdf.txt,"\n\n DEPARTMENT OF DEFENSE \n\n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n \n\n \n\nISCR Case No. 11-03073 \n\n \n \n\n) \n) \n) \n) \n) \n \n \n\n \n \n\nAppearances \n\n______________ \n\n \nDecision \n\n______________ \n\n \n\n \n\n \n\n \n \n\nFor Government: Robert J. Kilmartin, Esq., Department Counsel \n\nFor Applicant: Mark S. Zaid, Esq. \n\nLOUGHRAN, Edward W., Administrative Judge: \n\n \nApplicant mitigated the financial considerations security concerns. Eligibility for \n\naccess to classified information is granted. \n \n\nStatement of the Case \n\nOn October 28, 2014, the Department of Defense (DOD) issued a Statement of \nReasons (SOR) to Applicant detailing security concerns under Guideline F, financial \nconsiderations. The action was taken under Executive Order..."
2,pdfs\11-04909.h1.pdf.txt,"\n\n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n DEPARTMENT OF DEFENSE \n\n \n \n\n \n\n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n \n \n\n \n\nISCR Case No. 11-04909 \n\n \n\nFor Government: Richard Stevens, Esq., Department Counsel \n\nFor Applicant: Pro se \n\n \n\n \nDUFFY, James F., Administrative Judge: \n\n \nApplicant mitigated \n\nconsiderations). Clearance is granted. \n\nthe security concerns under Guideline F (financial \n\nStatement of the Case \n\nOn April 5, 2015, the Department of Defense (DOD) Consolidated Adjudications \nFacility (CAF) issued Applicant a Statement of Reasons (SOR) detailing security \nconcerns under Guideline F. DOD CAF took that action under Executive Order 10865, \nSafeguarding Classified Information Within Industry, dated February 20, 1960, as \namended; DOD Directive 5220.6, Defense Industria..."
3,pdfs\11-07728.h1.pdf.txt,"DEPARTMENT OF DEFENSE \n\n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n------------------------ \n \n\n \n\nISCR Case No. 11-07728 \n\n \n \n\n) \n) \n) \n) \n) \n \n \n\n \n\n \n \n\nAppearances \n\n___________ \n\n \nDecision \n\n___________ \n\nFor Government: Julie R. Mendez, Esq., Department Counsel \n\nFor Applicant: Mark S. Zaid, Esq. \n\n \n\nHARVEY, Mark, Administrative Judge: \n \nApplicant’s statement of reasons (SOR) alleges two allegations under Guideline \n \nK (handling protected information) and five allegations under Guideline E (personal \nconduct). All allegations relate to his handling of confidential data in December 2007 \nand January 2008 and his participation in the follow-up investigation in 2009 and 2010. \nApplicant was assured that “trusted downloads” provided by the Navy and Company L \ndid not contain classified information, when..."
4,pdfs\11-08313.h1.pdf.txt,"\n\n DEPARTMENT OF DEFENSE \n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n\nISCR Case No. 11-08313 \n\n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n--------------- \n \n\n \n\n \n \n\n) \n) \n) \n) \n) \n \n\n \n \n\nAppearances \n\n______________ \n\n \nDecision \n\n______________ \n\nFor Government: Julie R. Mendez, Esquire, Department Counsel \n\nFor Applicant: Pro se \n\n \n \n\n \n\nMARSHALL, Jr., Arthur E., Administrative Judge: \n\n \n Statement of the Case \n \nOn April 4, 2014, the Department of Defense (DOD) issued Applicant a \nStatement of Reasons (SOR) detailing security concerns under Guideline B (Foreign \nInfluence) and Guideline E (Personal Conduct).1 In a response signed April 28, 2014, \nApplicant admitted all allegations and requested a hearing based on the writte..."
5,pdfs\11-11916.h1.pdf.txt,"\n\nDEPARTMENT OF DEFENSE \n\nDEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \n\n \n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\nREDACTED \n\n \n\nISCR Case No. 11-11916 \n\nMENDEZ, Francisco, Administrative Judge: \n\n \nApplicant did not mitigate security concerns raised by his exercise of foreign \ncitizenship, including the possession of a current foreign passport. He also did not \nmitigate security concerns raised by his substantial ties to Russia through which he \ncould be subjected to adverse foreign influence. Clearance is denied. \n \n\nHistory of the Case \n\nOn August 15, 2009, Applicant submitted a security clearance application (SCA). \nHe voluntarily disclosed his dual U.S.-Russian citizenship, as well as his connections \nand property interest in Russia. \n\n \nOn April 17, 2014, the Department of Defense ..."
6,pdfs\11-12537.a1.pdf.txt,"KEYWORD: Guideline F\n\nDIGEST: Applicant’s concern regarding the malfunction of video-teleconference equipment is\novercome by the Judge’s express statement that he did not consider those pages of the transcript. \nAdverse decision affirmed.\n\nCASENO: 11-12537.a1\n\nDATE: 02/11/2016\n\nIn Re:\n\n----------\n \n\nApplicant for Security Clearance\n\nDATE: February 11, 2016\n\nISCR Case No. 11-12537\n\n)\n)\n)\n)\n)\n)\n)\n)\n\nAPPEAL BOARD DECISION\n\nAPPEARANCES\n\nFOR GOVERNMENT\n\nJames B. Norman, Esq., Chief Department Counsel\n\nFOR APPLICANT\n\nPro se\n\nThe Department of Defense (DoD) declined to grant Applicant a security clearance. On\nOctober 17, 2014, DoD issued a statement of reasons (SOR) advising Applicant of the basis for that\ndecision–security concerns raised under Guideline F (Financial Considerations) of Department of\nDefense Directive 5220.6 (Jan. 2, 1992, as amended) (Directive). Applicant requested a hearing.\nOn November 21, 2015, after the hearing,..."
7,pdfs\11-12635.h1.pdf.txt,"DEPARTMENT OF DEFENSE \n\nDEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n \n\n--- \n \n\n \n \n\n) \n) \n) \n) \n) \n \n \n\n \n\nISCR Case No. 11-12635 \n\nAppearances \n\nFor Government: Richard Stevens, Esquire, Department Counsel \n\nFor Applicant: Ryan C. Nerney, Esquire \n\n \n \n\n______________ \n\n \n\nDecision \n\n______________ \n\n \n\n \n\n \n \n\nGALES, Robert Robinson, Administrative Judge: \n\n \nApplicant mitigated the security concerns regarding personal conduct, use of \ninformation technology systems, and foreign influence. Eligibility for a security clearance \nand access to classified information is granted. \n\n \n\nStatement of the Case \n\n \nOn April 28, 2010, Applicant applied for a security clearance and submitted an \nElectronic Questionnaire for Investigations Processing (e-QIP) version of a Security \nClearance Application.1 ..."
8,pdfs\11-14135.h1.pdf.txt,"\n\n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n DEPARTMENT OF DEFENSE \n\n \n \n\n \n\n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n \n \n\n \n\nISCR Case No. 11-14135 \n\nAppearances \n\nFor Government: Tara Karoian, Esq., Department Counsel \n\nFor Applicant: Pro se \n\n \n\n \n\n) \n) \n) \n) \n) \n \n \n\n \n \n \n\n \n\n \n\n \n1 \n \n \n\n__________ \n\n \nDecision \n__________ \n\n \nDUFFY, James F., Administrative Judge: \n\n \nApplicant failed to mitigate the security concerns arising under Guideline K \n(handling protected information) and Guideline E (personal conduct). Clearance is \ndenied. \n\nStatement of the Case \n\nOn February 13, 2015, the Department of Defense Consolidated Adjudications \nFacility (DOD CAF) issued Applicant a Statement of Reasons (SOR) detailing security \nconcerns under Guidelines K and E. T..."
9,pdfs\11-14832.h1.pdf.txt,"\n\n DEPARTMENT OF DEFENSE \n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \n\n \nIn the matter of: \n \n[Name Redacted] \n \n \nApplicant for Security Clearance \n\n \n\n) \n) \n) \n) \n) \n) \n \n \n\n \n\nISCR Case No. 11-14832 \n\nAppearances \n\nFor Government: Eric Borgstrom, Esquire, Department Counsel \n\nFor Applicant: Pro se \n\n \n\n______________ \n\n \n\nDecision \n\n______________ \n\n \n\n \n\n \n \n\nHOGAN, Erin C., Administrative Judge: \n\n \nOn September 3, 2014, the Department of Defense (DOD) issued a Statement of \nReasons (SOR) to Applicant detailing security concerns under Guideline F, Financial \nConsiderations. The action was taken under Executive Order 10865, Safeguarding \nClassified Information within Industry (February 20, 1960), as amended; Department of \nDefense Directive 5220.6, Defense Industrial Personnel ..."


In [19]:
df_law['lawyer_app']= df_law['content'].str.extract(r'\n\nFor Applicant: (.*)')
df_law['lawyer_app']

0                      Pro se 
1          Mark S. Zaid, Esq. 
2                      Pro se 
3          Mark S. Zaid, Esq. 
4                      Pro se 
5                      Pro se 
6                          NaN
7     Ryan C. Nerney, Esquire 
8                      Pro se 
9                      Pro se 
10                     Pro Se 
11                         NaN
12                      Pro se
13     Stephen Glassman, Esq. 
14                     Pro se 
Name: lawyer_app, dtype: object

In [20]:
#The government's lawyer
df_law['lawyer_gov']=df_law['content'].str.extract(r'\n\nFor Government: (.*),')
df_law['lawyer_gov']

0       Stephanie C. Hess, Esq.
1     Robert J. Kilmartin, Esq.
2         Richard Stevens, Esq.
3         Julie R. Mendez, Esq.
4      Julie R. Mendez, Esquire
5          David F. Hayes, Esq.
6                           NaN
7      Richard Stevens, Esquire
8            Tara Karoian, Esq.
9       Eric Borgstrom, Esquire
10       Robert Kilmartin, Esq.
11                          NaN
12       Tovah Minster, Esquire
13          Erin Thompson, Esq.
14      David F. Hayes, Esquire
Name: lawyer_gov, dtype: object

In [21]:
#Name of the judge
df_law['judge']=df_law['content'].str.extract(r'\n{0,2}(.*,.*), Administrative Judge')
df_law['judge']

0           COACHER, Robert E.
1          LOUGHRAN, Edward W.
2              DUFFY, James F.
3                 HARVEY, Mark
4     MARSHALL, Jr., Arthur E.
5            MENDEZ, Francisco
6                          NaN
7       GALES, Robert Robinson
8              DUFFY, James F.
9               HOGAN, Erin C.
10      GOLDSTEIN, Jennifer I.
11                         NaN
12            MOGUL, Martin H.
13        GARCIA, Candace Le’i
14             HOWE, Philip S.
Name: judge, dtype: object

In [22]:
#Whether the clearance is granted or denied 
df_law['access']=df_law['content'].str.extract(r'Clearance is (.*)\.')
df_law['access']

0         NaN
1         NaN
2     granted
3         NaN
4         NaN
5      denied
6         NaN
7         NaN
8         NaN
9         NaN
10        NaN
11        NaN
12        NaN
13        NaN
14        NaN
Name: access, dtype: object

Okay, now do the work and **make those new columns!**

In [23]:
#just checking
df_law

Unnamed: 0,filename,content,lawyer_app,lawyer_gov,judge,access
0,pdfs\11-02438.h1.pdf.txt,"\n\n DEPARTMENT OF DEFENSE \n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \n\n \n \nIn the matter of: \n \n \n \n \nApplicant for Security Clearance \n\n \n \n\n \n\nISCR Case No. 11-02438 \n\nFor Government: Stephanie C. Hess, Esq., Department Counsel \n\nFor Applicant: Pro se \n\nAppearances \n\n______________ \n\n \nDecision \n\n______________ \n\n \n\n \n\n \n\n \n \n\n \n\nCOACHER, Robert E., Administrative Judge: \n\n \nApplicant has not mitigated the alcohol consumption security concerns. Eligibility \n\nfor access to classified information is denied. \n\nStatement of the Case \n\nOn June 16, 2015, the Department of Defense Consolidated Adjudications \nFacility (DOD CAF) issued Applicant a Statement of Reasons (SOR) detailing security \nconcerns under Guideline G, alcohol consumption. DOD CAF acted under Executive \nOrder (EO) 10865, ...",Pro se,"Stephanie C. Hess, Esq.","COACHER, Robert E.",
1,pdfs\11-03073.h1.pdf.txt,"\n\n DEPARTMENT OF DEFENSE \n\n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n \n\n \n\nISCR Case No. 11-03073 \n\n \n \n\n) \n) \n) \n) \n) \n \n \n\n \n \n\nAppearances \n\n______________ \n\n \nDecision \n\n______________ \n\n \n\n \n\n \n\n \n \n\nFor Government: Robert J. Kilmartin, Esq., Department Counsel \n\nFor Applicant: Mark S. Zaid, Esq. \n\nLOUGHRAN, Edward W., Administrative Judge: \n\n \nApplicant mitigated the financial considerations security concerns. Eligibility for \n\naccess to classified information is granted. \n \n\nStatement of the Case \n\nOn October 28, 2014, the Department of Defense (DOD) issued a Statement of \nReasons (SOR) to Applicant detailing security concerns under Guideline F, financial \nconsiderations. The action was taken under Executive Order...","Mark S. Zaid, Esq.","Robert J. Kilmartin, Esq.","LOUGHRAN, Edward W.",
2,pdfs\11-04909.h1.pdf.txt,"\n\n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n DEPARTMENT OF DEFENSE \n\n \n \n\n \n\n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n \n \n\n \n\nISCR Case No. 11-04909 \n\n \n\nFor Government: Richard Stevens, Esq., Department Counsel \n\nFor Applicant: Pro se \n\n \n\n \nDUFFY, James F., Administrative Judge: \n\n \nApplicant mitigated \n\nconsiderations). Clearance is granted. \n\nthe security concerns under Guideline F (financial \n\nStatement of the Case \n\nOn April 5, 2015, the Department of Defense (DOD) Consolidated Adjudications \nFacility (CAF) issued Applicant a Statement of Reasons (SOR) detailing security \nconcerns under Guideline F. DOD CAF took that action under Executive Order 10865, \nSafeguarding Classified Information Within Industry, dated February 20, 1960, as \namended; DOD Directive 5220.6, Defense Industria...",Pro se,"Richard Stevens, Esq.","DUFFY, James F.",granted
3,pdfs\11-07728.h1.pdf.txt,"DEPARTMENT OF DEFENSE \n\n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n------------------------ \n \n\n \n\nISCR Case No. 11-07728 \n\n \n \n\n) \n) \n) \n) \n) \n \n \n\n \n\n \n \n\nAppearances \n\n___________ \n\n \nDecision \n\n___________ \n\nFor Government: Julie R. Mendez, Esq., Department Counsel \n\nFor Applicant: Mark S. Zaid, Esq. \n\n \n\nHARVEY, Mark, Administrative Judge: \n \nApplicant’s statement of reasons (SOR) alleges two allegations under Guideline \n \nK (handling protected information) and five allegations under Guideline E (personal \nconduct). All allegations relate to his handling of confidential data in December 2007 \nand January 2008 and his participation in the follow-up investigation in 2009 and 2010. \nApplicant was assured that “trusted downloads” provided by the Navy and Company L \ndid not contain classified information, when...","Mark S. Zaid, Esq.","Julie R. Mendez, Esq.","HARVEY, Mark",
4,pdfs\11-08313.h1.pdf.txt,"\n\n DEPARTMENT OF DEFENSE \n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n\nISCR Case No. 11-08313 \n\n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n--------------- \n \n\n \n\n \n \n\n) \n) \n) \n) \n) \n \n\n \n \n\nAppearances \n\n______________ \n\n \nDecision \n\n______________ \n\nFor Government: Julie R. Mendez, Esquire, Department Counsel \n\nFor Applicant: Pro se \n\n \n \n\n \n\nMARSHALL, Jr., Arthur E., Administrative Judge: \n\n \n Statement of the Case \n \nOn April 4, 2014, the Department of Defense (DOD) issued Applicant a \nStatement of Reasons (SOR) detailing security concerns under Guideline B (Foreign \nInfluence) and Guideline E (Personal Conduct).1 In a response signed April 28, 2014, \nApplicant admitted all allegations and requested a hearing based on the writte...",Pro se,"Julie R. Mendez, Esquire","MARSHALL, Jr., Arthur E.",
5,pdfs\11-11916.h1.pdf.txt,"\n\nDEPARTMENT OF DEFENSE \n\nDEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \n\n \n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\nREDACTED \n\n \n\nISCR Case No. 11-11916 \n\nMENDEZ, Francisco, Administrative Judge: \n\n \nApplicant did not mitigate security concerns raised by his exercise of foreign \ncitizenship, including the possession of a current foreign passport. He also did not \nmitigate security concerns raised by his substantial ties to Russia through which he \ncould be subjected to adverse foreign influence. Clearance is denied. \n \n\nHistory of the Case \n\nOn August 15, 2009, Applicant submitted a security clearance application (SCA). \nHe voluntarily disclosed his dual U.S.-Russian citizenship, as well as his connections \nand property interest in Russia. \n\n \nOn April 17, 2014, the Department of Defense ...",Pro se,"David F. Hayes, Esq.","MENDEZ, Francisco",denied
6,pdfs\11-12537.a1.pdf.txt,"KEYWORD: Guideline F\n\nDIGEST: Applicant’s concern regarding the malfunction of video-teleconference equipment is\novercome by the Judge’s express statement that he did not consider those pages of the transcript. \nAdverse decision affirmed.\n\nCASENO: 11-12537.a1\n\nDATE: 02/11/2016\n\nIn Re:\n\n----------\n \n\nApplicant for Security Clearance\n\nDATE: February 11, 2016\n\nISCR Case No. 11-12537\n\n)\n)\n)\n)\n)\n)\n)\n)\n\nAPPEAL BOARD DECISION\n\nAPPEARANCES\n\nFOR GOVERNMENT\n\nJames B. Norman, Esq., Chief Department Counsel\n\nFOR APPLICANT\n\nPro se\n\nThe Department of Defense (DoD) declined to grant Applicant a security clearance. On\nOctober 17, 2014, DoD issued a statement of reasons (SOR) advising Applicant of the basis for that\ndecision–security concerns raised under Guideline F (Financial Considerations) of Department of\nDefense Directive 5220.6 (Jan. 2, 1992, as amended) (Directive). Applicant requested a hearing.\nOn November 21, 2015, after the hearing,...",,,,
7,pdfs\11-12635.h1.pdf.txt,"DEPARTMENT OF DEFENSE \n\nDEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n \n\n--- \n \n\n \n \n\n) \n) \n) \n) \n) \n \n \n\n \n\nISCR Case No. 11-12635 \n\nAppearances \n\nFor Government: Richard Stevens, Esquire, Department Counsel \n\nFor Applicant: Ryan C. Nerney, Esquire \n\n \n \n\n______________ \n\n \n\nDecision \n\n______________ \n\n \n\n \n\n \n \n\nGALES, Robert Robinson, Administrative Judge: \n\n \nApplicant mitigated the security concerns regarding personal conduct, use of \ninformation technology systems, and foreign influence. Eligibility for a security clearance \nand access to classified information is granted. \n\n \n\nStatement of the Case \n\n \nOn April 28, 2010, Applicant applied for a security clearance and submitted an \nElectronic Questionnaire for Investigations Processing (e-QIP) version of a Security \nClearance Application.1 ...","Ryan C. Nerney, Esquire","Richard Stevens, Esquire","GALES, Robert Robinson",
8,pdfs\11-14135.h1.pdf.txt,"\n\n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n DEPARTMENT OF DEFENSE \n\n \n \n\n \n\n \nIn the matter of: \n \n \n \nApplicant for Security Clearance \n\n \n \n\n \n\nISCR Case No. 11-14135 \n\nAppearances \n\nFor Government: Tara Karoian, Esq., Department Counsel \n\nFor Applicant: Pro se \n\n \n\n \n\n) \n) \n) \n) \n) \n \n \n\n \n \n \n\n \n\n \n\n \n1 \n \n \n\n__________ \n\n \nDecision \n__________ \n\n \nDUFFY, James F., Administrative Judge: \n\n \nApplicant failed to mitigate the security concerns arising under Guideline K \n(handling protected information) and Guideline E (personal conduct). Clearance is \ndenied. \n\nStatement of the Case \n\nOn February 13, 2015, the Department of Defense Consolidated Adjudications \nFacility (DOD CAF) issued Applicant a Statement of Reasons (SOR) detailing security \nconcerns under Guidelines K and E. T...",Pro se,"Tara Karoian, Esq.","DUFFY, James F.",
9,pdfs\11-14832.h1.pdf.txt,"\n\n DEPARTMENT OF DEFENSE \n DEFENSE OFFICE OF HEARINGS AND APPEALS \n\n \n \n\n \nIn the matter of: \n \n[Name Redacted] \n \n \nApplicant for Security Clearance \n\n \n\n) \n) \n) \n) \n) \n) \n \n \n\n \n\nISCR Case No. 11-14832 \n\nAppearances \n\nFor Government: Eric Borgstrom, Esquire, Department Counsel \n\nFor Applicant: Pro se \n\n \n\n______________ \n\n \n\nDecision \n\n______________ \n\n \n\n \n\n \n \n\nHOGAN, Erin C., Administrative Judge: \n\n \nOn September 3, 2014, the Department of Defense (DOD) issued a Statement of \nReasons (SOR) to Applicant detailing security concerns under Guideline F, Financial \nConsiderations. The action was taken under Executive Order 10865, Safeguarding \nClassified Information within Industry (February 20, 1960), as amended; Department of \nDefense Directive 5220.6, Defense Industrial Personnel ...",Pro se,"Eric Borgstrom, Esquire","HOGAN, Erin C.",


In [24]:
#Save as court_cleaned.csv
df_law.to_csv('court_cleaned.csv', index=False)
df_law= pd.read_csv('court_cleaned.csv')
df_law.head()

Unnamed: 0,filename,content,lawyer_app,lawyer_gov,judge,access
0,pdfs\11-02438.h1.pdf.txt,"\r\n\r\n DEPARTMENT OF DEFENSE \r\n DEFENSE OFFICE OF HEARINGS AND APPEALS \r\n\r\n \r\n \r\n\r\n \r\n \r\nIn the matter of: \r\n \r\n \r\n \r\n \r\nApplicant for Security Clearance \r\n\r\n \r\n \r\n\r\n \r\n\r\nISCR Case No. 11-02438 \r\n\r\nFor Government: Stephanie C. Hess, Esq., Department Counsel \r\n\r\nFor Applicant: Pro se \r\n\r\nAppearances \r\n\r\n______________ \r\n\r\n \r\nDecision \r\n\r\n______________ \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\nCOACHER, Robert E., Administrative Judge: \r\n\r\n \r\nApplicant has not mitigated the alcohol consumption security concerns. Eligibility \r\n\r\nfor access to classified information is denied. \r\n\r\nStatement of the Case \r\n\r\nOn June 16, 2015, the Department of Defense Consolidated Adjudications \r\nFacility (DOD CAF) issued Applicant a Statement of Reasons (SOR) detailing securi...",Pro se,"Stephanie C. Hess, Esq.","COACHER, Robert E.",
1,pdfs\11-03073.h1.pdf.txt,"\r\n\r\n DEPARTMENT OF DEFENSE \r\n\r\n DEFENSE OFFICE OF HEARINGS AND APPEALS \r\n\r\n \r\n \r\nIn the matter of: \r\n \r\n \r\n \r\nApplicant for Security Clearance \r\n\r\n \r\n\r\n \r\n\r\nISCR Case No. 11-03073 \r\n\r\n \r\n \r\n\r\n) \r\n) \r\n) \r\n) \r\n) \r\n \r\n \r\n\r\n \r\n \r\n\r\nAppearances \r\n\r\n______________ \r\n\r\n \r\nDecision \r\n\r\n______________ \r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n \r\n\r\nFor Government: Robert J. Kilmartin, Esq., Department Counsel \r\n\r\nFor Applicant: Mark S. Zaid, Esq. \r\n\r\nLOUGHRAN, Edward W., Administrative Judge: \r\n\r\n \r\nApplicant mitigated the financial considerations security concerns. Eligibility for \r\n\r\naccess to classified information is granted. \r\n \r\n\r\nStatement of the Case \r\n\r\nOn October 28, 2014, the Department of Defense (DOD) issued a Statement of \r\nReasons (SOR) to ...","Mark S. Zaid, Esq.","Robert J. Kilmartin, Esq.","LOUGHRAN, Edward W.",
2,pdfs\11-04909.h1.pdf.txt,"\r\n\r\n DEFENSE OFFICE OF HEARINGS AND APPEALS \r\n\r\n DEPARTMENT OF DEFENSE \r\n\r\n \r\n \r\n\r\n \r\n\r\n \r\nIn the matter of: \r\n \r\n \r\n \r\nApplicant for Security Clearance \r\n\r\n \r\n \r\n\r\n \r\n\r\nISCR Case No. 11-04909 \r\n\r\n \r\n\r\nFor Government: Richard Stevens, Esq., Department Counsel \r\n\r\nFor Applicant: Pro se \r\n\r\n \r\n\r\n \r\nDUFFY, James F., Administrative Judge: \r\n\r\n \r\nApplicant mitigated \r\n\r\nconsiderations). Clearance is granted. \r\n\r\nthe security concerns under Guideline F (financial \r\n\r\nStatement of the Case \r\n\r\nOn April 5, 2015, the Department of Defense (DOD) Consolidated Adjudications \r\nFacility (CAF) issued Applicant a Statement of Reasons (SOR) detailing security \r\nconcerns under Guideline F. DOD CAF took that action under Executive Order 10865, \r\nSafeguarding Classified Information Within In...",Pro se,"Richard Stevens, Esq.","DUFFY, James F.",granted
3,pdfs\11-07728.h1.pdf.txt,"DEPARTMENT OF DEFENSE \r\n\r\n DEFENSE OFFICE OF HEARINGS AND APPEALS \r\n\r\n \r\nIn the matter of: \r\n \r\n \r\n \r\nApplicant for Security Clearance \r\n\r\n------------------------ \r\n \r\n\r\n \r\n\r\nISCR Case No. 11-07728 \r\n\r\n \r\n \r\n\r\n) \r\n) \r\n) \r\n) \r\n) \r\n \r\n \r\n\r\n \r\n\r\n \r\n \r\n\r\nAppearances \r\n\r\n___________ \r\n\r\n \r\nDecision \r\n\r\n___________ \r\n\r\nFor Government: Julie R. Mendez, Esq., Department Counsel \r\n\r\nFor Applicant: Mark S. Zaid, Esq. \r\n\r\n \r\n\r\nHARVEY, Mark, Administrative Judge: \r\n \r\nApplicant’s statement of reasons (SOR) alleges two allegations under Guideline \r\n \r\nK (handling protected information) and five allegations under Guideline E (personal \r\nconduct). All allegations relate to his handling of confidential data in December 2007 \r\nand January 2008 and his participation in the follow-up investigation in 2009 and 2010. \r\nApplicant was as...","Mark S. Zaid, Esq.","Julie R. Mendez, Esq.","HARVEY, Mark",
4,pdfs\11-08313.h1.pdf.txt,"\r\n\r\n DEPARTMENT OF DEFENSE \r\n DEFENSE OFFICE OF HEARINGS AND APPEALS \r\n\r\n \r\n\r\nISCR Case No. 11-08313 \r\n\r\n \r\nIn the matter of: \r\n \r\n \r\n \r\nApplicant for Security Clearance \r\n\r\n--------------- \r\n \r\n\r\n \r\n\r\n \r\n \r\n\r\n) \r\n) \r\n) \r\n) \r\n) \r\n \r\n\r\n \r\n \r\n\r\nAppearances \r\n\r\n______________ \r\n\r\n \r\nDecision \r\n\r\n______________ \r\n\r\nFor Government: Julie R. Mendez, Esquire, Department Counsel \r\n\r\nFor Applicant: Pro se \r\n\r\n \r\n \r\n\r\n \r\n\r\nMARSHALL, Jr., Arthur E., Administrative Judge: \r\n\r\n \r\n Statement of the Case \r\n \r\nOn April 4, 2014, the Department of Defense (DOD) issued Applicant a \r\nStatement of Reasons (SOR) detailing security concerns under Guideline B (Foreign \r\nInfluence) and Guideline E (Personal Conduct).1...",Pro se,"Julie R. Mendez, Esquire","MARSHALL, Jr., Arthur E.",


# Reading books

When you're doing text work, you're legally obligated work on Jane Austen's Pride and Prejudice (at least I *think* so). Let's do some naive analysis of it!

## Read in Jane Austen's Pride and Prejudice (without moving the file!)

It's in the `data/` directory, and named `Austen_Pride.txt`.

## Look at the first 500 or so characters of it 

In [25]:
contents = open('data/Austen_Pride.txt', encoding="utf8").read()
contents[0:500]

' Pride and Prejudice\nby Jane Austen\nChapter 1\nIt is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\nHowever little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered the rightful property of some one or other of their daughters.\n"My dear Mr. Bennet," said his lady to him one day, "have you heard that Nethe'

## Use a regular expression to find every "he" or "she" in the book. There should be about 3000 of them.

**Tip:** Do you know about **word boundaries?** `\b` means "the beginning of end of a word."

**Tip:** You might also want to use `re.IGNORECASE`. Maybe you'll need to google it? 

**Tip:** Do NOT use `re.compile`

In [26]:
he_mentions= re.findall(r'(\bhe\b)',contents,re.IGNORECASE)
she_mentions= re.findall(r'(\bshe\b)',contents,re.IGNORECASE)
gender_mentions= re.findall(r'(\bs?he\b)',contents,re.IGNORECASE)
len(he_mentions)
len(she_mentions)
len(gender_mentions)

3047

## Use a regular expression to find those same "he" or "she"s, but also match *the word after it*

The first four should be:

* he is
* he had
* she told
* he came

In [65]:
#trying for he
he_mentions_plus= re.findall(r'\bhe\b\W+\w+',contents,re.IGNORECASE)
he_mentions_plus[0:5]

['he is', 'he had', 'he came', 'he agreed', 'he is']

In [66]:
#trying for she
she_mentions_plus= re.findall(r'\bshe\b\W+\w+',contents,re.IGNORECASE)
she_mentions_plus[0:5]

['she; "for', 'she told', 'she ought', 'she is', 'She was']

In [67]:
#trying for both 
gender_mentions_plus= re.findall(r'\bs?he\b\W+\w+',contents,re.IGNORECASE)
gender_mentions_plus [0:10]

['he is',
 'he had',
 'she; "for',
 'she told',
 'he came',
 'he agreed',
 'he is',
 'he married',
 'he may',
 'he comes']

## Use capture groups to save the pronoun (he/she) as one match and the word as another

The first five should look like

```
[('he', 'is'),
 ('he', 'had'),
 ('she', 'told'),
 ('he', 'came'),
 ('he', 'agreed')]```

In [69]:
capture_gender_mentions= re.findall(r'(\bs?he\b)\W+(\w+)',contents,re.IGNORECASE)
capture_gender_mentions [0:10]

[('he', 'is'),
 ('he', 'had'),
 ('she', 'for'),
 ('she', 'told'),
 ('he', 'came'),
 ('he', 'agreed'),
 ('he', 'is'),
 ('he', 'married'),
 ('he', 'may'),
 ('he', 'comes')]

## Save those matches into a dataframe

You can give the column names with `columns=['pronoun', 'verb']`

In [92]:
df= pd.DataFrame(capture_gender_mentions, columns=['pronoun', 'verb'])
df.head()

Unnamed: 0,pronoun,verb
0,he,is
1,he,had
2,she,for
3,she,told
4,he,came


## How many times is each pronoun used?

In [93]:
df['pronoun'].value_counts()

she    1384
he     1103
She     325
He      235
Name: pronoun, dtype: int64

## Oh, wait, clean that up.

Make it only 'he' and 'she' lowercase.

It should be about 1600 'she' and 1300 'he'

In [96]:
df['pronoun']=df['pronoun'].str.lower()
df['pronoun'].value_counts()

she    1709
he     1338
Name: pronoun, dtype: int64

## What are the top 20 most common verbs?

In [97]:
df.groupby(by='verb').count().sort_values('pronoun',ascending= False).head(20)

Unnamed: 0_level_0,pronoun
verb,Unnamed: 1_level_1
had,372
was,372
could,172
is,140
would,94
has,72
did,69
will,50
might,46
should,41


## What are the top 20 most common verbs for 'he', and the top 20 most common for 'she'

**Tip:** Don't use groupby, just filter. If you want to know how, though, you can also look at "value counts for different categories" on [this page](http://jonathansoma.com/lede/foundations-2017/classes/more-pandas/class-notes/)

In [98]:
df[df['pronoun']== 'he'].groupby(by='verb').count().sort_values('pronoun',ascending= False).head(20)

Unnamed: 0_level_0,pronoun
verb,Unnamed: 1_level_1
had,167
was,160
is,74
has,51
could,40
would,35
did,30
should,26
will,24
must,24


In [99]:
df[df['pronoun']== 'she'].groupby(by='verb').count().sort_values('pronoun',ascending= False).head(20)

Unnamed: 0_level_0,pronoun
verb,Unnamed: 1_level_1
was,212
had,205
could,132
is,66
would,59
did,39
felt,33
saw,29
will,26
might,25


## Who cries more, men or women? Give me a percentage answer.

**Tip:** It's `cried`, because of, you know, how books are written

In [107]:
df[df['verb']=='cried'].pronoun.value_counts()

she    11
he      1
Name: pronoun, dtype: int64

## How much more common is 'he' than 'she' in J.R.R. Tolkein's Fellowship of the Ring? How does that compare to Pride and Prejudice?

The book is in the same directory.

In [108]:
#myfavoriteofthetrilogyyyy
fellowship = open('data/LordoftheRings_1955.txt', encoding="utf8").read()
fellowship[0:500]

'THE FELLOWSHIP OF THE RING\n\n\n\n\nBEING THE FIRST PART OF\n\nTHE LORD OF THE RINGS\n\nBY\n\nJ.R.R. TOLKIEN\n\n\n\n\n\nThree Rings for the Elven-kings under the sky,\n\nSeven for the Dwarf-lords in their halls of stone,\n\nNine for Mortal Men doomed to die,\n\nOne for the Dark Lord on his dark throne\n\nIn the Land of Mordor where the Shadows lie.\n\nOne Ring to rule them all, One Ring to find them,\n\nOne Ring to bring them all and in the darkness bind them\n\nIn the Land of Mordor where the Shadows lie.\n\n\n\n\n\nCONTENTS\n\n\nCOV'

In [109]:
gender_fellowship= re.findall(r'(\bs?he\b)\W+(\w+)',fellowship,re.IGNORECASE)
gender_fellowship [0:10]

[('he', 'had'),
 ('He', 'was'),
 ('he', 'wrote'),
 ('he', 'recorded'),
 ('he', 'had'),
 ('he', 'sent'),
 ('he', 'had'),
 ('he', 'learned'),
 ('he', 'recorded'),
 ('he', 'used')]

In [115]:
dfF= pd.DataFrame(gender_fellowship, columns=['pronoun', 'verb'])
dfF.head(20)

Unnamed: 0,pronoun,verb
0,he,had
1,He,was
2,he,wrote
3,he,recorded
4,he,had
5,he,sent
6,he,had
7,he,learned
8,he,recorded
9,he,used


In [116]:
dfF['pronoun']=dfF['pronoun'].str.lower()
dfF['pronoun'].value_counts()

he     3062
she     159
Name: pronoun, dtype: int64

In [128]:
#How much more common is 'he' than 'she' in J.R.R. Tolkein's Fellowship of the Ring? 
#How does that compare to Pride and Prejudice?
print('In J.R.R. Tolkeins Fellowship of the Ring -he- is mentioned 3062 times, but -she- 159 times. In Jane Austens Pride and Prejudice she is the most use pronoun.') 

In J.R.R. Tolkeins Fellowship of the Ring -he- is mentioned 3062 times, but -she- 159 times. In Jane Austens Pride and Prejudice she is the most use pronoun.
