# How prevelant is personally identifying information?

Below we use the PeerRead reviews to run a NER algorithmn through simply to count whether people are detected or not. A JSON file is written for manual error analysis.

## Collect PeerRead Reviews

In [1]:
!git clone https://github.com/allenai/PeerRead.git

fatal: destination path 'PeerRead' already exists and is not an empty directory.


In [2]:
import json
import os

In [3]:
DATA_PATH = './PeerRead/data'

In [4]:
reviews = []
for folder in os.listdir(DATA_PATH):
    if folder in ['acl_accepted.txt', 'nips_2013-2017', 'arxiv.cs.lg_2007-2017', 'arxiv.cs.cl_2007-2017', 'arxiv.cs.ai_2007-2017']:
        continue
    
    for split in ['dev', 'train', 'test']:
        review_folder = os.path.join(DATA_PATH, folder, split, 'reviews')
        for rev in os.listdir(review_folder):
            if rev:
                with open(os.path.join(review_folder, rev)) as f:
                    reviews.append(json.load(f))


In [5]:
len(reviews) # this is actually the number of reviewed docs.

586

In [6]:
review_with_docs = 0
for reviewed_doc in reviews:
    for review in reviewed_doc['reviews']:
        if review['comments']:
            review_with_docs += 1
print(review_with_docs)

5798


## Run Spacy NER over review and analyze entities

In [7]:
import spacy

In [8]:
nlp = spacy.load("en_core_web_trf") # python -m spacy download en_core_web_trf if you haven't

In [9]:
entities_of_interest = ['PERSON']

In [10]:
reviews_with_people = []
for reviewed_doc in reviews:
    for review in reviewed_doc['reviews']:
        doc = nlp(review['comments'])
        people = [ent for ent in doc.ents if ent.label_ in entities_of_interest]
        if len(people) > 0:
            reviews_with_people.append((people, doc))

Token indices sequence length is longer than the specified maximum sequence length for this model (1170 > 512). Running this sequence through the model will result in indexing errors


In [11]:
import csv   

with open('reviews_with_people.json', 'w') as file:
    json.dump([{
        'people': [str(person) for person in review[0]],
        'review': str(review[1])
    } for review in reviews_with_people], file, indent = 2)

In [14]:
print(f"{len(reviews_with_people) / review_with_docs:.0%} of reviews have people mentions. {len(reviews_with_people)} review with people out of {review_with_docs}") # 28% of reviews have people mentions

28% of reviews have people mentions. 1641 review with people out of 5798
