# Tutorial - Extracting Text and Analysing Sentiment from Digitized Documents in the UCSD Library's Digital Archive


## Part 2 - Cleaning, Data Loss, Sentiment Analysis, and Categorization

## Description

In part 2 of this tutorial, we will clean and analyze the text we extracted from from each image (page) in Part 1.

For analysis, we use the natural language processing Api to read sentiment scores for each document, find the most frequent terms for each document, and determine the probability that a document corresponds to a pre-trained set of categories (for example, "health care" or "arts and entertainment". [1] 

We will also estimate the amount of data retained or lost from each document and investate the relationshop between data loss and the metrics we read from the NLP Api.

[1] https://cloud.google.com/natural-language/docs/categories

## Contents

In Part 2 of this tutorial, we will

1. Extract text from every page of the document we previously analyzed in Part 1 of the tutorial.

2. Remove non-alphanumeric characters using NLTK

3. Estimate the sentiment score for each document using the pre-trained NLP web api

4. Remove all non-English language words and stop words

5. Estimate the amount of data retained for each document

6. List the most common words for each document

7. Find what pre-trained categories apply to each document

8. Investigate the relationshop between the retention rate for a document (the amount of text that remains after extraction and cleaning) and sentiment scores.

### Extract Text from Each Page of the PDF Document

#### Note - this section of Part 2 covers material similar to Part 1. The main difference is that we extract text from all pages and store them in a list. In the next section, we'll get into new material, including cleaning and analyzing text. 

In [2]:
!pip install pdf2image
!apt-get install poppler-utils 

Collecting pdf2image
  Downloading https://files.pythonhosted.org/packages/c3/12/ba5aadb3ba2e9c0f15d897622aa5707d64d0b2cab1fb34bee21559fa386a/pdf2image-1.12.1.tar.gz
Building wheels for collected packages: pdf2image
  Building wheel for pdf2image (setup.py) ... [?25l[?25hdone
  Created wheel for pdf2image: filename=pdf2image-1.12.1-cp36-none-any.whl size=9027 sha256=3b6b6be9b3786dc9e7b69d639181f3a494154c4248a1eaf7c7b9fdd2728656a5
  Stored in directory: /root/.cache/pip/wheels/0f/80/3a/fac1dc0f7dbe12c805b9dc6497f325f0e9f9cedbec3ab0185b
Successfully built pdf2image
Installing collected packages: pdf2image
Successfully installed pdf2image-1.12.1
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 25 not upgraded.
Need to get 154 kB of archives.
After this operation, 613 kB of additional disk space will be used.
Get:1 http://archive.ubun

In [0]:
import nltk
import os
import base64
import pandas as pd
from pdf2image import convert_from_path, convert_from_bytes
from nltk.corpus import words
from nltk.corpus import wordnet 
from nltk.corpus import stopwords
from googleapiclient.discovery import build
from io import BytesIO
from google.colab import drive
import getpass

In [4]:
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [5]:
APIKEY = getpass.getpass()

··········


In [0]:
service = build('translate', 'v2', developerKey=APIKEY)

In [0]:
images = (convert_from_path('gdrive/My Drive//No-More-Silence/glbths_2005-13_001_001.pdf', fmt='png'))

In [0]:
def extract_text(base64string):
  vservice = build('vision', 'v1', developerKey=APIKEY)
  request = vservice.images().annotate(body={
          'requests': [{
                  'image': {
                      'content':base64string
                  },
                  'features': [{
                      'type': 'TEXT_DETECTION',
                      'maxResults': 5,
                  }]
              }],
          })
  responses = request.execute(num_retries=5)
  return responses

In [0]:
extracted_texts = []
document_page_id = []

# keep a page count, combined with doc name as a unique document/page id
p = 0
for img in images:
  buffered = BytesIO()
  img.save(buffered, format="JPEG")
  base64str = base64.b64encode(buffered.getvalue()).decode('ascii')
  extracted_text = extract_text(base64str)
  extracted_texts.append(extracted_text['responses'][0]['textAnnotations'][0]['description'])
  document_page_id.append("glbths_2005-13_001_001.pdf_" + str(p))
  p += 1
  #break

### Remove Non Alphanumeric Characters and Non English Language Words

We will remove all non-english language words from the extracted text. Keep in mind, this will often result in a higher level of data loss from certain images, particularly those that are extracted from images handwriting or typing that is not consistently aligned or formatted. 

We start by creating a set of words to retain from an image scan

In [11]:
nltk.download('words')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [0]:
extended_words = set().union(words.words(), wordnet.words())

Next, we will remove all text that is not alphanumeric and/or doesn't show up in the set of words we created above.

In [0]:
cleaned_texts = []
for e in extracted_texts:
  e = e.replace('\n', ' ')
  e = e.lower()
  e = " ".join(w for w in nltk.wordpunct_tokenize(e) if w.lower() in extended_words)
  
  cleaned_texts.append(e)

### Read Sentiment Scores

In [0]:
lservice = build('language', 'v1', developerKey=APIKEY)

In [0]:
def read_sentiment(document_str):
  response = lservice.documents().analyzeSentiment(
    body={
      'document': {
        'type': 'PLAIN_TEXT',
        'content': document_str
    }
  }).execute()
  score = response['documentSentiment']['score']
  magnitude = response['documentSentiment']['magnitude']
  return(score, magnitude, document_str)


In [0]:
sentiment_data = []

for et in cleaned_texts:
  sentiment_data.append(read_sentiment(et))


### Estimate Retention Rate

To estimate the amount of data retained, we will compare the overall character count of the extracted text before we cleaned it with the character count of the text reamaining after we have cleaned it.

There's plenty that could go wrong with this approach, as it does not consider the possibility that the text extraction API may just give up and produce nothing for some images. In this case, we'd preserve almost nothing either, but we can't necessarily trust this ratio, in all cases. 

In [0]:
retained = []
for i in range(len(cleaned_texts)):
  retained.append(len(cleaned_texts[i]) / len(extracted_texts[i]))

### Create a Pandas Dataframe

For ease of viewing and querying, we'll organize the information we've generated into a pandas datafraame

In [0]:
df = pd.DataFrame(sentiment_data)

In [0]:
df['Retained'] = retained
df['Extracted'] = extracted_texts
df['document_page_id'] = document_page_id


In [0]:
df.columns = ['Sentiment', 'Magnitude', 'Cleaned', 'Retained', 'Extracted', 'document_page_id']

We'll persist this since it takes a long time to generate - in the next tutorial, we'll read from the csv file

In [0]:
df.to_csv('gdrive/My Drive//No-More-Silence/glbths_2005-13_001_001.csv')

### Read Categories

We can use the NLP service to estimate the probability that a particular record blongs to one of a set of pre-trained categories. 

We'll add this to a separate dataframe, as we'll be storing a dictionary in the categories rather than a single value.

In [0]:
def read_categories(text):
  try:
    response = lservice.documents().classifyText(
      body={
        'document': {
          'type': 'PLAIN_TEXT',
          'content': text }
    }).execute()
  except:
    response = ""
  return response

In [0]:
 df_categories = pd.DataFrame()
 df_categories['document_page_id'] = document_page_id
 df_categories['Extracted'] = extracted_texts
 df_categories['Cleaned'] = cleaned_texts
 df_categories['Categories'] = df['Cleaned'].apply(lambda x: read_categories(x))

Let's persist this one as well

In [0]:
df_categories.to_csv('gdrive/My Drive//No-More-Silence/glbths_2005-13_001_001_categories.csv')

In [0]:
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('display.max_colwidth', 500)

In [29]:
df

Unnamed: 0,Sentiment,Magnitude,Cleaned,Retained,Extracted,document_page_id
0,-0.6,0.6,july 24 de ar sue louisiana state penitentiary of 18 acres of f arm land once ante plantation land its so to say home for 5 or more state the bulk of whom is confined to he remaining few are confined to which falls in the category of one man or two man currently there t any area here at angola where on ly hiv positive are confined and or restricted and matter of fact of the 94 hiv positive here at angola the majority of them are in which means that they are able to freely congregate or inter...,0.775498,"July 24, 1990\nDe ar Sue,\nIhe Louisiana State Penitentiary consistes of 18, 000\nacres of f arm land. Once ante bellum plantation land, its so to\nsay, home for 5,000 (or more) state prisoners, the bulk of\nwhom is confined to dormitories. he remaining few are confined\nto Cellblocks, which falls in the category of one man Cells\nor two man Cells. Currently, there aren't any designated area\nhere at Angola, where on ly HIV positive prisoners are confined\nand/or restricted. And matter of fa...",glbths_2005-13_001_001.pdf_0
1,-0.7,1.5,since then there have been absolutely no effort to educate or update us on or the progress in fighting it and to my knowledge incoming are not educated about aids in november of 1 were confined to a cell next door to a gay prisoner with the aids virus it was one experience stood under daily verbal and physical attack by the other confined to this tier and compound when would use the shower no one would use it after him until it was g i the next day no one he would shower one day and forsake ...,0.796132,"Since then there have been absolutely no effort to educate or\nUpdate us on ALDS or the progress in fighting it. And to my\nknowledge incoming prisoners are not educated about AIDS, In\nNovember of 1987, 1 were confined to a Cell, next door to a\nGay prisoner diagnosed with having the AIDS virus. It was one\nhelluva experience. Ronnie Waymyer stood under daily verbal\nand physical attack by the other prisoners confined to this\ntier and compound. When Waymyer would use the shower no one\nwou...",glbths_2005-13_001_001.pdf_1
2,-0.2,0.2,and be cause of his inability to intellectually defend himself the prison been in protecting him from the aggression of the gay and hopefully this information will help you in your task and you can use my name and whereabouts 1 m a prison activist as always clark cr d tier cell 4 louisiana state penitentiary angola louisiana,0.806931,"And be cause of his inability to intellectually defend himself,\nthe prison guards has been ind ifferent in protecting him\nfrom the aggression of the Gay bashers and homophobes.\nHopefully, this information will help you in your task.\nAnd you can use my name and whereabouts. 1'm a prison\nactivist.\nAs always,\nAlbertvChui Clark\n79909\nCR, D.Tier, Cell 4\nLouisiana State Penitentiary\nAngola, Louisiana\n70712\n",glbths_2005-13_001_001.pdf_2
3,-0.6,0.6,sue if possible will you consider sending me a copy of your article once printed sometime in the a printed and published here at ang ola did a story on angola with aids was featured in for further information concerning aids here at angola contact israel camp j shark 2 r louisiana state penitentiary angola la i m told that he have been doing some work in this area to better inform the community at large about the plight of gay people here at angola lastly i m sorry if this letter and informa...,0.842022,"Sue,\nIf possible will you consider sending me a copy of your\narticle once printed.\nSometime in 1987, the Angolite (a magezine printed and\npublished here at Ang ola), did a story on Angola prisoners\nwith AIDS (wthich Waymyer was featured in).\nFor further information concerning AIDS here at Angola,\ncontact Israel Izra Perkins #107028, Camp J, Shark 2-R, Louisiana\nState Penitentiary, Angola, La. 0712. I'm told that he have\nbeen doing some work in this area to better inform the communit...",glbths_2005-13_001_001.pdf_3
4,0.0,0.0,pm albert clark cell 4 louisiana state penitentiary angola louisiana o 27 usa 25 7 ms sue 6 box ithaca new york baton,0.5625,"TOUGE\nPM\nAlbert Chui Clark\n79909\nČK D_Tier, Cell 4\nLouisiana State Penitentiary\nAngola, Louisiana\nO 27 JUL\nUSA 25\n1990\n7 C712\nMs. Sue Rochman\n1o IRC\n$6°Box 713\nIthaca, New York\n14851\nlulbllall..lll\nBATON\n708\n",glbths_2005-13_001_001.pdf_4
5,-0.5,0.5,not censored not responsible for contents la penitentiary 28 la state pen an all male penal institute,0.848739,NOT CENSORED\nNot Responsible for Contents\nLa Srate Penitentiary\nJUL 28 1990\nLA. STATE PEN.\nAN ALL MALE PENAL INSTITUTE\n,glbths_2005-13_001_001.pdf_5
6,-0.5,0.5,jan 91 dean an low was your christmas d t seen yan fu the a even to mind d you my what see your story of may to amy oft data le el m into my june that after 9 f on git up to be how that el m want my month t went from to gin aug to dee and now of since 18 dee i el set to my weight ai fell more like oe my el m may o naw day year hat about these par t s d con look t so far an t no that n f try pat hog in and for men male gear her and she turned the advocate for this sa that en to me was real of...,0.504621,"Jan '91\nDean An,\nklay\nLow was your\nchristmas? d laven t seen yan erticlkin\nKspoy fu Jeo,\naiting\nthe Rdvocate, a even mne\nto\nMind ond dhided\nasone d you chorged\npinting My\nwhat happened\nsee\nyour\nstory of whot.\nMay\nto chorgg amy oft,\ndata, le aów el'm into\ntakng A2T-my\n283n June\nthat isolA rin, after 9 monthe, f\non git'up To\nbe how, that el 'm fially ontof\nwant\nMy\nmonth\nT-cells 'cont went from\nto 594 gin Aug\nto 444 wń\nDee, and now ont of Contro lled Howsing\nノ\nsi...",glbths_2005-13_001_001.pdf_6
7,-0.2,0.2,as d t any of any the to 90 l n till year an g keeping butter non that al n at s hat el think a still i live for day at a time el n live for hut reality to come they with f have came d the all the hove came od brow the result will arrive le fore old age does the will happen from t so that tn any my where el benefit to cope thing one just bine tl d of that dread ful hole and on the and future come,0.54212,"tr as d lanen t develaped any of\nany\nThe commen symptons relating to Aibs\nsimge Laing hy'5h\nHIVH (180ét 90) l'n till\nIbaing\niergtie hich\nyear, an\nZuess\ng aign toond keeping\nbutter mentelléttitede non that al'n at S hat\nISOLAHION, el think llhe A, Still taigit\nI'll live for\nday at a time, el'n haping IU live for\nelso telle\nAue\nhut reality\nto come ,\nnamJacrept thuings\nthey dilvelope with\nf'have, came d oww The\nall the courase hove, came od Brow the\nexheglgat result will a...",glbths_2005-13_001_001.pdf_7
8,0.0,0.0,c austin pm 78 us lint p 0 box tx usa 2 jan se p 0 box hew york,0.401274,"C NESON\nAUSTIN\nPM\nイX 78\nRiCHARD\nus# 78826-012-Haston lint\nP.0.Box#1010-FCI\nBASTROR TX. 78602\nUSA\n*2 JAN\n/991\nSe\nRoohman\nP.0. Box"" 713\nIthora, Hew York 14857\n",glbths_2005-13_001_001.pdf_8
9,-0.5,0.5,fletcher ta health talk that each its i 7 someone un to get better there sant to be aal aids,0.391489,Doud Fletcher\nWorksatRles\nta Publec Health Org\nTalk abaut that\n718-626-3414\nX354\nPakeus\nDaid Flatcher\nEach Bld has its aun clei i\n7 deff Bids\nSomeone un Adu Stages\ntranstrued to\nget better cae there\nsant to be theee\nSpe aal AIDS Doimf\n,glbths_2005-13_001_001.pdf_9


In [30]:
df_categories

Unnamed: 0,document_page_id,Extracted,Cleaned,Categories
0,glbths_2005-13_001_001.pdf_0,"July 24, 1990\nDe ar Sue,\nIhe Louisiana State Penitentiary consistes of 18, 000\nacres of f arm land. Once ante bellum plantation land, its so to\nsay, home for 5,000 (or more) state prisoners, the bulk of\nwhom is confined to dormitories. he remaining few are confined\nto Cellblocks, which falls in the category of one man Cells\nor two man Cells. Currently, there aren't any designated area\nhere at Angola, where on ly HIV positive prisoners are confined\nand/or restricted. And matter of fa...",july 24 de ar sue louisiana state penitentiary of 18 acres of f arm land once ante plantation land its so to say home for 5 or more state the bulk of whom is confined to he remaining few are confined to which falls in the category of one man or two man currently there t any area here at angola where on ly hiv positive are confined and or restricted and matter of fact of the 94 hiv positive here at angola the majority of them are in which means that they are able to freely congregate or inter...,"{'categories': [{'name': '/Health/Health Conditions/Infectious Diseases', 'confidence': 0.93}, {'name': '/Health/Reproductive Health', 'confidence': 0.93}, {'name': '/Health/Health Conditions/AIDS & HIV', 'confidence': 0.81}]}"
1,glbths_2005-13_001_001.pdf_1,"Since then there have been absolutely no effort to educate or\nUpdate us on ALDS or the progress in fighting it. And to my\nknowledge incoming prisoners are not educated about AIDS, In\nNovember of 1987, 1 were confined to a Cell, next door to a\nGay prisoner diagnosed with having the AIDS virus. It was one\nhelluva experience. Ronnie Waymyer stood under daily verbal\nand physical attack by the other prisoners confined to this\ntier and compound. When Waymyer would use the shower no one\nwou...",since then there have been absolutely no effort to educate or update us on or the progress in fighting it and to my knowledge incoming are not educated about aids in november of 1 were confined to a cell next door to a gay prisoner with the aids virus it was one experience stood under daily verbal and physical attack by the other confined to this tier and compound when would use the shower no one would use it after him until it was g i the next day no one he would shower one day and forsake ...,{'categories': []}
2,glbths_2005-13_001_001.pdf_2,"And be cause of his inability to intellectually defend himself,\nthe prison guards has been ind ifferent in protecting him\nfrom the aggression of the Gay bashers and homophobes.\nHopefully, this information will help you in your task.\nAnd you can use my name and whereabouts. 1'm a prison\nactivist.\nAs always,\nAlbertvChui Clark\n79909\nCR, D.Tier, Cell 4\nLouisiana State Penitentiary\nAngola, Louisiana\n70712\n",and be cause of his inability to intellectually defend himself the prison been in protecting him from the aggression of the gay and hopefully this information will help you in your task and you can use my name and whereabouts 1 m a prison activist as always clark cr d tier cell 4 louisiana state penitentiary angola louisiana,{'categories': []}
3,glbths_2005-13_001_001.pdf_3,"Sue,\nIf possible will you consider sending me a copy of your\narticle once printed.\nSometime in 1987, the Angolite (a magezine printed and\npublished here at Ang ola), did a story on Angola prisoners\nwith AIDS (wthich Waymyer was featured in).\nFor further information concerning AIDS here at Angola,\ncontact Israel Izra Perkins #107028, Camp J, Shark 2-R, Louisiana\nState Penitentiary, Angola, La. 0712. I'm told that he have\nbeen doing some work in this area to better inform the communit...",sue if possible will you consider sending me a copy of your article once printed sometime in the a printed and published here at ang ola did a story on angola with aids was featured in for further information concerning aids here at angola contact israel camp j shark 2 r louisiana state penitentiary angola la i m told that he have been doing some work in this area to better inform the community at large about the plight of gay people here at angola lastly i m sorry if this letter and informa...,"{'categories': [{'name': '/People & Society', 'confidence': 0.65}]}"
4,glbths_2005-13_001_001.pdf_4,"TOUGE\nPM\nAlbert Chui Clark\n79909\nČK D_Tier, Cell 4\nLouisiana State Penitentiary\nAngola, Louisiana\nO 27 JUL\nUSA 25\n1990\n7 C712\nMs. Sue Rochman\n1o IRC\n$6°Box 713\nIthaca, New York\n14851\nlulbllall..lll\nBATON\n708\n",pm albert clark cell 4 louisiana state penitentiary angola louisiana o 27 usa 25 7 ms sue 6 box ithaca new york baton,{'categories': []}
5,glbths_2005-13_001_001.pdf_5,NOT CENSORED\nNot Responsible for Contents\nLa Srate Penitentiary\nJUL 28 1990\nLA. STATE PEN.\nAN ALL MALE PENAL INSTITUTE\n,not censored not responsible for contents la penitentiary 28 la state pen an all male penal institute,
6,glbths_2005-13_001_001.pdf_6,"Jan '91\nDean An,\nklay\nLow was your\nchristmas? d laven t seen yan erticlkin\nKspoy fu Jeo,\naiting\nthe Rdvocate, a even mne\nto\nMind ond dhided\nasone d you chorged\npinting My\nwhat happened\nsee\nyour\nstory of whot.\nMay\nto chorgg amy oft,\ndata, le aów el'm into\ntakng A2T-my\n283n June\nthat isolA rin, after 9 monthe, f\non git'up To\nbe how, that el 'm fially ontof\nwant\nMy\nmonth\nT-cells 'cont went from\nto 594 gin Aug\nto 444 wń\nDee, and now ont of Contro lled Howsing\nノ\nsi...",jan 91 dean an low was your christmas d t seen yan fu the a even to mind d you my what see your story of may to amy oft data le el m into my june that after 9 f on git up to be how that el m want my month t went from to gin aug to dee and now of since 18 dee i el set to my weight ai fell more like oe my el m may o naw day year hat about these par t s d con look t so far an t no that n f try pat hog in and for men male gear her and she turned the advocate for this sa that en to me was real of...,"{'categories': [{'name': '/People & Society/Religion & Belief', 'confidence': 0.57}]}"
7,glbths_2005-13_001_001.pdf_7,"tr as d lanen t develaped any of\nany\nThe commen symptons relating to Aibs\nsimge Laing hy'5h\nHIVH (180ét 90) l'n till\nIbaing\niergtie hich\nyear, an\nZuess\ng aign toond keeping\nbutter mentelléttitede non that al'n at S hat\nISOLAHION, el think llhe A, Still taigit\nI'll live for\nday at a time, el'n haping IU live for\nelso telle\nAue\nhut reality\nto come ,\nnamJacrept thuings\nthey dilvelope with\nf'have, came d oww The\nall the courase hove, came od Brow the\nexheglgat result will a...",as d t any of any the to 90 l n till year an g keeping butter non that al n at s hat el think a still i live for day at a time el n live for hut reality to come they with f have came d the all the hove came od brow the result will arrive le fore old age does the will happen from t so that tn any my where el benefit to cope thing one just bine tl d of that dread ful hole and on the and future come,{'categories': []}
8,glbths_2005-13_001_001.pdf_8,"C NESON\nAUSTIN\nPM\nイX 78\nRiCHARD\nus# 78826-012-Haston lint\nP.0.Box#1010-FCI\nBASTROR TX. 78602\nUSA\n*2 JAN\n/991\nSe\nRoohman\nP.0. Box"" 713\nIthora, Hew York 14857\n",c austin pm 78 us lint p 0 box tx usa 2 jan se p 0 box hew york,
9,glbths_2005-13_001_001.pdf_9,Doud Fletcher\nWorksatRles\nta Publec Health Org\nTalk abaut that\n718-626-3414\nX354\nPakeus\nDaid Flatcher\nEach Bld has its aun clei i\n7 deff Bids\nSomeone un Adu Stages\ntranstrued to\nget better cae there\nsant to be theee\nSpe aal AIDS Doimf\n,fletcher ta health talk that each its i 7 someone un to get better there sant to be aal aids,"{'categories': [{'name': '/Health', 'confidence': 0.67}]}"
