# Textual Analysis for Archive MetaData or Journal Articles


In [1]:
%%capture
!pip install JATA
!pip install altair
from Text import *
from CJH import CJH_Archives
import altair as alt

# Choosing the Data Source: PDFs or Meta-Data

## Importing Meta Data

**Run the following cell if you want to work with meta-data. If not, skip over it.**

In [2]:
collections= CJH_Archives('AJHS').get_meta_data('collections', 1, 2)

Creating CJHA Scraper Object for AJHS
Scraping Collections (Finding Aids)
Scraping Archive Index for Entry Links 1
Scraping Archive Index for Entry Links 2
Number of Objects Extracted:  60
Scraping entry meta data...
Record:  1 https://archives.cjh.org/repositories/3/resources/15236 
      The White Jew Newspaper
    
Record:  2 https://archives.cjh.org/repositories/3/resources/13248 
      Synagogue Council of America Records
    
Record:  3 https://archives.cjh.org/repositories/3/resources/15562 
      Admiral Lewis Lichtenstein Strauss Papers
    
Record:  4 https://archives.cjh.org/repositories/3/resources/15566 
      Meyer Greenberg Papers
    
Record:  5 https://archives.cjh.org/repositories/3/resources/15570 
      Louis Lipsky Papers
    
Record:  6 https://archives.cjh.org/repositories/3/resources/15623 
      Noah Benevolent Society Records
    
Record:  7 https://archives.cjh.org/repositories/3/resources/15557 
      Leo Hershkowitz Collection of Court Records
    
Record: 

Finding aid descriptions are set as default but you can pick any column name from the imported data. You may also want to experiment with records data!

In [3]:
#Set inital quotes df.
df_quotes = collections
li_quotes = df_quotes['Finding Aid & Administrative Information'].tolist()
stringV = li_quotes
print("Number of Rows", len(li_quotes))
a = ' '.join(stringV)
b_meta = wordninja.split(a)
print(len(b_meta))

Number of Rows 60
5814


All possible fields we could analyze:

In [4]:
df_quotes.columns

Index(['Additional Description', 'Creator', 'Dates', 'Extent',
       'Finding Aid & Administrative Information', 'Language of Materials',
       'Link', 'Name', 'Physical Storage Information', 'Related Names',
       'Repository Details', 'Scope and Content Note', 'Subjects', 'Use Terms',
       'Access Terms'],
      dtype='object')

# Parsing and Converting a Group of PDFS to Plain Text

## Considerations for analyzing this medium
The step that poses the most issues when analyzing journal articles or academic papers is converting the file from a pdf to plain text. A pdf has a lot of other information on each page other than the content of the actual text. Think page numbers, citation caveats, margin notes, or tables and graphs. 


## Load the article text from our parsed data

***Set the variable in the following sell to True if you want to work with pdfs, if not, leave it be. Load your pdfs into the content file on the left***

In [5]:
working_with_pdfs = False

In [6]:
if working_with_pdfs:
  fileDF = parse_all_pdfs_in_curr_dir()
  #Set inital quotes df.
  df_quotes = fileDF

  li_quotes = df_quotes['Text'].tolist()
  stringV = li_quotes
  print("Number of Articles", len(li_quotes))

  a = ' '.join(stringV)

  b_pdf = wordninja.split(a)
  print(len(b_pdf))
else:
  pass

# Tokenize sentences and words, remove stopwords, use stemmer & lemmatizer

First, a note on the difference between Stemming vs Lemmatization:

* Stemming: Trying to shorten a word with simple regex rules

* Lemmatization: Trying to find the root word with linguistics rules (with the use of regex rules)

## Process results, find the most popular lemmatized words and group results by Part of Speech (POS)

In [7]:
df_words = stopStemLem(li_quotes)

df_token_lists.head(5):
       0      1   2    3        4          5          6          7             8        9       10      11         12      13       14        15      16         17        18      19         20      21      22        23        24           25           26          27        28    29           30            31           32            33           34           35            36            37      38            39      40           41            42           43        44           45           46    47           48         49          50         51          52       53         54              55        56            57          58          59      60         61          62         63   64    65       66          67              68     69            70     71          72       73          74         75          76          77    78          79          80          81          82          83        84          85        86       87       88           89       90       

In [8]:
print("df_words.head(10):")
print(df_words.head(10))

df_words.head(10):
           lem  index        token        stem pos  counts
0  description     14  description    descript  NN     217
1   repository     24   repository  repositori  NN     182
2     language     13     language     languag  NN     122
3       detail     25      details      detail  NN     120
4      english     15      english     english  JJ     116
5          aid    166          aid         aid  NN     106
6         find    165      finding        find  VB     106
7         part     28         part        part  NN      67
8       jewish     30       jewish      jewish  NN      65
9       script     16       script      script  NN      64


## Frequency of Lemmatized Words Grouped by Parts of Speech.

In [9]:
#hide-input
df_words.head(50)

Unnamed: 0,lem,index,token,stem,pos,counts
0,description,14,description,descript,NN,217
1,repository,24,repository,repositori,NN,182
2,language,13,language,languag,NN,122
3,detail,25,details,detail,NN,120
4,english,15,english,english,JJ,116
5,aid,166,aid,aid,NN,106
6,find,165,finding,find,VB,106
7,part,28,part,part,NN,67
8,jewish,30,jewish,jewish,NN,65
9,script,16,script,script,NN,64


## Top 10 words per Part Of Speech (POS)

In [10]:
df_words = df_words[['lem', 'pos', 'counts']].head(200)
dfList_pos = format_stopstemlem(df_words)

### Nouns

In [11]:
#hide-input
dfList_pos[0]

Unnamed: 0,index,lem,pos,counts
0,0,description,NN,217
1,1,repository,NN,182
2,2,language,NN,122
3,3,detail,NN,120
4,5,aid,NN,106
5,7,part,NN,67
6,8,jewish,NN,65
7,9,script,NN,64
8,11,york,NN,63
9,12,note,NN,62


### Adjectives

In [12]:
dfList_pos[1]

Unnamed: 0,index,lem,pos,counts
0,4,english,JJ,116
1,10,new,JJ,63
2,13,united,JJ,61
3,14,historical,JJ,61
4,24,american,JJ,60
5,38,undated,JJ,28
6,56,consolidated,JJ,10
7,58,mixed,JJ,10
8,63,physical,JJ,10
9,66,undetermined,JJ,8


### Verbs

In [13]:
dfList_pos[2]

Unnamed: 0,index,lem,pos,counts
0,6,find,VB,106
1,31,write,VB,50
2,32,create,VB,48
3,42,describe,VB,27
4,43,complete,VB,27
5,57,process,VB,10
6,68,make,VB,8
7,71,add,VB,6
8,73,remove,VB,6
9,78,derive,VB,5


### Adverb

In [14]:
dfList_pos[3]

Unnamed: 0,index,lem,pos,counts


### Frequency plot grouped by POS type

In [17]:
source = df_words[df_words.counts>1].sort_values(by=['counts'], ascending=False)
alt.Chart(source).mark_bar(opacity=0.7).encode(
    y=alt.Y('lem:N',sort= {"op": "distinct", "field": "sort_order:O"}),
    x=alt.X('counts:Q', stack=None),
    color="pos:N",
)

# Machine Learning Text Generation Model

While parsing PDFs the most common thing to see are page numbers and words that are stucktogetherlikethis. To handle this and to make our training data more robust we use a package called word ninja that uses english corpuses (corpii?) and some fancy math to split them up correctly. We also remove all numbers that are not spelled out in the text.

In [18]:
#collapse-hide
clean_text_for_training = clean_plain_text_for_training(stringV)
file = " ".join(clean_text_for_training)


60


In [19]:
x_data, X, y, chars = find_patterns(file)

Total number of characters: 32060
Total vocab: 37
Total Patterns: 31960


## Setting Paramaters and Training

JATA comes built in with a params function but you can feel free to override them in the custom function if you like! Set the my_own_params flag to True if you want this setting!

***FYI - This can take some time (Sometimes up to an hour using the built in settings), so grab a snack or take a nap!***

Reducing the epochs will reduce the time it takes to train however it will also reduce the robustness of your output!

In [20]:
my_own_params = False

In [21]:
def set_model_params_custom(X, y):
    model = Sequential()
    model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(256, return_sequences=True))
    model.add(Dropout(0.2))
    model.add(LSTM(128))
    model.add(Dropout(0.2))
    model.add(Dense(y.shape[1], activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    filepath = "model_weights_saved.hdf5"
    checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
    desired_callbacks = [checkpoint]
    model.fit(X, y, epochs=1, batch_size=200, callbacks=desired_callbacks)
    return model

import time

start = time.time()
if my_own_params:
  model = set_model_params_custom(X,y)
else:
  model = set_model_params(X,y)
end = time.time()
print("Time Elapsed:")
print(end - start)

Epoch 1/5
Epoch 00001: loss improved from inf to 3.06437, saving model to model_weights_saved.hdf5
Epoch 2/5
Epoch 00002: loss improved from 3.06437 to 3.01255, saving model to model_weights_saved.hdf5
Epoch 3/5
Epoch 00003: loss improved from 3.01255 to 2.99917, saving model to model_weights_saved.hdf5
Epoch 4/5
Epoch 00004: loss improved from 2.99917 to 2.92984, saving model to model_weights_saved.hdf5
Epoch 5/5
Epoch 00005: loss improved from 2.92984 to 2.66646, saving model to model_weights_saved.hdf5
Time Elapsed:
2405.342127799988


## Loading the Model and Generating Text

## Getting Some Output

In [22]:
filepath = "model_weights_saved.hdf5"


In [24]:
print(what_does_the_robot_say(x_data,model, chars,filepath))

ry details part american jewish historical society repository http j hs org contact  west  th st
s   p  status completed author finding aid created marc  ead j hs xsl date  descripti
 papers undated    p  status progress author finding aid created marc  ead j hs xsl 
ety repository http j hs org contact  west  th street new york ny  united states inquiries 
eanup physical storage information container consolidated box p  folder p  mixed materials repos
uthor finding aid michael mont albano part cj h holocaust resource initiative made possible conferen
ing aid created marc  ead j hs xsl date  language description english script description latin 
ndated   p  status completed author processed yakov ill ich sk lar date  description 
scription english script description latin language description note finding aid written english rev
  status completed author finding aid created marc  ead j hs xsl date  description rules
pt description latin language description note finding aid written eng

Play around with the training data and model params until you find your desired output!