<a href="https://colab.research.google.com/github/bstrain71/usn_instruction_search/blob/master/usn_instruction_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### INGEST

OPNAV Instructions: https://www.secnav.navy.mil/doni/opnav.aspx?RootFolder=%2Fdoni%2FDirectives%2F01000%20Military%20Personnel%20Support&FolderCTID=0x012000E8AF0DD9490E0547A7DE7CF736393D04&View=%7BCACF3AEF%2DAED4%2D433A%2D8CE5%2DA45245715B5C%7D

Note: This example only contains series 01-01 through 01-500 as it is just for demonstration. Here is a key to their topics:

 
01-01 General Military Personnel Records

01-100 General Recruiting Records 

01-200 Personnel Classification and Designation 

01-300 Assignment and Distribution Services  

01-400 Promotion and Advancement Programs 

01-500 Military Training and Education Services 



In [0]:
%matplotlib inline

In [58]:
# Load gdrive
from google.colab import drive
drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


In [0]:
!pip install -q tika #install tika

In [0]:
from tika import parser

# Test code for seeing how tika works.
#raw = parser.from_file('/content/gdrive/My Drive/MSDS_422/OPNAV_Instructions/1000.16L With CH-2.pdf')
# raw['content'] gives the raw text of the pdf
#print(raw['content'])
#raw.keys()
#raw['metadata']['resourceName'] # Calls the file name - not metadata. not every file has metadata so don't rely on it.

In [0]:
import os

# The indices for these are the same so they can be zipped later.
# The .pdf filename is what is read, not the metadata, so whatever
# the file is called in the folder is what it will be listed as
# in the results.

documents = [] # Contains raw text from all .pdfs.
filenames = [] # Contains the filenames of the .pdfs

# Parse all .pdf contents and their filenames.
directory = '/content/gdrive/My Drive/MSDS_422/OPNAV_Instructions/'
for file in os.listdir(directory):
  temp = parser.from_file(directory+file)
  documents.append(temp['content'])
  filenames.append(temp['metadata']['resourceName'])

### EDA and Data Cleaning

Make everything in the document lowercase and remove stopwords. The stopwords list can be modified as needed, this one comes directly from the gensim documentation and is relatively standard.

In [0]:
from collections import defaultdict
from gensim import corpora

# improved list from Stone, Denis, Kwantes (2010)
STOPWORDS = """
a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be
became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can
cannot cant co computer con could couldnt cry de describe
detail did didn do does doesn doing don done down due during
each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen
fify fill find fire first five for former formerly forty found four from front full further get give go
had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie
if in inc indeed interest into is it its itself keep last latter latterly least less ltd
just
kg km
made make many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely
neither never nevertheless next nine no nobody none noone nor not nothing now nowhere of off
often on once one only onto or other others otherwise our ours ourselves out over own part per
perhaps please put rather re
quite
rather really regarding
same say see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system take ten
than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under
until up unless upon us used using
various very very via
was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you
your yours yourself yourselves
"""

# remove common words and tokenize
stoplist = set(STOPWORDS.split())
texts = [
    [word for word in document.lower().split() if word not in stoplist]
    for document in documents
]

# remove words that appear only once
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [
    [token for token in text if frequency[token] > 1]
    for text in texts
]

# Building the corpus.
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

### Modeling

Build the model to measure similarity between documents. Here we will use Latent Semantic Analysis (LSI)

In [0]:
from gensim import models
# LSI uses singular value decomposition - I think that
# num_topics corresponds do the number of singular values
# to use.
# "Indexing by Latent Semantic Analysis" <http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf>
# Latent Semantic Indexing <https://en.wikipedia.org/wiki/Latent_semantic_indexing>
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=100)

Now suppose a user typed in the query [whatever doc = 'user stuff']. We would like to sort the corpus documents in decreasing order of relevance to this query. Unlike modern search engines, here we only concentrate on a single aspect of possible similarities—on apparent semantic relatedness of their texts (words). No hyperlinks, no random-walk static ranks, just a semantic extension over the boolean keyword match:

In [72]:
# Here the 'doc' object is the user's search term.
# vec_bow cleans the user's search term
# vec_lsi is a list of tuples which are the similarities,

# This is only the boolean similarity!
# This is the same as doing Ctrl+F and counting
# the number of hits.

doc = "personnel recovery"


vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]  # convert the query to LSI space
print(vec_lsi)

[(0, -0.013713959072796473), (1, 0.2024285724290296), (2, -0.05372699122833611), (3, 0.0845648569468504), (4, 0.006650192946701625), (5, 0.06435716867364291), (6, -0.017810813855787914), (7, 0.07809246379208735), (8, 0.020914464153183616), (9, -0.04360172752032436), (10, 0.029131633925782923), (11, 0.029582256269949894), (12, 0.04050317525986838), (13, -0.05770880198636202), (14, 0.03671056249646966), (15, 0.05463676156811324), (16, 0.00027852422945872935), (17, -0.011144916944591276), (18, -0.00796559642777531), (19, -0.007227705040453201), (20, -0.008182080718364713), (21, 0.18931805497119344), (22, 0.19190137497335943), (23, 0.12037824455248419), (24, -0.027031518500016803), (25, 0.08527406077895502), (26, -0.0816882638818058), (27, -0.04454863873906595), (28, -0.13615992548983594), (29, -0.024735572146978694), (30, 0.04324184238067171), (31, -0.008247487285190553), (32, 0.020263038606386348), (33, 0.04512507268797478), (34, -0.025537255113263588), (35, 0.03913739122147676), (36, 0.

In addition, we will be considering `cosine similarity <http://en.wikipedia.org/wiki/Cosine_similarity>`_
to determine the similarity of two vectors. Cosine similarity is a standard measure
in Vector Space Modeling, but wherever the vectors represent probability distributions,
`different similarity measures <http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Symmetrised_divergence>`_
may be more appropriate.

Initializing query structures
++++++++++++++++++++++++++++++++

To prepare for similarity queries, we need to enter all documents which we want
to compare against subsequent queries.



In [73]:
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus])  # transform corpus to LSI space and index it

  if np.issubdtype(vec.dtype, np.int):


<div class="alert alert-danger"><h4>Warning</h4><p>The class :class:`similarities.MatrixSimilarity` is only appropriate when the whole
  set of vectors fits into memory. For example, a corpus of one million documents
  would require 2GB of RAM in a 256-dimensional LSI space, when used with this class.

  Without 2GB of free RAM, you would need to use the :class:`similarities.Similarity` class.
  This class operates in fixed memory, by splitting the index across multiple files on disk, called shards.
  It uses :class:`similarities.MatrixSimilarity` and :class:`similarities.SparseMatrixSimilarity` internally,
  so it is still fast, although slightly more complex.</p></div>

Index persistency is handled via the standard :func:`save` and :func:`load` functions:



In [74]:
index.save('/tmp/deerwester.index')
index = similarities.MatrixSimilarity.load('/tmp/deerwester.index')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


This is true for all similarity indexing classes (:class:`similarities.Similarity`,
:class:`similarities.MatrixSimilarity` and :class:`similarities.SparseMatrixSimilarity`).
Also in the following, `index` can be an object of any of these. When in doubt,
use :class:`similarities.Similarity`, as it is the most scalable version, and it also
supports adding more documents to the index later.

Performing queries
++++++++++++++++++

To obtain similarities of our query document against the nine indexed documents:



In [75]:
# index is the similarity index, or what we are checking against (corpus in the lsi space)
# vec_lsi is our query in the lsi space
# the output is the similarity tuples

sims = index[vec_lsi]

similarity_list = list(zip(sims, filenames))
#print(similarity_list)
print(sorted(similarity_list, reverse = True))

[(0.53474784, '1640.9A.pdf'), (0.4640187, '3006.1 w CH-2.pdf'), (0.4238381, '1001.19D.pdf'), (0.3761509, '1414.4D.pdf'), (0.3660003, '1160.6C.pdf'), (0.36188006, '1000.24D.pdf'), (0.35457563, '1220.1E.pdf'), (0.33956644, '1070.2C.pdf'), (0.3228429, '3120.32D W CH-1.pdf'), (0.26242042, '1430.4C.pdf'), (0.2618715, '1520.36C.pdf'), (0.2552052, '3120.16C.pdf'), (0.25459012, '1330.2C.pdf'), (0.25389275, '1770.1B.pdf'), (0.24652676, '2221.5D.pdf'), (0.24579406, '1301.11.pdf'), (0.23478015, '1620.3.pdf'), (0.23458447, '3060.7C.pdf'), (0.22452052, '1520.39A.pdf'), (0.22212721, '1571.1B.pdf'), (0.2170622, '3130.7B.pdf'), (0.2122381, '1540.2F.pdf'), (0.21204665, '1500.85.pdf'), (0.20718022, '1754.4A.pdf'), (0.20310572, '1001.27.pdf'), (0.19834265, '2100.2A.pdf'), (0.19519977, '1306.4A.pdf'), (0.19471943, '1730.1E.pdf'), (0.19431277, '1220.2B.pdf'), (0.19288923, '1780.4.pdf'), (0.19110031, '1223.1D.pdf'), (0.19034609, '1320.6.pdf'), (0.18617517, '1000.16L With CH-2.pdf'), (0.18490621, '1770.3A.pd

### Conclusion

This model could be very useful in production. There is a huge corpus of instructions that USN personnel have to sort through in order to get information - an easy way to query this corpus to find relevant instructions given a topic would save everyone time and energy. There is no objective metric with which to measure 'accuracy' but the searches have consistently returned sane and sensible results during operator testing.