In [None]:
import nbpresent

nbpresent.__version__

from IPython.display import Image

%matplotlib inline

# Search with Inference
---


### GA SG Data Science 5 - Project

## Matthew A. Snell

#### 2017-10-21


# Objective

---

Leverage Ngrams plus TensorFlow (Word2Vec) to rank a body of *sources* against a provided search *phrase*

  - Rank Sources using Ngram based scoring (width of 3)
    
  - Suppliment Ngram search with Word2Vec Skipgram **Link** most likely nearest neighbours based *augmentation* of search term
  

# Long-Term Objective

## Build a Personal (Self-Hosted) Search Engine

### Indexes and searches preferred and personal data sources


#### Data Sources Can Include (Directly or via Plugins)

  - AV Metadata
  - Personal Files/Directories
    - eg. DOC, ODF, PDF, TXT, CSV, MOBI, EPUB, code
  - IMAP / Email Accounts
  - Evernote, Wallby, Pocket
  - RSS Feeds
  - Webbrowser Bookmarks and History
  - Social Media Streams and/or *Pages* or Subscriptions
    - eg. FB, Twitter, Reddit Subs etc
  - Online Storage
    - eg. Dropbox, GDrive, Box, pCloud, iCloud, WebDAV (Nextcloud, ownCloud) etc
  - Any Data Store with a defined API, authentication (eg OAUTH) and/or open (or documented) standards

#### Functionality Extendable via Plugins

  - P2P, Prediction and related sources etc
  

# Lessons Learnt ... *So Far* ...

---

  - Who needs classes? It's just a proof of concept

    **Wrong** - Code Complexity
    
  - Who needs a DBMS? It's just a proof of concept

    **Wrong** - Performance
      
  - ETL (cleansing) is slow (cpu time)

    **Get the Data Models Right** - Re-Runs & Validators are expensive
    
  - Dictionaries `dict()` or `{ k:v }` are Great!
   
   **Nested Dictionaries are (code) Messy**:
   
   `dict = { k1: { k2: { k3: v }}}` -> `dict[k1][k2][k3]`
   
   

# Lessons Learnt ... *So Far* ... (cont..)

---
    
     
  - I'll build a non/semi-Data Science model 1st for comparision

    **Time Consuming** - More about Python than Data Science

  - RegEx - Need I say more...
  
  - I don't know how to describe Data Science...
  
    - But I know what it is when I see it...
    

# What is an Ngram?

For the purposes of Data Science or NLP, an Ngram (monogram, bigram, trigram, quadgram etc) is a *sliding* group of *words* taken from a set of text.

eg. *using:* **`consider this line of text`**
```
    1gram = 5 items - [ consider, this, line, of, text ]

    2gram = 4 items - [ consider this, this line, line of, of text ]

    3gram = 3 items - [ consider this line, this line of, line of text ]
    
    Total = 12 potential Ngrams to leverage
```

# But for Search Phases,
# We handle Ngrams differently

How a search phrase is sequenced or weighted by a user can be very different compared to how the terms are used.

Consider:

**Search Phrase:** `Python Learning Examples SciKit Data Science`

vs
         
**Book Title:** `Data Science and SciKit: Learning through Python with Examples`

Using Ngrams, the best hit we get beyond 1gram (low scoring), is the 2gram `Data Science` - in a sea of documents, this will not rank high - unless a word is rare eg `scikit`


To counter this, for *search phrases*, we generate every possible Ngram, to given width, and drop all those never seen before (ie. not in our Ngram Datastore)

eg. `Python Learning Examples SciKit Data Science` ->

```
[ data, data examples, data examples learning, data examples python, data examples science, data examples scikit, data learning, data learning examples, data learning python, data learning science, data learning scikit, data python, data python examples, data python learning, data python science, data python scikit, data science, data science examples, data science learning, data science python, data science scikit, data scikit, data scikit examples, data scikit learning, data scikit python, data scikit science, examples, examples data, examples data learning, examples data python, examples data science, examples data scikit, examples learning, examples learning data, examples learning python, examples learning science, examples learning scikit, examples python, examples python data, examples python learning, examples python science, examples python scikit, examples science, examples science data, examples science learning, examples science python, examples science scikit, examples scikit, examples scikit data, examples scikit learning, examples scikit python, examples scikit science, learning, learning data, learning data examples, learning data python, learning data science, learning data scikit, learning examples, learning examples data, learning examples python, learning examples science, learning examples scikit, learning python, learning python data, learning python examples, learning python science, learning python scikit, learning science, learning science data, learning science examples, learning science python, learning science scikit, learning scikit, learning scikit data, learning scikit examples, learning scikit python, learning scikit science, python, python data, python data examples, python data learning, python data science, python data scikit, python examples, python examples data, python examples learning, python examples science, python examples scikit, python learning, python learning data, python learning examples, python learning science, python learning scikit, python science, python science data, python science examples, python science learning, python science scikit, python scikit, python scikit data, python scikit examples, python scikit learning, python scikit science, science, science data, science data examples, science data learning, science data python, science data scikit, science examples, science examples data, science examples learning, science examples python, science examples scikit, science learning, science learning data, science learning examples, science learning python, science learning scikit, science python, science python data, science python examples, science python learning, science python scikit, science scikit, science scikit data, science scikit examples, science scikit learning, science scikit python, scikit, scikit data, scikit data examples, scikit data learning, scikit data python, scikit data science, scikit examples, scikit examples data, scikit examples learning, scikit examples python, scikit examples science, scikit learning, scikit learning data, scikit learning examples, scikit learning python, scikit learning science, scikit python, scikit python data, scikit python examples, scikit python learning, scikit python science, scikit science, scikit science data, scikit science examples, scikit science learning, scikit science python ]
```


# Datastores

### Config
Contains application wide baselines (eg. Instance UUID, Corpus Directory, Ngram width, Word2Vec Ngram Width) & master keys (currently only last used `vector`)

---

### Dictionary
`word` to `vector` (`int`) mappings
`vector` assigned sequentially when new `word` is found

---

### Vector
`vector`  (`int`) to `word` mappings

---

### NGram
Per `ngram`, per `srcID` list of lines with Ngram


### Sources
Per Source Path (file path, URL etc) to `srcID` (UUID)

Potential to add source hash (MD5 or SHA1) to itentify duplicates or alternate sources

---

### DocMeta (Document Meta)
Per `source` summary information, including type (URL, FILE, ...) and sub-type (HTML, PDF, TXT, CSV, ...), index status, Word2Vec status, `list` of ngrams, indexed data version, when indexed etc

---

### DocStat (Document Statistics)
Per document (`srcID`), per line list of Ngrams


# Algorithm - Normalise Text



### Normalise Text

---

convert to lowercase plus RegEx;
```
# remove apostrophe in words: [alpha]'[alpha]=[alpha][alpha]
# eg. don't = dont
norm_text = re.sub(r'([\w]+)[\`\']([\w]+)', r'\1\2', norm_text)

# Replace non-AlphaNumeric sequences with Space
norm_text = re.sub(r'[^\w]+', r' ', norm_text)

# Replace spaces, underscores, tabs, newlines and return
# sequences with a space (mostly redundant except for '_')
norm_text = re.sub(r'[ _\t\n\r]+', r' ', norm_text)

# Replace pure digits with space eg 1234, but not 4u or best4u
norm_text = re.sub(r'^\d+$|^\d+\W+|\W+\d+\W+|\W+\d+$', r' ', norm_text)
```


# Algorithm - Index - Ngram


Per Line in File/Source;

  - Normalise
  
  - Remove *Stop Words* (eg. and, it, for, a)
  
  - Break into 1gram to Ngrams (Currently 3gram)
  
  - Add 1gram (words) to Dictionary/Vector Datastore
  
  - Add to Ngram Datastore (Ngram & *Source*)
  
  - Add Ngram's existence to DocMeta Datastore
  
  - Add Ngram & LineID (No.) to DocStat for *Source*

# Algorithm - Word2Vec

As at 11am Oct 21st - Broken

Per File/Pseudo Line;

  - Normalise

  - Build Frequency List `[ [word, 5] ]`
  
  - Vectorized List `"sample sentence"` -> `[ 17634, 23654 ]`

  - Load/Initialize TensorFlow
  
  - Perform 10,0000 Interations of Skipgram Algorithm
    - (word-1, word, word+1) -> calculate nearest neighbours likliehood
  
  - Save Model
  

**Producted vectorList:**

2, 79985, 79981, 4, 3, 6, 8, 7, 9, 8, 9, 6, 10, 7, 10, 6, 11, 7, 11, 6, 15, 7, 15, 12, 14, 13, 6, 16, 7, 16, 18, 19, 20, 21, 17, 6, 23, 7, 23, 22, 6, 24, 25, 24, 6, 26, 7, 26, 6, 27, 7, 27, 6, 28, 7, 28, 6, 29, 7, 29, 30, 6, 31, 7, 31, 6, 32, 7, 32, 33, 6, 34, 7, 34, 6, 35, 7, 35, 6, 36, 7, 36, 6, 37, 7, 37, 6, 39, 7, 29, 39, 40, 38, 41, 6, 
42, 25, 42, 6, 43...

**Produced wordCount:**

['fawn', 1], ['unattackable', 1], ['middleman', 1], ['yellow', 4], ['narcotic', 1], ['four', 4], ['prices', 1], ['woods', 1], ['woody', 2], ['aggression', 2], ['marching', 1], ['looking', 1], ['eligible', 2], ['electricity', 3], ['similarity', 2], ['albumen', 1], ['immature', 1], ['antecede', 1], ['slothful', 1], ['regional', 2], ['pigment', 1], ['medicament', 1], ['disturb', 1], ['prize', 5], ['wooden', 2], ['reliable', 2], ['ornamental', 1], ['charter', 2], ['tired', 2], ['bacon', 2], ['pulse', 1], ['empirical', 2], ['elegant', 2], ['second', 7], ['tether', 1], ['horseshoe', 1], ['inanimate', 1], ['errors', 1], ['medically', 1], ['widen', 3], ['cooking', 4], ['schism', 1], ['fossil', 3], ['numeral', 1], ['contributes', 1], ['inducement', 1], ['cull', 1], ['specialist', 2], ['hero', 4], ['reporter', 2]...


# Algorithm - Search

For Input;

  - Normalise
  
  - Sanitise - remove duplicate words
  
  - Multiplex Ngram *Search Phrase*
  
  - Drop Unseen Ngrams (ie. not in Ngram Datastore)
  

For Scoring Ngram Matches;
  
  - Restrict calculations to statistics from *sources* with 1 or more Ngram Match
  
  - Rare and/or Long Ngrams Boosts a *source's* score
    - 1gram < 2gram < ... < Ngram Weighting
    - Scarce Ngrams > Frequent Ngrams
  
  - `NgramWeight = math.log(AllNgramCounts, NgramCount)`
    - This reverses the count/frequency
  
  - Normalise to 0 -> 1 range 
    - `NgramWeight /= (HighestWeight - LowestWeight)`
  
  - *Source* Weight is sum of each Ngram contained;
    - `SrcScore += NgramWeight * SourceAppeareances * NgramWidth`
  
  - Normalise *source's* score 0 -> 1 range
  

**Search Words;** data, dont, guide, handy, library, oregon, python, science, state, university

topSrcMatches:
```
 1 : FILE TXT 4fb66765-... 1.0               19 .../text/Library_List.txt
 2 : FILE CSV d37b3802-... 0.215896415671    16 .../csv/deep-nlp-Sheet_2.csv
 3 : FILE TXT 4ca0fe9d-... 0.0394573896895    9 .../text/t8.shakespeare.txt
 4 : FILE TXT 24076e31-... 0.00736442711495   5 .../text/CASTLE.txt
 5 : FILE TXT 65982f57-... 0.00634658230231   7 .../text/core-wordnet.txt
 6 : FILE TXT 67324674-... 0.00229748047626   4 .../text/rq3.txt
 7 : FILE TXT 2022f0e4-... 0.0021464153977    4 .../text/dd_dwarf.txt
 8 : FILE TXT 23260beb-... 0.00181603356786   5 .../text/i11.txt
 9 : FILE TXT 3c81a474-... 0.00115478241492   9 .../text/dictionary.txt
10 : FILE CSV 43820f0b-... 0.000982226868107  1 .../csv/deep-nlp-Sheet_1.csv
```
**ngrams used;**
    data, data science, data science python, data state, dont, guide, guide data, 
    guide state, handy, library, library data, library guide, oregon, oregon state, oregon state university, 
    python, python data, python data science, science, science python, state, state university, 
    university, university library, university oregon, university python, university state

For Scoring Based on Word2Vec Skipgram;

  - Take Sanitized *Search Phrase*
    
  - For each word, add 2 most predicted neighours by W2V skipgram

  - Re-Sanitise - remove duplicate words
  
  - Multiplex Ngram *Search Phrase*
  
  - Drop Unseen Ngrams (ie. not in Ngram Datastore)

  - Repeat Scoring based on Ngram Match algorithm & merge results
  


# Moving Forward -> Opportunities

### Index and Search

---

  - Word2Vec Skipgram Model - Build nGram2Vec?
    - The more data processed, the better it gets
    
  - Use Word2Vec prediction against search term expansion or document scoring?
  
  - Identify & Improve weight based on Ngram usage
    - eg. title, headings, tags
    - Easier with rich/tagged sources eg. HTML `<h1>Important Heading</h1>`
    
  - De-Duplication or Identification of Same Material, but Different route
    - Same, Same but Different
    - What source is better or preferred?


### Index and Search (cont...)

---
 
  - Crawler
    - When URLs or other sources are linked, extend indexing
    - *BackRub*
      - Original Google Algorithm **Link**
        - More references/link = higher score
        - Higher Score of Link Source = Higher Score for Target
  
  - Language detection and Multi-Lingual Support
  
  

### Technical

---

  - Data Models
  
  - Multi-thread (currently single threaded & Batch Driven)
  
  - Full RDBMS backend support needed with concurrent access
  
  - Identify Opportunities for 1-pass vs Batch processing
  
    - eg. Acquire, parse, and 1gram *source*
    
    - Batch Ngram & Word2Vec process?
  
  - Language - Python vs Go? Both?
  
  - Stack - seperate & thread activities
  
    - eg UI, Acquisition, API, Analysis, Indexing, *Source* validation & updates etc
    
  - Multi-Lingual Support

# Thank you


# Questions?
