<a href="https://colab.research.google.com/github/adrianmoses/text-search-nlp/blob/main/TextSearchNLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install spacy



In [2]:
!python -m spacy download en_core_web_md

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 1.2 MB/s 
Building wheels for collected packages: en-core-web-md
  Building wheel for en-core-web-md (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-md: filename=en_core_web_md-2.2.5-py3-none-any.whl size=98051302 sha256=b242a9b2ed28e4d1ec469eadac4aca40e99bb7d3180204b98f18f47d6bfc2742
  Stored in directory: /tmp/pip-ephem-wheel-cache-n8z3lcyc/wheels/69/c5/b8/4f1c029d89238734311b3269762ab2ee325a42da2ce8edb997
Successfully built en-core-web-md
Installing collected packages: en-core-web-md
Successfully installed en-core-web-md-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [3]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [4]:
!ls

drive  sample_data


In [13]:
import json
def tokenize_cdc_data():
    with open('./drive/MyDrive/cdc_sample_data.json') as f:
         data = json.load(f)
    for item in data:
        doc = nlp(item['text'].lower())
        item['tokenized_text'] = [token.lemma_ 
                                  for token in doc 
                                  if not token.is_stop
                                  and not token.is_punct
                                  and token.dep_]

    with open('./cdc_tokenized_sample_data.json', 'w') as nf:
        json.dump(data, nf)
        

In [14]:
tokenize_cdc_data()

In [19]:
from itertools import chain
from collections import Counter


In [20]:
def build_vocabulary(documents):
    with open('./cdc_tokenized_sample_data.json') as f:
        data = json.load(f)
    all_tokens = list(chain(*[item['tokenized_text'] for item in documents]))
    token_counter = Counter(all_tokens)
    tc_dict = dict(token_counter)
    return tc_dict

In [21]:
def count_docs_with_token(token):
    doc_counter = 0
    for item in data:
        if token in item['tokenized_text']:
            doc_counter += 1
    return doc_counter

In [29]:
def compute_tfidfs(document):
    vocab = build_vocabulary(data)
    tf_idf = []
    for token, token_count in vocab.items():
        docs_with_token = count_docs_with_token(token)
        count_in_doc = Counter(document)[token]
        tf = count_in_doc / token_count
        idf = len(data) / docs_with_token 
        tf_idf.append(tf * idf)
    return tf_idf

In [25]:
for item in data:
    item['tf_idfs'] = compute_tfidfs(item['tokenized_text'])

In [27]:
with open('vocab.json', 'w') as vocab_file:
    vocab = build_vocabulary(data)
    json.dump(vocab, vocab_file)

In [28]:
with open('cdc_vectorized.json', 'w') as vec_file:
    json.dump(data, vec_file)

In [32]:
def tokenizer(input_string):
    doc = nlp(input_string.lower())
    tokens = [token.lemma_ 
                                  for token in doc 
                                  if not token.is_stop
                                  and not token.is_punct
                                  and token.dep_]
    return tokens

In [41]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def search_tfids(query, documents):
    tokens = tokenizer(query)
    tf_idfs = compute_tfidfs(tokens)
    doc_sim = [(doc, cosine_similarity(np.array([tf_idfs]), np.array([doc['tf_idfs']]))) for doc in documents]
    doc_sim.sort(key=lambda tup: tup[1])
    ranked_documents = [d[0] for d in doc_sim]
    return ranked_documents

In [42]:
search_tfids("care", data)

[{'text': 'A pandemic (from Greek πᾶν, pan, "all" and δῆμος, demos, "people") is an epidemic of an infectious disease that has spread across a large region, for instance multiple continents or worldwide, affecting a substantial number of people. A widespread endemic disease with a stable number of infected people is not a pandemic. Widespread endemic diseases with a stable number of infected people such as recurrences of seasonal influenza are generally excluded as they occur simultaneously in large regions of the globe rather than being spread worldwide.\nThroughout human history, there have been a number of pandemics of diseases such as smallpox and tuberculosis. The most fatal pandemic in recorded history was the Black Death (also known as The Plague), which killed an estimated 75–200 million people in the 14th century. The term was not used yet but was for later pandemics including the 1918 influenza pandemic (Spanish flu). Current pandemics include COVID-19 (SARS-CoV-2) and HIV/AI