# Overview
This project will capture the end to end flow from receiving documents till query time of these documents. That is, we'll be dealing with the parsing, storing, and clustering of our database.

On a high level, documents of various formats will be provided accompanied by some metadata (i.e. a paper or etc.). This will go to our parsers and then be formatted into a JSON document that will be uploaded on to an elasticsearch cluster. We will then train an [LDA Model](https://radimrehurek.com/gensim/models/ldamodel.html) to obtain latent topics based off the the accompanied metadata. On query time, given a list of descriptors, we will then try to find the relevant topics associated with these descriptors. 

Prior to running this notebook, run the following on terminal:
```
conda install elasticsearch
```

# Acquiring Datasets

We begin by obtaining data sets and storing it in a [JSON list](https://docs.google.com/document/d/1gSiucl9H1AR-2aCdE4dvOPwHZxP6zpkFzpIRfJ0usDc/edit). `Filename` represents the name of the file that was downloaded, `Data Source` represents the URL where we will manually download the file from, and `Relevant Articles` is what we'll use for our corpus to train our LDA model.

In [53]:
data_sets = [
  {
    "Filename": "U.S._Chronic_Disease_Indicators__CDI_.csv",
    "Data Source": "https://chronicdata.cdc.gov/views/g4ie-h725/rows.csv?accessType=DOWNLOAD",
    "Relevant Articles": [
      "https://www.cdc.gov/mmwr/pdf/rr/rr6401.pdf"
    ]
  },
  {
    "Filename": "diabetic_data.csv",
    "Data Source": "https://archive.ics.uci.edu/ml/machine-learning-databases/00296/",
    "Relevant Articles": [
      "https://www.hindawi.com/journals/bmri/2014/781670/",
      "https://archive.ics.uci.edu/ml/datasets/diabetes+130-us+hospitals+for+years+1999-2008"
    ]
  },
  {
    "Filename": "2016_Central_Line-Associated_Bloodstream_Infections__CLABSI__Table__Original_Baseline_.csv",
    "Data Source": "https://data.oregon.gov/api/views/757s-zskx/rows.csv?accessType=DOWNLOAD",
    "Relevant Articles": [
      "https://www.oregon.gov/oha/PH/DISEASESCONDITIONS/COMMUNICABLEDISEASE/HAI/Documents/Reports/2016_HAI_Annual_Report.pdf",
      "https://www.oregon.gov/oha/PH/DISEASESCONDITIONS/COMMUNICABLEDISEASE/HAI/Documents/Reports/2016_HAI_Annual_Report_Exec_Summary.pdf"
    ]
  },
  {
    "Filename": "crimelabaccidentaldrugdeathsextract2017.csv",
    "Data Source": "https://data.wprdc.org/dataset/7fb0505e-8e2c-4825-b22c-4fbee8fc8010/resource/2d963e35-4f69-495e-985e-55acd72c87ca/download/crimelabaccidentaldrugdeathsextract2017.csv",
    "Relevant Articles": [
      "https://www.alleghenycountyanalytics.us/wp-content/uploads/2017/04/Opiate-Related-Overdose-Deaths-in-Allegheny-County.pdf",
      "https://data.wprdc.org/dataset/7fb0505e-8e2c-4825-b22c-4fbee8fc8010/resource/a71e43e1-5a38-4fb3-b5f8-6ed7e51caade/download/me-data-dictionary.pdf",
      "https://www.overdosefreepa.pitt.edu/know-the-facts/view-overdose-death-data/"
    ]
  },
  {
    "Filename": "EMS_-_Transport_Count_by_Destination.csv",
    "Data Source": "https://data.austintexas.gov/api/views/jtkc-5pgh/rows.csv?accessType=DOWNLOAD",
    "Relevant Articles": [
      "https://data.austintexas.gov/Public-Safety/EMS-Transport-Count-by-Destination/jtkc-5pgh",
      "https://data.austintexas.gov/api/views/jtkc-5pgh/files/l9hg5sYLDEzQMylruCIknoVFpSq9kwMX2RvmlqN51g4?download=true&filename=EMS%20-%20Transport%20Count%20by%20Destination%20Metadata.pdf",
      "http://www.austintexas.gov/department/ems"
    ]
  }
]

# Parsing and Uploading to Elasticsearch
The datasets above are all in csv format for easy parsing. When expanding our use case, we will use various types of data.

### DO NOT RUN THE FOLLOWING AGAIN. EACH DOCUMENT IS GIVEN A UNIQUE ID AND WILL NOT OVERRIDE IT.

In [54]:
import csv
import json
from elasticsearch import Elasticsearch, helpers

def convert_csv_to_json(csv_file):
    with open(csv_file) as f:
        reader = csv.DictReader(f)
        rows = list(reader)
    return (rows)

def format_es_document(index, document):
    return {
        "_index": index.lower(), # Must be lowercase
        "_type": index.lower(),
        "_source": document
    }

es = Elasticsearch("https://search-data-pipeline-poc-bdfr3wal5lxncg2zllpoo6nd2e.us-east-1.es.amazonaws.com")
es_documents_for_datasets = []
for data in data_sets:
    raw_documents = convert_csv_to_json("datasets/" + data["Filename"])
    for raw_document in raw_documents:
        es_documents_for_datasets.append(format_es_document(data["Filename"], raw_document))
    
print("Uploading documents...")
#     res = es.index(index=data["Filename"],doc_type='post', body=document)
#     print(res['result'])
print(helpers.bulk(es, es_documents_for_datasets))


Uploading documents...
(2227, [])


# Deriving Corpuses for LDA Model and Model Generation
This section will derive latent topics given the the articles specified above.

In [74]:
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer

import string
import re
import urllib
import gensim
from gensim import corpora

def combine_relevant_articles(data):
    joined_articles = ""
    for link in data["Relevant Articles"]:
        file = urllib.urlopen(link)
        document = file.read()
        joined_articles += document
    return joined_articles
        
def derive_corpuses(data_sets):
    documents = []
    for data in data_sets:
        documents.append(combine_relevant_articles(data))
        
    stop = set(stopwords.words('english'))
    exclude = set(string.punctuation) 
    lemma = WordNetLemmatizer()
    return [clean(document).split() for document in documents] 

def clean(doc):
    stop = set(stopwords.words('english'))
    exclude = set(string.punctuation) 
    lemma = WordNetLemmatizer()
    doc = ''.join([i if ord(i) < 128 else ' ' for i in doc])
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized
    
corpuses = derive_corpuses(data_sets)
dictionary = corpora.Dictionary(corpuses)
doc_term_matrix = [dictionary.doc2bow(corpus) for corpus in corpuses]
lda = gensim.models.ldamodel.LdaModel
ldamodel = lda(doc_term_matrix, num_topics=50, id2word = dictionary, passes=50)

# Testing Using Existing Data
Before continuing, let's take a look at some of the topics discovered


In [73]:
print(ldamodel.print_topics(num_topics=3, num_words=10))

[(0, u'0.006*"div" + 0.002*"dataset" + 0.002*"asset" + 0.001*"data" + 0.001*"medical" + 0.001*"try" + 0.001*"please" + 0.001*"column" + 0.001*"owner" + 0.001*"requires"'), (1, u'0.000*"r" + 0.000*"p" + 0.000*"z" + 0.000*"v" + 0.000*"g" + 0.000*"x" + 0.000*"q" + 0.000*"k" + 0.000*"h" + 0.000*"u"'), (2, u'0.016*"q" + 0.015*"w" + 0.015*"u" + 0.015*"h" + 0.015*"z" + 0.015*"f" + 0.015*"g" + 0.015*"v" + 0.015*"r" + 0.015*"k"')]


We will first examine the topics generated with [the following problem](http://www.allerganclinicaltrials.com/pdfs/neuroscience/Approved/Combunox-OXY-MD-05-00.pdf) as well as validate using the existing JSON attributes.

In [None]:
problem_url = "http://www.allerganclinicaltrials.com/pdfs/neuroscience/Approved/Combunox-OXY-MD-05-00.pdf"
problem_url_file = urllib.urlopen(problem_url)
problem = problem_url_file.read()

clean(problem)
