<a href="https://colab.research.google.com/github/bptripp/ai-course/blob/main/document_vector.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Working with Unstructured Data
Fundamentally, the input to any machine learning system is a list of numbers (a "vector", to use the mathematical term). For example, recall when we used a decision tree to predict heart disease. To get a prediction for any patient, we had to give the decision tree a list of numbers including age, a numerical code for sex, numerical results of various tests, etc. To use a convolutional network to detect pneumonia, we had to provide a chest x-ray image. An image is also a list of numbers: one for each pixel in a greyscale image; three or four for each pixel in a colour image. 

What if we want a machine learning system to make inferences from something less structured, like a hospital discharge letter? We need a way to convert the discharge letter into a vector. There are many ways to do this and some work better than others. 

Let's first consider how to assign vectors to individual words. A simple way to do this is called "one-hot" encoding. This involves defining a vocabulary of perhaps 10,000 important words. The vector for each word is mostly zeros, with a single 1 at the word's index in the vocabulary. To illustrate, let's make up a very simple vocabulary with eight words: hypertension, diabetes, hyperlipidemia, fibrillation, infection, anxiety, reflux, and pain. With this vocabulary, the one-hot code for hypertention would be [1, 0, 0, 0, 0, 0, 0, 0], the one-hot code for diabetes would be [0, 1, 0, 0, 0, 0, 0, 0], and so on. 

Here is some code to produce one-hot vectors:

In [None]:
import numpy as np
vocabulary = ['hypertension', 'diabetes', 'hyperlipidemia', 'fibrillation', 'infection', 'anxiety', 'reflux', 'pain']

def make_word_vector(word):
  vector = np.zeros(len(vocabulary)) # start with a vector of all zeros 
  if word in vocabulary: 
    vector[vocabulary.index(word)] = 1 # change the appropriate entry to 1
  return vector  

print(make_word_vector('fibrillation'))

Given vectors for each word, the simplest way to create a vector for a passage of text is to add up the vectors for the words in the text. For example, consider the discharge letter below. 

_This 79 year old woman was admitted with a complaint of recurrent chest pain. There is a background of ischaemic heart disease with previous myocardial infarction. Other history is of hypertension, cerebral vascular disease, type II diabetes mellitus and obesity. ECG showed sinus rhythm with old  infarction. There were no sequential changes and troponin was not raised.
I felt that her symptoms were consistent with musculoskeletal origin._

This note contains the following words from our small vocabulary: pain, hypertension, and diabetes. If we add up the vectors for each of these words, we get this vector: [1, 1, 0, 0, 0, 0, 0, 1]. This could serve as input to a decision tree or a neural network. Here is the code to create this vector. 


In [None]:
import re # this is a text processing library 

def make_note_vector(note):
  note = re.sub('[^\w ]', '', note) # discard punctuation (keep only word characters and spaces)
  words = note.lower().split() # make a list of words (all lower-case)
  
  note_vector = np.zeros(len(vocabulary)) # start with a vector of zeros
  for word in words: # loop through the words
    note_vector = note_vector + make_word_vector(word) # add word vector to total
  
  return note_vector

# To keep this example simple we paste the letter into a string.
# In practice, we would read it from an electronic health record, perhaps using 
# a software package that performs HL7 FHIR or SQL database queries. 
note = "This 79 year old woman was admitted with a complaint of recurrent chest pain. There is a background of ischaemic heart disease with previous myocardial infarction. Other history is of hypertension, cerebral vascular disease, type II diabetes mellitus and obesity. ECG showed sinus rhythm with old  infarction. There were no sequential changes and troponin was not raised. I felt that her symptoms were consistent with musculoskeletal origin."

note_vector = make_note_vector(note)
print(note_vector)
