# ***Information Extraction***
-------------------------------

In [1]:
# Information is hidden in free text.
# Most traditional transactional information is structured.
# Most of the information generated today are unstructured.

In [2]:
# How do we convert this unstructured data into a structured form?
# i.e. How to extract meaningful information from messy unstructured data

In [3]:
# To make the extracted information reusable, it needs to be in a structured form.

![WebMD](./webmd.png)

In [4]:
# Identify and extract fields of interest from the above free text.

In [5]:
# Title -> Erbitux helps treat lung cancer.
# We can drop the modifier "advanced" but not "lung" since treatment for other forms of cancer like blood cancer may vary drastically from the 
# treatments for lung cancer.

# Author -> Charlene Laino
# Reviewer -> Louse Chang MD
# Published date -> 23/Sep/2009
# Published place -> Berlin, Germany

## ***Fields of interest***
-------------------

In [9]:
# Fields of interest are named entities.
# If it is a news, then -> people, places, dates ...
# Finance -> money, companies, stock, investment, banks ...
# Medicine -> diseases, pathogens, treatments ...
# Protected information -> unique identifiers of participants in drug trials, adresses, telephone numbers, bank account numbers, email addresses.

In [7]:
# Name entities can be very diverse
# Names of people and places are mostly in title case
# Dates occur is specific formats

# But names of other things like diseases or micro-organisms are not generally capitalized.
# e.g. bacteria, fungi, virus, arthritis, cholestrol, diabetes

## ***Relations***
-----------------------

In [8]:
# This is about what happened to whom, when, where, why etc..
# whom is typically a person / organization
# when is a time / date
# where is a place
# why is a reason

In [10]:
# There are relationships between these named entities that form meaning synergistically.

## ***Named Entity Recogniztion Task***
---------------------------

In [11]:
# Named entities are noun phrases of specific type that refer to specific individuals, places, organizations .etc...
# Identifying all mentions ofa predefined set of named entities in a text corpus.

In [12]:
# Identifying the occurrence of the named entity -> boundary detection.
# Indentify the type -> tagging / classification. -> Consider an entity "Chicago"
        # What Chicago is this? The city? / A music album? / A font?

In [18]:
"""\
The patient is a 63-old female with a three-years old history of bilateral hand numbness and occassional weakness. \
Within the past year these symptoms have progressively gotten worse, to encompass also her feet. \
She had a workup by her neurologist and an MRI revealed a C5-6 disc herniation with cord compression and a T2 signal change at that level.\
"""

'The patient is a 63-old female with a three-years old history of bilateral hand numbness and occassional weakness. Within the past year these symptoms have progressively gotten worse, to encompass also her feet. She had a workup by her neurologist and an MRI revealed a C5-6 disc herniation with cord compression and a T2 signal change at that level.'

In [19]:
# The condition -> bilateral hand numbness, occassional weakness, C5-6 disc herniation with cord compression
# Personal info -> 63-old, female, three-years old history of bilateral hand numbness and occassional weakness
# Medical speciality -> neurologist, MRI
# Body parts -> feet, hand

In [None]:
# These named entites do not have to be separate from one another.
# A larger phrase might serve as a named entity, while a smaller phrase / token within that phrase might function as a separate named entity.

# e.g. "cord compression" is a medical ailment while "cord" is a body part i.e spinal cord.

## ***Approaches for Indentifying Named Entities***
--------------------

In [20]:
# Depends on the kind of entities that need to be identified.
# For strictly formatted entities like dates/times/phone numbers regular expressions may help.

In [21]:
# For other fileds we could use machine learning
# In fact, even for dates / times / phone-numbers, we could still use machine learning.
# e.g. Phone numbers & fax numbers are very similar
# The only thing that can be leveraged to differentiate them are tokens preceding them like "tp no", "phone", "hotline", "fax" etc ...

In [22]:
# Standard NER (named entity recognition) task involves a 4 class model;

# Person
# Organization
# Location / Geopolotical entities
# Outside (any other classes)

In [23]:
"John met Jennifer at the library"

'John met Jennifer at the library'

In [24]:
# John & Jennifer are named entities -> Person
# met is a verb -> Outside
# at, the -> Outside
# library -> Location

## ***Relational Extraction***
-------------------

In [25]:
# Identifying relationships between named entities

# consider "Erbitux helps treat advanced lung cancer"
# This sentence has a relationship between 2 named entities

# Erbitux -> a drug
# Lung cancer -> a disease

# Erbitux and Lung cancer are related by a treatment relationship.
# i.e Erbitux is a treatment for lung cancer
# or. Lung cancer can be treated with Erbitux.

## ***Co-reference Resolution***
------------------

In [26]:
# Disambiguate mentions in text and group mentions together.

In [27]:
# Consider,
# Anita met Nathan at the market.He suprised her with a rose.

In [None]:
# Two named entities -> Anita & Nathan
# But in the second sentence, the same people are referred to with their pronouns. (he, her)
# In this case, these pronouns need to be resolved.

## ***Q/A Task***
-----------------

In [28]:
# Given a question, find the most appropriate answer from the text.
# What does Erbitux treat?

# We first have to identify that Erbitux is a drug.
# And leveraging its treatment relationship to a malady ________________, find the malady!

# Who gave Anita a rose at the market?
# We do know "he" gave Anita a rose (from the second sentence)
# A pronoun resolution, with the help of first sentence will show that the "he" is Nathan
# A relationalship mapping with Nathan & market will reveal that, it was at the market Nathan gave a rose to Anita.