# Information Extraction (IE)

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Information-Extraction-(IE)" data-toc-modified-id="Information-Extraction-(IE)-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Information Extraction (IE)</a></span><ul class="toc-item"><li><span><a href="#IE-Applications" data-toc-modified-id="IE-Applications-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>IE Applications</a></span></li><li><span><a href="#IE-Tasks" data-toc-modified-id="IE-Tasks-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>IE Tasks</a></span></li><li><span><a href="#Overview-of-Information-Extraction-(IE)-Tasks-and-Approaches:" data-toc-modified-id="Overview-of-Information-Extraction-(IE)-Tasks-and-Approaches:-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Overview of Information Extraction (IE) Tasks and Approaches:</a></span></li><li><span><a href="#The-General-Pipeline-for-IE" data-toc-modified-id="The-General-Pipeline-for-IE-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>The General Pipeline for IE</a></span></li><li><span><a href="#Keyphrase-Extraction-Task" data-toc-modified-id="Keyphrase-Extraction-Task-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Keyphrase Extraction Task</a></span><ul class="toc-item"><li><span><a href="#Applying-of-RAKE-(Rapid-Automatic-Keyword-Extraction-algorithm)-for-KPE:" data-toc-modified-id="Applying-of-RAKE-(Rapid-Automatic-Keyword-Extraction-algorithm)-for-KPE:-1.5.1"><span class="toc-item-num">1.5.1&nbsp;&nbsp;</span>Applying of RAKE (Rapid Automatic Keyword Extraction algorithm) for KPE:</a></span></li><li><span><a href="#Classwork:-Apply-Rake-on-a-text-files" data-toc-modified-id="Classwork:-Apply-Rake-on-a-text-files-1.5.2"><span class="toc-item-num">1.5.2&nbsp;&nbsp;</span>Classwork: Apply Rake on a text files</a></span></li><li><span><a href="#Applying-Text-Rank-for-KPE" data-toc-modified-id="Applying-Text-Rank-for-KPE-1.5.3"><span class="toc-item-num">1.5.3&nbsp;&nbsp;</span>Applying Text Rank for KPE</a></span></li><li><span><a href="#Classwork:-Apply-TextRank-on-a-text-files" data-toc-modified-id="Classwork:-Apply-TextRank-on-a-text-files-1.5.4"><span class="toc-item-num">1.5.4&nbsp;&nbsp;</span>Classwork: Apply TextRank on a text files</a></span></li><li><span><a href="#Practical-Advice-on-KPE" data-toc-modified-id="Practical-Advice-on-KPE-1.5.5"><span class="toc-item-num">1.5.5&nbsp;&nbsp;</span>Practical Advice on KPE</a></span></li></ul></li><li><span><a href="#Named-Entity-Recognition-(NER)-task:" data-toc-modified-id="Named-Entity-Recognition-(NER)-task:-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Named Entity Recognition (NER) task:</a></span></li></ul></li></ul></div>

## IE Applications

1. Tagging news and other content
![1.PNG](attachment:1.PNG)

2. Chatbots: 
A chatbot needs to understand the user’s question in order to generate/retrieve a
correct response. For example, consider the question, “What are the best cafes
around the Eiffel Tower?” The chatbot needs to understand that “Eiffel Tower”
and “cafe” are locations, then identify cafes within a certain distance of the Eiffel
Tower. 

3. Applications in social media:


4. Extracting data from forms and receipts

## IE Tasks

![2.PNG](attachment:2.PNG)

- Keyword or keyphrase extraction (KPE): Identifying that the article is about “buyback” or “stock price” 
- Named entity recognition (NER): Identifying Apple as an organization and Luca Maestri as a person
- Named Entity Disambiguation (NED) and Named Entity Linking (NEL): Recognizing that Apple is not a fruit, but a company, and that it refers to Apple, Inc. and not some other company with the word “apple” in its name.
- Relation extraction: Extracting the information that Luca Maestri is the finance chief of Apple.
- Event extraction: Identifying that this article is about a single event (let’s call it “Apple buys back stocks”) and being able to link it to other articles talking about the same event over time
- Temporal information extraction: extract information about times and dates, which is also useful for developing calendar applications and interactive personal assistants.
- Template filling: automatically generating weather reports or flight announcements, follow a standard template with some slots that need to be filled based on extracted data. 

##  Overview of Information Extraction (IE) Tasks and Approaches:

- In industry, IE is generally implemented as a hybrid system that incorporates rule-based and learning-based approaches.
- IE is an active area of research, and not all tasks are considered "solved" or mature enough to have standard approaches for real-world applications.
- Named Entity Recognition (NER) and Knowledge Pattern Extraction (KPE) are more widely studied than other IE tasks and have some tried-and-tested solutions.
- Other IE tasks are more challenging, and it is common to rely on pay-as-you-use services from large providers like Microsoft, Google, and IBM.

## The General Pipeline for IE

![3.PNG](attachment:3.PNG)

## Keyphrase Extraction Task 

Keyphrase extraction (KPE) can be solved using supervised or unsupervised learning methods. 

- Supervised approaches require labeled datasets, which are time and cost intensive to create, while unsupervised approaches are more domain-agnostic and popular in real-world applications. Recent research has shown that state-of-the-art deep learning methods for KPE do not perform better than unsupervised approaches.

- Unsupervised KPE algorithms represent words and phrases in a text as nodes in a weighted graph, where keyphrases are identified based on how well-connected they are to the rest of the graph. The top-N important nodes from the graph are then returned as keyphrases. Different graph-based KPE approaches differ in the way they select potential words/phrases from the text and the way these words/phrases are scored in the graph.

### Applying of RAKE (Rapid Automatic Keyword Extraction algorithm) for KPE:

Rake is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

In [1]:
from rake_nltk import Rake

# Sample text to extract keyphrases from
text = """The Mona Lisa is a 16th century oil painting created by Leonardo da Vinci. 
It's one of the most famous paintings in the world."""

# Initialize the Rake object
r = Rake()

# Extract keyphrases from the text
r.extract_keywords_from_text(text)

# Get the top 5 keyphrases and their scores
keyphrases = r.get_ranked_phrases_with_scores()[:4]

# Print the keyphrases and their scores
for score, phrase in keyphrases:
    print(f"{phrase} ({score})")


16th century oil painting created (25.0)
leonardo da vinci (9.0)
mona lisa (4.0)
famous paintings (4.0)


### Classwork: Apply Rake on a text files 

In [None]:
from rake_nltk import Rake

# Load text from a .txt file
file_path = 'sample.txt'
with open(file_path, 'r', encoding='utf-8') as file:
    text = file.read()

# Initialize RAKE
r = Rake()  # You can also pass custom stopwords if needed

# Extract keywords
r.extract_keywords_from_text(text)

# Get top ranked phrases with scores
keyphrases = r.get_ranked_phrases_with_scores()

# Display results
print("Top Keyphrases:")
for score, phrase in keyphrases[:10]:  # Top 10
    print(f"{phrase} ({score})")


### Applying Text Rank for KPE

In [2]:
import gensim
print(gensim.__version__)

4.3.3


### Classwork: Apply TextRank on a text files

In [None]:
import spacy
import pytextrank

# Load spaCy model and add TextRank
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("textrank")

# Load text from file
with open('example.txt', 'r', encoding='utf-8') as f:
    text = f.read()

# Process the text
doc = nlp(text)

# Print top-ranked phrases
print("Top Keyphrases (TextRank):")
for phrase in doc._.phrases[:10]:  # top 10
    print(f"{phrase.text} ({phrase.rank:.4f})")


### Practical Advice on KPE

- The process of extracting potential n-grams and building the graph with them is sensitive to document length, which could be an issue in a production scenario. One approach to dealing with it is to not use the full text, but instead try using the first M% and the last N% of the text, since we would expect that the introductory and concluding parts of the text should cover the main summary of the text.


- Since each keyphrase is independently ranked, we sometimes end up seeing overlapping keyphrases (e.g., “buy back stock” and “buy back”). One solution for thiscould be to use some similarity measure (e.g., cosine similarity) between the top-ranked keyphrases and choose the ones that are most dissimilar to one another.


-  Improper text extraction can affect the rest of the KPE process, especially when dealing with formats such as PDF or scanned images. This is primarily because KPE is sensitive to sentence structure in the document. Hence, it’s always a good idea to add some post-processing to the extracted key phrases list to create a final, meaningful list without noise.



## Named Entity Recognition (NER) task:
 

In [5]:
# import the spacy library and load the pre-trained language model
import spacy
nlp = spacy.load("en_core_web_lg")

# define the text to be analyzed
text_from_fig = "On Tuesday, Apple announced its plans for another major chunk of the money: It will buy back a further $75 billion in stock."

# process the text with the loaded language model
doc = nlp(text_from_fig)

# iterate over the entities in the analyzed text and print their text and label
for ent in doc.ents:
    if ent.text:
        print(ent.text, "\t", ent.label_)


Tuesday 	 DATE
Apple 	 ORG
$75 billion 	 MONEY
