## Extract type facts from a Wikipedia file


### === Purpose ===

The goal of this lab is to write an algorithm that extracts the type of an entity by using Wikipedia articles.  
The algorithm's input contains both the title and the content of the Wikipedia article. The title is the entity for which we want to extract its type.

For example, we want to extract the type of the entity Leicester from its corresponding Wikipedia article. The input is:

    title: Leicester

    content: Leicester is a small city in England
    
and the goal is to return:

    Leicester TAB city


### === Provided Data ===

We provide:

1. a preprocessed version of the Simple Wikipedia (`wikipedia-first.txt`), which looks like above
2. a template for your code, `type_extraction.py`
3. a gold standard sample (`gold-standard-sample.tsv`).

### === Task ===

Complete the `extract_type()` function so that it extracts the type of the article entity from the content.
For example, for a content of "Leicester is a beautiful English city in the UK", it should return "city".
Exclude terms that are too abstract ("member of...", "way of..."), and try to extract exactly the noun.
You can also skip articles (e.g. `return None`) if you are not sure or if the text does not contain any type.
In order to ensure a fair evaluation, do not use any non-standard Python libraries except `nltk` (`pip install nltk`).

Input:

April  
April is the fourth month of the year with 30 days.

Output:
April TAB month


### === Development and Testing ===

We provide a certain number of gold samples for validating your model.
Finally, we calculate a F1 score using following equation:

`F1 = (1 + beta * beta) * precision * recall / (beta * beta * precision + recall)`

with `beta = 0.5`, putting more weight on precision in that way.


### === Submission ===

1. Take your code, any necessary resources to run the code, and the output of your code on the test dataset (no need to put the other datasets!)
2. ZIP these files in a file called `firstName_lastName.zip`
3. submit it here before the deadline announced during the lab:


https://ln5.sync.com/dl/04d8f4540/ezmyuduy-rss32ktt-kx923xbr-ncmnfatt
### === Contact ===

If you have any additional questions, you can send an email to: zacchary.sadeddine@telecom-paris.fr


In [1]:
"""
Don't modify this code.
"""

import sys


class Page:
    """
    This class is used to store title and content of a wiki page
    """
    __author__ = "Jonathan Lajus"

    def __init__(self, title, content):
        self.content = content
        self.title = title
        if sys.version_info[0] < 3:
            self.title = title.decode("utf-8")
            self.content = content.decode("utf-8")

    def __eq__(self, other):
        return isinstance(other, self.__class__) and self.title == other.title and self.content == other.content

    def __ne__(self, other):
        return not self.__eq__(other)

    def __hash__(self):
        return hash((self.title, self.content))

    def __str__(self):
        return 'Wikipedia page: "' + (self.title.encode("utf-8") if sys.version_info[0] < 3 else self.title) + '"'

    def __repr__(self):
        return self.__str__()

    def _to_tuple(self):
        return self.title, self.content


class Parsy:
    """
    Parse a Wikipedia file, return page objects
    """
    __author__ = "Jonathan Lajus"

    def __init__(self, wikipediaFile):
        self.file = wikipediaFile

    def __iter__(self):
        title, content = None, ""
        with open(self.file, encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line and title is not None:
                    yield Page(title, content.rstrip())
                    title, content = None, ""
                elif title is None:
                    title = line
                elif title is not None:
                    content += line + " "


def eval_f1(gold_file, pred_file):

    # Dictionaries
    goldstandard = dict()
    student = dict()

    # Reading first file
    with open(gold_file, 'r', encoding="utf-8") as f:
        for line in f:
            temp = line.split("\t")
            if len(temp) != 2:
                print("The line:", line, "has an incorrect number of tabs")
            else:
                if temp[0] in goldstandard:
                    print(temp[0], " has two solutions")
                goldstandard[temp[0]] = str.lower(temp[1])

    # Reading second file
    with open(pred_file, 'r', encoding="utf-8") as f:
        for line in f:
            temp = line.split("\t")
            if len(temp) != 2:
                print("The line: '", line, "' has an incorrect number of tabs")
            else:
                if temp[0] in student:
                    print(temp[0], " has two solutions")
                student[temp[0]] = str.lower(temp[1])

    true_pos = 0
    false_pos = 0
    false_neg = 0

    for key in student:
        if key in goldstandard:
            if student[key] == goldstandard[key]:
                true_pos += 1
            else:
                false_pos += 1
                print("You got", key, "wrong. Expected output: ", goldstandard[key], ",given:", student[key])

    for key in goldstandard:
        if key not in student:
            false_neg += 1
            print("No solution was given for", key)

    if true_pos + false_pos != 0:
        precision = float(true_pos) / (true_pos + false_pos) * 100.0
    else:
        precision = 0.0

    if true_pos + false_neg != 0:
        recall = float(true_pos) / (true_pos + false_neg + false_pos) * 100.0
    else:
        recall = 0.0

    beta = 0.5

    if precision + recall != 0.0:
        f05 = (1 + beta * beta) * precision * recall / (beta * beta * precision + recall)
    else:
        f05 = 0.0

    # grade = 0.75 * precision + 0.25 * recall
    grade = f05

    print("Comment :=>>", "Precision:", precision, "%")
    print("Comment :=>>", "Recall:", recall, "%")
    print("Simulated Grade (F0.5) :=>>", grade, "%")


In [2]:
# a simplified wiki page document
wiki_file = 'wikipedia-first.txt'
# some gold samples for validation
gold_file = 'gold-standard-sample.tsv'
# predicted results generated by your model
# you are supposed to submit this file
result_file = 'results.tsv'

In [3]:
import nltk

In [4]:
sentence='April is the fourth month of the year with 30 days.'
tokens=nltk.word_tokenize(sentence)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\INTERNET_DIGITAL/nltk_data'
    - 'c:\\Users\\INTERNET_DIGITAL\\mambaforge\\envs\\CESDS\\nltk_data'
    - 'c:\\Users\\INTERNET_DIGITAL\\mambaforge\\envs\\CESDS\\share\\nltk_data'
    - 'c:\\Users\\INTERNET_DIGITAL\\mambaforge\\envs\\CESDS\\lib\\nltk_data'
    - 'C:\\Users\\INTERNET_DIGITAL\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************


In [5]:
import re

In [37]:
def extract_type(wiki_page):
    """
    :param wiki_page is an object contains a title and the first sentence from its wiki page.
    :return:
    """
    
    title = wiki_page.title
    content = wiki_page.content

    pattern=re.compile('is a ([a-z]+)')
    match=pattern.search(content)
    if match!=None:
        print(content)
        return(match.group(1))

    return None

In [38]:
def run():
    '''
    First, extract types from each sentence in the wiki file
    Next, use gold samples to evaluate your model
    :return:
    '''
    with open(result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wiki_file):
            typ = extract_type(page)
            if typ:
                output.write(page.title + "\t" + typ + "\n")

    # Evaluate on some gold samples for checking your model
    eval_f1(gold_file, result_file)


run()

"Adobe Illustrator" is a computer program for making graphic design and illustrations.
Andouille is a sort of pork sausage.
Australia is a continent in the Southern Hemisphere between the Pacific Ocean and the Indian Ocean.
An abbreviation is a shorter way to write a word or phrase.
In many religions, an angel is a good spirit.
An apple is a kind of fruit.
Algebra is a part of mathematics  that helps show the general links between numbers and math operations  used on the numbers.
Afghanistan  is a country located in South Asia.
Angola is a country in Africa.
Argentina or the República Argentina, is a country in south South America.
Austria , officially "Republic of Austria", is a country in Central Europe.
Armenia is a country in the Caucasus region of Europe.
Acceleration is a measure of how fast velocity changes.
Basic English is a made-up language written by Charles Kay Ogden.
A boot is a type of footwear that protects the foot and ankle.
Breakfast sausage is a type of fresh pork sa