Adressa Dataset: Gulla, J. A., Zhang, L., Liu, P., Özgöbek, Ö., & Su, X. (2017, August). The Adressa dataset for news recommendation. In Proceedings of the International Conference on Web Intelligence (pp. 1042-1048). ACM.

In [7]:
from tqdm import tqdm
import os
import json
from utils.adressa_util import preprocessing

In [8]:
one_week_path = './adressa/one_week'
content_news = './adressa/content_refine'
out_path = './adressa/mind_format'

In [9]:
hash_title, hash2id = preprocessing.news_title(one_week_path)

1513739it [00:28, 53654.44it/s]


In [10]:
from pathlib import Path

out_path = Path(out_path)
preprocessing.write_news_files(hash_title,hash2id,out_path)

100%|██████████| 4641/4641 [00:00<00:00, 781569.30it/s]


In [11]:
news_line = """
N55528	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://assets.msn.com/labs/mind/AAGH0ET.html	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]
"""

In [12]:
article_hash_content = {}
for filename in os.listdir(content_news):
    with open(os.path.join(content_news, filename), "r") as f:
        article = json.load(f)
        article_hash = article['id']
        article_dict = {d['field'] : d['value'] for d in article['fields']}
        article_hash_content[article_hash] = article_dict

In [13]:
article_hash_content['7ae358503726b54c7b489062955c51fc496e07b0']

{'modifiedtime': '2012-04-19T11:05:59.000Z',
 'adressa-access': 'free',
 'adressa-importance': '0',
 'author': 'ingrid j. brissach',
 'body': ['Saken oppdateres.',
  'I 1991 satte en duo ved navn Trond A. Tune og Nicolai Riise i gang med å lete opp operaen «Fredkulla», skrevet i 1858 av Martin Andreas Udbye (1820-1889). Bakgrunnen for gravejobben var en hovedfagsoppgave i musikkvitenskap ved Universitetet i Oslo. Etter det begynte snøballen å rulle, og det endte med uroppførelse av Norges første opera i Olavshallen under tusenårsjubileet i Trondheim i 1997.',
  'Udbyes gate ligger på Øya og går mellom Olav Kyrres gate og Abels gate.',
  'Martin Andreas Udbye var født i Trondheim og vokste opp i Sanden. Foreldrene var Ole og Bergitte (Øien) Udbye. Begge foreldrene var musikalske, men ikke særlig bemidlet. Men selv om gutten ikke hadde mer skolegang enn vanlig «almueskole», viste han tidlig gode evner.',
  'Allerede som 16-åring ble han huslærer på Verdalsøra og senere i Sparbu. I disse 

In [14]:
from utils.adressa_util.preprocessing import write_news_files_full
write_news_files_full(hash_title,hash2id,out_path,article_hash_content)

100%|██████████| 4641/4641 [00:00<00:00, 340025.24it/s]


In [18]:
from transformers import M2M100ForConditionalGeneration
from utils.tokenization_small100 import SMALL100Tokenizer

model = M2M100ForConditionalGeneration.from_pretrained("alirezamsh/small100")
tokenizer = SMALL100Tokenizer.from_pretrained("alirezamsh/small100")

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'M2M100Tokenizer'. 
The class this function is called from is 'SMALL100Tokenizer'.


In [None]:
hi_text = article_hash_content['7ae358503726b54c7b489062955c51fc496e07b0']['description']

# translate Hindi to French
tokenizer.tgt_lang = "en"
encoded_no = tokenizer(hi_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_no)
results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

In [None]:
from refined.inference.processor import Refined
refined = Refined.from_pretrained(
    model_name='wikipedia_model',
    entity_set="wikidata"
)

Downloading /Users/andreaparolin/.cache/refined/wikipedia_model/model.pt: 100%|██████████| 734M/734M [04:15<00:00, 2.87MB/s] 
Downloading /Users/andreaparolin/.cache/refined/wikipedia_model/config.json: 100%|██████████| 702/702 [00:00<00:00, 2.37kB/s]
Downloading /Users/andreaparolin/.cache/refined/wikipedia_model/precomputed_entity_descriptions_emb_wikidata_33831487-300.np:  24%|██▍       | 4.87G/20.3G [19:14<1:51:59, 2.30MB/s]

In [None]:
from refined.data_types.base_types import Entity, Span
def spans_to_mind_format(results : Span):
    entity_list = []
    for entity in results:
        if entity.predicted_entity.wikidata_entity_id == None:
            continue
            # An entity has been detected but does not have a wikidata page
        entity_list.append({
            'Label' : entity.predicted_entity.wikipedia_entity_title,
            'Type': entity.coarse_mention_type,
            'WikidataId': entity.predicted_entity.wikidata_entity_id,
            'Confidence': entity.candidate_entities[0][1],
            'OccurrenceOffsets': [entity.start],
            'SurfaceForms': [entity.text]
        })
    return entity_list