# Binary Annotation

In this notebook we will use the neurox library to create a binary labeled dataset based on a pattern. we will be using the annotate data function from the neurox.data.annotate module

# Imports 

In [1]:
from neurox.data.annotate import annotate_data
import neurox.data.extraction.transformers_extractor as transformers_extractor
import neurox.data.loader as data_loader
import re

# Inspect data

In [2]:
!cat "data/sentences.txt"

In the year 1969, Neil Armstrong became the first person to set foot on the moon during the Apollo 11 mission.
The Berlin Wall, which divided East and West Germany, stood from 1961 until its fall in 1989.
In 1776, the United States declared its independence from Great Britain with the signing of the Declaration of Independence.
The year 1945 marked the end of World War II, with the surrender of Germany and Japan.
The internet as we know it today began to take shape in 1969, when the first host-to-host connection was established between two computers.
The devastating earthquake and tsunami in Japan occurred in 2011, causing widespread destruction and a nuclear disaster at the Fukushima Daiichi power plant.
The Chernobyl nuclear disaster took place in 1986, when a reactor at the Chernobyl power plant in Ukraine exploded, releasing a significant amount of radioactive material.
The year 1492 is famous for Christopher Columbus's first voyage to the Americas, which opened up a new era

# Extract Activations

In [3]:
transformers_extractor.extract_representations('bert-base-uncased',
    "data/sentences.txt",
    'activations.json',
    aggregation="average" #last, first
)

Loading model: bert-base-uncased


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Reading input corpus
Preparing output file
Extracting representations from model
Sentence         : "In the year 1969, Neil Armstrong became the first person to set foot on the moon during the Apollo 11 mission."
Original    (021): ['In', 'the', 'year', '1969,', 'Neil', 'Armstrong', 'became', 'the', 'first', 'person', 'to', 'set', 'foot', 'on', 'the', 'moon', 'during', 'the', 'Apollo', '11', 'mission.']
Tokenized   (025): ['[CLS]', 'in', 'the', 'year', '1969', ',', 'neil', 'armstrong', 'became', 'the', 'first', 'person', 'to', 'set', 'foot', 'on', 'the', 'moon', 'during', 'the', 'apollo', '11', 'mission', '.', '[SEP]']
Filtered   (023): ['in', 'the', 'year', '1969', ',', 'neil', 'armstrong', 'became', 'the', 'first', 'person', 'to', 'set', 'foot', 'on', 'the', 'moon', 'during', 'the', 'apollo', '11', 'mission', '.']
Detokenized (021): ['in', 'the', 'year', '1969,', 'neil', 'armstrong', 'became', 'the', 'first', 'person', 'to', 'set', 'foot', 'on', 'the', 'moon', 'during', 'the', 'apoll

Sentence         : "The year 1492 is famous for Christopher Columbus's first voyage to the Americas, which opened up a new era of exploration and colonization."
Original    (023): ['The', 'year', '1492', 'is', 'famous', 'for', 'Christopher', "Columbus's", 'first', 'voyage', 'to', 'the', 'Americas,', 'which', 'opened', 'up', 'a', 'new', 'era', 'of', 'exploration', 'and', 'colonization.']
Tokenized   (030): ['[CLS]', 'the', 'year', '149', '##2', 'is', 'famous', 'for', 'christopher', 'columbus', "'", 's', 'first', 'voyage', 'to', 'the', 'americas', ',', 'which', 'opened', 'up', 'a', 'new', 'era', 'of', 'exploration', 'and', 'colonization', '.', '[SEP]']
Filtered   (028): ['the', 'year', '149', '##2', 'is', 'famous', 'for', 'christopher', 'columbus', "'", 's', 'first', 'voyage', 'to', 'the', 'americas', ',', 'which', 'opened', 'up', 'a', 'new', 'era', 'of', 'exploration', 'and', 'colonization', '.']
Detokenized (023): ['the', 'year', '149##2', 'is', 'famous', 'for', 'christopher', "columbu

# Annotate the data

In [4]:
pattern = re.compile(r"\b(19\d{2}|20\d{2})\b")
annotate_data("data/sentences.txt", "activations.json", pattern, "annotations")

Loading json activations from activations.json...
10 13.0
Creating binary dataset ...
Number of Positive examples:  9


IndexError: tuple index out of range