# Overview

The goal of this notebook is to ingest some pairs and answers, and create the X and Y portions of a dataset. Then, attempt to get a BERT model loaded that can process the context and content into vector representations. 


# Setup

## Imports

In [2]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import warnings
warnings.filterwarnings('ignore')

## Constants

In [3]:
GCS_APP_ID = "aqa-research"
GCS_BUCKET = "dabi-aqa-data-00"

FN_QUESTIONS = "questions.csv"
FN_CONTEXTS  = "contexts.csv"
FN_CONTENT_ANSWERS   = "content_answers.csv"

## GCS Auth and Methods

In [4]:
!gcloud auth login --launch-browser

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=g6fjXx30IouSyNkCQEAxjEf4M3uz4x&prompt=consent&access_type=offline&code_challenge=hvAuUm3p0okP96MzZoR7Bco6dJDdv5VUrvR0VC9t7Wc&code_challenge_method=S256

Enter authorization code: 4/0AfgeXvu9p1CHa47rVj904hjGby0hShOpVnbrlT5mnBbyUeGAw7YpFQiGzHy6kztqjQvmpA

You are now logged in as [willpowe@gmail.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [5]:
!gcloud config set project {GCS_APP_ID}

Updated property [core/project].


In [6]:
def download_file_from_gcs(src_fn, dest_fn):
  dest = f"/content/{dest_fn}"
  dl_command = f"gsutil -m cp gs://{GCS_BUCKET}/{src_fn} {dest}"
  os.system(dl_command)

def upload_file_to_gcs(src_fn, dest_fn):
  dest_url = "{}/{}".format(GCS_BUCKET, dest_fn)
  ul_command = "gsutil -m cp {} gs://{}".format(src_fn, dest_url)
  os.system(ul_command)

## Raw Data

In [7]:
download_file_from_gcs(FN_QUESTIONS, FN_QUESTIONS)
download_file_from_gcs(FN_CONTEXTS, FN_CONTEXTS)
download_file_from_gcs(FN_CONTENT_ANSWERS, FN_CONTENT_ANSWERS)

## BERT Setup

In [8]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.2 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 76.2 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 46.5 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.0 tokenizers-0.13.2 transformers-4.24.0


In [9]:
import transformers as ppb

# Embedding Content

In [10]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
#model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
questions = pd.read_csv(FN_QUESTIONS)
contexts  = pd.read_csv(FN_CONTEXTS)
content_answers = pd.read_csv(FN_CONTENT_ANSWERS)

In [12]:
def tokenize_and_count(df_in, text_col):
  df_in['tokens'] = df_in[text_col].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))
  df_in['token_length'] = df_in['tokens'].apply(lambda x: len(x))

tokenize_and_count(questions, 'question_text')
tokenize_and_count(contexts, 'context_text')
tokenize_and_count(content_answers, 'content_text')

Padding the token sequences. 

In [13]:
max_len = max(questions['token_length'].max(), contexts['token_length'].max(), content_answers['token_length'].max())
padded_content = np.array([i + [0]*(max_len-len(i)) for i in content_answers['tokens'].values])

Then creating a mask so we can ignore the padded 0's during training. 

In [14]:
attention_mask = np.where(padded_content != 0, 1, 0)
print(attention_mask.shape)
print(padded_content.shape)

(2000, 354)
(2000, 354)


Now we can process the padded content to obtain token level embeddings for each comment. 

In [15]:
input_ids = torch.tensor(padded_content)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

The following truncates all but the first embedding (which corresponds to the CLS token, representing the full sentence embedding).

In [16]:
comment_embeddings = last_hidden_states[0][:,0,:].numpy()