# NLP Project - Word Embeddings

## Overview:
This project focuses on a reading comprehension using the BoolQ dataset and making use of word embeddings (word2vec) along with a two-layer classifier with ReLU activation.

## Tools & Libraries:
1. **Dataset:** BoolQ form Hugging Face datasets.
2. **Word Embeddings:** Pre-trained word2vec embeddings.
3. **Neural Network:** A 2-layer classifier with ReLU non-linearity.
4. **Evaluation & Monitoring:** Weights & Biases for experiment tracking.
5. **Framework:** PyTorch (for modelling).

## Introduction
### Problem:
A reading comprehension based on the BoolQ dataset.

### Approach:
Use word2vec embeddings with a classifier model.

### Objective:
Classify weather the answer to a give question based on context is "yes" or "no".

## Setup
Install the necessary libraries: 
- Hugging Face datasets
- PyTorch
- gensim
- weights and biases
- scikit-learn

## Preprocessing
- Tokenize the dataset: Convert text to tokens compatible with the word2vec model.
- embed the tokens using pre-trained word2vec embeddings (using gensim).
- Handle out-of-vocabulary words by averaging word vectors or ignoring them.
- prepare input for the model: Combine the question and context embeddings.

## Model Architecture
- Input: Embedding size from word2vec
- Layers:
  - Layer 1: Fully connected layer with ReLu activation.
  - Layer 2: Fully connected layer outputting two logits (for binary classification).
- use softmax to convert logits into probabilities for final classification.


In [2]:
from datasets import load_dataset

dataset = load_dataset('boolq')

train_data = dataset['train']
validation_data = dataset['validation']

print(train_data[0])

# View some basic statistics
print(f"Number of training samples: {len(train_data)}")
print(f"Number of validation samples: {len(validation_data)}")

{'question': 'do iran and afghanistan speak the same language', 'answer': True, 'passage': 'Persian (/ˈpɜːrʒən, -ʃən/), also known by its endonym Farsi (فارسی fārsi (fɒːɾˈsiː) ( listen)), is one of the Western Iranian languages within the Indo-Iranian branch of the Indo-European language family. It is primarily spoken in Iran, Afghanistan (officially known as Dari since 1958), and Tajikistan (officially known as Tajiki since the Soviet era), and some other regions which historically were Persianate societies and considered part of Greater Iran. It is written in the Persian alphabet, a modified variant of the Arabic script, which itself evolved from the Aramaic alphabet.'}
Number of training samples: 9427
Number of validation samples: 3270


In [5]:
import gensim.downloader as api
