---
This notebook contains the code to create .dat format files from the glove dataset. For every vector length dataset, 2 pickled files are created, one containing words while other containing their indices. In the "word_embedding_test", we then create a dictionary from these files and test them.
The code reference for the same was taken from https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76.

---

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


In [None]:
!pip install bcolz --quiet

Collecting bcolz
[?25l  Downloading https://files.pythonhosted.org/packages/5c/4e/23942de9d5c0fb16f10335fa83e52b431bcb8c0d4a8419c9ac206268c279/bcolz-1.2.1.tar.gz (1.5MB)
[K     |████████████████████████████████| 1.5MB 5.5MB/s 
Building wheels for collected packages: bcolz
  Building wheel for bcolz (setup.py) ... [?25l[?25hdone
  Created wheel for bcolz: filename=bcolz-1.2.1-cp36-cp36m-linux_x86_64.whl size=2668988 sha256=232a2d44a604db879078857368d149a93bfe8464b41d3d02703f5814c3f18f43
  Stored in directory: /root/.cache/pip/wheels/9f/78/26/fb8c0acb91a100dc8914bf236c4eaa4b207cb876893c40b745
Successfully built bcolz
Installing collected packages: bcolz
Successfully installed bcolz-1.2.1


In [None]:
# # No need to run again

# !unzip './drive/My Drive/AML_assignments/Assignment2/glove.6B.zip' -d './drive/My Drive/AML_2'

Archive:  ./drive/My Drive/AML_assignments/Assignment2/glove.6B.zip
  inflating: ./drive/My Drive/AML_2/glove.6B.50d.txt  
  inflating: ./drive/My Drive/AML_2/glove.6B.100d.txt  
  inflating: ./drive/My Drive/AML_2/glove.6B.200d.txt  
  inflating: ./drive/My Drive/AML_2/glove.6B.300d.txt  


In [None]:
import bcolz
import numpy as np
import pickle


---
> Libraries/ Packages imported
---

> *   The package bcolz provides columnar, chunked data containers that can be compressed either in-memory or on-disk. Column storage allows for efficiently querying tables, as well as for cheap column addition and removal. It is based on NumPy.
*   carray: Container for homogeneous & heterogeneous (row-wise) data.
*   carray is very similar to a NumPy ndarray in that it supports the same types and basic data access interface. The main difference between the two is that a carray can keep data compressed (both in-memory and on-disk), allowing to deal with larger datasets with the same amount of memory/disk.
*The pickle module implements binary protocols for serializing and de-serializing a Python object structure. “Pickling” is the process whereby a Python object hierarchy is converted into a byte stream. 
* The function pickle.dump() writes the pickled representation of the object obj to the open file object file.



---

In [None]:
words = []
idx = 0
word2idx = {}
vectors = bcolz.carray(np.zeros(1), rootdir=f'./drive/My Drive/AML_2/6B.50d.dat', mode='w')

# Picking words line wise and arranging them in 'vectors'

with open(f'./drive/My Drive/AML_2/glove.6B.50d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        words.append(word)
        word2idx[word] = idx
        idx += 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)

# Flushing all the rows in 'vector' to carray    
vectors = bcolz.carray(vectors[1:].reshape((400000, 50)), rootdir=f'./drive/My Drive/AML_2/6B.50d.dat', mode='w')
vectors.flush()
# Storing the vectors and their indices to pkl files
pickle.dump(words, open(f'./drive/My Drive/AML_2/6B.50_words.pkl', 'wb'))
pickle.dump(word2idx, open(f'./drive/My Drive/AML_2/6B.50_idx.pkl', 'wb'))

In [None]:
words = []
idx = 0
word2idx = {}
vectors = bcolz.carray(np.zeros(1), rootdir=f'./drive/My Drive/AML_2/6B.200d.dat', mode='w')

# Picking words line wise and arranging them in 'vectors'

with open(f'./drive/My Drive/AML_2/glove.6B.200d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        words.append(word)
        word2idx[word] = idx
        idx += 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)

# Flushing all the rows in 'vector' to carray    
vectors = bcolz.carray(vectors[1:].reshape((400000, 200)), rootdir=f'./drive/My Drive/AML_2/6B.200d.dat', mode='w')
vectors.flush()
# Storing the vectors and their indices to pkl files
pickle.dump(words, open(f'./drive/My Drive/AML_2/6B.200_words.pkl', 'wb'))
pickle.dump(word2idx, open(f'./drive/My Drive/AML_2/6B.200_idx.pkl', 'wb'))