
### Data Preprocessing
- Gensim word2vec requires that a format of list of lists for training where every document is contained in a list and every list contains a list of tokens of that document.


### References
---
- https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92?gi=2df9a428548d
- data: https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92?gi=2df9a428548d
    

# Setup

In [5]:
import os
import pandas as pd
from string import punctuation

In [1]:
import gensim



### Globals

In [2]:
DIR_DATA = "/home/oem/repositories/nlp/data"
FILENAME_DATA = "moby10b.txt"

### Load Data

In [3]:
text_raw = open(os.path.join(DIR_DATA, FILENAME_DATA), 'r').readlines()

In [4]:
text_raw[:10]

['**The Project Gutenberg Etext of Moby Dick, by Herman Melville**\n',
 '#3 in our series by Herman Melville\n',
 '\n',
 'This Project Gutenberg version of Moby Dick is based on a combination\n',
 'of the etext from the ERIS project at Virginia Tech and another from\n',
 "Project Gutenberg's archives, as compared to a public-domain hard copy.\n",
 '\n',
 'Copyright laws are changing all over the world, be sure to check\n',
 'the copyright laws for your country before posting these files!!\n',
 '\n']

In [5]:
f"Number of Lines of Text => {len(text_raw)}"

'Number of Lines of Text => 23244'

<br>

---

# Create a Training Dataset
---

1. Create list of rows of text.
2. Standardize text.
2. Create list of list where the inner lists are tokenized text.

<br>

### Clean Text

In [6]:
def clean_text(text_line, punctuation):
    for p in punctuation:
        text_line = text_line.replace(p, '')
        text_line = text_line.replace('  ', ' ')
    return text_line

In [7]:
text_clean = [
    x.lower()\
    .replace("\n", "")\
    for x in text_raw if x
]

In [8]:
text_clean = [clean_text(x, punctuation).split(' ') for x in text_clean]


In [9]:
text_clean = [x for x in text_clean if x]

In [10]:
text_clean[:10]

[['the',
  'project',
  'gutenberg',
  'etext',
  'of',
  'moby',
  'dick',
  'by',
  'herman',
  'melville'],
 ['3', 'in', 'our', 'series', 'by', 'herman', 'melville'],
 [''],
 ['this',
  'project',
  'gutenberg',
  'version',
  'of',
  'moby',
  'dick',
  'is',
  'based',
  'on',
  'a',
  'combination'],
 ['of',
  'the',
  'etext',
  'from',
  'the',
  'eris',
  'project',
  'at',
  'virginia',
  'tech',
  'and',
  'another',
  'from'],
 ['project',
  'gutenbergs',
  'archives',
  'as',
  'compared',
  'to',
  'a',
  'publicdomain',
  'hard',
  'copy'],
 [''],
 ['copyright',
  'laws',
  'are',
  'changing',
  'all',
  'over',
  'the',
  'world',
  'be',
  'sure',
  'to',
  'check'],
 ['the',
  'copyright',
  'laws',
  'for',
  'your',
  'country',
  'before',
  'posting',
  'these',
  'files'],
 ['']]