# 1. Loading and Preparing the Dataset
Language is very domain specific. Businesses collect text-based datasets that are tailored to their domains (legal, healthcare, insurance, social networks, finance, etc.). These domain-specific corpuses can then be exploited in multiple ways: entity recognition, search indexing, query completion, product recommendations systems, or sentiment analysis, to name just a few.

Language models are at the core of many natural language processing (NLP) applications like the ones listed in the previous paragraph. Simply put, given surrounding or preceding words, a language model predicts a word. As you will see, you can directly exploit language models for any application that generates text, such as machine translation, speech to text, text generation, or query completion applications.

Off-the-shelf models which are trained on large generic corpuses do not reflect the particularities of a given business domain and corpus. For instance, completing the query "how to avoid over…" will not give the same results in the general public domain (overeating, overthinking, …) compared to a data science context where “overfitting” would be a more appropriate completion.

To reap the benefits of domain-specific corpuses, we must build language models that are tuned to the particular vocabulary of the domain at hand. Since Stack Exchange operates domain specific forums, the content of these forums constitutes perfect corpuses for building domain specific language models.

In this Project, you will be taking on the role of an NLP data scientist at Stack Exchange, a network of question-and-answer (Q&A) websites on topics in diverse fields. Stack Exchange has over 10M registered users and is best known for its flagship websites StackOverflow or AskUbuntu. You will build statistics-focused language models using gradually more complex methods. You will evaluate and apply these models to the tasks of:

- Query completion
- Larger text generation
- Sentence selection

At the end of this project, you will be able to build the foundations of any domain-specific NLP system by creating the most a robust and efficient language model.

The link to the dataset is available here:

https://go.aws/30Vpl5b

# 1.1 Loading and preparing the dataset
### Objective

- The goal of this preliminary task is to load and clean the dataset. The raw text is noisy and we want to remove nonwords, keep punctuation to a minimum, and reduce the overall vocabulary of the corpus.

### Workflow

1. Load the dataset into a pandas dataframe.
2. Use regular expressions to remove elements that are not words such as HTML tags, LaTeX expressions, URLs, digits, line returns, and so on.
3. Remove missing values for texts
4. Remove texts that are extremely large or too short to bring any information to the model. We want to keep paragraphs that contain at least a few words and remove the paragraphs that are composed of large numerical tables.
5. Use a tokenizer to create a version of the original text that is a string of space-separated lowercase tokens. For instance,

Thank you!, This equation y = ax + by=ax+b, is very helpful.
would be transformed as:
thank you ! this equation , is very helpful.

“retrieve a distance matrix” is a matter of coding. It also might be irrelevant: one can imagine creative answers.
becomes, if you choose to remove double quotes from the original text:
retrieve a distance matrix is a matter of coding . it also might be irrelevant : one can imagine creative answers .

- Note that punctuation signs (, . : !) are also represented as tokens.
Export the resulting dataframe into a csv file

# 1.2 Regular expressions
A regular expression (regex) is a way of recognizing and often extracting data from certain patterns of text. A regex that recognizes a piece of text or a string is said to match that text or string. A regex is defined by a string in which certain characters (the so-colled metacharacters) can have a special meaning, which enables a single regex to match many different specific strings. 

It's easier to understand this through example than through explanation. Here's a program with a regular expression taht counts how many lines in a text file contain the word hello. A line that contains hello more than once is counted only once:

In [1]:
import re

regexp = re.compile("[hH]ello")
count = 0
file = open("textfile", 'r')

for line in file.readlines():
    if regexp.search(line):
        count = count + 1
file.close()
print(count)

2


In [2]:
import re

regexp = re.compile(r"(?P<last>[-a-zA-Z]+),"
         r" (?P<first>[-a-zA-Z]+)"
         r"( (?P<middle>([-a-zA-Z]+)))?"
         r": (?P<phone>(\(\d{3}-)?\d{3}-\d{4})"
         )

file = open("textfile", 'r')

for line in file.readlines():
    result = regexp.search(line)
    if result == None:
        print("Oops, I don't think this is a record")
    else:
        lastname = result.group('last')
        firstname = result.group('first')
        middlename = result.group('middle')
        
        if middlename == None:
            middlename = ""
        phonenumber = result.group('phone')
    #print('Name:', firstname, middlename, lastname, ' Number:', phonenumber)
file.close()

Oops, I don't think this is a record
Oops, I don't think this is a record


# 1.3 Tokenization
Tokenization is a process where the input text is split into smaller units. There are two types of tokenization: word and sentence tokenization. Word tokenization splits a sentence into tokens (rough equivalent of words and punctuations) that I mentioned above. Sentence tokenization, on the other hand, splits a piece of text that may include more than one sentence into individual sentences. If you just say tokenization, it usually means word tokenization in NLP.

Many NLP libraries and frameworks support tokenization out of the box, because it is one of the most fundamental and widely used pre-processing steps in NLP. In what follows, I’d like to show you how to do tokenization using two popular NLP libraries—NLTK (https://www.nltk.org/) and spaCy (https://spacy.io/).

Note that the results from NLTK and spaCy are slightly different from each other. For example, spaCy’s word tokenizer leaves newlines ('\n') intact. The behavior of tokenizers differs from one implementation to another, and there is no single standard solution that every NLP practitioner agrees upon. Although standard libraries such as NLTK and spaCy give a good baseline, be ready to experiment depending on your task and data. Also, if you are dealing with languages other than English, your options may vary (and might be quite limited depending on the language). If you are familiar with the Java ecosystem, Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/) is another good NLP framework worth checking out.

Finally, there is an increasingly popular and important tokenization method for neural network-based NLP models called byte-pair encoding (BPE). Byte-pair encoding is a purely statistical technique to split text into sequences of characters in any language, not relying on heuristic rules (such as spaces and punctuations) but only on character statistics from the dataset. 

In [3]:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

In [4]:
s = '''Good muffins cost $3.88\nin New York.  Please buy me two of them.\n\nThanks.'''

In [5]:
word_tokenize(s)

['Good',
 'muffins',
 'cost',
 '$',
 '3.88',
 'in',
 'New',
 'York',
 '.',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 'Thanks',
 '.']

In [6]:
sent_tokenize(s)

['Good muffins cost $3.88\nin New York.',
 'Please buy me two of them.',
 'Thanks.']

NLTK implements a wide range of tokenizers in addition to the default one we used here. Its documentation page (https://www.nltk.org/api/nltk.tokenize.html) is a good starting point if you are interested in exploring more options.

You can tokenize words and sentences as follows using spaCy:

In [7]:
import spacy

In [8]:
nlp = spacy.load("en_core_web_sm")

In [9]:
doc = nlp(s)

In [10]:
[token.text for token in doc]

['Good',
 'muffins',
 'cost',
 '$',
 '3.88',
 '\n',
 'in',
 'New',
 'York',
 '.',
 ' ',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 '\n\n',
 'Thanks',
 '.']

In [11]:
[token.text for token in doc]

['Good',
 'muffins',
 'cost',
 '$',
 '3.88',
 '\n',
 'in',
 'New',
 'York',
 '.',
 ' ',
 'Please',
 'buy',
 'me',
 'two',
 'of',
 'them',
 '.',
 '\n\n',
 'Thanks',
 '.']

Note that the results from NLTK and spaCy are slightly different from each other. For example, spaCy’s word tokenizer leaves newlines ('\n') intact. The behavior of tokenizers differs from one implementation to another, and there is no single standard solution that every NLP practitioner agrees upon. Although standard libraries such as NLTK and spaCy give a good baseline, be ready to experiment depending on your task and data. Also, if you are dealing with languages other than English, your options may vary (and might be quite limited depending on the language). If you are familiar with the Java ecosystem, Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/) is another good NLP framework worth checking out.

Finally, there is an increasingly popular and important tokenization method for neural network-based NLP models called byte-pair encoding (BPE). Byte-pair encoding is a purely statistical technique to split text into sequences of characters in any language, not relying on heuristic rules (such as spaces and punctuations) but only on character statistics from the dataset. 

In [12]:
import pandas as pd
import numpy as np
import spacy

In [13]:
nlp = spacy.load("en_vectors_web_lg")

In [14]:
stack_df = pd.read_csv('dataset/stackexchange_812k.csv')

In [15]:
stack_df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category
0,1,,,Eliciting priors from experts,title
1,2,,,What is normality?,title
2,3,,,What are some valuable Statistical Analysis op...,title
3,4,,,Assessing the significance of differences in d...,title
4,6,,,The Two Cultures: statistics vs. machine learn...,title


In [16]:
#stack_text = str(stack_df['text']).lower()
stack_df['text_tokens'] = stack_df.text.str.lower()

In [17]:
#text_tokens = word_tokenize(stack_text)
lines = [str(l) for l in stack_df.text_tokens.values]

In [18]:
stack_df.text_tokens = [list(nlp(t)) for t in lines]

In [19]:
stack_df.to_csv("stackexchange.csv", index=False, sep='\t', 
                encoding='utf-8', columns=['text', 'text_tokens'])

In [21]:
stack_df.head()

Unnamed: 0,post_id,parent_id,comment_id,text,category,text_tokens
0,1,,,Eliciting priors from experts,title,"[eliciting, priors, from, experts]"
1,2,,,What is normality?,title,"[what, is, normality, ?]"
2,3,,,What are some valuable Statistical Analysis op...,title,"[what, are, some, valuable, statistical, analy..."
3,4,,,Assessing the significance of differences in d...,title,"[assessing, the, significance, of, differences..."
4,6,,,The Two Cultures: statistics vs. machine learn...,title,"[the, two, cultures, :, statistics, vs., machi..."
