# Remark<div class='tocSkip'/>

The code in this notebook differs slightly from the printed book, because we removed some boilerplate parts of it. For example we frequently use pretty print (`pp.pprint`) instead of `print` and `tqdm`'s `progress_apply` instead of Pandas' `apply`. 

Moreover, several layout and formatting commands, like `figsize` to control figure size or subplot commands are removed in the book. Numbers in the book may have less decimal places as shown here in the notebook. We also used `textwrap` to control linebreaks for the book. The respective statements here are also just for formatting - you can ignore them.

You may also find some lines marked with three hashes ###. Those are not in the book as well as they don't contribute to the concept.

All of this is done to simplify the code in the book and put the focus on the important parts.

# Setup<div class='tocSkip'/>

## Determine Environment<div class='tocSkip'/>

In [None]:
import sys
ON_COLAB = 'google.colab' in sys.modules

if ON_COLAB:
    BASE_DIR = "/content"
    print("You are working on Google Colab.")
    print(f'Files will be downloaded to "{BASE_DIR}".')
    # adjust release
    GIT_ROOT = "https://github.com/blueprints-for-text-analytics-python/early-release/raw/master"
else:
    BASE_DIR = ".."
    print("You are working on a local system.")
    print(f'Files will be searched relative to "{BASE_DIR}".')

## Download data files<div class='tocSkip'/>

In [None]:
import os, subprocess
from subprocess import PIPE

required_files = [
                  'settings.py',
                  'packages/blueprints/__init__.py',
                  'packages/blueprints/exploration.py',
                  'data/reddit-selfposts/reddit-selfposts.db.gz',
                  'ch04/colab_requirements.txt'
]

if ON_COLAB:
    print("Downloading required files ...")
    for file in required_files:
        cmd = ['wget', '-P', os.path.dirname(BASE_DIR+'/'+file), GIT_ROOT+'/'+file]
        print('!'+' '.join(cmd))
        stdout, stderr = subprocess.Popen(cmd, stdout=PIPE, stderr=PIPE).communicate()
        # print(stderr.decode()) # uncomment in case of problems

## Install required libraries and additional setup<div class='tocSkip'/>

It may take a moment to install the required Python libraries.

In [None]:
if ON_COLAB:
    print("\nAdditional setup ...")
    setup_cmds = ['pip install -r ch13/colab_requirements.txt',
                  'mkdir -p models',
                 # f'gunzip -k {BASE_DIR}/data/reddit-selfposts/reddit-selfposts.db.gz'
                 ]

    for cmd in setup_cmds:
        print('!'+cmd)
        if os.system(cmd) != 0:
            print('  --> ERROR')

## Common Imports<div class='tocSkip'/>

In [None]:
%matplotlib inline
%config InlineBackend.figure_format = 'png'

%run "$BASE_DIR/settings.py"

%reload_ext autoreload
%autoreload 2

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
# to import blueprints package
import os, sys
sys.path.append(BASE_DIR + '/packages')

In [None]:
# otherwise text between $ signs will be interpreted as formula and printed in italic
pd.set_option('display.html.use_mathjax', False)

# How to Prepare Textual Data For Statistics and Machine Learning
## What you'll learn and what we build


# A Data Preprocessing Pipeline


# Introducing the Data Set: Reddit Self Posts


## Loading Data into Pandas


In [None]:
posts_file = "../data/reddit-selfposts/rspct.tsv.gz"
posts_file = "../data/reddit-selfposts/rspct_autos.tsv.gz" ### for faster loads use this subset
posts_df = pd.read_csv(posts_file, sep='\t')

subred_file = "../data/reddit-selfposts/subreddit_info.csv.gz"
subred_df = pd.read_csv(subred_file).set_index(['subreddit'])

df = posts_df.join(subred_df, on='subreddit')
len(df) ###

In [None]:
# write subset with autos to rspct_autos.tsv.gz

# auto_subreddits = subred_df[subred_df['category_1'] == 'autos'].index.to_list()
# posts_df[posts_df.subreddit.isin(auto_subreddits)] \
#   .to_csv('../data/reddit-selfposts/rspct_autos.tsv.gz', sep='\t', index=False)

## Standardizing Attribute Names


In [None]:
print(df.columns)

In [None]:
column_mapping = {
    'id': 'id',
    'subreddit': 'subreddit',
    'title': 'title',
    'selftext': 'text',
    'category_1': 'category',
    'category_2': 'subcategory',  
    'category_3': None, # no data
    'in_data': None, # not needed
    'reason_for_exclusion': None # not needed
}

# define remaining columns
columns = [c for c in column_mapping.keys() if column_mapping[c] != None]

# select and rename those columns
df = df[columns].rename(columns=column_mapping)

In [None]:
df = df[df['category'] == 'autos']
len(df) ###

In [None]:
pd.options.display.max_colwidth = None ###
df.sample(1, random_state=7).T
pd.options.display.max_colwidth = 200 ###

## Checking for Missing Values


In [None]:
df.isna().sum()

## Saving and Loading a Data Frame


In [None]:
df.to_pickle("reddit_dataframe.pkl")

In [None]:
import sqlite3

db_path = "../data/reddit-selfposts/reddit-selfposts.db"

con = sqlite3.connect(db_path)
df.to_sql("posts", con, index=False, if_exists="replace")
con.close()

In [None]:
import sqlite3 ###
db_path = "../data/reddit-selfposts/reddit-selfposts.db" ###
con = sqlite3.connect(db_path)
df = pd.read_sql("select * from posts", con)
con.close()

In [None]:
len(df)

# Cleaning Textual Data with Regular Expressions


In [None]:
text = """
After viewing the [PINKIEPOOL Trailer](https://www.youtu.be/watch?v=ieHRoHUg)
it got me thinking about the best match ups.
<lb>Here's my take:<lb><lb>[](/sp)[](/ppseesyou) Deadpool<lb>[](/sp)[](/ajsly)
Captain America<lb>"""

text = text.replace('\n', ' ').strip() ###
print(text) ###

## Blueprint: Identifying Dirty Data


In [None]:
import re

RE_SUSPICIOUS = re.compile(r'[&#<>{}\[\]\\]')

def impurity(text, min_len=10):
    if text == None or len(text) < min_len:
        return 0
    else:
        # return share of suspicious characters in a text
        return len(RE_SUSPICIOUS.findall(text))/len(text)

print(impurity(text))

In [None]:
pd.options.display.max_colwidth = 100 ###
# add new column to data frame
df['impurity'] = df['text'].progress_apply(impurity, min_len=10)

# get the top 3 records
df[['text', 'impurity']].sort_values(by='impurity', ascending=False).head(3)
pd.options.display.max_colwidth = 200 ###

In [None]:
from blueprints.exploration import count_words ###
count_words(df, column='text', preprocess=lambda t: re.findall(r'<[\w/]*>', t))

## Blueprint: Text-Cleaning with Regular Expressions


In [None]:
import html

def clean(text):
    # convert html escapes like &amp; to characters.
    text = html.unescape(text) 
    # tags like <tab>
    text = re.sub(r'<[^<>]*>', ' ', text)
    # markdown URLs like [Some text](https://....)
    text = re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1', text)
    # text or code in brackets like [0]
    text = re.sub(r'\[[^\[\]]*\]', ' ', text)
    # standalone sequences of specials, matches &# but not #cool
    text = re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]{1,}(?:\s|$)', ' ', text)
    # standalone sequences of hyphens like --- or ==
    text = re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)', ' ', text)
    # sequences of white spaces
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

In [None]:
clean_text = clean(text)
print(clean_text)
print("Impurity:", impurity(clean_text))

In [None]:
# just for formatting - ignore
import textwrap

for line in textwrap.wrap(clean_text):
    print(line)
print("Impurity:", impurity(clean_text))

In [None]:
df['clean_text'] = df['text'].progress_apply(clean)
df['impurity']   = df['clean_text'].apply(impurity, min_len=20)

In [None]:
df[['clean_text', 'impurity']].sort_values(by='impurity', ascending=False).head(3)

In [None]:
# just for formatting in the book - ignore
df[['clean_text', 'impurity']].sort_values(by='impurity', ascending=False).head(3) \
.applymap(lambda x: x if type(x) == float else x[:80]+'...')

## Removing Noise with textacy 


In [None]:
from textacy.preprocessing.resources import RE_URL

count_words(df, column='clean_text', preprocess=RE_URL.findall).head(3)

### Pattern-based Data Masking with Textacy


In [None]:
from textacy.preprocessing.replace import replace_urls

text = "Check out https://spacy.io/usage/spacy-101"

# using default substitution _URL_
print(replace_urls(text))

### Unicode Character Normalization


In [None]:
text = "The café “Saint-Raphaël” is loca-\nted on Côte dʼAzur."

In [None]:
import textacy.preprocessing as tprep

def normalize(text):
    text = tprep.normalize_hyphenated_words(text)
    text = tprep.normalize_quotation_marks(text)
    text = tprep.normalize_unicode(text)
    text = tprep.remove_accents(text)
    return text

print(normalize(text))

In [None]:
df['clean_text'] = df['clean_text'].progress_map(normalize)

In [None]:
df['text'] = df['clean_text']
df.drop(columns=['clean_text', 'impurity'], inplace=True)

db_path = "../data/reddit-selfposts/reddit-selfposts.db" ###
con = sqlite3.connect(db_path)
df.to_sql("posts_cleaned", con, index=False, if_exists="replace")
con.close()

# Tokenization


In [None]:
text = """
2019-08-10 23:32: @pete/@louis - I don't have a well-designed 
solution for today's problem. The code of module AC68 should be -1. 
Have to think a bit... #goodnight ;-) 😩😬"""

## Tokenization with Regular Expressions


In [None]:
tokens = re.findall(r'\w\w+', text)
print("|".join(tokens))

In [None]:
# just for formatting - ignore
import textwrap

for line in textwrap.wrap(" ".join(re.findall(r'\w\w+', text))):
    print(line.replace(" ", "|"))

In [None]:
RE_TOKEN = re.compile(r"""
               ( [#]?[@\w'’\.\-\:]*\w     # words, hash tags and email adresses
               | [:;<]\-?[\)\(3]          # coarse pattern for basic text emojis
               | [\U0001F100-\U0001FFFF]  # coarse code range for unicode emojis
               )
               """, re.VERBOSE)

def tokenize(text):
    return RE_TOKEN.findall(text)

tokens = tokenize(text)
print("|".join(tokens))

In [None]:
# just for formatting - ignore
for line in textwrap.wrap(" ".join(tokens)):
    print(line.replace(" ", "|"))

In [None]:
df['tokens'] = df['text'].progress_map(tokenize)

## Tokenization with NLTK


In [None]:
import nltk

nltk.download('punkt') ###
tokens = nltk.tokenize.word_tokenize(text)
print("|".join(t for t in tokens))

In [None]:
# just for formatting - ignore
for line in textwrap.wrap(" ".join(tokens)):
    print(line.replace(" ", "|"))

## Recommendations for Tokenization


# Linguistic Processing with spaCy


## Instantiating a Pipeline


In [None]:
import spacy
nlp = spacy.load('en')

In [None]:
nlp.pipeline

In [None]:
nlp = spacy.load("en", disable=["parser", "ner"])

## Processing Text


In [None]:
nlp = spacy.load("en")
text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)

In [None]:
for token in doc:
    print(token, end="|")

In [None]:
from blueprints.preparation import display_nlp
display_nlp(doc)
### Table tab-nlp-result: Result of spaCy's document processing as generated by `display_nlp`

In [None]:
def display_nlp(doc, include_punct=False):
    """Generate data frame for visualization of spaCy tokens."""

    rows = []
    for i, t in enumerate(doc):
        if not t.is_punct or include_punct:
            row = {'token': i, 
                   'text': t.text, 'lemma': t.lemma_, 
                   'is_stop': t.is_stop, 'is_alpha': t.is_alpha,
                   'pos': t.pos_, 'dep': t.dep_, 
                   'ent_type': t.ent_type_}
            rows.append(row)
    
    df = pd.DataFrame(rows).set_index('token')
    df.index.name = None
    
    return df

## Modifying Tokenization


In [None]:
text = "@Pete: choose low-carb #food #eat-smart. _url_ ;-) 😋👍"
nlp = spacy.load('en') ###
doc = nlp(text)

print(*[token for token in doc], sep="|")

In [None]:
import re ###
import spacy ###
from spacy.tokenizer import Tokenizer
from spacy.util import compile_prefix_regex, \
                       compile_infix_regex, compile_suffix_regex

def custom_tokenizer(nlp):
    
    # use default patterns except the ones matched by re.search
    prefixes = [pattern for pattern in nlp.Defaults.prefixes 
                if pattern not in ['-', '_', '#']]
    suffixes = [pattern for pattern in nlp.Defaults.suffixes
                if pattern not in ['_']]
    infixes  = [pattern for pattern in nlp.Defaults.infixes
                if not re.search(pattern, 'xx-xx')]

    return Tokenizer(vocab          = nlp.vocab, 
                     rules          = nlp.Defaults.tokenizer_exceptions,
                     prefix_search  = compile_prefix_regex(prefixes).search,
                     suffix_search  = compile_suffix_regex(suffixes).search,
                     infix_finditer = compile_infix_regex(infixes).finditer,
                     token_match    = nlp.Defaults.token_match)

nlp = spacy.load('en')
nlp.tokenizer = custom_tokenizer(nlp)

doc = nlp(text)
print(*[token for token in doc], sep="|")

## Lemmatization


In [None]:
stemmer = nltk.snowball.SnowballStemmer("english")

words = "university universe easily easy"

for word in words.split():
    print(f"{word:>12} --> {stemmer.stem(word)}")

## Stop Word Detection


In [None]:
from spacy.lang.en import STOP_WORDS as stop_words
print(len(stop_words))

In [None]:
nlp = spacy.load('en')
nlp.vocab['down'].is_stop = False
nlp.vocab['Dear'].is_stop = True
nlp.vocab['Regards'].is_stop = True

In [None]:
text = "Dear Ryan, we need to sit down and talk. Regards, Pete"
doc = nlp.make_doc(text) # only tokenize
    
tokens_wo_stop = [token for token in doc ]
for token in doc:
    if not token.is_stop and not token.is_punct:
        print(token, end='|')

## Part-of-Speech Tagging


# Blueprints for Feature Extraction


## Extracting Words based on Part-of-Speech


In [None]:
import textacy

text = "My best friend Ryan Peters likes fancy adventure games."
doc = nlp(text)

tokens = textacy.extract.words(doc, 
            filter_stops = True,           # default True, no stopwords
            filter_punct = True,           # default True, no punctuation
            filter_nums = True,            # default False, no numbers
            include_pos = ['ADJ', 'NOUN'], # default None = include all
            exclude_pos = None,            # default None = exclude none
            min_freq = 1)                  # minimum frequency of words

In [None]:
def extract_lemmas(doc, **kwargs):
    return [t.lemma_ for t in textacy.extract.words(doc, **kwargs)]

lemmas = extract_lemmas(doc, include_pos=['ADJ', 'NOUN'])
print(*lemmas, sep='|')

## Extracting Noun Chunks


In [None]:
def ngrams(tokens, n=2, sep=' '):
    return [sep.join(ngram) for ngram in zip(*[tokens[i:] for i in range(n)])]

# just for formatting - ignore
for t in textwrap.wrap(' '.join(ngrams(tokenize(text), sep='_'))):
    print(t.replace(' ', '|'))

In [None]:
spans = textacy.extract.matches(doc, patterns=["POS:ADJ:? POS:NOUN:+"])
print(*spans, sep='|')

In [None]:
print(*doc.noun_chunks, sep='|')

In [None]:
def extract_noun_chunks(doc, include_pos=['NOUN'], sep='_'):

    chunks = []
    for noun_chunk in doc.noun_chunks:
        chunk = [token.lemma_ for token in noun_chunk
                 if token.pos_ in include_pos]
        if len(chunk) >= 2:
            chunks.append(sep.join(chunk))
    return chunks

In [None]:
noun_chunks = extract_noun_chunks(doc, include_pos=['ADJ', 'NOUN', 'PROPN'])
print(*noun_chunks, sep='|')

## Extracting Named Entities


In [None]:
def extract_entities(doc, include_types=None, sep='_'):

    ents = textacy.extract.entities(doc, 
             include_types=include_types, 
             exclude_types=None, 
             drop_determiners=True, 
             min_freq=1)
    
    return [re.sub('\s+', sep, e.lemma_)+'/'+e.label_ for e in ents]

In [None]:
nlp = spacy.load('en') ###
text = "George Washington was the first president of the United States."
doc = nlp(text)

entities = extract_entities(doc, ['PERSON', 'GPE'])
print(*entities, sep='|')

# Extracting NLP Features on a Large Dataset
## One Function to Get It All


In [None]:
nlp = spacy.load('en') # load model
nlp.tokenizer = custom_tokenizer(nlp) # optional

def nlp_extract(text):

    doc = nlp(text)
    
    lemmas          = extract_lemmas(doc, exclude_pos = ['PART', 'PUNCT', 
                                           'DET', 'PRON', 'SYM', 'SPACE'],
                                          filter_stops = False)
    adjs_verbs      = extract_lemmas(doc, include_pos = ['ADJ', 'VERB'])
    nouns           = extract_lemmas(doc, include_pos = ['NOUN', 'PROPN'])
    noun_chunks     = extract_noun_chunks(doc, ['NOUN'])
    adj_noun_chunks = extract_noun_chunks(doc, ['NOUN', 'ADJ'])
    entities        = extract_entities(doc, ['PERSON', 'ORG', 'GPE', 'LOC'])

    return lemmas, adjs_verbs, nouns, noun_chunks, adj_noun_chunks, entities

In [None]:
text = "My best friend Ryan Peters likes fancy adventure games."
print(*nlp_extract(text), sep='\n')

## Creating Multiple Columns in a Data Frame


In [None]:
import sqlite3 ###
db_path = "../data/reddit-selfposts/reddit-selfposts.db" ###
con = sqlite3.connect(db_path)
df = pd.read_sql("select * from posts_cleaned", con)
con.close()

df['text'] = df['title'] + ': ' + df['text']

In [None]:
# for faster processing
# df = df.sample(500) ###
# len(df) ###

In [None]:
# define column names
nlp_columns = ['lemmas', 'adjs_verbs', 'nouns', 'noun_chunks', 
               'adj_noun_chunks', 'entities']

### this takes about 10-15 min
df[nlp_columns] = df.progress_apply(lambda row: nlp_extract(row['text']), 
                                    axis='columns', result_type='expand')

In [None]:
count_words(df, 'noun_chunks').head(10).plot(kind='barh', figsize=(8,3)).invert_yaxis()
### img_width: 80%

## Persisting the Result


In [None]:
import sqlite3 ###
df[nlp_columns] = df[nlp_columns].applymap(lambda items: ' '.join(items))

con = sqlite3.connect(db_path) 
df.to_sql("posts_nlp", con, index=False, if_exists="replace")
con.close() 

### A Note on Execution Time


# There is More
## Language Detection
## Spell Checking
## Token Normalization


# Closing Remarks and Recommendations
