# Data Acquisition

How to get the data?
1. Use a public dataset
2. Scrape data
3. Product intervention: AI team work with the product team to collect more data by developing better instrumentation.
4. Data augmentation
    - Synonym replacement
    - Back translation
    - TF-IDF-based word replacement
    - Bigram flipping
    - Replacing entities
    - Adding noises to data
    - Advanced techniques: Snorkel, Easy Data Augmentation (EDA), Active learning

# Text Extraction and Cleanup

## HTML parsing and cleanup

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

In [2]:
myurl = "https://stackoverflow.com/questions/415511/how-to-get-the-current-time-in-python"

html = urlopen(myurl).read()
soupified = BeautifulSoup(html, "html.parser")

question = soupified.find("div", {"class": "question"})
questiontext = question.find("div", {"class": "s-prose js-post-body"})
print("Question: \n", questiontext.get_text().strip())

answer = soupified.find("div", {"class": "answer"})
answertext = answer.find("div", {"class": "s-prose js-post-body"})
print("Best answer: \n", answertext.get_text().strip())

Question: 
 What is the module/method used to get the current time?
Best answer: 
 Use:
>>> import datetime
>>> datetime.datetime.now()
datetime.datetime(2009, 1, 6, 15, 8, 24, 78915)

>>> print(datetime.datetime.now())
2009-01-06 15:08:24.789150

And just the time:
>>> datetime.datetime.now().time()
datetime.time(15, 8, 24, 78915)

>>> print(datetime.datetime.now().time())
15:08:24.789150

See the documentation for more information.
To save typing, you can import the datetime object from the datetime module:
>>> from datetime import datetime

Then remove the leading datetime. from all of the above.


## Unicode normalization

In [3]:
text = 'I love 🍕!  Shall we book a 🚗 to gizza?'
Text = text.encode("utf-8")
print(Text)

b'I love \xf0\x9f\x8d\x95!  Shall we book a \xf0\x9f\x9a\x97 to gizza?'


## Spelling correction

In [4]:
# Need azure upgraded account
# Tutorial: https://docs.microsoft.com/en-us/azure/cognitive-services/bing-spell-check/quickstarts/python

import requests
import json

api_key = "<ENTER-KEY-HERE>"
example_text = "Hollo, wrld"
endpoint = "https://api.cognitive.microsoft.com/bing/v7.0/SpellCheck"

data = {'text': example_text}
params = {
    'mkt':'en-us',
    'mode':'proof'
    }
headers = {
    'Content-Type': 'application/x-www-form-urlencoded',
    'Ocp-Apim-Subscription-Key': api_key,
    }
response = requests.post(endpoint, headers=headers, params=params, data=data)
json_response = response.json()
print(json.dumps(json_response, indent=4))

{
    "error": {
        "code": "401",
        "message": "Access denied due to invalid subscription key or wrong API endpoint. Make sure to provide a valid key for an active subscription and use a correct regional API endpoint for your resource."
    }
}


## System-specific error correction

In [5]:
from PIL import Image
import pytesseract
from pytesseract import image_to_string

# need to follow this steps to use tesseract
# https://stackoverflow.com/questions/50951955/pytesseract-tesseractnotfound-error-tesseract-is-not-installed-or-its-not-i
pytesseract.pytesseract.tesseract_cmd = r'D:\Program Files\Tesseract-OCR\tesseract.exe'

filename = "scanned_document.png"
text = image_to_string(Image.open(filename))
print(text)

in the nineteenth century the only Kind of linguistics considered
seriously was this comparative and historical study of words in languages
known or believed to be cognate—say the Semitic languages, or the Indo-
European languages. It is significant that the Germans who really made
the subject what it was, used the term Indo-germanisch. Those who know
the popular works of Otto Jespersen will remember how fitmly he
declares that linguistic science is historical. And those who have noticed



# Pre-Processing
- Preliminaries:
    - Sentence segmentation, word tokenization
- Frequent steps:
    - stop word removal, stemming and lemmatization, removing digits/punctuation, lowercasing.
- Other steps:
    - normalization, language detection, code mixing, transliteration, etc.
- Advanced processing:
    - POS tagging, parsing, coreference resolution, etc.

In [6]:
# Sentence segmentation
from nltk.tokenize import sent_tokenize, word_tokenize

mytext = "In the previous chapter, we saw examples of some common NLP \
applications that we might encounter in everyday life. If we were asked to \
build such an application, think about how we would approach doing so at our \
organization. We would normally walk through the requirements and break the \
problem down into several sub-problems, then try to develop a step-by-step \
procedure to solve them. Since language processing is involved, we would also \
list all the forms of text processing needed at each step. This step-by-step \
processing of text is known as pipeline. It is the series of steps involved in \
building any NLP model. These steps are common in every NLP project, so it \
makes sense to study them in this chapter. Understanding some common procedures \
in any NLP pipeline will enable us to get started on any NLP problem encountered \
in the workplace. Laying out and developing a text-processing pipeline is seen \
as a starting point for any NLP application development process. In this \
chapter, we will learn about the various steps involved and how they play \
important roles in solving the NLP problem and we’ll see a few guidelines \
about when and how to use which step. In later chapters, we’ll discuss \
specific pipelines for various NLP tasks (e.g., Chapters 4–7)."

my_sentences = sent_tokenize(mytext)
my_sentences

['In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life.',
 'If we were asked to build such an application, think about how we would approach doing so at our organization.',
 'We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them.',
 'Since language processing is involved, we would also list all the forms of text processing needed at each step.',
 'This step-by-step processing of text is known as pipeline.',
 'It is the series of steps involved in building any NLP model.',
 'These steps are common in every NLP project, so it makes sense to study them in this chapter.',
 'Understanding some common procedures in any NLP pipeline will enable us to get started on any NLP problem encountered in the workplace.',
 'Laying out and developing a text-processing pipeline is seen as a starting point for any NLP application developmen

In [7]:
# word tokenization
for sentence in my_sentences:
    print(sentence)
    print(word_tokenize(sentence))

In the previous chapter, we saw examples of some common NLP applications that we might encounter in everyday life.
['In', 'the', 'previous', 'chapter', ',', 'we', 'saw', 'examples', 'of', 'some', 'common', 'NLP', 'applications', 'that', 'we', 'might', 'encounter', 'in', 'everyday', 'life', '.']
If we were asked to build such an application, think about how we would approach doing so at our organization.
['If', 'we', 'were', 'asked', 'to', 'build', 'such', 'an', 'application', ',', 'think', 'about', 'how', 'we', 'would', 'approach', 'doing', 'so', 'at', 'our', 'organization', '.']
We would normally walk through the requirements and break the problem down into several sub-problems, then try to develop a step-by-step procedure to solve them.
['We', 'would', 'normally', 'walk', 'through', 'the', 'requirements', 'and', 'break', 'the', 'problem', 'down', 'into', 'several', 'sub-problems', ',', 'then', 'try', 'to', 'develop', 'a', 'step-by-step', 'procedure', 'to', 'solve', 'them', '.']
Since

In [8]:
# frequent steps
from nltk.corpus import stopwords
from string import punctuation

def preprocess_corpus(texts):
    mystopwords = set(stopwords.words("english"))
    def remove_stops_digits(tokens):
        return [token.lower() for token in tokens if token not in mystopwords and 
                not token.isdigit() and token not in punctuation]
    return [remove_stops_digits(word_tokenize(text)) for text in texts]

In [9]:
preprocess_corpus([mytext])

[['in',
  'previous',
  'chapter',
  'saw',
  'examples',
  'common',
  'nlp',
  'applications',
  'might',
  'encounter',
  'everyday',
  'life',
  'if',
  'asked',
  'build',
  'application',
  'think',
  'would',
  'approach',
  'organization',
  'we',
  'would',
  'normally',
  'walk',
  'requirements',
  'break',
  'problem',
  'several',
  'sub-problems',
  'try',
  'develop',
  'step-by-step',
  'procedure',
  'solve',
  'since',
  'language',
  'processing',
  'involved',
  'would',
  'also',
  'list',
  'forms',
  'text',
  'processing',
  'needed',
  'step',
  'this',
  'step-by-step',
  'processing',
  'text',
  'known',
  'pipeline',
  'it',
  'series',
  'steps',
  'involved',
  'building',
  'nlp',
  'model',
  'these',
  'steps',
  'common',
  'every',
  'nlp',
  'project',
  'makes',
  'sense',
  'study',
  'chapter',
  'understanding',
  'common',
  'procedures',
  'nlp',
  'pipeline',
  'enable',
  'us',
  'get',
  'started',
  'nlp',
  'problem',
  'encountered',
  '

In [10]:
# stemming and lemmatization
import nltk
nltk.download('wordnet')
    
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()
word1, word2 = "cars", "revolution"
print(stemmer.stem(word1), stemmer.stem(word2))


from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize("better", pos="a")) #a is for adjective

import spacy

sp = spacy.load('en_core_web_sm')
token = sp(u'better')
for word in token:
    print(word.text,  word.lemma_)

[nltk_data] Downloading package wordnet to C:\Users\Yasir Abdur
[nltk_data]     Rohman\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


car revolut
good
better well


In [11]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Charles Spencer Chaplin was born on 16 April 1889 toHannah Chaplin (born Hannah Harriet Pedlingham Hill) and Charles Chaplin Sr')
for token in doc:
    print(token.text, token.lemma_, token.pos_,
          token.shape_, token.is_alpha, token.is_stop)

Charles Charles PROPN Xxxxx True False
Spencer Spencer PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
was be AUX xxx True True
born bear VERB xxxx True False
on on ADP xx True True
16 16 NUM dd False False
April April PROPN Xxxxx True False
1889 1889 NUM dddd False False
toHannah toHannah PROPN xxXxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
( ( PUNCT ( False False
born bear VERB xxxx True False
Hannah Hannah PROPN Xxxxx True False
Harriet Harriet PROPN Xxxxx True False
Pedlingham Pedlingham PROPN Xxxxx True False
Hill Hill PROPN Xxxx True False
) ) PUNCT ) False False
and and CCONJ xxx True True
Charles Charles PROPN Xxxxx True False
Chaplin Chaplin PROPN Xxxxx True False
Sr Sr PROPN Xx True False


# Feature Engineering

- Classical NLP/ML Pipeline
- DL Pipeline

# Modeling
- Start with Simple Heuristics
- Building Your Model
- Building The Model

# Evaluation

Depends on two factors:
1. Using the right metric for evaluation
2. Following the right evaluation process

Two types of evaluation:
1. Intrinsic: focuses on intermediary objectives, ex: precision and recall
2. Extrinsic: focuses on evaluating performance on the final objective, ex: the amount of time users spent on a spam email

# Post-Modeling Phases
- Deployment
- Monitoring
- Updating the model