# Extract Negative Keyword List

## I. Load dataset

In [1]:
import pandas as pd
import numpy as np

**We need to transfer Search-terms-report.csv to UTF-8 before loadind!**

In [4]:
# Read Search terms report
df = pd.read_csv('Search-terms-report.csv')
df.head()

Unnamed: 0,Search term,Match type,Added/Excluded,Campaign,Ad group,Keyword,Clicks,Impr.,Conversions,CTR,Avg. CPC,Cost / conv.,Cost
0,cost to lease lexus es 350,Phrase match,,NEWSEM_Models_C_E_C,Sedan_ES,+Lexus +ES,1,1,0.0,100.00%,$1.64,$0.00,$1.64
1,lexus fairfield,Exact match,Added,NEWSEM_DealerName_CE_E_EC,Neighborhood,+Lexus +Fairfield,1,1,0.0,100.00%,$0.81,$0.00,$0.81
2,lexus rx 450h,Phrase match,,NEWSEM_Models_C_E_C,SUV_RX,Lexus +RX,1,3,0.0,33.33%,$2.18,$0.00,$2.18
3,new lexus es in danville ca 94506,Broad match,,NEWSEM_DealerName_CE_E_EC,Neighborhood,+Lexus +Danville,1,1,0.0,100.00%,$0.86,$0.00,$0.86
4,lexus is300 for sale,Phrase match (close variant),,NEWSEM_Models_C_E_C,Sedan_IS,"""Lexus IS 300""",1,2,0.0,50.00%,$4.68,$0.00,$4.68


In [5]:
df.tail()

Unnamed: 0,Search term,Match type,Added/Excluded,Campaign,Ad group,Keyword,Clicks,Impr.,Conversions,CTR,Avg. CPC,Cost / conv.,Cost
188,lexus fairfield ca,Phrase match,,NEWSEM_DealerName_CE_E_EC,Neighborhood,+Lexus +Fairfield,1,2,0.0,50.00%,$0.61,$0.00,$0.61
189,lexus lx 570 lease,Phrase match,,NEWSEM_Models_C_E_C,LX,"""Lexus LX 570""",1,1,0.0,100.00%,$1.43,$0.00,$1.43
190,2019 lexus nx 300,Phrase match,,NEWSEM_Models_C_E_C,SUV_NX,+Lexus +NX,1,1,0.0,100.00%,$3.68,$0.00,$3.68
191,lexus es 300h for sale,Phrase match,,NEWSEM_Models_C_E_C,Sedan_ES,+Lexus +ES,1,1,0.0,100.00%,$3.33,$0.00,$3.33
192,2017 lexus nx,Phrase match,,NEWSEM_Models_C_E_C,SUV_NX,+Lexus +NX,1,1,0.0,100.00%,$4.87,$0.00,$4.87


In [6]:
# Use keywords related columns
dfn = df[['Search term', 'Ad group', 'Keyword']]
dfn.head()

Unnamed: 0,Search term,Ad group,Keyword
0,cost to lease lexus es 350,Sedan_ES,+Lexus +ES
1,lexus fairfield,Neighborhood,+Lexus +Fairfield
2,lexus rx 450h,SUV_RX,Lexus +RX
3,new lexus es in danville ca 94506,Neighborhood,+Lexus +Danville
4,lexus is300 for sale,Sedan_IS,"""Lexus IS 300"""


## II. Text Normalization (Text Wrangling)

Text normalization is defined as a process that consists of a series of steps that should be followed to wrangle, clean, and standardize textual data into a form that could be consumed by other NLP and analytics systems and applications as input. Besides tokenization, various other techniques include cleaning text, case conversion, correcting spellings, removing stopwords and other unnecessary terms, stemming, and lemmatization. Text normalization is also often called text cleansing or wrangling.

In [7]:
import nltk
import re
import string

### 1. Expanding Contractions

Contractions are shortened version of words or syllables. They exist in either written or spoken forms. Shortened versions of existing words are created by removing specific letters and sounds. In case of English contractions, they are often created by removing one of the vowels from the word.

By nature, contractions do pose a problem for NLP and text analytics because, to start with, we have a special apostrophe character in the word. Ideally, we can have a proper mapping for contractions and their corresponding expansions and then use it to expand all the contractions in our text.

In [8]:
from contractions import CONTRACTION_MAP

# Define function to expand contractions
def expand_contractions(text):
    contractions_pattern = re.compile('({})'.format('|'.join(CONTRACTION_MAP.keys())),flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = CONTRACTION_MAP.get(match)\
                        if CONTRACTION_MAP.get(match)\
                        else CONTRACTION_MAP.get(match.lower())
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
    
    expanded_text = contractions_pattern.sub(expand_match, text)
    expanded_text = re.sub("'", "", expanded_text)
    return expanded_text

### 2. Removing Special Characters

One important task in text normalization involves removing unnecessary and special characters. These may be special symbols or even punctuation that occurs in sentences. This step is often performed before or after tokenization. The main reason for doing so is because often punctuation or special characters do not have much significance when we analyze the text and utilize it for extracting features or information based on NLP and ML.

In [26]:
# Define the function to remove special characters
def remove_characters(text):
    text = text.strip()
    PATTERN = '[^a-zA-Z0-9 ]' # only extract alpha characters and numbers
    filtered_text = re.sub(PATTERN, '', text)
    return filtered_text

### 3. Tokenizing Text

Tokenization can be defined as the process of breaking down or splitting textual data into smaller meaningful components called tokens.

**Sentence tokenization** is the process of splitting a text corpus into sentences that act as the first level of tokens which the corpus is comprised of. This is also known as sentence segmentation , because we try to segment the text into meaningful sentences.

**Word tokenization is** the process of splitting or segmenting sentences into their constituent words. A sentence is a collection of words, and with tokenization we essentially split a sentence into a list of words that can be used to reconstruct the sentence.

In [10]:
# Define the tokenization function
def tokenize_text(text):
    word_tokens = nltk.word_tokenize(text)
    tokens = [token.strip() for token in word_tokens]
    return tokens

### 4. Removing Stopwords

*Stopwords* are words that have little or no significance. They are usually removed from text during processing so as to retain words having maximum significance and context. Stopwords are usually words that end up occurring the most if you aggregated any corpus of text based on singular tokens and checked their frequencies. Words like a, the , me , and so on are stopwords.

In [11]:
from nltk.corpus import stopwords
# In Python, searching a set is much faster than searching a list, 
# so convert the stop words to a set
stopword_list = set(stopwords.words("english"))

# Define function to remove stopwords
def remove_stopwords(tokens):
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    return filtered_tokens

### 5. Correcting Words

One of the main challenges faced in text normalization is the presence of incorrect words in the text. The definition of incorrect here covers words that have spelling mistakes as well as words with several letters repeated that do not contribute much to its overall significance.

#### 5.1 Correcting Repeating Characters

In [12]:
from nltk.corpus import wordnet

# Define function to remove repeated characters
def remove_repeated_characters(tokens):
    repeat_pattern = re.compile(r'(\w*)(\w)\2(\w*)')
    match_substitution = r'\1\2\3'
    def replace(old_word):
        if wordnet.synsets(old_word):
            return old_word
        new_word = repeat_pattern.sub(match_substitution, old_word)
        return replace(new_word) if new_word != old_word else new_word

    correct_tokens = [replace(word) for word in tokens]
    return correct_tokens

#### 5.2 Correcting Spellings

In [14]:
from collections import Counter

# Generate a map of frequently occurring words in English and their counts
"""
The input corpus we use is a file containing several books from the Gutenberg corpus and also 
a list of most frequent words from Wiktionary and the British National Corpus. You can find 
the file under the name big.txt or download it from http://norvig.com/big.txt and use it.
"""
def tokens(text):
    """
    Get all words from the corpus
    """
    return re.findall('[a-z]+', text.lower())

WORDS = tokens(open('big.txt').read())
WORD_COUNTS = Counter(WORDS)

In [15]:
# Define functions that compute sets of words that are one and two edits away from input word.
def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [16]:
# Define function that returns a subset of words from our candidate set of words obtained from 
# the edit functions, based on whether they occur in our vocabulary dictionary WORD_COUNTS.
# This gives us a list of valid words from our set of candidate words.
def known(words): 
    "The subset of `words` that appear in the dictionary of WORD_COUNTS."
    return set(w for w in words if w in WORD_COUNTS)

In [17]:
# Define function to correct words
def correct(words):
    # Get the best correct spellings for the input words
    def candidates(word): 
        # Generate possible spelling corrections for word.
        # Priority is for edit distance 0, then 1, then 2, else defaults to the input word itself.
        candidates = known([word]) or known(edits1(word)) or known(edits2(word)) or [word]
        return candidates
    
    corrected_words = [max(candidates(word), key=WORD_COUNTS.get) for word in words]
    return corrected_words

### 6. Lemmatization

The process of lemmatization is to remove word affixes to get to a base form of the word. The base form is also known as the root word, or the lemma, will always be present in the dictionary.

In [18]:
import spacy
nlp = spacy.load("en")

In [19]:
# Define function for Lemmatization
def Lemmatize_tokens(tokens):
    doc = ' '.join(tokens)
    Lemmatized_tokens = [token.lemma_ for token in nlp(doc)]
    return Lemmatized_tokens

### 7. Text Normalization

In [27]:
def normalize_corpus(corpus):
    normalized_corpus = []    
    for text in corpus:
        text = text.lower()
        text = expand_contractions(text)
        text = remove_characters(text)
        tokens = tokenize_text(text)
        tokens = remove_stopwords(tokens)
        #tokens = remove_repeated_characters(tokens)
        #tokens = correct(tokens)
        tokens = Lemmatize_tokens(tokens)
        text = ' '.join(tokens)
        normalized_corpus.append(text)
                    
    return normalized_corpus

In [28]:
# Normalize Search term
dfn = dfn.assign(Norm_SearchTerm = normalize_corpus(dfn['Search term']))

In [30]:
dfn.head(10)

Unnamed: 0,Search term,Ad group,Keyword,Norm_SearchTerm,Norm_Keyword
0,cost to lease lexus es 350,Sedan_ES,+Lexus +ES,cost lease lexus es 350,lexus es
1,lexus fairfield,Neighborhood,+Lexus +Fairfield,lexus fairfield,lexus fairfield
2,lexus rx 450h,SUV_RX,Lexus +RX,lexus rx 450h,lexus rx
3,new lexus es in danville ca 94506,Neighborhood,+Lexus +Danville,new lexus es danville ca 94506,lexus danville
4,lexus is300 for sale,Sedan_IS,"""Lexus IS 300""",lexus is300 sale,lexus
5,2018 nx300,SUV_NX,+NX +300,2018 nx300,nx
6,lexus concord stock,DealerName,"""Lexus Concord""",lexus concord stock,lexus concord
7,2018 lexus rx,SUV_RX,Lexus +RX,2018 lexus rx,lexus rx
8,lexus nx300h,SUV_NX,[Lexus NX 300h],lexus nx300h,lexus nx h
9,凌志 汽车 2018,OEM Make,+凌志,2018,


In [31]:
# Normalize Keyword
dfn = dfn.assign(Norm_Keyword = normalize_corpus(dfn['Keyword']))

In [32]:
dfn.head(20)

Unnamed: 0,Search term,Ad group,Keyword,Norm_SearchTerm,Norm_Keyword
0,cost to lease lexus es 350,Sedan_ES,+Lexus +ES,cost lease lexus es 350,lexus es
1,lexus fairfield,Neighborhood,+Lexus +Fairfield,lexus fairfield,lexus fairfield
2,lexus rx 450h,SUV_RX,Lexus +RX,lexus rx 450h,lexus rx
3,new lexus es in danville ca 94506,Neighborhood,+Lexus +Danville,new lexus es danville ca 94506,lexus danville
4,lexus is300 for sale,Sedan_IS,"""Lexus IS 300""",lexus is300 sale,lexus 300
5,2018 nx300,SUV_NX,+NX +300,2018 nx300,nx 300
6,lexus concord stock,DealerName,"""Lexus Concord""",lexus concord stock,lexus concord
7,2018 lexus rx,SUV_RX,Lexus +RX,2018 lexus rx,lexus rx
8,lexus nx300h,SUV_NX,[Lexus NX 300h],lexus nx300h,lexus nx 300h
9,凌志 汽车 2018,OEM Make,+凌志,2018,


## III. Generate Negative Keyword

In [33]:
import warnings 
warnings.filterwarnings("ignore", category=DeprecationWarning)

Below is a user-defined function to remove unwanted text patterns from the tweets. It takes two arguments, one is the original string of text and the other is the pattern of text that we want to remove from the string. The function returns the same input string but without the given pattern. 

In [34]:
def remove_pattern(input_txt, pattern):
    r = re.findall(pattern, input_txt)
    for i in r:
        input_txt = re.sub(i, '', input_txt)
        
    return input_txt

In [38]:
# Remove keyword from Search Term
n = len(dfn)
nkw = []
for i in range(n):
    text = dfn['Norm_SearchTerm'][i]
    tokens = tokenize_text(dfn['Norm_Keyword'][i])
    for pattern in tokens:
        text = remove_pattern(text, pattern)
    nkw.append(text)

In [39]:
dfn['Negative Keyword'] = nkw

In [40]:
dfn.head(20)

Unnamed: 0,Search term,Ad group,Keyword,Norm_SearchTerm,Norm_Keyword,Negative Keyword
0,cost to lease lexus es 350,Sedan_ES,+Lexus +ES,cost lease lexus es 350,lexus es,cost lease 350
1,lexus fairfield,Neighborhood,+Lexus +Fairfield,lexus fairfield,lexus fairfield,
2,lexus rx 450h,SUV_RX,Lexus +RX,lexus rx 450h,lexus rx,450h
3,new lexus es in danville ca 94506,Neighborhood,+Lexus +Danville,new lexus es danville ca 94506,lexus danville,new es ca 94506
4,lexus is300 for sale,Sedan_IS,"""Lexus IS 300""",lexus is300 sale,lexus 300,is sale
5,2018 nx300,SUV_NX,+NX +300,2018 nx300,nx 300,2018
6,lexus concord stock,DealerName,"""Lexus Concord""",lexus concord stock,lexus concord,stock
7,2018 lexus rx,SUV_RX,Lexus +RX,2018 lexus rx,lexus rx,2018
8,lexus nx300h,SUV_NX,[Lexus NX 300h],lexus nx300h,lexus nx 300h,
9,凌志 汽车 2018,OEM Make,+凌志,2018,,2018


In [41]:
dfn1 = dfn[['Search term', 'Ad group', 'Keyword', 'Negative Keyword']]

In [43]:
# Save data sets to files
dfn1.to_excel('NKW.xlsx', index=False)