<a href="https://colab.research.google.com/github/andrewxu13/TextAnalytics/blob/main/Week_04_Feature_Engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regex
Regular Expression
https://docs.python.org/3/howto/regex.html

In [None]:
import re

In [None]:
#search
regex = re.compile("\d{3}-\d{3}-\d{4}") # create the regex as a compiler

In [None]:
x1 ='If you want to call the White house, please use 202-456-1111'
x2 ='If you want to call the White house, please use (202)456-1111'
x3 ='If you want to call the White house, please use (202) 456-1111'
x4 ='If you want to call the White house, please use 2024561111'
x5 ='If you want to call the White house, please use 202.456.1111'
x6 ='If you want to call the White house, please use 202 456 1111'
x7 ='If you want to call the White house, please use 202 456 1111'
x8 =' If you want to call the White house, please use 202-456-1111. Dial 212-897-1964 to get a very special #Ghostbusters message!'

## Regex findall function

In [None]:
re.findall(regex, x1) #return the match

['202-456-1111']

In [None]:
re.findall(regex, x2) #return the match

[]

In [None]:
re.findall(regex, x8) #can return all the matches

['202-456-1111', '212-897-1964']

## Regex Split function



In [None]:
re.split(regex, x1) #split, remove the match

['If you want to call the White house, please use ', '']

In [None]:
re.split(regex, x8) #split, remove the match everything else in as a list

[' If you want to call the White house, please use ',
 '. Dial ',
 ' to get a very special #Ghostbusters message!']

## Regex Sub Function

In [None]:
re.sub(regex, "999-999-9999", x8) #substitution

' If you want to call the White house, please use 999-999-9999. Dial 999-999-9999 to get a very special #Ghostbusters message!'

In [None]:
text1 = re.sub(regex,"999-999-9999",x8,count=1) #substitution
text1

' If you want to call the White house, please use 999-999-9999. Dial 212-897-1964 to get a very special #Ghostbusters message!'

In [None]:
text2 = re.sub(regex,"888-888-8888",text1,flags=0)
text2

' If you want to call the White house, please use 888-888-8888. Dial 888-888-8888 to get a very special #Ghostbusters message!'

In [None]:
text2

' If you want to call the White house, please use 888-888-8888. Dial 888-888-8888 to get a very special #Ghostbusters message!'

In [None]:
re.sub(regex, "[redacted]", x8) #substitution as anything you want

' If you want to call the White house, please use [redacted]. Dial [redacted] to get a very special #Ghostbusters message!'

# Vectorization

Bascially turn human-text to machine-readable one, ready for modeling

It's called feature engineering for the fancy mind.

## One-hot Coding

vectorize each document based on the words. When the word appears, code it as 1 in the vector

### Using Python
Flex your Python muscle

In [None]:
#basic cleaning, lower case, then remove period
documents = ["Dog bites man.", "Man bites dog.", "Dog eats meat.", "Man eats food."]
processed_docs = [doc.lower().replace(".","") for doc in documents]
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [None]:
#Build the vocabulary through dict, with the number as the position of the word
vocab = {}
count = 0
for doc in processed_docs:
    for word in doc.split():
        if word not in vocab:
            count = count +1
            vocab[word] = count

print(vocab)

{'dog': 1, 'bites': 2, 'man': 3, 'eats': 4, 'meat': 5, 'food': 6}


In [None]:
#Get one hot representation for any string based on this vocabulary.
#If the word exists in the vocabulary, its representation is returned.
#If not, a list of zeroes is returned for that word.
def get_onehot_vector(somestring):
    onehot_encoded = []
    for word in somestring.split():
        temp = [0]*len(vocab)
        if word in vocab:
            temp[vocab[word]-1] = 1 # -1 is to take care of the fact indexing in array starts from 0 and not 1
        onehot_encoded.append(temp)
    return onehot_encoded

In [None]:
print(processed_docs[1])
get_onehot_vector(processed_docs[1]) #one hot representation for a text from our corpus.

man bites dog


[[0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]

In [None]:
get_onehot_vector("man bites dog")

[[0, 0, 1, 0, 0, 0], [0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0]]

In [None]:
get_onehot_vector("man eats fruits") #still works (code don't break) , but missing information on fruits

[[0, 0, 1, 0, 0, 0], [0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0]]

### Using scikit-learn

Documentation at https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [None]:
S1 = 'dog bites man'
S2 = 'man bites dog'
S3 = 'dog eats meat'
S4 = 'man eats food'

In [None]:
#Preprocessing data
from sklearn.preprocessing import OneHotEncoder

data = [S1.split(), S2.split(), S3.split(), S4.split()]
values = data[0]+data[1]+data[2]+data[3]
print("The data: ",values)



The data:  ['dog', 'bites', 'man', 'man', 'bites', 'dog', 'dog', 'eats', 'meat', 'man', 'eats', 'food']


In [None]:
data

[['dog', 'bites', 'man'],
 ['man', 'bites', 'dog'],
 ['dog', 'eats', 'meat'],
 ['man', 'eats', 'food']]

In [None]:
#One-Hot Encoding
onehot_encoder = OneHotEncoder()
onehot_encoded = onehot_encoder.fit_transform(data).toarray() #toarray() to get the matrix to the right representation
print(onehot_encoder.get_feature_names_out())
print("Onehot Encoded Matrix:\n",onehot_encoded)

['x0_dog' 'x0_man' 'x1_bites' 'x1_eats' 'x2_dog' 'x2_food' 'x2_man'
 'x2_meat']
Onehot Encoded Matrix:
 [[1. 0. 1. 0. 0. 0. 1. 0.]
 [0. 1. 1. 0. 1. 0. 0. 0.]
 [1. 0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 1. 0. 0.]]


## Bag of Words

We use sklearn CountVectorizer to create the Bag of Words

Documentation https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

We can use binary= True to see if the word appears or not, no need to count

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

#look at the documents list
print("Our corpus: ", processed_docs)



Our corpus:  ['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']


In [None]:
count_vect1 = CountVectorizer()
#Build a BOW representation for the corpus
bow_rep1 = count_vect1.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect1.vocabulary_)

Our vocabulary:  {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}


In [None]:
#see the BOW rep for first 2 documents
print("BoW representation for" , processed_docs[0], ": ", bow_rep1[0].toarray())
print("BoW representation for" , processed_docs[1], ": ",bow_rep1[1].toarray())


BoW representation for dog bites man :  [[1 1 0 0 1 0]]
BoW representation for man bites dog :  [[1 1 0 0 1 0]]


In [None]:

#Get the representation using this vocabulary, for a new text
temp = "dog and dog are friends"
t_transform = count_vect1.transform([temp])
print("Bow representation:",t_transform.toarray() ) #don't forget the [ ] for vectorization

Bow representation: [[0 2 0 0 0 0]]


In [None]:
#let's try one more
temp = "###" #new text here
t_transform = count_vect1.transform([temp])
print("Bow representation:",t_transform.toarray() ) #don't forget the [ ] for vectorization

Bow representation: [[0 0 0 0 0 0]]


In [None]:
#BoW with binary vectors , whether the word appear or not
count_vect2 = CountVectorizer(binary=True)
count_vect2.fit(processed_docs)
temp = count_vect2.transform(["dog and dog are friends"])
print("Bow representation for 'dog and dog are friends':", temp.toarray())

Bow representation for 'dog and dog are friends': [[0 1 0 0 0 0]]


## Bag of N-grams


In [None]:
### Let

### Uni-gram ==one word

In [None]:
#Ngram vectorization example with count vectorizer and uni-gram first
count_vect3 = CountVectorizer(ngram_range=(1,1)) #set a range
# Build a BOW representation for the corpus
bow_rep3 = count_vect3.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect3.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for" , processed_docs[0], ": ", bow_rep3[0].toarray())
print("BoW representation for" , processed_docs[1], ": ",bow_rep3[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect3.transform(["dog and dog are friends"])

print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our vocabulary:  {'dog': 1, 'bites': 0, 'man': 4, 'eats': 2, 'meat': 5, 'food': 3}
BoW representation for dog bites man :  [[1 1 0 0 1 0]]
BoW representation for man bites dog :  [[1 1 0 0 1 0]]
Bow representation for 'dog and dog are friends': [[0 2 0 0 0 0]]


### Bi-gram == two words

In [None]:
#Ngram vectorization example with count vectorizer and uni-gram first
count_vect4 = CountVectorizer(ngram_range=(2,2)) #set a range
#Build a BOW representation for the corpus
bow_rep4 = count_vect4.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect4.vocabulary_)

#see the BOW rep for first 2 documents
print("BoW representation for" , processed_docs[0], ": ", bow_rep4[0].toarray())
print("BoW representation for" , processed_docs[1], ": ",bow_rep4[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect4.transform(["dog and dog are friends"])

print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our vocabulary:  {'dog bites': 2, 'bites man': 1, 'man bites': 6, 'bites dog': 0, 'dog eats': 3, 'eats meat': 5, 'man eats': 7, 'eats food': 4}
BoW representation for dog bites man :  [[0 1 1 0 0 0 0 0]]
BoW representation for man bites dog :  [[1 0 0 0 0 0 1 0]]
Bow representation for 'dog and dog are friends': [[0 0 0 0 0 0 0 0]]


### tri-gram == three words

In [None]:
#Ngram vectorization example with count vectorizer and uni-gram first
count_vect5 = CountVectorizer(ngram_range=(3,3)) #set a range
#Build a BOW representation for the corpus
bow_rep5 = count_vect5.fit_transform(processed_docs)

#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect5.vocabulary_)
#see the BOW rep for first 2 documents
print("BoW representation for" , processed_docs[0], ": ", bow_rep5[0].toarray())
print("BoW representation for" , processed_docs[1], ": ",bow_rep5[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect5.transform(["dog and dog are friends"])

print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our vocabulary:  {'dog bites man': 0, 'man bites dog': 2, 'dog eats meat': 1, 'man eats food': 3}
BoW representation for dog bites man :  [[1 0 0 0]]
BoW representation for man bites dog :  [[0 0 1 0]]
Bow representation for 'dog and dog are friends': [[0 0 0 0]]


### Bag of N-grams
Let's have uni-gram. bi-gram, and tri-gram all together

In [None]:
#Ngram vectorization example with count vectorizer and uni-gram first
count_vect5 = CountVectorizer(ngram_range=(1,3)) #set a range from 1 to 3 gram
#Build a BOW representation for the corpus
bow_rep5 = count_vect5.fit_transform(processed_docs)

#Look at the vocabulary mapping


print("Our vocabulary: ", {k: v for k, v in sorted(count_vect5.vocabulary_.items(), key=lambda item: item[1])})  #count_vect5.vocabulary_ is the dict, let's sort it

#see the BOW rep for first 2 documents
print("BoW representation for" , processed_docs[0], ": ", bow_rep5[0].toarray())
print("BoW representation for" , processed_docs[1], ": ",bow_rep5[1].toarray())

#Get the representation using this vocabulary, for a new text
temp = count_vect5.transform(["dog and dog are friends"])

print("Bow representation for 'dog and dog are friends':", temp.toarray())

Our vocabulary:  {'bites': 0, 'bites dog': 1, 'bites man': 2, 'dog': 3, 'dog bites': 4, 'dog bites man': 5, 'dog eats': 6, 'dog eats meat': 7, 'eats': 8, 'eats food': 9, 'eats meat': 10, 'food': 11, 'man': 12, 'man bites': 13, 'man bites dog': 14, 'man eats': 15, 'man eats food': 16, 'meat': 17}
BoW representation for dog bites man :  [[1 0 1 1 1 1 0 0 0 0 0 0 1 0 0 0 0 0]]
BoW representation for man bites dog :  [[1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0]]
Bow representation for 'dog and dog are friends': [[0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0]]


## TF-IDF

YOu can flex your Python to calculate TF, IDF, then TF-IDF, but you don't have to. Sklearn have it covered

Documentation
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

L2 normalization is a way to make the vector smaller to save computing power but still retain enough information

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer() #norm =l2
bow_rep_tfidf = tfidf.fit_transform(processed_docs)


In [None]:
processed_docs

['dog bites man', 'man bites dog', 'dog eats meat', 'man eats food']

In [None]:
#IDF for all words in the vocabulary
print("IDF for all words in the vocabulary",tfidf.idf_)


IDF for all words in the vocabulary [1.51082562 1.22314355 1.51082562 1.91629073 1.22314355 1.91629073]


In [None]:
#All words in the vocabulary.
print("All words in the vocabulary",tfidf.get_feature_names())


All words in the vocabulary ['bites', 'dog', 'eats', 'food', 'man', 'meat']




In [None]:
#TFIDF representation for all documents in our corpus
print("TFIDF representation for all documents in our corpus\n",bow_rep_tfidf.toarray())

#order is based on the get_feature names


TFIDF representation for all documents in our corpus
 [[0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.65782931 0.53256952 0.         0.         0.53256952 0.        ]
 [0.         0.44809973 0.55349232 0.         0.         0.70203482]
 [0.         0.         0.55349232 0.70203482 0.44809973 0.        ]]


In [None]:
temp = tfidf.transform(["dog and man are friends."])
print("Tfidf representation for 'dog and man are friends':\n", temp.toarray())

Tfidf representation for 'dog and man are friends':
 [[0.         0.70710678 0.         0.         0.70710678 0.        ]]
