<h1>3 Text to Features (Feature Engineering on Text Data)</h1>
<p>To analyze preprocessed data, it must be converted into features. There are multiple techniques:</p>
<ul>
    <li>Syntactic Parsing</li>
    <li>Entities</li>
    <li>N-grams</li>
    <li>Word-Based Features</li>
</ul>

<h1>3.1 Syntactic Parsing</h1>
<p>Syntactical parsing involves analysis of words in the sentence for grammar and arrangement that shows the relationships among the words.</p>
<p>Dependency Grammar and Part of Speech tags are important attributes of text syntactics</p>

<h3>Dependency Grammar:</h3>
<ul>
    <li>Relationship among words in a sentence.</li>
    <li>Dependency Grammar is a class of syntactic text analysis that deals with (labeled) asymmetrical binary relations between two lexical items (words).</li>
    <li>Every relation can be represented as a triplet: (relation, governor, dependent)</li>
</ul>

<p>Example: “Bills on ports and immigration were submitted by Senator Brownback, Republican of Kansas.”</p>
<img src="dependency-grammar.png">

<ul>
    <li>The Tree shows "submitted" is the root word of the sentence and is linked by two sub-trees: subject and object</li>
    <li>Each subtree is also a dependency tree</li>
    <li>This type of tree, parsed top-down gives grammar relation triplets as output that can be used as features for nlp problems like:
        <ul>
            <li>entity wise sentiment analysis</li>
            <li>actor & entrity identification</li>
            <li>text classification</li>
        </ul>
    </li>
    <li>StanfordCoreNLP (by Stanford NLP Group) and NLTK dependency grammars can be used to generate dependency trees</li>
</ul>

<h3>Part of Speech Tagging</h3>
<p>Every word in a sentence is associated with a part of speech (pos) tag.</p>
<p>Parts of Speech:</p>
<ul>
    <li>Nouns</li>
    <li>Verbs</li>
    <li>Adjectives</li>
    <li>Adverbs</li>
    <li>Etc.</li>
</ul>

In [5]:
from nltk import word_tokenize, pos_tag

sample_text = "Rex, the running brown dog, was panting loudly."

tokens = word_tokenize(sample_text)
tagged_tokens = pos_tag(tokens)
tagged_tokens

[('Rex', 'NNP'),
 (',', ','),
 ('the', 'DT'),
 ('running', 'VBG'),
 ('brown', 'JJ'),
 ('dog', 'NN'),
 (',', ','),
 ('was', 'VBD'),
 ('panting', 'VBG'),
 ('loudly', 'RB'),
 ('.', '.')]

<h3>A. Word Sense Disambiguation:</h3>
<p>Some words in a language have multiple meanings according to their usage.</p>

In [19]:
book1 = "I am going to book a flight to PDX."
book2 = "I am going to read this book on my flight to PDX."

print("Basic tokenizing/POS tagging cannot detect the difference " +
      "between these two uses of 'book': ", book1, book2, sep="\n",
      end="\n\n")

book1_tokens = word_tokenize(book1)
book1_tagged_tokens = pos_tag(book1_tokens)
print("Example 1 'book':", book1_tagged_tokens[4], end="\n\n")

book2_tokens = word_tokenize(book2)
book2_tagged_tokens = pos_tag(book2_tokens)
print("Example 2 'book':", book2_tagged_tokens[6])

Basic tokenizing/POS tagging cannot detect the difference between these two uses of 'book': 
I am going to book a flight to PDX.
I am going to read this book on my flight to PDX.

Example 1 'book': ('book', 'NN')

Example 2 'book': ('book', 'NN')


<h3>Important NLP Uses of POS Tagging:</h3>

<h3>B. Improving word-based features:</h3>
<p>A learning model could learn different contexts of a word - but if the part of speech tag is linked with it, the context is preserved.</p>

<h3>C. Normalization and Lemmatization:</h3>
<p>POS tags are the basis of the lemmatization process for converting a word into its base form (lemma)</p>

<h3>D. Efficient stopword removal:</h3>
<p>POS tags are also useful in efficient removal of stopwords</p>
<p>Some tags always define low-frequency/lower importance words. Ex:</p>
<ul>
    <li>IN (Preposition or Subordinating Conjunction): "within", "upon", ... (</li>
    <li>CD (Cardinal Number): "one", "two", ...</li>
    <li>MD (Modal): "may", "must", ...</li>
</ul>

<h1>3.2 Entity Extraction (Entities as features)</h1>
<p>Entities are the most important chunks of a sentence - noun/verb phrases. Entity detection algorithms usually use rule-based parsing, dictionary lookups, pos tagging, and dependency parsing.</p>
<h3>Key NLP Entity Detection Methods:</h3>

<h3>A. Named Entity Recognition (NER)</h3>
<p>The process of detecting named entities from text.</p>
<p><strong>Sentence:</strong> "Sergey Brin, the manager of Google Inc. is walking the streets of New York."</p>
<p><strong>Named Entities:</strong> ("person": "Sergey Brin"), ("org": "Google Inc"), ("location": "New York")</p>
<p>NER model consists of three blocks:</p>
<ul>
    <li><strong>Noun phrase identification:</strong> extracting the noun phrases using dependency parsing and POS tagging</li>
    <li><strong>Phrase Classification:</strong> all extracted noun phrases are classified into respective categories. Resources:
    <ul>
        <li>Google Maps API - location disambiguation</li>
        <li>Wikipedia - person/company names</li>
    </ul>
    </li>
    <li><strong>Entity disambiguation:</strong> It is possible that entities may be misclassified, so creating a validation layer may be useful.</li>
</ul>

<h3>B. Topic Modeling</h3>
<p>A Process of automatically identifying the topics present in a text corpus. Derives hidden patterns among words in the corpus in an unsupervised manner.</p>
<ul>
    <li><strong>Healthcare:</strong> "health", "doctor", "patient", "hospital"</li>
    <li><strong>Farming:</strong> "farm", "crops", "wheat"</li>
</ul>
<p>Latent Dirichlet Allocation (LDA) is the most popular topic modelling technique.</p>

In [38]:
import gensim
from gensim import corpora

doc1 = "Sugar is bad to consume. My sister likes to have sugar, but not my father." 
doc2 = "My father spends a lot of time driving my sister around to dance practice."
doc3 = "Doctors suggest that driving may cause increased stress and blood pressure."
doc_complete = [doc1, doc2, doc3]
doc_clean = [doc.split() for doc in doc_complete]

# Create the term dictionary of the corpus, where every unique
# term is assigned an index
# Ex: (0, "Sugar"), (1, "is"), (2, "bad") ...
dictionary = corpora.Dictionary(doc_clean)

# Convert the list of documents (corpus) into a Document Term 
# Matrix using the dictionary prepared above
# doc2bow creates bag-of-words representation:
# list of (word_id, word_frequency)
# [(0, 1), (1, 1), (2, 1), (3, 2), ...]
doc_term_matrix = [dictionary.doc2bow(doc) for doc in doc_clean]

# Create the object for the LDA model using the gensim library
Lda = gensim.models.ldamodel.LdaModel

# Run and train LDA model on the document term matrix
ldamodel = Lda(doc_term_matrix, num_topics=3,
               id2word=dictionary, passes=50)

# Results
print(ldamodel.print_topics())


[(0, '0.060*"driving" + 0.060*"may" + 0.060*"suggest" + 0.060*"stress" + 0.060*"that" + 0.060*"pressure." + 0.060*"and" + 0.060*"Doctors" + 0.060*"blood" + 0.060*"increased"'), (1, '0.089*"to" + 0.051*"My" + 0.051*"my" + 0.051*"sister" + 0.051*"sugar," + 0.051*"consume." + 0.051*"is" + 0.051*"Sugar" + 0.051*"but" + 0.051*"have"'), (2, '0.053*"driving" + 0.053*"sister" + 0.053*"my" + 0.053*"My" + 0.053*"a" + 0.053*"father" + 0.053*"dance" + 0.053*"time" + 0.053*"spends" + 0.053*"lot"')]


<h3>C. N-Grams as Features</h3>
<p>A combination of N words together are called N-Grams. N-Grams (N > 1) are generally more informative than words (Unigrams) as features. Bigrams (N = 2) are considered the most important features.</p>

In [47]:
def generate_ngrams(text, n):
    words = text.split()
    output = []
    for i in range(len(words) - n + 1):
        output.append(tuple(words[i:i+n]))
        print(words[i:i+n])
    return output

sample_text = "this sentence will be turned into bigrams"
generate_ngrams(sample_text, 2)

['this', 'sentence']
['sentence', 'will']
['will', 'be']
['be', 'turned']
['turned', 'into']
['into', 'bigrams']


[('this', 'sentence'),
 ('sentence', 'will'),
 ('will', 'be'),
 ('be', 'turned'),
 ('turned', 'into'),
 ('into', 'bigrams')]

<h1>3.3 Statistical Features</h1>
<p>Text data can also be quantified directly into numbers using several techniques.</p>
<h3>A. Term Frequency - Inverse Document Frequency (TF - IDF)</h3>
<p>TF-IDF is a weighted model commonly used for information retrieval problems. It aims to convert text documents into vector models on the basis of occurence of words in the documents without considering the exact ordering.</p>
<p><strong>EX -</strong> there is a dataset of N text documents. In any document "D", TF and IDF are defined as:</p>
<ul>
    <li>Term Frequency (TF) - TF for a term "t" is defined as the count of a term "t" in a document "D"</li>
    <li>Inverse Document Frequency (IDF) - IDF for a term "t" is defined as logarithm of ratio of total documents available in the corpus and number of documents containing "t".</li>
    <li>TF IDF: the TF IDF formula gives the relative importance of a term in a corpus (list of documents), given by the following formula:<img src="TF-IDF-formula.png"></li>
</ul>

In [51]:
from sklearn.feature_extraction.text import TfidfVectorizer

d1 = "This is a sort of long sample document."
d2 = "This is another sample document."
d3 = "This is a random document."

obj = TfidfVectorizer()
corpus = [d1, d2, d3]
X = obj.fit_transform(corpus)
print(X)

  (0, 8)	0.274634427112
  (0, 2)	0.274634427112
  (0, 7)	0.464996505949
  (0, 4)	0.464996505949
  (0, 3)	0.464996505949
  (0, 6)	0.35364182827
  (0, 1)	0.274634427112
  (1, 8)	0.364544396761
  (1, 2)	0.364544396761
  (1, 6)	0.469417284322
  (1, 1)	0.364544396761
  (1, 0)	0.617227317565
  (2, 8)	0.412858572062
  (2, 2)	0.412858572062
  (2, 1)	0.412858572062
  (2, 5)	0.699030327257


<p>The model creates a vocabulary dictionary and assigns an index to each word. Each row in the output contains a tuple (i,j) and a tf-idf value of the word at index j in document i.</p>

<h3>B. Count/Density/Readability Features</h3>
<p>Count or Density based features can also be used in models and analysis. These features might seem trivial but they have a great impact in learning models. Some of the features are: </p>
<ul>
    <li>Word Count</li>
    <li>Sentence Count</li>
    <li>Punctuation Count</li>
    <li>Industry Specific Word Count</li>
</ul>
<p>Other types of measures include readability measures such as:</p>
<ul>
    <li>Syllable Counts</li>
    <li>Smog Index</li>
    <li>Flesch Reading Ease</li>
</ul>

<h1>3.4 Word Embedding (text vectors)</h1>
<p>Representing words as vectors. Aims to redefine the high dimensional word features into low dimensional feature vectors by preserving the contextual similarity in the corpus. They are widely used in deep learning models such as Convolutional Neural Networks and Recurrent Neural Networks</p>
<p>Word2Vec GloVe are two popular models to create word embedding of a text. Takes a corpus as input and produces word vectors as output.</p>
<p>Word2Vec is composed of a preprocessing module, shallow neural network model (Continuous Bag of Words), and another shallow neural network (skip-gram). These models are widely used for NLP problems. It first constructs a vocabulary from the training corpus and then learns word embedding representations.</p>

<h3>Create the Text Vectors</h3>

In [8]:
from gensim.models import Word2Vec
sentences = [["data", "science"], ["learning", "science", "data", "analytics"],
             ["machine", "learning"], ["deep", "learning"]]

# Train the model on your corpus
model = Word2Vec(sentences, min_count = 1)

print(model.similarity("data", "science"))
print(model["learning"])

-0.0556528586891
[  1.04084518e-03  -1.77087018e-03  -4.61314805e-03  -3.06192669e-03
  -1.42174272e-03  -2.12180754e-03  -3.43434309e-04   1.28958223e-03
   2.03153747e-03   2.86165567e-04  -2.45753489e-03  -4.70468448e-03
  -4.27514518e-04  -1.77422364e-03   3.74865299e-03   3.30105773e-03
   4.28414717e-03  -4.15968476e-03  -3.81730171e-03   3.44632915e-03
  -4.37509082e-03  -3.91984126e-03   4.96243173e-03   2.10108631e-03
  -4.47424827e-04   3.18493135e-03  -4.91256453e-03   3.60128330e-03
   3.76950120e-05  -4.82656760e-03  -6.73248258e-04  -7.62612792e-04
   3.35885887e-03  -4.64811642e-03   5.33412036e-04  -1.91591599e-03
  -3.18837818e-03  -3.16291675e-03  -3.00405663e-03   1.34144025e-03
  -2.76611256e-03   9.11801762e-04   2.10785563e-03   1.06058235e-03
  -1.23878563e-04   4.94708493e-03  -3.91837023e-03   3.36146983e-03
  -3.94387683e-03  -1.05841365e-03   2.58075166e-03   2.48690927e-03
   4.98169975e-04  -4.48935665e-03  -4.39680321e-03   1.05393201e-03
   2.51843012e-03