<h1>4 Important Tasks of NLP</h1>

<h1>4.1 Text Classification</h1>
<p>Text classification is a classic NLP problem. Examples:
<ul>
    <li>Email Spam Identification</li>
    <li>Topic Classification of News</li>
    <li>Sentiment Classification and Organization of Web Pages</li>
</ul>
</p>
<p>Text classification is defined as a technique to systematically classify a text object (document or sentence) into a fixed category. This is mostly helpful for filtering, organizing, and storing large amounts of data.</p>
<p>A typical natural language classifier consists of two parts:
<ul>
    <li>Training</li>
    <li>Prediction</li>
</ul>
Firstly, the text input is processed and features are created. The machine learning models then learn these features and predict against new text.
</p>
<img src="language-classifier.png">

<h3>Code Using a naive Bayes classifier with the text blob library (built on top of nltk)</h3>

In [7]:
from textblob.classifiers import NaiveBayesClassifier as NBC
from textblob import TextBlob

training_corpus = [
                   ('I am exhausted from this work.', 'Class_B'),
                   ("I can't cooperate with this", 'Class_B'),
                   ('He is my worst enemy!', 'Class_B'),
                   ('My management is poor.', 'Class_B'),
                   ('I love this burger.', 'Class_A'),
                   ('This is an brilliant place!', 'Class_A'),
                   ('I feel very good about these dates.', 'Class_A'),
                   ('This is my best work.', 'Class_A'),
                   ("What an awesome view", 'Class_A'),
                   ('I do not like this dish', 'Class_B')]
test_corpus = [
                ("I am not feeling well today.", 'Class_B'), 
                ("I feel brilliant!", 'Class_A'), 
                ('Gary is a friend of mine.', 'Class_A'), 
                ("I can't believe I'm doing this.", 'Class_B'), 
                ('The date was good.', 'Class_A'),
                ('I do not enjoy my job', 'Class_B')]

model = NBC(training_corpus)
print(model.classify("Their codes are amazing."))

print(model.classify("I don't like their computer."))

test_results = [model.classify(tup[0]) for tup in test_corpus]
print(test_results)
print(model.accuracy(test_corpus))

Class_A
Class_B
['Class_B', 'Class_A', 'Class_A', 'Class_B', 'Class_A', 'Class_B']
1.0


<h3>Scikit Learn Pipeline Framework for Text Classification</h3>

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn import svm

# Prepare data for SVM model - using same training
# corpus and test corpus from Naive Bayes example
train_data = []
train_labels = []
for tup in training_corpus:
    train_data.append(tup[0])
    train_labels.append(tup[1])
    
test_data = []
test_labels = []
for tup in test_corpus:
    test_data.append(tup[0])
    test_labels.append(tup[1])
    
# Create feature vectors
# df = document frequency
vectorizer = TfidfVectorizer(min_df=4, max_df=0.9)

# Train the feature vectors
train_vectors = vectorizer.fit_transform(train_data)

# Apply model on test data
test_vectors = vectorizer.transform(test_data)

# Perform classification with SVM, kernel=linear
model = svm.SVC(kernel="linear")
model.fit(train_vectors, train_labels)
prediction = model.predict(test_vectors)
# prediction.tolist()

print(classification_report(test_labels, prediction))

             precision    recall  f1-score   support

    Class_A       0.50      0.67      0.57         3
    Class_B       0.50      0.33      0.40         3

avg / total       0.50      0.50      0.49         6



<h1>4.2 Text Matching / Similarity</h1>
<p>One of the important areas of NLP is the matching of text objects to find similarities. Important applications of text matching include:
<ul>
    <li>Automatic Spelling Correction</li>
    <li>Data De-Duplication</li>
    <li>Genome Analysis</li>
</ul>
</p>
<h3>Some Text Matching Techniques:</h3>

<h3>A. Levenshtein Distance</h3>
<p>The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other with the following operations:
<ul>
    <li>Insertion</li>
    <li>Deletion</li>
    <li>Substitution of a Single Character</li>
</ul>
</p>

In [9]:
def levenshtein(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1
    distances = range(len(s1) + 1)
    for index2, char2 in enumerate(s2):
        newDistances = [index2 + 1]
        for index1, char1 in enumerate(s1):
            if char1 == char2:
                newDistances.append(distances[index1])
            else:
                newDistances.append(1 + min((distances[index1],
                                            distances[index1 + 1],
                                            newDistances[-1])))
        distances = newDistances
    return distances[-1]

print(levenshtein("analyze", "analyse"))

1


<h3>B. Phonetic Matching</h3>
<p>A phonetic matching algorithm takes a keyword as input (person's name, location name, etc.) and produces a character string that identifies a set of words that are (roughly) phonetically similar. It is useful for searching large text corpuses, correcting spelling errors and matching relevant names. Soundex and Metaphone are two main phonetic algorithms used for this purpose. Python's module Fuzzy is used to compute soundex strings for different words:</p>

In [10]:
import fuzzy

soundex = fuzzy.Soundex(4)

print(soundex("aunt"))
print(soundex("ant"))

A53
A53


<h3>C. Flexible String Matching</h3>
<p>A complete text matching system includes different algorithms pipelined together to compute a variety of text variations. Regular expressions are helpful for this purpose as well. Another common techniques include:
<ul>
    <li>Exact String Matching</li>
    <li>Lemmatized Matching</li>
    <li>Compacted Matching (takes care of spaces, punctuations, slangs, etc.)</li>
</ul>
</p>

<h3>D. Cosing Similarity</h3>
<p>When the text is represented in vector notation, a general cosine similarity can also be applied in order to measure vectorized similarity. The following code converts text to vectors (using term frequency) and applies cosine similarity to provide closeness among two texts:</p>


In [16]:
import math
from collections import Counter

def get_cosine(vec1, vec2):
    common = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in common])
    
    sum1 = sum([vec1[x]**2 for x in vec1.keys()])
    sum2 = sum([vec2[x]**2 for x in vec2.keys()])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)
    
    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator

def text_to_vector(text):
    words = text.split()
    return Counter(words)

text1 = "This is a write-up about natural language processing."
text2 = "The write up is about natural language processing."

vector1 = text_to_vector(text1)
vector2 = text_to_vector(text2)
cosine = get_cosine(vector1, vector2)
print(cosine)
    

0.6249999999999999


<h1>4.3 Coreference Resolution</h1>
<p>Coreference Resolution is the process of finding relational links among the words (or phrases) within sentences.</p>
<p>Example: "Donald went to John's office to see the new table. He looked at it for a minute."</p>
<p>Coreference resolution is used to determine that "he" is Donald and "it" is the table. Coreference resolution is used in:
<ul>
    <li>Document Summarization</li>
    <li>Question Answering</li>
    <li>Information Extraction</li>
</ul>
</p>

<h1>4.4 Other NLP Problems/Tasks</h1>
<ul>
    <li><strong>Text Summarization - </strong> Given a text article or paragraph, summarize it automatically to produce the most important and relevant sentences in order.</li>
    <li><strong>Machine Translation - </strong> Automatically translate text from one human language to another by taking care of grammar, semantics and information about the real world, etc.</li>
    <li><strong>Natural Language Generation and Understanding - </strong> Convert information from computer databases or semantic intents into readable human language. Converting chunks of text into more logical structures that are easier for computer programs to manipulate is called language understanding.</li>
    <li><strong>Optical Character Recognition - </strong> Given an image representing printed text, determine the corresponding text.</li>
    <li><strong>Document to Information - </strong> This involves parsing of textual data present in documents (websites, files, pdfs, and images) into analyzable and clean format.</li>
</ul>