### Application of Machine Learning: Natural Language Processing

Natural Language Processing, or NLP, is a special branch of computer sience concerned with letting computers and their programs understand text and spoken words similar to how human beings understand text and spoken words.

### Import the Necessary Libraries

Here, we again use the `scikit-learn` machine learning libraries, in addition to another open-source and free library specifically for natural language processing: `nltk`. The Natural Language Toolkit, or `nltk`, is a package of different libraries most often used by researchers working with human language data.

In [None]:
import nltk
nltk.download('popular')

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

In [None]:
import string
from nltk.corpus import stopwords 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

### Process the Words

Here, we go through several different processes that are familiar to NLP.

In the function `clean_string`, we pre-process the sentence. We remove the punctuations from the sentence, we put all of the letters in lowercase so the computer has an easier time reading it all, and we remove the most frequent words (e.g. the, an, etc.) which do not contribute as much to the meaning of the sentence.

The `clean_vectorizer` function will take care of converting strings to numerical vectors - this function will take care of transforming these sentences, into words, into mathematical objects in a multi-dimensional space.

In the function `cosine_sim_vectors`, we measure the cosine similarity between the two strings. Cosine similarity is "a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them."

Finally, the `compare` function combines all of these functions into one and cleans, vectorizes, and then compares two strings with each other.

In [None]:
stopwords_ = stopwords.words('english')

def clean_string(text):
    text = ''.join([word for word in text if word not in string.punctuation])
    text = text.lower()
    text = ' '.join([word for word in text.split() if word not in stopwords_])
    return text

def clean_vectorizer(cleaned_vec):
    vectorizer = CountVectorizer().fit_transform(cleaned_vec)
    vectors = vectorizer.toarray()
    return vectors

def cosine_sim_vectors(vec1,vec2):
    vec1 = vec1.reshape(1, -1)
    vec2 = vec2.reshape(1, -1)
    return cosine_similarity(vec1, vec2)[0][0]

def compare(a,b):
    sentences=[a,b]
    cleaned = list(map(clean_string,sentences))
    vectors = clean_vectorizer(cleaned)
    rawscore = cosine_sim_vectors(vectors[0],vectors[1])
    finalscore = rawscore*100
    return finalscore

### Compare some Sentences

Finally, we can compare some similar sentences and see if the program responds well.

In [None]:
original_sentence = 'This is a foo bar sentence'
modified_sentence = 'This sentence is similar to a foo bar sentence.'

In [None]:
print('The two sentences are {:.2f}% similar.'.format(compare(original_sentence, modified_sentence)))

The two sentences are 87.29% similar.
