Noun Phrase extraction & Text similarity
--


Extracting Noun Phrases
--

Problem
--
You want to extract a noun phrase.

Solution
--
Noun Phrase extraction is important when you want to analyze the “who”
in a sentence. Let’s see an example below using TextBlob.

In [23]:
#Import libraries
import nltk
from textblob import TextBlob

#Extract noun
blob = TextBlob(" Zoya and Rocky are Riding a Bike with Zuhrah")

for np in blob.noun_phrases:
 print(np)

# anything  i keep in uppercase or its first char as uppercase it 
# mistakenly recongnise it as a noun

zoya
rocky
riding
bike
zuhrah


Finding Similarity Between Texts
--

In this coding example, we are going to discuss how to find the similarity between two documents or text. There are many similarity metrics like Euclidian, cosine, Jaccard, etc. 

Applications of text similarity can be found in areas like spelling correction and data deduplication.

Here are a few of the similarity measures:

> Cosine similarity: Calculates the cosine of the angle between the two vectors.

> Jaccard similarity: The score is calculated using the intersection or union of words.
Jaccard Index = (the number in both sets) / (the number in either set) * 100.

> Levenshtein distance: Minimal number of insertions, deletions, and replacements required for transforming string “a” into string “b.”

> Hamming distance: Number of positions with the same symbol in both strings. But it can be defined only for strings with equal length.

Problem
--
You want to find the similarity between texts/documents.

Solution
--
The simplest way to do this is by using cosine similarity from the sklearn
library.

In [3]:
documents = (
"I like NLP",
"I am exploring NLP",
"I am a beginner in NLP",
"I want to learn NLP",
"I like advanced NLP"
)

# code to find the similarity.
# Import libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

#Compute tfidf
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
tfidf_matrix.shape

(5, 10)

In [4]:
#compute similarity for first sentence with rest of the sentences
cosine_similarity(tfidf_matrix[0:1],tfidf_matrix)

array([[1.        , 0.17682765, 0.14284054, 0.13489366, 0.68374784]])

Phonetic matching
--
The next version of similarity checking is phonetic matching, which roughly
matches the two words or sentences and also creates an alphanumeric string as an encoded version of the text or word. 

It is very useful for searching large text corpora, correcting spelling errors, and matching relevant names.

Soundex and Metaphone are two main phonetic algorithms used for this
purpose. The simplest way to do this is by using the fuzzy library.

In [2]:
# Install and import the library
!pip install Fuzzy

import Fuzzy

# Run the Soundex function
soundex = Fuzzy.Soundex(4)
# Soundex works by converting your input string to a ‘4’ 
# or more character output which can be compared to soundex value 
# calculated for the other string.

# Generate the phonetic form
soundex('natural')

# recommeded reading :
# https://medium.com/@yash_agarwal2/soundex-and-levenshtein-distance-in-python-8b4b56542e9e

Collecting Fuzzy
  Using cached https://files.pythonhosted.org/packages/ad/b0/210f790e81e3c9f86a740f5384c758ad6c7bc1958332cf64263a9d3cf336/Fuzzy-1.2.2.tar.gz
Building wheels for collected packages: Fuzzy
  Building wheel for Fuzzy (setup.py): started
  Building wheel for Fuzzy (setup.py): finished with status 'error'
  Running setup.py clean for Fuzzy
Failed to build Fuzzy
Installing collected packages: Fuzzy
    Running setup.py install for Fuzzy: started
    Running setup.py install for Fuzzy: finished with status 'error'


  ERROR: Command errored out with exit status 1:
   command: 'c:\program files\python36\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\Zuhrah\\AppData\\Local\\Temp\\pip-install-23ae7qqs\\Fuzzy\\setup.py'"'"'; __file__='"'"'C:\\Users\\Zuhrah\\AppData\\Local\\Temp\\pip-install-23ae7qqs\\Fuzzy\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\Zuhrah\AppData\Local\Temp\pip-wheel-qitld2c1' --python-tag cp36
       cwd: C:\Users\Zuhrah\AppData\Local\Temp\pip-install-23ae7qqs\Fuzzy\
  Complete output (8 lines):
  running bdist_wheel
  running build
  running build_ext
  cythoning src/fuzzy.pyx to src\fuzzy.c
    tree = Parsing.p_module(s, pxd, full_module_name)
  building 'fuzzy' extension
  error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microso

ModuleNotFoundError: No module named 'Fuzzy'

Important Resource links : (Make your own Notebook)
---

> https://pypi.org/project/Fuzzy/  

(from above link : pip install Fuzzy and execute only the first example )

> http://www.informit.com/articles/article.aspx?p=1848528

( Read and execute 3 code examples : Soundex, NYSIIS and DMetaphone )

> https://www.datacamp.com/community/tutorials/fuzzy-string-python

( Read everything but start executing examples (from middle of the blog) where we use "Levenshtein package". i.e import Levenshtein as lev )


*only for extra reading*  
> https://medium.com/@yash_agarwal2/soundex-and-levenshtein-distance-in-python-8b4b56542e9e