# AIM

1) Stemmers

2) Lemmatizer

3) Vectorising
    #Bag Of Words
    #TF-IDF
4) Similarity Measuring
    #Manhattan Similarity
    #Euclids
    #Cosine Similarity

## Stemmers

In [14]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [15]:
text = """The Government of India, often abbreviated as GoI, is the union government created by the constitution of India as the legislative, executive and judicial authority of the union of 29 states and seven union territories of a constitutionally democratic republic. It is located in New Delhi, the capital of India."""

In [16]:
stemmer = PorterStemmer()
example = [stemmer.stem(token) for token in text.split(" ")]
print(" ".join(example))

the govern of india, often abbrevi as goi, is the union govern creat by the constitut of india as the legislative, execut and judici author of the union of 29 state and seven union territori of a constitut democrat republic. It is locat in new delhi, the capit of india.


## Lemmatizer

In [17]:
lemmatizer = WordNetLemmatizer()
example = [lemmatizer.lemmatize(token) for token in text.split(" ")]
print(" ".join(example))

The Government of India, often abbreviated a GoI, is the union government created by the constitution of India a the legislative, executive and judicial authority of the union of 29 state and seven union territory of a constitutionally democratic republic. It is located in New Delhi, the capital of India.


In [18]:
print(lemmatizer.lemmatize('better', pos='a'))

good


In [19]:
print(lemmatizer.lemmatize('better'))

better


## Vectorising

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [11]:
vect = CountVectorizer(binary=True)
corpus = ["Tesseract is an optical cahracter recognition engine", "optical character recognition"]
vect.fit(corpus)
print(vect.transform(corpus).toarray())

[[1 1 0 1 1 1 1 1]
 [0 0 1 0 0 1 1 0]]


In [12]:
vocab = vect.vocabulary_
for key in sorted(vocab.keys()):
    print("{}:{}".format(key,vocab[key]))

an:0
cahracter:1
character:2
engine:3
is:4
optical:5
recognition:6
tesseract:7


###          - Cosine Similarity

In [13]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [15]:
similarity = cosine_similarity(vect.transform(["Tessaract is an optical character recognition engine"]).toarray(), 
                               vect.transform(["Optical character recognition"]).toarray())
print(similarity)

[[0.70710678]]


## Spacy

In [16]:
import spacy

ModuleNotFoundError: No module named 'spacy'

## Findng Cosine Similarity between Texts

In [18]:
str1 = "Winter is the coldest season of the year in polar and temperate zones (winter does not occur in most of the tropical zone). It occurs after autumn and before spring in each year. Winter is caused by the axis of the Earth in that hemisphere being oriented away from the Sun. Different cultures define different dates as the start of winter, and some use a definition based on weather. When it is winter in the Northern Hemisphere, it is summer in the Southern Hemisphere, and vice versa. In many regions, winter is associated with snow and freezing temperatures. The moment of winter solstice is when the Sun's elevation with respect to the North or South Pole is at its most negative value (that is, the Sun is at its farthest below the horizon as measured from the pole). The day on which this occurs has the shortest day and the longest night, with day length increasing and night length decreasing as the season progresses after the solstice. The earliest sunset and latest sunrise dates outside the polar regions differ from the date of the winter solstice, however, and these depend on latitude, due to the variation in the solar day throughout the year caused by the Earth's elliptical orbit (see earliest and latest sunrise and sunset)."
str2 = "Autumn, also known as fall in American English and sometimes in Canadian English,[1] is one of the four temperate seasons. Autumn marks the transition from summer to winter, in September (Northern Hemisphere) or March (Southern Hemisphere), when the duration of daylight becomes noticeably shorter and the temperature cools considerably. One of its main features in temperate climates is the shedding of leaves from deciduous trees"
str3 = "Spring is one of the four temperate seasons, following winter and preceding summer. There are various technical definitions of spring, but local usage of the term varies according to local climate, cultures and customs. When it is spring in the Northern Hemisphere, it is autumn in the Southern Hemisphere and vice versa. At the spring (or vernal) equinox, days and nights are approximately twelve hours long, with day length increasing and night length decreasing as the season progresses."

In [19]:
vect = TfidfVectorizer(binary = True)

In [20]:
corpus = [str1, str2, str3]
vect.fit(corpus)

TfidfVectorizer(analyzer='word', binary=True, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words=None, strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

In [21]:
vecstr1 = vect.transform([str1]).toarray()
vecstr2 = vect.transform([str2]).toarray()
vecstr3 = vect.transform([str3]).toarray()

In [22]:
sim = cosine_similarity(vecstr1, vecstr2)

In [23]:
sim

array([[0.12107879]])

In [24]:
print('Cosine similarity between text 1 and 2:', cosine_similarity(vecstr1, vecstr2))

print('Cosine similarity between text 2 and 3:', cosine_similarity(vecstr2, vecstr3))

print('Cosine similarity between text 1 and 3:', cosine_similarity(vecstr1, vecstr3))

Cosine similarity between text 1 and 2: [[0.12107879]]
Cosine similarity between text 2 and 3: [[0.20105444]]
Cosine similarity between text 1 and 3: [[0.23285187]]
