<a href="https://colab.research.google.com/github/anuradha-datascience/NLP/blob/main/Part_4_Cosine_Simililarity_Document_Similarity_Using_TFIDF_Movie_summary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import chardet
with open('Hydra-Movie-Scrape.csv', 'rb') as f:
    encoding = chardet.detect(f.read())['encoding']


In [8]:
df=pd.read_csv('Hydra-Movie-Scrape.csv',encoding=encoding)

In [9]:
df.head()

Unnamed: 0,Title,Year,Summary,Short Summary,IMDB ID,Runtime,YouTube Trailer,Rating,Movie Poster,Director,Writers,Cast
0,Patton Oswalt: Annihilation,2017,"Patton Oswald, despite a personal tragedy, pro...","Patton Oswalt, despite a personal tragedy, pro...",tt7026230,66,4hZi5QaMBFc,7.4,https://hydramovies.com/wp-content/uploads/201...,Bobcat Goldthwait,Patton Oswalt,Patton Oswalt
1,New York Doll,2005,A recovering alcoholic and recently converted ...,A recovering alcoholic and recently converted ...,tt0436629,75,jwD04NsnLLg,7.9,https://hydramovies.com/wp-content/uploads/201...,Greg Whiteley,Arthur Kane,Sylvain Sylvain
2,Mickey's Magical Christmas: Snowed in at the H...,2001,After everyone is snowed in at the House of Mo...,Mickey and all his friends hold their own Chri...,tt0300195,65,uCKwHHftrU4,6.8,https://hydramovies.com/wp-content/uploads/201...,Tony Craig,Thomas Hart,Carlos Alazraqui|Wayne Allwine
3,Mickey's House of Villains,2001,The villains from the popular animated Disney ...,The villains from the popular animated Disney ...,tt0329374,0,JA03ciYt-Ek,6.6,https://hydramovies.com/wp-content/uploads/201...,Jamie Mitchell,Thomas Hart,Tony Anselmo|Wayne Allwine
4,And Then I Go,2017,"In the cruel world of junior high, Edwin suffe...","In the cruel world of junior high, Edwin suffe...",tt2018111,99,8CdIiD6-iF0,7.6,https://hydramovies.com/wp-content/uploads/201...,Vincent Grashaw,Brett Haley,Arman Darbo|Sawyer Barth


In [10]:
df.shape

(3886, 12)

In [12]:
# only extract the column of relevance for now to demonstrate the concept of cosine similarity to show similar movies
df=df[['Summary']]
df.head()

Unnamed: 0,Summary
0,"Patton Oswald, despite a personal tragedy, pro..."
1,A recovering alcoholic and recently converted ...
2,After everyone is snowed in at the House of Mo...
3,The villains from the popular animated Disney ...
4,"In the cruel world of junior high, Edwin suffe..."


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3886 entries, 0 to 3885
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Summary  3881 non-null   object
dtypes: object(1)
memory usage: 30.5+ KB



**Data Cleaning and Preprocessing**




In [24]:
df[df['Summary'].isnull()]


Unnamed: 0,Summary
977,
1843,
2613,
3391,
3729,


In [25]:
df['Summary'][977]

nan

In [28]:
df['Summary'].fillna("Not Available", inplace=True)
df['Summary'][977]

'Not Available'

In [29]:

import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

#download resources

nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [30]:
# Get the English stopwords list
stopwords_list = stopwords.words('english')


In [31]:
# remove punctuations and stop words
stop_words = stopwords.words('english') + list(string.punctuation)


In [33]:
# let's make a single function for preprocessing using nltk
def get_wordnet_pos(word):
  """Map POS tag to first character lemmatize() accepts"""
  tag = nltk.pos_tag([word])[0][1][0].upper()
  tag_dict = {"J": wordnet.ADJ,
              "N": wordnet.NOUN,
              "V": wordnet.VERB,
              "R": wordnet.ADV}
  return tag_dict.get(tag, wordnet.NOUN)

def preprocess_data(text):
  #tokenizing
  preprocess_tokens=word_tokenize(text)

  #stop word| lowering | punctuation| only alpha
  stopwords_list = stopwords.words('english')
  preprocess_tokens = [word.lower() for word in preprocess_tokens if word.lower() not in stop_words]
  preprocess_tokens=[token for token in preprocess_tokens if token.isalpha()]

  # Lemmatization with POS tagging
  lemmatizer = WordNetLemmatizer()
  preprocess_tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in preprocess_tokens]

  # Join the tokens back into a single string
  preprocessed_text = " ".join(preprocess_tokens)
  return preprocessed_text


df["Summary_preprocess"]=df['Summary'].apply(lambda x: preprocess_data(x))

In [34]:
df=df.drop(['Summary'],axis=1)

In [35]:
df.head()


Unnamed: 0,Summary_preprocess
0,patton oswald despite personal tragedy produce...
1,recover alcoholic recently convert mormon arth...
2,everyone snow house mouse mickey suggests thro...
3,villain popular animate disney film gather hou...
4,cruel world junior high edwin suffers state an...


Get TFIDF Matrix of Summary Column



In [41]:
from sklearn.feature_extraction.text import TfidfVectorizer
# vectorizer=TfidfVectorizer(max_features=20)
vectorizer=TfidfVectorizer(max_features=2000)
df_summary_tfidf = vectorizer.fit_transform(df['Summary_preprocess'])

tfidf_matrix_df = pd.DataFrame(df_summary_tfidf.toarray(), columns=vectorizer.get_feature_names_out())

# Print the DataFrame
#print(tfidf_matrix_df)
print(tfidf_matrix_df.shape)

(3886, 2000)


# Cosine Similarity

- Cosine Similarity is a mathematical measurement used to quantify the similarity between two or more vectors.

-  It calculates the similarity based on the cosine of the angle between the vectors in a multidimensional space.

- The vectors involved in cosine similarity are typically non-zero vectors, representing data points or features in an inner product space.

-  Mathematically, cosine similarity is computed as the division of the dot product of the vectors by the product of their Euclidean norms or magnitudes.

## Example Calculation

```
       term1      term2      term3
Doc1    0.2        0.6        0.4

Doc2    0.8        0.4        0.3
```



```
Dot Product = (0.2 * 0.8) + (0.6 * 0.4) + (0.4 * 0.3)
            = 0.16 + 0.24 + 0.12
            = 0.52
```

```
Norm of Doc1 = sqrt((0.2)^2 + (0.6)^2 + (0.4)^2)
             = sqrt(0.04 + 0.36 + 0.16)
             = sqrt(0.56)
             ≈ 0.748
```

```
Norm of Doc2 = sqrt((0.8)^2 + (0.4)^2 + (0.3)^2)
             = sqrt(0.64 + 0.16 + 0.09)
             = sqrt(0.89)
             ≈ 0.944
```
```
Cosine Similarity = Dot Product / (Norm of Doc1 * Norm of Doc2)
                  = 0.52 / (0.748 * 0.944)
                  ≈ 0.739
```

In [None]:
# cosine similarity -
#how similar two or more documents/movies are concerning each other in this document collection of summary.