In [1]:
!pip install gensim
import pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_documents
import numpy as np



In [2]:
!wget https://storage.googleapis.com/pet-detect-239118/text_retrieval/archive.zip archive.zip

--2022-03-14 18:20:36--  https://storage.googleapis.com/pet-detect-239118/text_retrieval/archive.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.253.63.128, 172.217.2.112, 142.251.16.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.253.63.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 31376869 (30M) [application/x-zip-compressed]
Saving to: ‘archive.zip’


2022-03-14 18:20:36 (153 MB/s) - ‘archive.zip’ saved [31376869/31376869]

--2022-03-14 18:20:36--  http://archive.zip/
Resolving archive.zip (archive.zip)... failed: Name or service not known.
wget: unable to resolve host address ‘archive.zip’
FINISHED --2022-03-14 18:20:36--
Total wall clock time: 0.4s
Downloaded: 1 files, 30M in 0.2s (153 MB/s)


In [3]:
from zipfile import ZipFile
file_name = '/content/archive.zip'

with ZipFile(file_name, 'r',) as zip:
  zip.extractall()
  print('Done!!')

Done!!


The Singular-Value Decomposition (SVD) is a matrix decomposition method for reducing a matrix to its constituent parts in order to make certain subsequent matrix calculations simpler. <p>

The formula for SVD can be express as: \begin{align}
A = U Σ V^T
\end{align}
Where A is the real m x n matrix to be decomposed, U is an m x m matrix, $Σ$  is an m x n diagonal matrix, and $V^T$ is the transpose of an n x n matrix.<p>
The diagonal values in the $\Sigma$ matrix are known as the singular values of the original matrix A. The columns of the U matrix are called the left-singular vectors of A, and the columns of V are called the right-singular vectors of A. <p>
The SVD is calculated via iterative numerical methods. The SVD is used widely both in the calculation of other matrix operations, such as matrix inverse, but also as a data reduction method in machine learning. SVD can also be used in least squares linear regression, image compression, and denoising data.

In [13]:
# Singular-value decomposition
from numpy import array
from scipy.linalg import svd
# define a matrix
A = array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(A)
# SVD
U, s, VT = svd(A)
print("U matrix:")
print(U)
print("Sigma matrix:")
print(s)
print("V-transpose matrix:")
print(VT)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
U matrix:
[[-0.21483724  0.88723069  0.40824829]
 [-0.52058739  0.24964395 -0.81649658]
 [-0.82633754 -0.38794278  0.40824829]]
Sigma matrix:
[1.68481034e+01 1.06836951e+00 4.41842475e-16]
V-transpose matrix:
[[-0.47967118 -0.57236779 -0.66506441]
 [-0.77669099 -0.07568647  0.62531805]
 [-0.40824829  0.81649658 -0.40824829]]


Here we only use movies with release year ≥ 2000 to save RAM

In [4]:
df = pd.read_csv('/content/wiki_movie_plots_deduped.csv', sep=',')
df = df[df['Release Year'] >= 2000]
text_corpus = df['Plot'].values

In [5]:
text_corpus.shape

(12560,)

use gensim.parsing.preprocessing.preprocess_documents can perform the following operations:


*   strip_tags: returns a unicode string without tags,
*   strip_punctuation: replaces punctuation characters with spaces,
*   strip_multiple_whitespaces: removes non-alphabetic characters,
*   strip_numeric: removes digits,
*   remove_stopwords: removes stop words,
*   strip_short: removes words with length lesser than 3,
*   stem_text: transforms the document into lowercase and stems it









In [6]:
processed_corpus = preprocess_documents(text_corpus)
dictionary = gensim.corpora.Dictionary(processed_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

Get  the tfidf representation model

In [7]:
tfidf = gensim.models.TfidfModel(bow_corpus)

Transform the whole corpus via our model and index it, in preparation for similarity queries.

In [8]:
index = gensim.similarities.MatrixSimilarity(tfidf[bow_corpus])

In [9]:
query = "Infiltrate in minds and extract information through a shared dream world. Different levels of dreams"

In [10]:
new_doc = gensim.parsing.preprocessing.preprocess_string(query)
new_vec = dictionary.doc2bow(new_doc)
vec_bow_tfidf = tfidf[new_vec]
sims = index[vec_bow_tfidf]
for s in sorted(enumerate(sims), key=lambda item: -item[1])[:10]:
    print(f"{df['Title'].iloc[s[0]]} : {str(s[1])}")

Let's Dance : 0.19282113
Dream : 0.19088349
Inception : 0.18904062
Swapner Din : 0.15743802
Dancing Queen : 0.1458902
Aalukkoru Aasai : 0.14180927
The Good Night : 0.13072486
Darwin : 0.12126312
Popcorn : 0.11409524
Days of Our Own : 0.11404269
