<a href="https://colab.research.google.com/github/dasparagjyoti/Web-Data-Mining/blob/main/Copy_of_Simple_Search_Engine.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create A Simple Search Engine Using Python 
## Utilize TF-IDF and Cosine Similarity to retrieve similar articles with query
[EN] Information Retrieval right now is an important task. Probably you're wondering, how does the system can retrieve articles that we want using a query? Here are the steps,
1. Extract documents from the Internet (It could be Web Scraping or extract manually)
2. Clean the documents to make the retrieval much easier
3. Create a Term-Document Matrix with TF-IDF weighting
4. Write your queries and convert it as vector (based on TF-IDF)
5. Calculate the cosine similarity between the query and the document and repeat the process on each document.
6. Finally, show the document
---------
# Membuat Search Engine Sederhana dengan TF-IDF dan Cosine Similarity
[ID] Menemukan informasi merupakan suatu hal yang penting mengingat jumlah informasi yang semakin banyak. Namun, bagaimana caranya untuk menemukan suatu dokumen berdasarkan query yang kita inginkan? Berikut tahapannya,
1. Ekstrak Dokumen dari Internet (Bisa Menggunakan Web Scraping atau manual)
2. Bersihkan isi dokumen tersebut agar memudahkan proses analisis
3. Buatlah Term-Document Matrix dengan pembobotan TF-IDF
4. Tuliskanlah query yang diinginkan dan ubahlah ke dalam bentuk vector (sesuai dengan matriks TF-IDF)
5. Lakukan pengulangan antar dokumen untuk menghitung similaritas kosinus dengan query yang digunakan dan tampilkan dokumen dengan similaritas > 0
6. Finally, show the document

### Created by Irfan Alghani Khalid

In [None]:
import re
import string
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer

def retrieve_docs_and_clean():
  # Untuk mendapatkan link berita populer
  r = requests.get('https://www.w3schools.com/')
  soup = BeautifulSoup(r.content, 'html.parser')

  link = []
  for i in soup.find('div', {'class':'most__wrap'}).find_all('a'):
      i['href'] = i['href'] + '?page=all'
      link.append(i['href'])

  # Retrieve Paragraphs
  documents = []
  for i in link:
      r = requests.get(i)
      soup = BeautifulSoup(r.content, 'html.parser')

      sen = []
      for i in soup.find('div', {'class':'read__content'}).find_all('p'):
          sen.append(i.text)
      documents.append(' '.join(sen))

  # Clean Paragraphs
  documents_clean = []
  for d in documents:
      document_test = re.sub(r'[^\x00-\x7F]+', ' ', d)
      document_test = re.sub(r'@\w+', '', document_test)
      document_test = document_test.lower()
      document_test = re.sub(r'[%s]' % re.escape(string.punctuation), ' ', document_test)
      document_test = re.sub(r'[0-9]', '', document_test)
      document_test = re.sub(r'\s{2,}', ' ', document_test)
      documents_clean.append(document_test)

  return documents_clean

In [None]:
docs = retrieve_docs_and_clean()

# Create Term-Document Matrix with TF-IDF weighting
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Create a DataFrame
df = pd.DataFrame(X.T.toarray(), index=vectorizer.get_feature_names())
print(df.head())
print(df.shape)

AttributeError: ignored

In [None]:
docs = retrieve_docs_and_clean()
# Create Term-Document Matrix with TF-IDF weighting
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)

# Create a DataFrame
df = pd.DataFrame(X.T.toarray(), index=vectorizer.get_feature_names())
df.head()



Unnamed: 0,0,1,2,3,4,5,6,7,8,9
absen,0.0,0.0,0.0,0.0,0.0,0.035829,0.0,0.0,0.0,0.066786
ac,0.0,0.0,0.0,0.031057,0.0,0.0,0.0,0.0,0.0,0.0
acara,0.0,0.0,0.0,0.0,0.0,0.0,0.016402,0.0,0.0,0.0
ada,0.04692,0.045287,0.0,0.0,0.0,0.027869,0.021691,0.0,0.0,0.0
adalah,0.038093,0.012256,0.027732,0.0,0.017182,0.0,0.079246,0.016004,0.0,0.0


In [None]:
def get_similar_articles(q, df):
  print("query:", q)
  print("The following is the article with the highest cosine similarity value:")
  q = [q]
  q_vec = vectorizer.transform(q).toarray().reshape(df.shape[0],)
  sim = {}
  for i in range(10):
    sim[i] = np.dot(df.loc[:, i].values, q_vec) / np.linalg.norm(df.loc[:, i]) * np.linalg.norm(q_vec)
  
  sim_sorted = sorted(sim.items(), key=lambda x: x[1], reverse=True)
  
  for k, v in sim_sorted:
    if v != 0.0:
      print("Similarity Value:", v)
      print(docs[k])
      print()


q1 = 'barcelona'
q2 = 'gareth bale'
q3 = 'shin tae yong'

get_similar_articles(q1, df)
print('-'*100)
get_similar_articles(q2, df)
print('-'*100)
get_similar_articles(q3, df)

query: barcelona
The following is the article with the highest cosine similarity value:
----------------------------------------------------------------------------------------------------
query: gareth bale
The following is the article with the highest cosine similarity value:
----------------------------------------------------------------------------------------------------
query: shin tae yong
The following is the article with the highest cosine similarity value:
Similarity Value: 0.030445917836089895
kompas com ganda putra indonesia fajar alfian muhammad rian ardianto membuka perjuangan di korea open dengan kemenangan fajar rian bertanding dengan gwang min na jon seong noh pada babak pertama atau besar korea open selasa pagi wib baca juga jadwal korea open wakil indonesia beraksi di besar hari ini bertanding di palma stadium suncheon fajar rian mulus ke babak selanjutnya setelah menang dua gim atas gwang jon ganda putra nomor dunia itu berhasil menang dengan dalam tempo menit sela