# Example search of urls with tags

In this example we will understand how a search works.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

Here is some example data. These are 20 made up urls tagged with two tags each, from a pool of 13 tags.

In [67]:
data = [
  {"url": "https://djangoexamples.com/modern-forms", "tags": ["forms", "views"]},
  {"url": "https://djangosecurityhub.net/secure-coding-practices", "tags": ["security", "views"]},
  {"url": "https://djangorestframeworktutorial.org/api-design-guide", "tags": ["REST", "views"]},
  {"url": "https://djangomigrationsguide.com/advanced-patterns", "tags": ["migrations", "databases"]},
  {"url": "https://djangoformshandbook.com/custom-widgets", "tags": ["forms", "Frontend"]},
  {"url": "https://djangostaticfiles.com/efficient-management", "tags": ["statics", "caching"]},
  {"url": "https://djangomediahandling.com/upload-strategies", "tags": ["media", "databases"]},
  {"url": "https://djangodatabasesoptimization.com/indexing-tips", "tags": ["databases", "performance"]},
  {"url": "https://djangocachepatterns.com/strategies-for-scaling", "tags": ["caching", "performance"]},
  {"url": "https://djangofrontendintegration.com/css-frameworks", "tags": ["Frontend", "statics"]},
  {"url": "https://djangosecuritychecklist.com/xss-prevention", "tags": ["security", "Frontend"]},
  {"url": "https://djangorestbestpractices.com/token-authentication", "tags": ["REST", "security"]},
  {"url": "https://djangoquickmigrations.com/zero-downtime", "tags": ["migrations", "databases"]},
  {"url": "https://djangocleancodeprinciples.com/refactoring-techniques", "tags": ["views", "refactoring"]},
  {"url": "https://djangostaticassetmanagement.com/cdn-integration", "tags": ["statics", "caching"]},
  {"url": "https://djangomediaoptimization.com/compression-techniques", "tags": ["media", "performance"]},
  {"url": "https://djangodatabasearchitectures.com/replication", "tags": ["databases", "caching"]},
  {"url": "https://djangocacheoptimization.com/memoization", "tags": ["caching", "views"]},
  {"url": "https://djangofrontendpatterns.com/react-integration", "tags": ["Frontend", "REST"]},
  {"url": "https://djangosecuredeployment.com/https-configurations", "tags": ["security", "deployment"]}
]

Let's assume the tags are not tags, but some written text, for example a summary, and we want to search for the urls that match our topics of interest best. We will simulate this by joining the tags together, separated by a space, and later demonstrate how we can extract the features back again with a tokenizer

In [68]:
df = pd.DataFrame(data)
df['summary'] = df['tags'].apply(lambda x: " ".join(x))
df = df[['url', 'summary']]
df

Unnamed: 0,url,summary
0,https://djangoexamples.com/modern-forms,forms views
1,https://djangosecurityhub.net/secure-coding-pr...,security views
2,https://djangorestframeworktutorial.org/api-de...,REST views
3,https://djangomigrationsguide.com/advanced-pat...,migrations databases
4,https://djangoformshandbook.com/custom-widgets,forms Frontend
5,https://djangostaticfiles.com/efficient-manage...,statics caching
6,https://djangomediahandling.com/upload-strategies,media databases
7,https://djangodatabasesoptimization.com/indexi...,databases performance
8,https://djangocachepatterns.com/strategies-for...,caching performance
9,https://djangofrontendintegration.com/css-fram...,Frontend statics


To be able to search for keywords in our dataset, we need to process all the data and extract the features. We know already that the features in the "summary" column are separated by a space character. So we can define our *tokenizer* that converts our text into tokens:

In [69]:
def tokenizer(text):
    return text.split(" ")

We can now fit our entire data with the CountVectorizer class, that will extract all the features and can be used for search.

In [74]:
count_vectorizer = CountVectorizer(binary=True, tokenizer=tokenizer, token_pattern=None)
corpus = df['summary'].tolist()
count_vectorizer.fit(corpus)
feature_names = count_vectorizer.get_feature_names_out()
feature_names

array(['caching', 'databases', 'deployment', 'forms', 'frontend', 'media',
       'migrations', 'performance', 'refactoring', 'rest', 'security',
       'statics', 'views'], dtype=object)

We can see how are data matches the features with:

In [71]:
count_vectorizer.transform(corpus).toarray()

array([[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]])

This generates a matrix of 20 by 13. 20 is the number of urls in our data, and 13 the number of unique tags that we defined earlier in the list. Whenever a url presents one of the tags, the number 1 is present, otherwise it is 0. This might be an easier way to visualize it:

In [76]:
matrix = count_vectorizer.transform(df['summary'])
tokens = sorted(count_vectorizer.vocabulary_.keys(), key=lambda token: count_vectorizer.vocabulary_[token])
fit = pd.DataFrame(matrix.toarray(), columns=tokens)
fit["url"] = df["url"]
fit.set_index("url", inplace=True)
fit

Unnamed: 0_level_0,caching,databases,deployment,forms,frontend,media,migrations,performance,refactoring,rest,security,statics,views
url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
https://djangoexamples.com/modern-forms,0,0,0,1,0,0,0,0,0,0,0,0,1
https://djangosecurityhub.net/secure-coding-practices,0,0,0,0,0,0,0,0,0,0,1,0,1
https://djangorestframeworktutorial.org/api-design-guide,0,0,0,0,0,0,0,0,0,1,0,0,1
https://djangomigrationsguide.com/advanced-patterns,0,1,0,0,0,0,1,0,0,0,0,0,0
https://djangoformshandbook.com/custom-widgets,0,0,0,1,1,0,0,0,0,0,0,0,0
https://djangostaticfiles.com/efficient-management,1,0,0,0,0,0,0,0,0,0,0,1,0
https://djangomediahandling.com/upload-strategies,0,1,0,0,0,1,0,0,0,0,0,0,0
https://djangodatabasesoptimization.com/indexing-tips,0,1,0,0,0,0,0,1,0,0,0,0,0
https://djangocachepatterns.com/strategies-for-scaling,1,0,0,0,0,0,0,1,0,0,0,0,0
https://djangofrontendintegration.com/css-frameworks,0,0,0,0,1,0,0,0,0,0,0,1,0


## Search against the data

In [None]:
query = "caching, clean code"
query_vector = count_vectorizer.transform([query])
a = matrix.dot(query_vector.T).todense()
number_of_query_tokens = query_vector.sum()
fit["score"] = a.sum(axis=1)
fit = fit.sort_values("score", ascending=False)
fit[["score", "tags"]]