---
title: Document Search with Scikit-Learn and TF-IDF
pubDate: 2024-04-25
shortDescription: Implementing keyword search on documents using scikit-learn's CountVectorizer for text analysis
tags:
  - Backend
keywords: machine learning, text search, scikit-learn, nlp, document retrieval, vector search
---

# Keyword search on a list of documents (with scikit-learn)

In [1]:
documents = [
  {"id": 1, "text": "Programming requires logical thinking and problem-solving skills."},
  {"id": 2, "text": "Learning to code improves problem-solving and logical thinking abilities."},
  {"id": 3, "text": "Logical thinking is essential for writing efficient code."},
  {"id": 4, "text": "Effective programming involves solving complex problems logically."},
  {"id": 5, "text": "Developing software enhances problem-solving and logical thinking."}
]

## Convert the documents to a matrix of token counts

Install packages

    pip install scikit-learn pandas numpy
    


In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [3]:
df = pd.DataFrame(documents)
df

Unnamed: 0,id,text
0,1,Programming requires logical thinking and prob...
1,2,Learning to code improves problem-solving and ...
2,3,Logical thinking is essential for writing effi...
3,4,Effective programming involves solving complex...
4,5,Developing software enhances problem-solving a...


In [4]:
df["text"].tolist()

['Programming requires logical thinking and problem-solving skills.',
 'Learning to code improves problem-solving and logical thinking abilities.',
 'Logical thinking is essential for writing efficient code.',
 'Effective programming involves solving complex problems logically.',
 'Developing software enhances problem-solving and logical thinking.']

In [5]:
count_vectorizer = CountVectorizer(binary=True)
count_vectorizer.fit(df["text"].tolist())
count_vectorizer.get_feature_names_out()

array(['abilities', 'and', 'code', 'complex', 'developing', 'effective',
       'efficient', 'enhances', 'essential', 'for', 'improves',
       'involves', 'is', 'learning', 'logical', 'logically', 'problem',
       'problems', 'programming', 'requires', 'skills', 'software',
       'solving', 'thinking', 'to', 'writing'], dtype=object)

In [6]:
corpus = df['text'].tolist()
matrix = count_vectorizer.transform(corpus)
tokens = sorted(count_vectorizer.vocabulary_.keys(), key=lambda token: count_vectorizer.vocabulary_[token])
fit = pd.DataFrame(matrix.toarray(), columns=tokens)
fit["text"] = df["text"]
fit.set_index("text", inplace=True)
fit

Unnamed: 0_level_0,abilities,and,code,complex,developing,effective,efficient,enhances,essential,for,...,problem,problems,programming,requires,skills,software,solving,thinking,to,writing
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Programming requires logical thinking and problem-solving skills.,0,1,0,0,0,0,0,0,0,0,...,1,0,1,1,1,0,1,1,0,0
Learning to code improves problem-solving and logical thinking abilities.,1,1,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,1,1,0
Logical thinking is essential for writing efficient code.,0,0,1,0,0,0,1,0,1,1,...,0,0,0,0,0,0,0,1,0,1
Effective programming involves solving complex problems logically.,0,0,0,1,0,1,0,0,0,0,...,0,1,1,0,0,0,1,0,0,0
Developing software enhances problem-solving and logical thinking.,0,1,0,0,1,0,0,1,0,0,...,1,0,0,0,0,1,1,1,0,0


In [7]:
query_vector = count_vectorizer.transform(["logical thinking"])
a = matrix.dot(query_vector.T).todense()
fit["score"] = a.sum(axis=1)
fit.sort_values("score", ascending=False)

Unnamed: 0_level_0,abilities,and,code,complex,developing,effective,efficient,enhances,essential,for,...,problems,programming,requires,skills,software,solving,thinking,to,writing,score
text,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Programming requires logical thinking and problem-solving skills.,0,1,0,0,0,0,0,0,0,0,...,0,1,1,1,0,1,1,0,0,2
Learning to code improves problem-solving and logical thinking abilities.,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,1,0,2
Logical thinking is essential for writing efficient code.,0,0,1,0,0,0,1,0,1,1,...,0,0,0,0,0,0,1,0,1,2
Developing software enhances problem-solving and logical thinking.,0,1,0,0,1,0,0,1,0,0,...,0,0,0,0,1,1,1,0,0,2
Effective programming involves solving complex problems logically.,0,0,0,1,0,1,0,0,0,0,...,1,1,0,0,0,1,0,0,0,0
