# Semantic Search using the best tools available

We are going to use the best tools available to perform semantic search on the dataset. We will use the following tools:
- OpenAI Embedding models
- MongoDb Vectore Database funtionality
- Spacy for some data preprpocessing
- Monggregate as a query builder
- Good old Pandas for data manipulation

## Imports

In [3]:
import os

import numpy as np
import pandas as pd

import spacy
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

from openai import OpenAI
from dotenv import load_dotenv

from pymongo import MongoClient
from monggregate import Pipeline

## Loading Data

In [4]:
df = pd.read_csv("data/57k_spotify_songs.csv")
df.head()

Unnamed: 0,artist,song,link,text
0,ABBA,Ahe's My Kind Of Girl,/a/abba/ahes+my+kind+of+girl_20598417.html,"Look at her face, it's a wonderful face \nAnd..."
1,ABBA,"Andante, Andante",/a/abba/andante+andante_20002708.html,"Take it easy with me, please \nTouch me gentl..."
2,ABBA,As Good As New,/a/abba/as+good+as+new_20003033.html,I'll never know why I had to go \nWhy I had t...
3,ABBA,Bang,/a/abba/bang_20598415.html,Making somebody happy is a question of give an...
4,ABBA,Bang-A-Boomerang,/a/abba/bang+a+boomerang_20002668.html,Making somebody happy is a question of give an...


## Exploration

### Basics

In [5]:
df.shape

(57650, 4)

In [6]:
df.describe(include="all")

Unnamed: 0,artist,song,link,text
count,57650,57650,57650,57650
unique,643,44824,57650,57494
top,Donna Summer,Have Yourself A Merry Little Christmas,/a/abba/ahes+my+kind+of+girl_20598417.html,I just came back from a lovely trip along the ...
freq,191,35,1,6


**Observations**:

* No missing data
* Each row represent a song
* 4 columns
   * artist : artist name
   * song : song title
   * link : link to ???. I don't understand those links and I won't use them anyway
   * text : the lyrics of the song
* There are 57650 songs coming from "only" 643 artists. (Assuming there are no typos in the artist names).

### Domain specific exploration

In [7]:
df["text"].sample(100)

8368     [Verse:]  \nWhatever it takes, whatever Your w...
26601    Well she drew out all her money from the south...
34690    Introducing..Tre' Cool..  \nHi!..um..I wrote t...
1277     Where are you ? I love you . Where are you ? I...
22875    Funny how everything changes for me  \nMemorie...
                               ...                        
4181     Tak tertahan Berdiam diri... sakit  \nSementar...
18723    Get down deeper and down  \nDown down deeper a...
10157    You're such a poet  \nI wish I could be Wesley...
16484    [Jessica Rivera]  \n"Hmmm, so you're the one t...
40079    Baby when I met you there was peace unknown  \...
Name: text, Length: 100, dtype: object

In [8]:
df["artist"].unique()

array(['ABBA', 'Ace Of Base', 'Adam Sandler', 'Adele', 'Aerosmith',
       'Air Supply', 'Aiza Seguerra', 'Alabama', 'Alan Parsons Project',
       'Aled Jones', 'Alice Cooper', 'Alice In Chains', 'Alison Krauss',
       'Allman Brothers Band', 'Alphaville', 'America', 'Amy Grant',
       'Andrea Bocelli', 'Andy Williams', 'Annie', 'Ariana Grande',
       'Ariel Rivera', 'Arlo Guthrie', 'Arrogant Worms', 'Avril Lavigne',
       'Backstreet Boys', 'Barbie', 'Barbra Streisand', 'Beach Boys',
       'The Beatles', 'Beautiful South', 'Beauty And The Beast',
       'Bee Gees', 'Bette Midler', 'Bill Withers', 'Billie Holiday',
       'Billy Joel', 'Bing Crosby', 'Black Sabbath', 'Blur', 'Bob Dylan',
       'Bob Marley', 'Bob Rivers', 'Bob Seger', 'Bon Jovi', 'Boney M.',
       'Bonnie Raitt', 'Bosson', 'Bread', 'Britney Spears',
       'Bruce Springsteen', 'Bruno Mars', 'Bryan White', 'Cake',
       'Carly Simon', 'Carol Banawa', 'Carpenters', 'Cat Stevens',
       'Celine Dion', 'Chaka Khan

**Observations**:

* Most of the songs seem to be in English
* ==> let's verify this

In [9]:
def robust_detect(text):
    try:
        return detect(text)
    except LangDetectException:
        return "N/A"

In [9]:
df["language"] = df["text"].apply(robust_detect)
df["language"].head()

0    en
1    en
2    en
3    en
4    en
Name: language, dtype: object

As the above step is time consuming, let's save the results in a file to avoid doing it again.

In [14]:
df.to_csv("data/cache/57k_spotify_songs.csv", index=False)

In [10]:
df["language"].value_counts()

language
en    57189
tl      133
id       93
fr       38
so       36
it       34
sw       24
de       24
es       18
ca       13
nl       10
ro        8
cy        8
pt        6
af        4
et        3
da        3
hr        2
no        1
sv        1
sq        1
sl        1
Name: count, dtype: int64

In [11]:
(df[df["language"]=="en"].shape[0] / df.shape[0]) * 100

99.20034692107545

99.2 % of the songs are indeed in English.

## Filtering Dataset

In [12]:
selected_artist = ["David Bowie", "Drake", "Adele", "Vybz Kartel", "Lauryn Hill", "Gucci Mane", "Chris Brown", "Bob Marley", "Britney Spears"]

In [16]:
filtered_df = df[(df["artist"].isin(selected_artist))|(df["language"].isin(["fr", "es"]))]
filtered_df.shape

(950, 5)

In [18]:
# Saving the results
filtered_df.to_csv("data/cache/sample.csv", index=False)

## Embedding

### Cost Estimation

In [19]:
filtered_df["raw_size"] = filtered_df["text"].apply(len)
filtered_df["raw_size"].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["raw_size"] = filtered_df["text"].apply(len)


19     1564
133    1494
134    1402
135     990
136    1025
Name: raw_size, dtype: int64

In [22]:
# NOTE: You need to first run spacy download en_core_web_sm in the terminal for the below line to work
nlp = spacy.load('en_core_web_sm')

In [24]:
def count_token(text):
    try:
        doc = nlp(text)
        return len(doc)
    except:
        return np.nan

In [25]:
openai_price_min = 0.00002
openai_price_med = 0.00010
openai_price_max = 0.00013

In [26]:
filtered_df["token_counts"] = filtered_df["text"].apply(count_token)
filtered_df["token_counts"].value_counts(dropna=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["token_counts"] = filtered_df["text"].apply(count_token)


token_counts
210    7
246    7
286    7
268    6
244    6
      ..
536    1
403    1
603    1
245    1
145    1
Name: count, Length: 537, dtype: int64

In [27]:
filtered_df["price_max"] = filtered_df["token_counts"] * openai_price_max / 1000
filtered_df["price_med"] = filtered_df["token_counts"] * openai_price_med / 1000
filtered_df["price_min"] = filtered_df["token_counts"] * openai_price_min / 1000

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["price_max"] = filtered_df["token_counts"] * openai_price_max / 1000
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["price_med"] = filtered_df["token_counts"] * openai_price_med / 1000
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["price_min"] = filtered_df["tok

In [28]:
filtered_df["price_max"].sum()

0.05262582

In [30]:
filtered_df["token_counts"].sort_values(ascending=False).head(10)

4537     1028
4529     1026
41293    1024
34812    1017
7284      979
30515     977
30467     976
30507     949
30452     943
30478     943
Name: token_counts, dtype: int64

## Embedding

In [2]:
keys = load_dotenv(".env")
keys

True

In [3]:

openai_client = OpenAI(
    api_key=os.environ["OPENAI_API_KEY"],
)

In [4]:
model_name = "text-embedding-3-large"
max_dimensions = 2048 
# NOTE: text-embedding-3-large returns by default embeddings with 3072 dimensions, 
# however MongoDb vector search indexes can't handle more than 2048 dimensions
# Cf : https://www.mongodb.com/community/forums/t/vector-search-dimensions-limited-to-2048/257718

In [5]:
def _get_embedding(text:str, model:str=model_name):
   text = text.replace("\n", " ")
   return openai_client.embeddings.create(input = [text], model=model, dimensions=max_dimensions).data[0]

def get_embedding(text:str, model_name:str=model_name):
    try:
        return _get_embedding(text, model_name).embedding
    except Exception as e:
        print(e)
        return np.nan

### Testing

In [7]:
sample_text = filtered_df.reset_index().loc[0, "text"]
sample_text

NameError: name 'filtered_df' is not defined

In [52]:
_embedding = _get_embedding(sample_text)
_embedding

Embedding(embedding=[-0.021085655316710472, 0.007618457544595003, -0.010882432572543621, 0.006392108276486397, -0.0009174033766612411, 0.01473128143697977, 0.002773435553535819, 0.006920381914824247, -0.0038431892171502113, 0.011576734483242035, -0.053853701800107956, 0.04398253560066223, 0.01691984198987484, 0.05131798982620239, 0.018066950142383575, 0.02750040404498577, -0.03380949795246124, 0.013833216391503811, -0.02075359784066677, -0.0057242196053266525, 0.029326722025871277, 0.01752358302474022, -0.014414317905902863, 0.0012763462727889419, 0.0075392164289951324, 0.017900921404361725, 0.00042615627171471715, 0.02476847730576992, 0.006229852791875601, -0.033869873732328415, 0.000345264415955171, 0.020210230723023415, 0.01457279920578003, -0.013772842474281788, -0.03700932487845421, -0.010444720275700092, -0.007456202059984207, 0.01528974249958992, -0.006452482659369707, -0.01468600146472454, -0.02922106720507145, 0.006837367545813322, 0.00613174494355917, -0.0010565468110144138, 

In [53]:
_embedding.dict()

{'embedding': [-0.021085655316710472,
  0.007618457544595003,
  -0.010882432572543621,
  0.006392108276486397,
  -0.0009174033766612411,
  0.01473128143697977,
  0.002773435553535819,
  0.006920381914824247,
  -0.0038431892171502113,
  0.011576734483242035,
  -0.053853701800107956,
  0.04398253560066223,
  0.01691984198987484,
  0.05131798982620239,
  0.018066950142383575,
  0.02750040404498577,
  -0.03380949795246124,
  0.013833216391503811,
  -0.02075359784066677,
  -0.0057242196053266525,
  0.029326722025871277,
  0.01752358302474022,
  -0.014414317905902863,
  0.0012763462727889419,
  0.0075392164289951324,
  0.017900921404361725,
  0.00042615627171471715,
  0.02476847730576992,
  0.006229852791875601,
  -0.033869873732328415,
  0.000345264415955171,
  0.020210230723023415,
  0.01457279920578003,
  -0.013772842474281788,
  -0.03700932487845421,
  -0.010444720275700092,
  -0.007456202059984207,
  0.01528974249958992,
  -0.006452482659369707,
  -0.01468600146472454,
  -0.029221067205

In [54]:
len(_embedding.embedding)

2048

In [55]:
filtered_df["embedding"] = filtered_df["text"].apply(get_embedding)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["embedding"] = filtered_df["text"].apply(get_embedding)


In [58]:
filtered_df.to_csv("data/cache/sample_with_lang_and_embedding.csv", index=False) 
#filtered_df.to_json("data/cache/sample_with_lang_and_embedding.json", orient="records")     

## Vector Search

In [13]:
mongo_client = MongoClient(os.environ["MONGODB_URI"])
db = mongo_client["spofity"]
collection = db["songs"]

In [14]:
db.list_collection_names()

['songs']

In [24]:
index = "lyrics_semantics"
def find_songs(text:str)->list[dict]:

    pipeline = Pipeline()
    pipeline.vector_search(
        index=index,
        path="embedding",
        query_vector=get_embedding(text),
        limit=10,
        num_candidates=100
        ).project(projection={"_id": 0, "text": 1, "artist": 1, "song": 1})

    return list(collection.aggregate(pipeline.export()))

### Testing some queries

In [28]:
description = """
Combattons pour notre liberté et la défense de nos droits.


"""

results = find_songs(description)
results

[{'artist': 'Bob Marley',
  'song': 'Get Up, Stand Up',
  'text': "Get up, stand up, stand up for your right  \nGet up, stand up, stand up for your right  \nGet up, stand up, stand up for your right  \nGet up, stand up, don't give up the fight  \n  \nPreacher man don't tell me heaven is under the earth  \nI know you don't know what life is really worth  \nIs not all that glitters in gold and  \nHalf the story has never been told  \nSo now you see the light, aay  \nStand up for your right. Come on  \n  \nGet up, stand up, stand up for your right  \nGet up, stand up, don't give up the fight  \n(Repeat)  \n  \nMost people think great God will come from the sky  \nTake away ev'rything, and make ev'rybody feel high  \nBut if you know what life is worth  \nYou would look for yours on earth  \nAnd now you see the light  \nYou stand up for your right, yeah!  \n  \nGet up, stand up, stand up for your right  \nGet up, stand up, don't give up the fight  \nGet up, stand up. Life is your right  \nS

In [34]:
def print_results(results:list[dict]):
    for result in results:
        print(f"{result['artist']}: {result['song']}")
        #print(result["text"])
        #print("\n")

In [27]:
for result in results:
    print(f"{result['artist']}: {result['song']}")

Celine Dion: Je Ne Vous Oublie Pas
Marianne Faithfull: Coquillage
Celine Dion: La Religieuse
Michael Bolton: Une Femme Comme Toi
Gloria Estefan: Amour Infini
David Bowie: Chant Of The Ever Circling Skeletal
Celine Dion: Carmen
Celine Dion: Glory Alleluia
Marianne Faithfull: Plaisir D'amour
Celine Dion: Je Ne Veux Pas


In [30]:
headlines_description = """
J'ai réussi dans ma carrière je suis libre, j'en ai plus rien à foutre de ce que les gens pensent de moi.

"""

In [35]:
#results = find_songs(headlines_description)
print_results(results)

Gucci Mane: Grown Man
Warren Zevon: Laissez-Moi Tranquille
Celine Dion: Lolita
Ween: Ode To Rene
Drake: Money 2 Blow
Celine Dion: Cherche Encore
Celine Dion: Je Ne Vous Oublie Pas
David Bowie: Almost Grown
Chris Brown: 4 Years Old
David Bowie: Anyway, Anyhow, Anywhere


In [36]:
started_from_the_bottom_description = """
Grinding and going up the ladder

"""

In [37]:
results = find_songs(started_from_the_bottom_description)
print_results(results)

Lauryn Hill: I Had To Walk
David Bowie: Chant Of The Ever Circling Skeletal
Gucci Mane: Hustle
Bob Marley: Iron, Lion, Zion
Drake: Stunt Hard
Gucci Mane: I'm Up
David Bowie: Jump They Say
David Bowie: It Ain't Easy
Drake: Say What's Real
Lauryn Hill: Do You Like The Way
