In [1]:
import pandas as pd
import numpy as np
import chromadb

# Motivation
The Goal is to create vector data bases that gets filled with data about the football players which allows to determine similar players based on their attributes and characteristics. This notebook shall help to build the basic functionality to achieve this goal. These are the steps which are covered in this notebook.
* Build a Vector Database
* Load Players into the Database
* Perform testing queries
* Draw conclusion

The first step is to load the preprocessed data.

In case you want to use Pinecone vector database:

`load_dotenv(find_dotenv())
api_key = os.getenv("PINECONE_API_KEY")
print(api_key)`

#### Load player statistics and player information

In [2]:
# load data
df = pd.read_csv('../data/preprocessed_data.csv', sep=',')
df_player = pd.read_csv('../data/player_data.csv', sep=',')

# delete anormal column
df = df.drop(columns='Unnamed: 0')
df_player = df_player.drop(columns='Unnamed: 0')

df_player

Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
0,212198,Bruno Fernandes,26,88,CAM,Portugal,Manchester United,€250K,€107.5M
1,209658,L. Goretzka,26,87,LDM,Germany,FC Bayern München,€140K,€93M
2,176580,L. Suárez,34,88,RS,Uruguay,Atlético de Madrid,€135K,€44.5M
3,192985,K. De Bruyne,30,91,RCM,Belgium,Manchester City,€350K,€125.5M
4,224334,M. Acuña,29,84,LB,Argentina,Sevilla FC,€45K,€37M
...,...,...,...,...,...,...,...,...,...
16705,240558,18 L. Clayton,17,53,RES,England,Cheltenham Town,€1K,€100K
16706,262846,�. Dobre,20,53,RES,Romania,FC Academica Clinceni,€550,€180K
16707,241317,21 Xue Qinghao,19,47,RES,China PR,Shanghai Shenhua FC,€700,€100K
16708,259646,A. Shaikh,18,47,SUB,India,ATK Mohun Bagan FC,€500,€110K


Drop unnesseracy columns.

### Create Vector Database

In [3]:
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
    name = "player_database",
    metadata={"hnsw:space": "cosine"}
)


#### Prepare Data
* The `id` needs is denoted from the `df` dataframe and stored seperately into one dimensional array/list. 
* Each row/player needs to be converted into one dimensional array and gets added to the collection

For testing purposes only 50 players are entered in the first place.

In [4]:
# get embeddings and ids
ids = df['ID'].astype(str).tolist()
df_cleaned = df.drop(columns='ID')
embeddings = df_cleaned.values.tolist()


### Add items

In [5]:
# add 50 samples to vector database
collection.add(
    embeddings= embeddings,
    #metadatas=[{"source": "source a"}, {"source": "source b"}],
    ids=ids
)

#### Query item
Create custom query

In [6]:
### random player
#target = df.sample(n=1, random_state=42)

### custom target
target = df[df['ID'] == 212198]

# prepare target
target_id = target['ID'].iloc[0]
target_cleaned = target.drop(columns='ID')
target_embedding = target_cleaned.values.tolist()

In [7]:
query_player = target_embedding

results = collection.query(
    query_embeddings=query_player,
    n_results=10
)
#

# query result object
print("Queried Player")
display(df_player[df_player['ID'] == target_id])

query_idx = list(results.get('ids')[0])

# output player names
print("Similar players")


for idx in query_idx:
    id_int = int(idx)
    display(df_player[df_player['ID'] == id_int])


Queried Player


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
0,212198,Bruno Fernandes,26,88,CAM,Portugal,Manchester United,€250K,€107.5M


Similar players


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
0,212198,Bruno Fernandes,26,88,CAM,Portugal,Manchester United,€250K,€107.5M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
3,192985,K. De Bruyne,30,91,RCM,Belgium,Manchester City,€350K,€125.5M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
50,233927,Lucas Paquetá,23,81,CAM,Brazil,Olympique Lyonnais,€65K,€37M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
115,188152,Oscar,29,82,CAM,Brazil,Shanghai Port FC,€38K,€30M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
18,204923,M. Sabitzer,27,84,SUB,Austria,FC Bayern München,€110K,€48M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
179,214097,B. Bourigeaud,27,78,RCM,France,Stade Rennais FC,€47K,€16M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
56,228251,L. Pellegrini,25,81,CAM,Italy,Roma,€61K,€38M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
120,187072,L. Stindl,32,82,CAM,Germany,Borussia Mönchengladbach,€41K,€24M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
7,181291,G. Wijnaldum,30,84,SUB,Netherlands,Paris Saint-Germain,€115K,€40.5M


Unnamed: 0,ID,Name,Age,Overall,Position,Nationality,Club,Wage,Value
47,186942,I. Gündoğan,30,85,LCM,Germany,Manchester City,€185K,€51.5M


$$ Similarity Percentage = { (1 - cosine similarity) * 100}$$

In [8]:
distances = results['distances'][0] 
for dist in distances:
    similarity_percentage = ((1 - dist) * 100)
    print(f"Similarity percentage with base query item: {similarity_percentage:.2f}%")


Similarity percentage with base query item: 100.00%
Similarity percentage with base query item: 98.55%
Similarity percentage with base query item: 98.00%
Similarity percentage with base query item: 97.81%
Similarity percentage with base query item: 97.66%
Similarity percentage with base query item: 97.57%
Similarity percentage with base query item: 97.55%
Similarity percentage with base query item: 97.48%
Similarity percentage with base query item: 97.39%
Similarity percentage with base query item: 97.36%


### Conclusion
The player vectors cosist of around 85 dimension.

#### Loading
Loading data with 16000+ players was done in less than 5 seconds. Quicker than I expected. From the initial clean data set the id's have been detached and seperately stored. In the following, the player embeddings and their ids were added to the data base in one statement.

#### Querying
Querying was also relatively easy done by only one statement. The results contains several information that can be used to decode the player id and calculate the similarity percentage.

#### Results
The similarity works surprisingly well and quickly. The responses are reasonably accurate. The high-dimensional vector space, combined with cosine similarity, distinguishes between different positions and their characteristics effectively. It senses whether a player is more offensive or defensive without explicitly coding it, while considering that around 84 variables don't introduce the curse of dimensionality.

#### Outlook
In the future it could meaningful to add maybe some filtering or some weights for the following variables.
 * does age matter
 * does wage matter
 * does value matter
 * does international reputation matter
 * what makes them similar (key attributes)