## How large would embedding wikipedia be in GB?

In [10]:
"""
First just create a table/dataframe with number of chunks and different dimensions for embedding to get a sense of the size.
"""

def estimate_size(n_rows, embedding_dimension):
    n_floats = n_rows * embedding_dimension
    n_bytes = n_floats * 4 # 4 bytes per float
    n_gigabytes = n_bytes / 2**30
    return n_gigabytes

cohere_embedding_vector_dimension = 768
n_rows = 36 * 10**6
print(f"dimension of cohere embedding vector: {cohere_embedding_vector_dimension}")


print(f"size of cohere embedding vector: {estimate_size(n_rows, cohere_embedding_vector_dimension)} GB")
print(f"size if 438 dimensions: {estimate_size(n_rows, 438)} GB")
print(f"size if 16 dimensions: {estimate_size(n_rows, 16)} GB")


dimension of cohere embedding vector: 768
size of cohere embedding vector: 102.996826171875 GB
size if 438 dimensions: 58.74037742614746 GB
size if 16 dimensions: 2.1457672119140625 GB


In [15]:
embedding_dimensions = [128, 384, 438, 768, 1024, 1536]

n_vectors = [
    500_000,
    1_000_000,
    2_000_000,
    3_000_000,
    4_000_000,
    5_000_000,
    6_000_000,
    7_000_000,
    8_000_000,
    9_000_000,
    10_000_000,
    20_000_000,
    50_000_000,
    100_000_000,
]

import pandas as pd
import numpy as np

# Create a 2D dataframe
df = pd.DataFrame(index=embedding_dimensions, columns=n_vectors)

# Fill the dataframe with estimated sizes
for dim in embedding_dimensions:
    for n in n_vectors:
        df.loc[dim, n] = estimate_size(n, dim)

# format column names to be human readable
df.columns = [f"{col:,} vectors" for col in df.columns]

# Display the dataframe
df


Unnamed: 0,"500,000 vectors","1,000,000 vectors","2,000,000 vectors","3,000,000 vectors","4,000,000 vectors","5,000,000 vectors","6,000,000 vectors","7,000,000 vectors","8,000,000 vectors","9,000,000 vectors","10,000,000 vectors","20,000,000 vectors","50,000,000 vectors","100,000,000 vectors"
128,0.238419,0.476837,0.953674,1.430511,1.907349,2.384186,2.861023,3.33786,3.814697,4.291534,4.768372,9.536743,23.841858,47.683716
384,0.715256,1.430511,2.861023,4.291534,5.722046,7.152557,8.583069,10.01358,11.444092,12.874603,14.305115,28.610229,71.525574,143.051147
438,0.815839,1.631677,3.263354,4.895031,6.526709,8.158386,9.790063,11.42174,13.053417,14.685094,16.316772,32.633543,81.583858,163.167715
768,1.430511,2.861023,5.722046,8.583069,11.444092,14.305115,17.166138,20.027161,22.888184,25.749207,28.610229,57.220459,143.051147,286.102295
1024,1.907349,3.814697,7.629395,11.444092,15.258789,19.073486,22.888184,26.702881,30.517578,34.332275,38.146973,76.293945,190.734863,381.469727
1536,2.861023,5.722046,11.444092,17.166138,22.888184,28.610229,34.332275,40.054321,45.776367,51.498413,57.220459,114.440918,286.102295,572.20459


**Conclusion**

Will run into memory issues for english.  
Need to look at alternative solutions: 
 - Another vectordb that supports quantization?
 - Dimensionality reduction?
 - Fewer vectors, i.e. more text per chunk.
 - Binary vectors?
 - 


Need to test out for a language, start with Icelandic.  

Just continue with pinecone for now, check how much the memory and usage results from the Icelandic wikipedia. 

