# Processing cost and time for tagging historical data

Count tokens

- Price per year
- n docs


### OpenAI pricing as of 12/14/2023

model: `gpt-4-1106-preview`
- input: 0.01 per 1k token = 1e-5 / token
- output: 0.03 per 1k token = 3e-5 / token

In [1]:
import tiktoken
from datetime import datetime
import altair as alt
from database import load_sqlite_as_df

INPUT_COST_PER_TOKEN = 0.01 / 1000

In [2]:
df = load_sqlite_as_df()

In [3]:
def count_token(string: str, model_name: str = "cl100k_base") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(model_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [4]:
# Count tokens for price estimation
df["title_tokens"] = df["title"].apply(count_token)
df["summary_tokens"] = df["summary"].apply(count_token)
df["content_tokens"] = df["content"].apply(count_token)

df["title_and_summary_tokens"] = df["title_tokens"] + df["summary_tokens"]
df["title_and_content_tokens"] = df["title_tokens"] + df["content_tokens"]

# Get actual cost
df["cost_title_and_summary"] = df["title_and_summary_tokens"] * INPUT_COST_PER_TOKEN
df["cost_title_and_content"] = df["title_and_content_tokens"] * INPUT_COST_PER_TOKEN

# Get year
df["date"] = df["date_gmt"].apply(lambda x: datetime.fromisoformat(x))
df["year"] = df["date"].dt.year

In [5]:
# Make plot data

df_plot = (
    df.groupby("year")
    .agg({"cost_title_and_summary": "sum", "cost_title_and_content": "sum"})
    .reset_index()
    .melt(id_vars="year", value_name="cost")
)

df_plot["plan"] = df_plot["variable"].apply(lambda x: x.split("_")[-1])

In [6]:
df_plot.sample(5)

Unnamed: 0,year,variable,cost,plan
14,2011,cost_title_and_summary,0.66261,summary
45,2015,cost_title_and_content,5.42788,content
36,2006,cost_title_and_content,6.79363,content
50,2020,cost_title_and_content,5.80743,content
37,2007,cost_title_and_content,5.7756,content


In [7]:
alt.Chart(df_plot).mark_bar().encode(
    x="year", y=alt.Y("cost", title="cost($)"), color="plan"
)

In [8]:
df_plot.groupby("plan").agg({"cost": "sum"}).reset_index()

Unnamed: 0,plan,cost
0,content,151.85273
1,summary,13.08241


Time cost

- I have 300,000 TPM limit with DSI account
- Assume processing with 1/3 of max TPM


In [9]:
PROCESS_TPM = 1e5

total_process_minute_long = sum(df.title_and_content_tokens) / PROCESS_TPM
total_process_minute_short = sum(df.title_and_summary_tokens) / PROCESS_TPM

print(f"Total process time (long): {total_process_minute_long:.2f} minutes")
print(f"Total process time (short): {total_process_minute_short:.2f} minutes")

Total process time (long): 151.85 minutes
Total process time (short): 13.08 minutes


Summary

- Cost to process everything should be < $200 USD (34635 news articles), processing time should be around 2 hours. Very manageable.
- An other more cost efficient option is to process the title + summary only which will cost < $20 USD (34635 news articles)
- In conclusion, we can process everything with a reasonable cost and time.