In [15]:
import pandas as pd 
import numpy as np

In [16]:
df = pd.read_csv("./data/posts_rows.csv")
df = df.sort_values(by="post_order")

In [17]:
df.head(100)

Unnamed: 0,post_id,date,title,body,user_id,tag,image_urls,summary,post_order
33,56c08225-a216-4c92-9481-bdf9c396913c,22/10/2024,Strom Drain Version One,Today I have started the creation of my own pe...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,project-progress,[],,1
44,612c56d6-78f7-4dda-a81b-a93b8a7f85be,22/10/2024,Search Algorithm of Life,I was thinking a lot today about how I approac...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,thoughts,[],,2
123,fdedbbc8-5198-4a79-b3fb-a70336c83254,23/10/2024,Demo Advice,Angus just told me when demoing a project to a...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,thoughts,[],,3
95,c693fe03-08e6-414b-9aa3-f17095a789b7,23/10/2024,Stormdrain progress,I have been able to get the frontend to talk t...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,project-progress,[],,4
73,a9c73ec7-b053-4d38-ba2d-b2fd06cfe5cd,23/10/2024,useRef Hook,After watching a tutorial trying to figure out...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,frontend-learning,[],,5
...,...,...,...,...,...,...,...,...,...
8,133fefcd-c37c-4dbd-977c-e9566478b665,13/11/2024,Terminology Alert!!!,Be aware in the nextjs documentation api route...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,frontend-learning,[],,96
99,d0381848-5c8e-4240-afcd-e459f335420e,13/11/2024,ML Projects,The two big projects I will do in order or get...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,machine-learning,[],,97
32,53fbf4df-fbbc-4573-9c5e-aac15af10f2b,14/11/2024,Data Engineer AI App,After speaking with Angus about his supercord ...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,thoughts,[],,98
107,db0c5775-8dea-4523-a6f0-a11f4d5416f4,14/11/2024,Azure Storage Accounts,Yesterday and today I have gone through the le...,fbc72f17-b191-48a6-86ab-54ed20be6cf1,azure-data-engineer,[],,99


In [18]:
body = df["body"]
body.head()

33     Today I have started the creation of my own pe...
44     I was thinking a lot today about how I approac...
123    Angus just told me when demoing a project to a...
95     I have been able to get the frontend to talk t...
73     After watching a tutorial trying to figure out...
Name: body, dtype: object

In [19]:
from dotenv import load_dotenv
import os

load_dotenv()

OPENAI_KEY = os.getenv('OPENAI_KEY')

In [20]:
from openai import OpenAI
client = OpenAI(
    api_key=OPENAI_KEY
)

def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ") # Replace all newlines with spaces for better embeddings 
   return client.embeddings.create(input = [text], model=model).data[0].embedding # Get the embedding by calling api using wrapper

# Create the combined column allowing us to embed the title and the body of each post
df['combined'] = (
    "Title: " + df.title.str.strip() + "; Content: " + df.body.str.strip().apply(lambda x: x.replace("\n", " "))
)

# Add the embedding column that we want
df['embedding'] = df.combined.apply(lambda x: get_embedding(x, model='text-embedding-3-small'))
# Save the new rows to the csv
df.to_csv('./data/posts-with-embeddings.csv', index=False)

In [21]:
# Create the combined column allowing us to embed the title and the body of each post
df['combined'] = (
    "Title: " + df.title.str.strip() + "; Content: " + df.body.str.strip().apply(lambda x: x.replace("\n", " "))
)

df['combined'][20]

'Title: Confidence vs Prediction Intervals; Content: I finished off the Applied Statistics for Machine Learning Engineers video from mike west and it covered a lot!  I will definitely need to sit down and review the content in the future but I think I will hold off until I have completed the entire course. Today what I learned was the there are there main types of intervals we look at, tolerance, confidence and prediction.  Prediction intervals define our uncertainty or certainty in a models prediction, while confidence intervals focus on the certainty in a specific model parameter say the mean or std. Mike provided a great diagram to understand the difference between the two of these. With tolerance interval, I am still a little confused.  After a little more research, I think that I understand it more.  Essentially it looks at the observed values as a whole as opposed to a single prediction or a parameter like prediction and confidence intervals.  A tolerance interval makes a stateme

### Prompt Format
After testing a few prompts for summarisation on gpt 4 mini I have found one that seems to achieve what I am looking for in my blog TLDR summarisations


"Please provide a TLDR summary of the following content in one sentence: [insert combined title and body of post here with /n stripped] The summary should capture the main points and give an overview of the key activities or plans mentioned."

In [30]:
from openai import OpenAI

def get_summary(combined):
    
    prompt = f"Please provide a TLDR summary of the following content in one sentence: {combined} The summary should capture the main points and give an overview of the key activities or plans mentioned.  The authors name is Adam for your references if needed"
    chat_completion = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    
    result = chat_completion.choices[0].message.content
    
    return result

# Lets do a test
##test = get_summary(df["combined"][0])
#print(test)
# Add the embedding column that we want
df['summary'] = df.combined.apply(lambda x: get_summary(x))
# save the resulting dataframe to a csv for analysis and to reupload to database when completed
df = df.drop('combined', axis=1) # Removed combined column as not needed in final db 
df.to_csv('./data/posts-with-embeddings-and-summary.csv', index=False)


KeyboardInterrupt: 