In [1]:
# import libraries
import googleapiclient
from googleapiclient.discovery import build
import pandas as pd
import re
import ollama
from langchain.schema import HumanMessage
import chromadb
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
response = ollama.generate(model='llama2:7B', prompt='how many parameters do you have')
print(response['response'])


As a text-based AI language model, I don't have any physical parameters or characteristics. However, I can provide information on the number of parameters in different contexts:

1. Number of parameters in a linear regression model: In a linear regression model, the number of parameters (also known as coefficients) is equal to the number of independent variables plus one. For example, if you have two independent variables, the model will have three parameters (one for each independent variable and one for the intercept).
2. Number of parameters in a neural network: The number of parameters in a neural network depends on the number of layers, the number of neurons in each layer, and the complexity of the connections between them. In general, the number of parameters in a neural network can be quite large, especially for deeper networks with more layers and more complex connections.
3. Number of parameters in a statistical model: The number of parameters in a statistical model depends o

In [3]:
response = ollama.generate(model='llama2:7B', prompt='how can I input a dataframe into chroma db on python')
print(response['response'])


To input a Pandas DataFrame into ChromaDB using Python, you can use the `chroma_db.load()` function. Here's an example of how to do this:
```
import pandas as pd
from chroma_db import ChromaDB

# Load the dataframe into a ChromaDB object
db = ChromaDB()
db.load(pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]})
```
In this example, we first create a Pandas DataFrame with two columns (`x` and `y`) and three rows. We then pass this DataFrame to the `load()` function of the ChromaDB object, which will store the data in the ChromaDB database.

You can also specify additional options when loading the DataFrame, such as the column names to use for the ChromaDB keys and values, like this:
```
db = ChromaDB()
db.load(pd.DataFrame({'x': [1, 2, 3], 'y': [4, 5, 6]}), key_column='x', value_column='y')
```
This tells ChromaDB to use the `x` column as the keys and the `y` column as the values in the database.

Alternatively, you can also load a DataFrame from a file using the `load_from_csv()` function

In [4]:
response = ollama.generate(model='llama2:7B', prompt='what is your context length')
print(response['response'])


The context length refers to the number of words or characters that are considered when calculating the similarity between two texts. The context length can affect the accuracy of the text similarity measurement, as longer context lengths can capture more subtle similarities and variations in language use.

Here are some common context lengths used in text similarity measures:

1. Word-level: This is the most common context length, where the similarity between two texts is measured based on the number of shared words.
2. Character-level: At this level, the similarity is measured based on the number of shared characters between two texts, without considering word boundaries.
3. Context window: This involves dividing a text into a fixed-size context window and measuring the similarity between the window and another text. The size of the context window can vary, but common sizes include 10, 20, or 30 words.
4. N-grams: This involves measuring the similarity between two texts based on a s

In [5]:
response = ollama.generate(model='llama2:7B', prompt='how long of a response can you type out')
print(response['response'])


I can generate text responses of varying lengths, but the maximum length of a response I can provide is limited by the platform's character limit. On most platforms, my response will be limited to around 200-300 characters. However, if you have a longer message or request, feel free to ask and I will do my best to accommodate it!


In [6]:
api_key = "AIzaSyB6-Tl-ScLoYTiFZLweVE8FebvY9ghrjqc"
video_id = ["oqL7Ke4O3fg", "Tl8RS0sR-qA", "u4LUix-BU0s", "FHsONupIdlo", "KZeIEiBrT_w", "A5w-dEgIU1M", "bBC-nXj3Ng4", "vhzYhq0oTu8", "8Gm7kSUkBAk", "IF8YNn4v-y4", "aiv6kJ7eJ5U"]
# create a list dictionary that will store the output
output = []
# use api key to create youtube object
youtube = build('youtube', 'v3', developerKey=api_key)

In [7]:
# function to call api and produce a list of comment-reply pairs

# video == specify id of the video to get comments from
# amount_of_comments == amount of comments to retrieve from current video
# output == list item to store comment-reply pairs (pass as argument so function can continuously add to it)
def call_youTube_API(video, amount_of_comments, output):

    # call api to get comments on a particular video using video id
    # order comments by relevance, popular comments are more likely to have replies
    apiCall = youtube.commentThreads().list(part=["snippet","replies"], videoId=video, maxResults=amount_of_comments, order="relevance").execute()
    
    # iterate through the comments the api returned
    for i in range(len(apiCall["items"])):
        
        # get comment text
        textOutput = apiCall["items"][i]["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
        
        # get count of replies
        replyCount = apiCall["items"][i]["snippet"]["totalReplyCount"]
        
        if replyCount > 0:
            
            # get list of all the returned replies (api usually returns 5 replies)
            replies = apiCall["items"][i]["replies"]["comments"]

            # get the likes per reply
            likes = []
            for reply in replies:
                likes.append(reply["snippet"]["likeCount"])
                
            # get index of comment with most likes
            maxIndex = likes.index(max(likes))
            # print comment with most likes
            mostLikedReplyText = replies[maxIndex]["snippet"]["textDisplay"]

            # save comment text and most liked reply text to output list dictionary
            output.append({"comment":textOutput, "reply":mostLikedReplyText})
        
    return

In [8]:
# call function to get comments through all videos
for video in video_id:
    
    call_youTube_API(video, 250, output)

# convert list of dictionaries to dataframe
output = pd.DataFrame(output)

In [9]:
# do some data cleaning

# in order: remove links, html tags, special characters and punctuation, emojis
output['comment'] = output['comment'].astype(str) \
    .str.replace(r"http\S+|www\S+|https\S+", "", regex=True) \
    .str.replace(r"<.*?>", "", regex=True) \
    .str.replace(r"[^\w\s]", "", regex=True) \
    .str.replace(r"[\U00010000-\U0010ffff]", "", regex=True)

output['reply'] = output['reply'].astype(str) \
    .str.replace(r"http\S+|www\S+|https\S+", "", regex=True) \
    .str.replace(r"<.*?>", "", regex=True) \
    .str.replace(r"[^\w\s]", "", regex=True) \
    .str.replace(r"[\U00010000-\U0010ffff]", "", regex=True)

In [10]:
# initialize an instance of the database
# persist ensures that the database is saved to the computer so I can reference it in other scripts
database = chromadb.PersistentClient(path="./youtube_comment_database")
# create a collection (group of documents and their embeddings)
collection = database.create_collection(name="youtube_comments")

# import sentence embedder from huggingface
# using all-MiniLM-L6-v2 since llama2:7B doesn't have an encoder and this one is light enough for me to run
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [17]:
# prepping data for embedding model

# only going to embed the comments
# users will prompt the llm with a comment and the llm will draft a reply to the comment based on the comment-reply pairs the semantic search returns
# when the semantic search is happening, we should only be searching the comments, I want to see the comment-reply pairs for the most similar comments
# therefore the replies will be stored as metadata in the database while the only the comments will be embedded

# convert dataframe items to lists
comments = output["comment"].to_list()
replies = output["reply"].to_list()

# convert replies to list of dictionaries so I can pass it as metadata
replies_dict = [{"reply":reply} for reply in replies]

# embed comments
encoded_comments = embedding_model.encode(comments)

In [18]:
# add data into database
collection.add(
    ids=[str(i) for i in range(len(comments))],
    embeddings=encoded_comments,
    documents=comments,
    metadatas=replies_dict
)

In [26]:
# technically this file would end here
# in the next file I will load the database and the model and then query it
# doing a test query here though

query = "the mazda civic sucks"

# need to encode the query with the embedder
encoded_query = embedding_model.encode(query)

# search the database using the encoded query (semantic search)
# distance metric is cosine similarity by default, need to set it when I set up the collection
semantic_search_results = collection.query(query_embeddings=encoded_query, n_results=5)

semantic_search_results

{'ids': [['611', '7', '10', '1', '0']],
 'embeddings': None,
 'documents': [['Well I don39t care whatever others say My Honda Civic puts a smile on my face everytime I drive it One day hopefully I39ll put my hands on an Impreza but for now my Civic is perfect',
   'Just sold a 2024 EX Civic and leased a 2025 premium 3 and man this thing is in a different league than the civic Dont listen to anyone that says but it has a torsion beam rear suspension They dont understand Mazda engineering',
   'Mazdas are great cars very under appreciated ',
   'As a current gen Mazda 3 owner this makes me very happy',
   'When most companies make phones with wheels Mazda does the opposite I love it']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'reply': 'I recently got a Civic 11th gen and its amazing Note that I never drove RWD before so I cant directly compare but the Civic is awesome in corners Same as you always got a big smile driving it Ver

In [15]:
# separate data pipeline for later use
# ingest comments that do and do not have more than 100 likes

# video == specify id of the video to get comments from
# amount_of_comments == amount of comments to retrieve from current video
# output == list item to store comment-like pairs (pass as argument so function can continuously add to it)
def call_youTube_API_likes(video, amount_of_comments, output):

    # call api to get comments on a particular video using video id
    # order comments by relevance, popular comments are more likely to have replies
    apiCall = youtube.commentThreads().list(part=["snippet","replies"], videoId=video, maxResults=amount_of_comments, order="relevance").execute()
    
    # iterate through the comments the api returned
    for i in range(len(apiCall["items"])):
        
        # get comment text
        textOutput = apiCall["items"][i]["snippet"]["topLevelComment"]["snippet"]["textDisplay"]
        
        # get count of likes
        likeCount = apiCall["items"][i]["snippet"]["topLevelComment"]["snippet"]["likeCount"]
        
        output.append({"comment":textOutput,"like_count":likeCount})
        
    return