# Project 2 AI Chat Notebook - 

In this notebook, we want to see your interaction with ChatGPT that you use to solve
the unsupervised AND supervised machine learning tasks in `project2.ipynb`.  


# ChatGPT functions to Use

You are free to use these functions, or other LLMs, or the Web ChatGPT, as long as you show those systems' outputs
so we are convinced you have used a large language model to perform the machine learning tasks.


In [1]:
from openai import OpenAI

client = OpenAI(
    # This is the default and can be omitted
	base_url = 'http://aitools.cs.vt.edu:7860/openai/v1',
    api_key="aitools"
)


In [2]:

def converse(prompt, messages=None, 
             model="gpt-3.5-turbo", 
             max_tokens=4000, 
             temperature=0, 
             top_p=1, 
             frequency_penalty=0,
			 presence_penalty=0):
    # Add the user's message to the list of messages
    if messages is None:
        messages = []

    messages.append({"role": "user", "content": prompt})

    response = client.chat.completions.create(
                model=model,
		messages=messages,
		temperature=temperature,
		max_tokens=max_tokens,
		top_p=top_p,
		frequency_penalty=frequency_penalty,
		presence_penalty=presence_penalty).choices[0].message.content

    # Add the assistant's message to the list of messages
    messages.append({"role": "assistant", "content": response})

    return response, messages

# Unsupervised Learning

In this section, show us how you performed the unsupervised learning tasks with the help of Chat GPT. 

In [3]:
import glob

import pandas as pd
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer


In [4]:
prompt = """
load all the data files matching 'data/datasets/train-00000-5k-*.parquet' and merge them into a single dataframe and name it as df. print the head"""

response, messages = converse(prompt, [])
print(response)

import pandas as pd
import glob

# Load all the data files matching the pattern
file_paths = glob.glob('data/datasets/train-00000-5k-*.parquet')
dfs = [pd.read_parquet(file) for file in file_paths]

# Merge the dataframes into a single dataframe
df = pd.concat(dfs)

# Print the head of the merged dataframe
print(df.head())


In [5]:
prompt="""Now fetch the code column from the dataset and name it as code_snippets. and do the same for language column and name it as labels"""
response, messages = converse(prompt, messages)
print(response)

# Fetch the 'code' column and name it as code_snippets
code_snippets = df['code']

# Fetch the 'language' column and name it as labels
labels = df['language']

# Print the head of the code_snippets and labels
print("Code Snippets:")
print(code_snippets.head())

print("\nLabels:")
print(labels.head())


In [6]:
prompt = """now take the code_snippets and make it as the tf-idf feature"""
response, messages = converse(prompt, messages)
print(response)

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the code_snippets to get the tf-idf features
tfidf_features = tfidf_vectorizer.fit_transform(code_snippets)

# Print the shape of the tf-idf features
print("Shape of tf-idf features:", tfidf_features.shape)


In [7]:
prompt="""now perform k-mean clustering on this data using tf-idf
"""
response, messages = converse(prompt, messages)
print(response)

from sklearn.cluster import KMeans

# Initialize the KMeans model with 5 clusters
kmeans = KMeans(n_clusters=5, random_state=42)

# Fit the KMeans model on the tf-idf features
kmeans.fit(tfidf_features)

# Get the cluster labels
cluster_labels = kmeans.labels_

# Print the cluster labels
print("Cluster Labels:")
print(cluster_labels)


In [8]:
prompt = """now lets print the cluster assignment with each cluster being truncated to 80 characters long.
"""
response, messages = converse(prompt, messages)
print(response)

# Create a new dataframe with code snippets and their corresponding cluster labels
cluster_df = pd.DataFrame({'code_snippets': code_snippets, 'cluster_labels': cluster_labels})

# Truncate each code snippet to 80 characters
cluster_df['code_snippets_truncated'] = cluster_df['code_snippets'].apply(lambda x: x[:80])

# Group the dataframe by cluster labels and print the truncated code snippets for each cluster
for cluster_label, group in cluster_df.groupby('cluster_labels'):
    print(f"Cluster {cluster_label}:")
    for code_snippet in group['code_snippets_truncated']:
        print(code_snippet)
    print()


## Showing Unsupervised Clustering Visually (Optional)

In [None]:
prompt = """
- Generate code to make a plot (use mathplotlib) showing the data points and clusters. Use a distinct color for each cluster.
"""
response, messages = converse(prompt, messages)
print(response)

In [None]:
# Supervised Learning

In this section, show us how you performed the supervised learning tasks with the help of Chat GPT or other LLM.

In [9]:
prompt = """Now lets work on supervised learning lets define a method named learn which will do all the following tasks.

- Write a function named learn that takes a dataframe and an index as parameters.
- The index indicates the round number of learning. The function should:
- 1. Create a dictionary to store the measurements for this round
- 2. Print a message indicating the round number
- 3. Store the round number in the dictionary under 'round'
- 4. Split the dataframe into code snippets and labels
- 5. Split the code snippets and labels into training and test sets
- 6. Create a TF-IDF vectorizer
- 7. Use the 'fit_transform' method on the training data to learn the vocabulary and idf, and return term-document matrix.
- 8. Use the 'transform' method on the test data to transform documents to document-term matrix.
- 9. Create a Support Vector Machine classifier
- 10. Train the classifier using the training data
- 11. Save the model to a file and load it back from a file (to make sure it works)
- 12. Use the classifier to predict the labels for the test data
- 13. Print the classification report which should be a dictionary
- 14. Store the training time in the dictionary under 'train'
- 15. Store the prediction time in the dictionary under 'predict'
- 16. Store the classification report in the dictionary under 'report'
- 17. Add 'accuracy' to the dictionary and set it to the accuracy score from the classification report
- 18. Append the dictionary to the measurements list for this round
"""

response, messages = converse(prompt, messages)
print(response)

import pickle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from time import time

measurements = []

def learn(df, index):
    global measurements
    
    # Create a dictionary to store the measurements for this round
    round_measurements = {}
    
    # Print a message indicating the round number
    print(f"Round {index}:")
    
    # Store the round number in the dictionary under 'round'
    round_measurements['round'] = index
    
    # Split the dataframe into code snippets and labels
    code_snippets = df['code']
    labels = df['language']
    
    # Split the code snippets and labels into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(code_snippets, labels, test_size=0.2, random_state=42)
    
    # Create a TF-IDF vectorizer
    tfidf_vectorizer = TfidfVectorizer()
    
    # Use the 'fit_transform

In [11]:
prompt="""now in reach round call this learn with the data from data files matching data/datasets/train-00000-5k-*.parquet.
- Load all data files matching 'data/datasets/train-00000-5k*.parquet'.
- For each file loaded, merge the latest data file with the merged data to date,
- and call the learn function with the dataframe and the index of the file in the list of files.
"""
response, messages = converse(prompt, messages)
print(response)

import pandas as pd
import glob

# Load all data files matching 'data/datasets/train-00000-5k*.parquet'
file_paths = glob.glob('data/datasets/train-00000-5k*.parquet')
merged_data = pd.DataFrame()

for index, file_path in enumerate(file_paths):
    # Load the data file
    data = pd.read_parquet(file_path)
    
    # Merge the latest data file with the merged data to date
    merged_data = pd.concat([merged_data, data])
    
    # Call the learn function with the dataframe and the index of the file in the list of files
    learn(merged_data, index)


In [12]:
prompt = """ write code to plot graphs for tfidf,train, predict and accuracy using matplotlib. in the graph add a legend.add axis labels and a title.
"""
response, messages = converse(prompt, messages)
print(response)

import matplotlib.pyplot as plt

# Extract the measurements for tfidf, train, predict, and accuracy
tfidf_times = [round_measurements['train'] for round_measurements in measurements]
train_times = [round_measurements['train'] for round_measurements in measurements]
predict_times = [round_measurements['predict'] for round_measurements in measurements]
accuracies = [round_measurements['accuracy'] for round_measurements in measurements]

# Plot the graphs
plt.figure(figsize=(12, 8))

plt.plot(range(1, len(measurements)+1), tfidf_times, label='TF-IDF Time')
plt.plot(range(1, len(measurements)+1), train_times, label='Train Time')
plt.plot(range(1, len(measurements)+1), predict_times, label='Predict Time')
plt.plot(range(1, len(measurements)+1), accuracies, label='Accuracy')

plt.xlabel('Round')
plt.ylabel('Time (s) / Accuracy')
plt.title('Performance Metrics over Rounds')
plt.legend()
plt.grid(True)
plt.show()
