<a href="https://colab.research.google.com/github/community-bytes/c-bytes/blob/main/GChomp_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GCHOMP Data Processor
This notebook is part of the G-Chomp project for the [Chicago Innovate Hackathon 2023](https://www.chicagoinnovate.tech/hackathon).

For a general description of the project, visit the project [github repo](https://github.com/community-bytes/c-bytes).

This stage of the project takes a dataset of posts from the McNeel Forum as csv input, parses the content, and does some data analytics on the dataset.

Most of this notebook is derivative of examples shared during [Becoming AI Power-Users: A Hands-on Workshop on Machine Learning and Generative AI](https://www.chicagoinnovate.tech/courses-1/becoming-ai-power-users%3A-a-hands-on-workshop-on-machine-learning-and-generative-ai) by Seyedomid Sajedi of Thornton Thomasetti during Chicago Innovate 2023.

## Step 1 Dependencies
install dependencies for getting the embeddings and pre-processing text from the CSV

In [1]:
%%capture
!pip install tiktoken
!pip install bertopic datasets accelerate bitsandbytes xformers adjustText
!pip install feedparser

In [2]:
# https://spacy.io/
import spacy
# !pip install --upgrade spacy # might be needed if the default spacy in colab is not working
import requests
from io import BytesIO
#from PyPDF2 import PdfReader
from tqdm import tqdm
import tiktoken
import pickle

from bertopic import BERTopic
#from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from umap import UMAP
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd

## Step 2: Setup Processor Models
Executed prior to parsing csv so to make it easier to replace the dataset and rerun the notebook

In [4]:
# Setup model to remove 'stop' words - words of high frequency - for topical analysis
nlp_model = spacy.load("en_core_web_sm")
# Get the list of stop words
stop_words = list(nlp_model.Defaults.stop_words)
# print(len(stop_words))
# print(stop_words)

In [5]:
# Append Rhino stop words
stop_words.extend(["Rhino", "rhino", "grasshopper"]);

Spacy Tokenizer that checks for stop words in the sample text

In [6]:
def spacy_tokenizer(text, nlp):
    doc = nlp(text)
    tokens = []
    for token in doc:
        # Check if the token is not punctuation and not a stop word
        if not (token.is_punct or token.is_stop):
            tokens.append(token.lemma_.lower())
    return tokens

## Step 3: Parsing the dataset
- Load the dataset from our [github repo](https://github.com/community-bytes/c-bytes/tree/main/helpers/csv) (make sure to use the link to the raw file)
- or loaded into the notebook (if loaded into the notebook, use the name "output.csv" and uncomment the code below)

In [None]:
# Retrieve dataset from github repo
import requests
from io import StringIO

# Define the GitHub repository URL and the path to the CSV file
github_repo_url = "https://raw.githubusercontent.com/community-bytes/c-bytes/"
csv_file_path = "main/helpers/csv/output.csv"

# Combine the URL and file path to get the raw file URL
raw_csv_url = github_repo_url + csv_file_path

# Send a GET request to the raw CSV file URL
response = requests.get(raw_csv_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Save the content to a local CSV file
    df = pd.read_csv(StringIO(response.text))
    print("CSV file loaded into a DataFrame.")
    # check contents of csv
    # text_values = df['text']
    # print(text_values)
else:
    print("Failed to retrieve CSV file. Status code:", response.status_code)


In [55]:
# Uncomment this if using local csv file
# df = pd.read_csv('output.csv')

In [8]:
# Check contents of CSV
text_values = df['text']
print(text_values)

0      <p>Hello <a class="mention" href="/u/alex_8">@...
1      <p>Hi Garry - that layout appears to be corrup...
2      <p>just to share the script with everyone:<br>...
3      <p><div class="lightbox-wrapper"><a class="lig...
4      <p>Yes it is normal. For a one-shot osnap like...
                             ...                        
123    <p>Hi <a class="mention" href="/u/siemen">@sie...
124    <aside class="quote no-group" data-post="3" da...
125    <p>You can certainly code it up in Grasshopper...
126    <p>I solved the problem, I cant do DataTree&gt...
127    <p><a href="https://mcneel.myjetbrains.com/you...
Name: text, Length: 128, dtype: object


In [9]:
# Step 3: Clean the html and newLine characters from the text
df['text'] = df['text'].str.replace(r'<[^<>]*>', '', regex=True)
df['text'] = df['text'].str.replace(r'\n', ' ', regex=True)

In [10]:
# Remove @ referenced names
# df['text'] = df['text'].str.replace(r'@\w+\s*', '', regex=True)

In [11]:
# text_values = df['text']
# print(text_values)

### Remove stop words from tokenization

In [12]:
# tokenize dataset with separate words in an array
tokenized_dataset= []
text_dataset =[]
for line in df['text']:
  text = line
  text_dataset.append(line)
  tokenized_page = spacy_tokenizer(text,nlp_model)
  tokenized_dataset.append(tokenized_page)

In [13]:
# tokenize dataset with words from a single post rejoined into a string
tokenized_dataset= []
text_dataset =[]
for line in df['text']:
  text = line
  text_dataset.append(line)
  tokenized_page = spacy_tokenizer(text,nlp_model)
  separator = " "  # Separator between strings, e.g., a space
  tokenized_page = separator.join(tokenized_page)
  tokenized_dataset.append(tokenized_page)

In [14]:
print(tokenized_dataset)

['hello @alex_8 look clear explanation   screen shot 2019 02 24 11.15.09.png1260×655 95.3 kb   note need component lunchbox work dot indicate straight planar basis locate topo_v1.gh 34.8 kb', 'hi garry layout appear corrupt able thing look right   selall ctrl page 3 copytoclipboard ctrl c new layout paste ctrl v   file heap developer look https://mcneel.myjetbrains.com/youtrack/issue/rh-51111 thanks -pascal', 'share script change input method getline feedback group output use risk    noecho -runscript distbetween sub distbetween    dim arrpt0 arrpt1 dbltolerance dbldistance strline arrmidpt   strdot strgroup strlinearr   create line arrowhead strlinearr = rhino getline               isnull(strlinearr exit sub              isnull(strlinearr(0 exit sub              isnull strlinearr(1 exit sub              strline = rhino addline strlinearr(0 strlinearr(1              isnull(strline exit sub                                        exit distance small add line dbldistance = rhino distance(

In [15]:
tokenized_dataset[0]

'hello @alex_8 look clear explanation   screen shot 2019 02 24 11.15.09.png1260×655 95.3 kb   note need component lunchbox work dot indicate straight planar basis locate topo_v1.gh 34.8 kb'

## Analytics
- prepare embeddings
- use [Bertopic](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fmaartengr.github.io%2FBERTopic%2Findex.html) by Marteen Grootendorst for analtics on the set

In [None]:
# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(tokenized_dataset, show_progress_bar=True)

In [17]:
topic_model = BERTopic().fit(tokenized_dataset, embeddings)

In [18]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,39,-1_vector_vertex_geometry_display,"[vector, vertex, geometry, display, value, fil...",[ stevebaer bump minimum thickness is...
1,0,49,0_kb_surface_look_use,"[kb, surface, look, use, fillet, offset, large...",[close evaluate surface require uv point xyz p...
2,1,40,1_rhino_rs_object_print,"[rhino, rs, object, print, point, import, retu...",[hi @mrhe author tarsier source code attempt s...


In [19]:
topic_model.visualize_barchart(top_n_topics=50,n_words=5)

In [20]:
# not currently working b/c sample size is too small
#fig= topic_model.visualize_topics()
#fig.write_html("topics_LLM.html")
#fig

In [21]:
topic_model.visualize_heatmap(height=1000,width=1000)

In [22]:
# Marteen suggests reducing dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
fig = topic_model.visualize_documents(tokenized_dataset, reduced_embeddings=reduced_embeddings,height=1200,width=1800)
fig.write_html("document_view.html")
fig