<a href="https://colab.research.google.com/github/community-bytes/c-bytes/blob/main/GChomp_Analytics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GCHOMP Data Processor
This notebook is part of the G-Chomp project for the [Chicago Innovate Hackathon 2023](https://www.chicagoinnovate.tech/hackathon).

For a general description of the project, visit the project [github repo](https://github.com/community-bytes/c-bytes).

This stage of the project takes a dataset of posts from the McNeel Forum as csv input, parses the content, and does some data analytics on the dataset.

Most of this notebook is derivative of examples shared during [Becoming AI Power-Users: A Hands-on Workshop on Machine Learning and Generative AI](https://www.chicagoinnovate.tech/courses-1/becoming-ai-power-users%3A-a-hands-on-workshop-on-machine-learning-and-generative-ai) by Seyedomid Sajedi of Thornton Thomasetti during Chicago Innovate 2023.

## Step 1 Dependencies
install dependencies for getting the embeddings and pre-processing text from the CSV

In [1]:
%%capture
!pip install tiktoken
!pip install bertopic datasets accelerate bitsandbytes xformers adjustText
!pip install feedparser

In [2]:
# https://spacy.io/
import spacy
# !pip install --upgrade spacy # might be needed if the default spacy in colab is not working
import requests
from io import BytesIO
#from PyPDF2 import PdfReader
from tqdm import tqdm
import tiktoken
import pickle

from bertopic import BERTopic
#from sklearn.datasets import fetch_20newsgroups
from sentence_transformers import SentenceTransformer
from umap import UMAP
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd

## Step 2: Setup Processor Models
Executed prior to parsing csv so to make it easier to replace the dataset and rerun the notebook

In [30]:
# Setup model to remove 'stop' words - words of high frequency - for topical analysis
nlp_model = spacy.load("en_core_web_sm")
# Get the list of stop words
stop_words = list(nlp_model.Defaults.stop_words)
# print(len(stop_words))
# print(stop_words)

In [31]:
# Append Rhino stop words
stop_words.extend(["Rhino", "rhino", "grasshopper"]);

Spacy Tokenizer that checks for stop words in the sample text

In [32]:
def spacy_tokenizer(text, nlp):
    doc = nlp(text)
    tokens = []
    for token in doc:
        # Check if the token is not punctuation and not a stop word
        if not (token.is_punct or token.is_stop):
            tokens.append(token.lemma_.lower())
    return tokens

## Step 3: Parsing the dataset
- Load the dataset from our [github repo](https://github.com/community-bytes/c-bytes/tree/main/helpers/csv) (make sure to use the link to the raw file)
- or loaded into the notebook (if loaded into the notebook, use the name "output.csv" and uncomment the code below)

In [33]:
# Retrieve dataset from github repo
import requests
from io import StringIO

# Define the GitHub repository URL and the path to the CSV file
github_repo_url = "https://raw.githubusercontent.com/community-bytes/c-bytes/"
csv_file_path = "main/helpers/csv/updated_output.csv"

# Combine the URL and file path to get the raw file URL
raw_csv_url = github_repo_url + csv_file_path

# Send a GET request to the raw CSV file URL
response = requests.get(raw_csv_url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Save the content to a local CSV file
    df = pd.read_csv(StringIO(response.text))
    print("CSV file loaded into a DataFrame.")
    # check contents of csv
    # text_values = df['text']
    # print(text_values)
else:
    print("Failed to retrieve CSV file. Status code:", response.status_code)


CSV file loaded into a DataFrame.


In [34]:
# Uncomment this if using local csv file
# df = pd.read_csv('output.csv')

In [35]:
# Check contents of CSV
text_values = df['text']
print(text_values)

0      <p>Because projection is just additional step,...
1                                  <p>Great, thanks.</p>
2      <aside class="quote quote-modified" data-post=...
3      <p>HI Pascal,<br>\nsuper fast reply as usual, ...
4      <p>BTW: Added some proof (But can The Lord be ...
                             ...                        
810    <p>Hi <a class="mention" href="/u/siemen">@sie...
811    <aside class="quote no-group" data-post="3" da...
812    <p>You can certainly code it up in Grasshopper...
813    <p>I solved the problem, I cant do DataTree&gt...
814    <p><a href="https://mcneel.myjetbrains.com/you...
Name: text, Length: 815, dtype: object


In [36]:
# Step 3: Clean the html and newLine characters from the text
df['text'] = df['text'].str.replace(r'<[^<>]*>', '', regex=True)
df['text'] = df['text'].str.replace(r'\n', ' ', regex=True)

In [37]:
# Remove @ referenced names
# df['text'] = df['text'].str.replace(r'@\w+\s*', '', regex=True)

In [38]:
# text_values = df['text']
# print(text_values)

### Remove stop words from tokenization

In [39]:
# tokenize dataset with separate words in an array
tokenized_dataset= []
text_dataset =[]
for line in df['text']:
  text = line
  text_dataset.append(line)
  tokenized_page = spacy_tokenizer(text,nlp_model)
  tokenized_dataset.append(tokenized_page)

In [40]:
# tokenize dataset with words from a single post rejoined into a string
tokenized_dataset= []
text_dataset =[]
for line in df['text']:
  text = line
  text_dataset.append(line)
  tokenized_page = spacy_tokenizer(text,nlp_model)
  separator = " "  # Separator between strings, e.g., a space
  tokenized_page = separator.join(tokenized_page)
  tokenized_dataset.append(tokenized_page)

In [41]:
print(tokenized_dataset)



In [42]:
tokenized_dataset[0]

'projection additional step planarization side guarantee planarity depend planarize geometry edge line cut plane cut point connect polyline flat sit plane define plane well connection surface side depend straight forward averaging point optimization method average plane adjacent plane planarizemesh.gh 23.3 kb   projection986×570 12.4 kb    grid1352×739 82.6 kb    grid21352×739 131 kb    canvas%20at%2007%3b40%3b001284×464 68 kb'

## Step 4: Analytics
- prepare embeddings
- use [Bertopic](https://colab.research.google.com/corgiredirector?site=https%3A%2F%2Fmaartengr.github.io%2FBERTopic%2Findex.html) by Marteen Grootendorst for analytics on the set

In [43]:
# Prepare embeddings
sentence_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = sentence_model.encode(tokenized_dataset, show_progress_bar=True)

Batches:   0%|          | 0/26 [00:00<?, ?it/s]

In [44]:
topic_model = BERTopic().fit(tokenized_dataset, embeddings)

In [45]:
topic_model.get_topic_info()

Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,10,-1_duplicate_light_edge_block,"[duplicate, light, edge, block, fair, material...","[work edge duplicate method, oh sorry us..."
1,0,785,0_kb_rhino_point_use,"[kb, rhino, point, use, rs, curve, surface, wo...",[little late quality poor start think add fill...
2,1,20,1_thank_lot_great_perfect,"[thank, lot, great, perfect, work, regard, yes...","[yes work thank, thank help lot, great thank]"


In [53]:
topic_model.visualize_barchart(top_n_topics=50,n_words=5)

In [47]:
# not currently working b/c sample size is too small
#fig= topic_model.visualize_topics()
#fig.write_html("topics_LLM.html")
#fig

In [48]:
topic_model.visualize_heatmap(height=1000,width=1000)

In [49]:
# Marteen suggests reducing dimensionality of embeddings, this step is optional but much faster to perform iteratively:
reduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)
fig = topic_model.visualize_documents(tokenized_dataset, reduced_embeddings=reduced_embeddings,height=1200,width=1800)
fig.write_html("document_view.html")
fig

## Step 5: Parse Extended data
- grasshopper file data is extracted from the gh file linked to a post on a topic
- This is generated by crawling the file as an xml (ghx) format

## Step 6: Analyse (grasshopper file data)

## Step 7: Correlate topical data with grasshopper file data