# Commonsense statements cleaning & preprocessing

## Libraries and setup

run the following cell to import the necessary libraries and set up the environment.


In [1]:
# Data Processing
import pandas as pd
import numpy as np
import os
import openai
import csv


# Modelling
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, ConfusionMatrixDisplay
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

# Tree Visualisation
from sklearn.tree import export_graphviz
from IPython.display import Image
import graphviz


openai.organization = os.getenv("OPENAI_ORGANIZATION")
openai.api_key = os.getenv("OPENAI_API_KEY")



## Looking into data and preprocessing

we will import the cleaned statements and look into the data. We will also preprocess the data to make it ready for the model.

In [14]:
cleaned_statements_df = pd.read_csv('statements.csv')
statement_properties_df = pd.read_csv('statement_properties.csv')

## Getting the embeddings for the fixed statements via OpenAI API

Run the first cell to get the embeddings from openAI API. This will take a while (roughly 20 minutes). The embeddings will be saved in the embedded_statements.

In [None]:
def get_embedding(text, model="text-embedding-ada-002"):
   text = text.replace("\n", " ")
   return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']


cleaned_statements_df['embeddings'] = cleaned_statements_df['fixed statement'].apply(lambda x: get_embedding(x, model='text-embedding-ada-002'))
cleaned_statements_df.to_pickle('embedded_statements.pkl')

In [38]:
cleaned_statements_df.to_pickle('embedded_statements.pkl')

In [22]:
embedded_statements = pd.read_pickle('embedded_statements.pkl')

In [25]:
merged_df = statement_properties_df.merge(embedded_statements, left_index=True, right_index=True)
merged_df.head(5)

Unnamed: 0,statement_number,statement_x,behavior,everyday,figure_of_speech,judgment,opinion,reasoning,category,elicitation,statement_y,fixed statement,embeddings
0,1,1 plus 1 is 2,0,1,0,0,0,1,Mathematics and logic,category response,1 plus 1 is 2,1 plus 1 equals 2.,"[0.030699048191308975, -0.004340122453868389, ..."
1,2,5 is alot bigger than 1,0,0,0,0,0,0,Mathematics and logic,category response,5 is alot bigger than 1,5 is significantly larger than 1.,"[-9.334934293292463e-05, 0.013633872382342815,..."
2,3,a balanced diet and regular exercise is needed...,1,1,0,1,0,1,Health and fitness,category response,a balanced diet and regular exercise is needed...,"To maintain good health, one needs a balanced ...","[0.011176474392414093, 0.004732023924589157, 0..."
3,4,a ball is round,0,1,0,0,0,0,Natural and physical sciences,Concept Net,a ball is round,A ball is round.,"[-0.004082511644810438, -6.48864806862548e-05,..."
4,5,a baton twirler doesn't want a broken finger,0,1,0,1,1,0,Human activities,Concept Net,a baton twirler doesn't want a broken finger,A baton twirler wouldn't want to suffer a brok...,"[-0.02298557199537754, 0.006573873572051525, 0..."


In [35]:
arr = merged_df.embeddings[0].strip('[]').split(',')

np.array([(item) for item in arr])

array([ 0.03069905, -0.00434012, -0.00222734, ..., -0.00375783,
       -0.00634472, -0.04861955])