Replicação - Equipe

- snapshot_20230727
- pegaram 100 conversas para montar as categorias
- resultou em 22 categorias
- 95% de confiança e 5% de erro
- Qualquer conversa que não estiver disponível e não for em inglês, descarta (N/A)

Importando o dataset em json

In [20]:
import re

import pandas as pd
import json

discussion_json_path = './20230727_195954_discussion_sharings.json'
hn_sharings_json_path = './20230727_195816_hn_sharings.json'

with open(discussion_json_path, "r", encoding="utf-8") as f:
    discussion_data = json.load(f)

with open(hn_sharings_json_path, "r", encoding="utf-8") as f:
    hn_sharings_data = json.load(f)

discussion_sources = discussion_data.get("Sources", [])
hn_sources = hn_sharings_data.get("Sources", [])

sources = discussion_sources + hn_sources
rows = []

percorre o JSON das fontes e conversas do ChatGPT, filtra apenas os prompts feitos pelos usuários e transforma cada prompt em uma linha estruturada com metadados para análise em um DataFrame.


In [21]:
for src in sources:
    chat_sharings = src.get("ChatgptSharing", []) or []
    for share in chat_sharings:
        conversations = share.get("Conversations", []) or []
        for idx, turn in enumerate(conversations):
            prompt = turn.get("Prompt")
            if prompt is None:
                continue

            rows.append({
                "source_type": src.get("Type"),
                "source_url": src.get("URL"),
                "source_title": src.get("Title"),
                "chat_url": share.get("URL"),
                "chat_title": share.get("Title"),
                "date_of_conversation": share.get("DateOfConversation"),
                "status": share.get("Status"),
                "prompt_index": idx,
                "prompt": prompt
            })

df = pd.DataFrame(rows)


Limpeza do texto

In [22]:
if not df.empty:
    df["prompt"] = (
        df["prompt"]
        .astype(str)
        .str.replace(r"\s+", " ", regex=True)
        .str.strip()
    )
    df = df[df["prompt"].str.len() > 0].reset_index(drop=True)

print(f"Total de prompts extraídos: {len(df)}")
df.head()

Total de prompts extraídos: 812


Unnamed: 0,source_type,source_url,source_title,chat_url,chat_title,date_of_conversation,status,prompt_index,prompt
0,discussion,https://github.com/orgs/deep-foundation/discus...,Should we worry about imports perfomance in ha...,https://chat.openai.com/share/1e0f86ff-2094-44...,await import vs plain import,"July 11, 2023",200,0,Can I always use await import instead of plain...
1,discussion,https://github.com/orgs/deep-foundation/discus...,Should we worry about imports perfomance in ha...,https://chat.openai.com/share/1e0f86ff-2094-44...,await import vs plain import,"July 11, 2023",200,0,Can I always use await import instead of plain...
2,discussion,https://github.com/JushBJJ/Mr.-Ranedeer-AI-Tut...,v2.7 - Code Interpreter Exclusive,https://chat.openai.com/share/53b38605-68e5-49...,Mr. Ranedeer v2.7 (CODE INTERPRETER ONLY),"July 15, 2023",200,0,"=== Author: JushBJJ Name: ""Mr. Ranedeer"" Versi..."
3,discussion,https://github.com/JushBJJ/Mr.-Ranedeer-AI-Tut...,v2.7 - Code Interpreter Exclusive,https://chat.openai.com/share/bb0d35d9-0239-49...,Mr. Ranedeer's Wizard,"July 14, 2023",200,0,"[Personalization Options] Language: [""English""..."
4,discussion,https://github.com/dtch1997/gpt-text-gym/discu...,GPT decomposing missions using functions,https://chat.openai.com/share/1ee48447-8296-4a...,Gridworld Agent: Rules & Objects,"June 26, 2023",200,0,You are an agent in a gridworld. The environme...


In [34]:
"""Tratamentos"""


# Função para remover os textos que não estão em inglês.
def is_english(text):
    if not isinstance(text, str):
        return False
    return bool(re.match(r"^[\x00-\x7F]+$", text))


# Aplicando o filtro no dataframe.
df["is_english"] = df["prompt"].apply(is_english)

df_lang = df[df["is_english"]].reset_index(drop=True)

print(f"Após filtro de idioma (inglês): {len(df_lang)}")

# Realizar uma última verificação para garantir que os dados estejam filtrados de maneira adequada.
df = df_lang[
    df_lang["prompt"].notna() & (df_lang["prompt"].str.len() > 5)
].reset_index(drop=True)

# 1. Remover duplicatas antes da amostragem
df = df.drop_duplicates(subset=["prompt"]).reset_index(drop=True)

print(f"\nApós remoção de duplicatas: {len(df)}")

print(f"Dataset final após todos os filtros: {len(df)}")

df.head()

Após filtro de idioma (inglês): 715

Após remoção de duplicatas: 626
Dataset final após todos os filtros: 626


Unnamed: 0,source_type,source_url,source_title,chat_url,chat_title,date_of_conversation,status,prompt_index,prompt,is_english
0,discussion,https://github.com/orgs/deep-foundation/discus...,Should we worry about imports perfomance in ha...,https://chat.openai.com/share/1e0f86ff-2094-44...,await import vs plain import,"July 11, 2023",200,0,Can I always use await import instead of plain...,True
1,discussion,https://github.com/dtch1997/gpt-text-gym/discu...,GPT decomposing missions using functions,https://chat.openai.com/share/1ee48447-8296-4a...,Gridworld Agent: Rules & Objects,"June 26, 2023",200,0,You are an agent in a gridworld. The environme...,True
2,discussion,https://github.com/dtch1997/gpt-text-gym/discu...,GPT decomposing missions using functions,https://chat.openai.com/share/1ee48447-8296-4a...,Gridworld Agent: Rules & Objects,"June 26, 2023",200,1,Here is some code designed to parse the enviro...,True
3,discussion,https://github.com/dtch1997/gpt-text-gym/discu...,GPT decomposing missions using functions,https://chat.openai.com/share/1ee48447-8296-4a...,Gridworld Agent: Rules & Objects,"June 26, 2023",200,2,Consider the following environment and mission...,True
4,discussion,https://github.com/dave1010/pandora/discussions/6,Demo and examples thread. Share what you've do...,https://chat.openai.com/share/77168701-1f4e-41...,Pandora demo: Conway Game in JS,"July 2, 2023",200,0,i want to make a cool conway game of life demo...,True


Selecionar aleatoriamente 100 conversas para a fase exploratória.

In [24]:
trial_sample = df.sample(n=100, random_state=42)
trial_sample.to_csv("trial_phase.csv", index=False)


Amostra estatística de 321 conversas

In [25]:
remaining = df.drop(trial_sample.index)
coding_sample = remaining.sample(n=321, random_state=42)

coding_sample.to_csv("coding_phase.csv", index=False)


In [35]:
coding_sample["annotator_A"] = ""
coding_sample["annotator_B"] = ""

coding_sample[[
    "chat_url",
    "prompt",
    "annotator_A",
    "annotator_B"
]].to_csv("annotation_sheet.csv", index=False)


In [None]:
# ============================================================
# AMOSTRAGEM ESTATÍSTICA
# Conforme metodologia do artigo: 95% de confiança, 5% de erro
# ============================================================

import math


# 2. Cálculo do tamanho amostral — Fórmula de Cochran com correção finita
N = len(df)       # tamanho da população
z = 1.96                # nível de confiança 95%
e = 0.05                # margem de erro 5%
p = 0.5                 # proporção esperada (máxima variância)

n0 = (z**2 * p * (1 - p)) / (e**2)           # amostra inicial (Cochran)
n = math.ceil(n0 / (1 + (n0 - 1) / N))       # correção para população finita

print(f"\n{'='*45}")
print(f"  AMOSTRAGEM ESTATÍSTICA")
print(f"{'='*45}")
print(f"  População (N):              {N}")
print(f"  Nível de confiança:         95% (z = {z})")
print(f"  Margem de erro:             5%  (e = {e})")
print(f"  Proporção esperada (p):     {p}")
print(f"  Amostra inicial (n₀):       {n0:.0f}")
print(f"  Tamanho amostral final (n): {n}")
print(f"{'='*45}")

# 3. Selecionar amostra aleatória (reprodutível)
amostra = df.sample(n=n, random_state=42).reset_index(drop=True)
print(f"\n  Amostra selecionada: {len(amostra)} prompts")

# 4. Exportar amostra
amostra.to_csv("devgpt_amostra.csv", index=False)
print(f"  Arquivo exportado:   devgpt_amostra.csv")
print(f"\nPróximo passo: categorizar os {len(amostra)} prompts da amostra.")


Cohen's Kappa

In [None]:
from sklearn.metrics import cohen_kappa_score

annotated = pd.read_csv("annotation_sheet_filled.csv")

kappa = cohen_kappa_score(
    annotated["annotator_A"],
    annotated["annotator_B"]
)

print("Cohen's Kappa:", kappa)
