# Data exploration

We start with adding packages that we need for the task of data exploration. 

In [7]:
!pip install pandas



In [1]:
import os, glob
import json
import pandas as pd

In [4]:
devgpt_location = "./DevGPT/"
github_devgpt_snapshot_folders = [path for path,_,_ in os.walk(devgpt_location) if "snapshot" in path]
commits_json_filepaths = []
issues_json_filepaths = []
discussion_json_filepaths = []
pull_requests_json_filepaths = []
code_json_filepaths = []
hackernews_json_filepaths = []
for snapshot_folder in github_devgpt_snapshot_folders:
    json_file_paths = glob.glob(f'{snapshot_folder}/*.json')
    for json_file_path in json_file_paths:
        if "commit" in json_file_path:
            commits_json_filepaths.append(json_file_path)
        elif "issue" in json_file_path:
            issues_json_filepaths.append(json_file_path)
        elif "discussion" in json_file_path:
            discussion_json_filepaths.append(json_file_path)
        elif "pr" in json_file_path:
            pull_requests_json_filepaths.append(json_file_path)
        elif "file" in json_file_path:
            code_json_filepaths.append(json_file_path)
        elif "hn" in json_file_path:
            hackernews_json_filepaths.append(json_file_path)
        else:
            raise Exception(f"JSON file '{json_file_paths}' was not recognised!")

""" Uncomment to print out all the file paths!
print(commits_json_filepaths)
print(issues_json_filepaths)
print(discussion_json_filepaths)
print(pull_requests_json_filepaths)
print(code_json_filepaths)
print(hackernews_json_filepaths)
"""

' Uncomment to print out all the file paths!\nprint(commits_json_filepaths)\nprint(issues_json_filepaths)\nprint(discussion_json_filepaths)\nprint(pull_requests_json_filepaths)\nprint(code_json_filepaths)\nprint(hackernews_json_filepaths)\n'

We start with looking closely at what data is contained in the files and whether newer snapshots were just appended with new information, but contain old snapshot information as well, or they just contain information that was created after the last snapshot was combined. 

In [62]:
def read_json_data_from_files_to_dataframe(filepaths_list):
    file_df = pd.DataFrame()
    for json_filepath in filepaths_list:
        with open(json_filepath, 'r') as file:
            # Load JSON data from file
            json_data = json.load(file)
            new_dataframe = pd.json_normalize(json_data, record_path='Sources')
            print(f"Data from '{json_filepath}' file contains {len(new_dataframe)} commits.")
            new_dataframe["ChatgptSharing"] = new_dataframe.ChatgptSharing.str[0].to_frame()
            file_df = pd.concat([file_df, new_dataframe])
    print(f"When all the dataframes were concatenated/appended, we have total of {len(file_df)} rows in the DF.")
    #file_df = file_df.drop_duplicates(subset=[subset_for_suplicates], keep='last') # Keep newest
    print(f"After removing the duplicated based on SHA of the source, we have total of {len(file_df)} rows in the DF.")
    return file_df

In [68]:
commits_dataframe = read_json_data_from_files_to_dataframe([commits_json_filepaths[0]])
commits_dataframe.ChatgptSharing

Data from './DevGPT/snapshot_20230727\20230727_200003_commit_sharings.json' file contains 179 commits.
When all the dataframes were concatenated/appended, we have total of 179 rows in the DF.
After removing the duplicated based on SHA of the source, we have total of 179 rows in the DF.


0      {'URL': 'https://chat.openai.com/share/c89e041...
1      {'URL': 'https://chat.openai.com/share/76af53f...
2      {'URL': 'https://chat.openai.com/share/0f8a3cf...
3      {'URL': 'https://chat.openai.com/share/e1f4926...
4      {'URL': 'https://chat.openai.com/share/67ff020...
                             ...                        
174    {'URL': 'https://chat.openai.com/share/b53e39e...
175    {'URL': 'https://chat.openai.com/share/76d4817...
176    {'URL': 'https://chat.openai.com/share/b57df6e...
177    {'URL': 'https://chat.openai.com/share/4aeb8ed...
178    {'URL': 'https://chat.openai.com/share/75cd8ea...
Name: ChatgptSharing, Length: 179, dtype: object

In [17]:
issues_dataframe = read_json_data_from_files_to_dataframe(issues_json_filepaths, "Body")
issues_dataframe

Data from './DevGPT/snapshot_20230727\20230727_195941_issue_sharings.json' file contains 235 commits.
Data from './DevGPT/snapshot_20230803\20230803_094705_issue_sharings.json' file contains 255 commits.
Data from './DevGPT/snapshot_20230810\20230810_123938_issue_sharings.json' file contains 282 commits.
Data from './DevGPT/snapshot_20230817\20230817_130502_issue_sharings.json' file contains 303 commits.
Data from './DevGPT/snapshot_20230824\20230824_101836_issue_sharings.json' file contains 336 commits.
Data from './DevGPT/snapshot_20230831\20230831_061759_issue_sharings.json' file contains 353 commits.
Data from './DevGPT/snapshot_20230907\20230907_092956_issue_sharings.json' file contains 388 commits.
Data from './DevGPT/snapshot_20230914\20230914_080417_issue_sharings.json' file contains 422 commits.
Data from './DevGPT/snapshot_20231012\20231012_235128_issue_sharings.json' file contains 507 commits.
When all the fataframes were concatenated/appended, we have total of 3081 rows in 

Unnamed: 0,Type,URL,Author,RepoName,RepoLanguage,Number,Title,Body,CreatedAt,ClosedAt,UpdatedAt,State,ChatgptSharing
0,issue,https://github.com/gakusyutai/gakusyutai.githu...,yuyu31,gakusyutai/gakusyutai.github.io,HTML,31,ハンバーガーメニューの実装,- https://chat.openai.com/share/8b0f517f-1aaf-...,2023-07-23T15:38:42Z,,2023-07-23T15:38:42Z,OPEN,[{'URL': 'https://chat.openai.com/share/795827...
1,issue,https://github.com/jabrena/aqa-tests-experimen...,jabrena,jabrena/aqa-tests-experiments,Java,4,Run a test in multiple java distros,- https://chat.openai.com/share/e169e9a7-40c5-...,2023-07-07T20:30:07Z,,2023-07-08T11:56:45Z,OPEN,[{'URL': 'https://chat.openai.com/share/b508dd...
2,issue,https://github.com/OpenVoiceOS/ovos-technical-...,JarbasAl,OpenVoiceOS/ovos-technical-manual,,4,document ovos-classifiers,ovos-classifiers is in pre-alpha but documenta...,2023-06-08T19:13:26Z,,2023-06-08T19:13:26Z,OPEN,[{'URL': 'https://chat.openai.com/share/1c4bc8...
3,issue,https://github.com/SKKUFastech/week1/issues/5,smh9800,SKKUFastech/week1,C,5,7/19,https://chat.openai.com/share/18990fa3-c8c6-41...,2023-07-19T01:36:52Z,,2023-07-19T01:50:48Z,OPEN,[{'URL': 'https://chat.openai.com/share/18990f...
4,issue,https://github.com/SKKUFastech/week1/issues/4,woojinsung-jimmy,SKKUFastech/week1,C,4,16진수,https://chat.openai.com/share/83859c4c-8894-41...,2023-07-18T06:14:42Z,,2023-07-18T06:14:42Z,OPEN,[{'URL': 'https://chat.openai.com/share/83859c...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
472,issue,https://github.com/simonw/datasette/issues/2189,simonw,simonw/datasette,Python,2189,Server hang on parallel execution of queries t...,I've started to encounter a bug where queries ...,2023-09-18T17:23:18Z,2023-09-21T22:26:21Z,2023-09-21T22:26:21Z,CLOSED,[{'URL': 'https://chat.openai.com/share/cc4628...
478,issue,https://github.com/cyd01/KiTTY/issues/495,mgrant0,cyd01/KiTTY,C,495,OSC52 panic,I enabled OSC52 support in tmux. I double-cli...,2023-06-12T08:09:49Z,2023-09-17T16:59:20Z,2023-10-02T16:38:09Z,CLOSED,[{'URL': 'https://chat.openai.com/share/ee1d11...
480,issue,https://github.com/VOICEVOX/voicevox/issues/1004,sousuke0422,VOICEVOX/voicevox,TypeScript,1004,vuexをやめてpiniaにする,## 内容\r\n\r\n[vuex](https://vuex.vuejs.org/)をや...,2022-11-02T04:29:35Z,,2023-09-17T09:23:57Z,OPEN,[{'URL': 'https://chat.openai.com/share/cce984...
483,issue,https://github.com/violentmonkey/violentmonkey...,cyfung1031,violentmonkey/violentmonkey,JavaScript,1901,"[Question] Regarding the ""Allow Updates"", how ...",Just a newbie to use Violentmonkey BETA. (v2.1...,2023-09-22T01:35:12Z,2023-09-25T02:21:28Z,2023-09-27T21:47:27Z,CLOSED,[{'URL': 'https://chat.openai.com/share/490132...


In [None]:
discussion_dataframe = read_json_data_from_files_to_dataframe(discussion_json_filepaths)
discussion_dataframe

In [None]:
pull_requests_dataframe = read_json_data_from_files_to_dataframe(pull_requests_json_filepaths)
pull_requests_dataframe

In [None]:
code_dataframe = read_json_data_from_files_to_dataframe(code_json_filepaths)
code_dataframe

In [None]:
hackernews_dataframe = read_json_data_from_files_to_dataframe(hackernews_json_filepaths)
hackernews_dataframe