# Kaggle API Setup Guide

This guide explains how to set up the Kaggle API to use Python for downloading datasets.

## Prerequisites
- Ensure Python is installed on your system.
- Install the Kaggle API package by running:
  ```
  pip install kaggle
  ```

## Setting Up `kaggle.json`
The `kaggle.json` file contains your API key, which is required to authenticate with Kaggle.

### Step 1: Download `kaggle.json`
1. Log in to your [Kaggle account settings](https://www.kaggle.com/account).
2. Scroll to the **API** section.
3. Click **Create New API Token**. This will download a file named `kaggle.json`.

### Step 2: Place the `kaggle.json` File
1. Create a `.kaggle` directory in your user home directory:
   - On Windows: `C:\Users\YourUsername\.kaggle`
   - On Mac/Linux: `~/.kaggle`

2. Move the `kaggle.json` file to the `.kaggle` directory.

### Step 3: Verify Permissions (Optional for Windows)
- On Linux/Mac, set the correct permissions to secure the file:
  ```
  chmod 600 ~/.kaggle/kaggle.json
  ```

In [1]:
import os
from kaggle.api.kaggle_api_extended import KaggleApi

# Initialize and authenticate the Kaggle API
api = KaggleApi()
api.authenticate()

# Specify the dataset to download
dataset_name = "simiotic/github-code-snippets"  # Replace with your desired dataset

# Define the path to download the dataset
download_path = "./datasets"

# Download the dataset
print(f"Downloading dataset '{dataset_name}'...")
api.dataset_download_files(dataset_name, path=download_path, unzip=True)
print(f"Dataset downloaded and extracted to: {os.path.abspath(download_path)}")

Downloading dataset 'simiotic/github-code-snippets'...
Dataset URL: https://www.kaggle.com/datasets/simiotic/github-code-snippets
Dataset downloaded and extracted to: /root/Ollama-Set-RAG/datasets


## Show the data of this `.db` file
### Step 1: Download `pandas`

In [2]:
!pip install pandas

Collecting pandas
  Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting numpy>=1.26.0 (from pandas)
  Downloading numpy-2.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
Collecting pytz>=2020.1 (from pandas)
  Downloading pytz-2024.2-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2024.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.7/12.7 MB[0m [31m120.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.1/16.1 MB[0m [31m110.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pytz-2024.2-py2.py3-none-any.whl (508 kB)
Downloading tzdata-2024.2-py2.py3-no

### Step 2: Show the data
Since your dataset is located in the ./datasets/snippets/ directory and you have a file named snippets.db, it seems like the dataset might be in an SQLite database format. You can use Python to load the data from the snippets.db file using the sqlite3 library in Python.

Here's how you can access the data from the snippets.db SQLite database:

In [3]:
import sqlite3
import pandas as pd

# Define the path to the SQLite database
database_path = "./datasets/snippets/snippets.db"

# Connect to the database
conn = sqlite3.connect(database_path)

# Query to select all columns from the 'snippets' table
query = "SELECT language, SUM(chunk_size) AS total_chunk_size FROM snippets GROUP BY language"  # You can adjust the LIMIT as needed

# Load the data into a pandas DataFrame
df = pd.read_sql_query(query, conn)

# Close the database connection
conn.close()

# Display the DataFrame as a table
print(df)


      language  total_chunk_size
0         Bash             53840
1            C          61905695
2          C++          32807530
3          CSV            888775
4      DOTFILE            337085
5           Go          40669595
6         HTML           9936365
7         JSON          32576945
8         Java          37707120
9   JavaScript          45340285
10     Jupyter           1229605
11    Markdown          14949355
12  PowerShell            195980
13      Python          19833355
14        Ruby           5435615
15        Rust           2993515
16       Shell           1923690
17         TSV             51525
18        Text          17147775
19     UNKNOWN         158415145
20        YAML           2938680


---
To filter the Java-related snippets from your large database and save them in a more accessible format, here's the plan:

1. **Filter Java-related Snippets**: You'll extract only the rows where the `language` is "Java".
2. **Export to an Efficient Format**: Save the filtered data in a format that is easy to read and process, such as **CSV** or **Parquet**, as both formats are fast to load and can be used for future analysis or requests.

### Steps:

1. **Filter the Data**: Query the database for all Java-related snippets.
2. **Export the Filtered Data**: Save the filtered snippets to a file (either CSV or Parquet).
3. **Load and Work with the Data**: You can later load this file into pandas or any other tool for further processing.

### 1. Python Code to Filter Java Snippets and Export to CSV or Parquet:


In [4]:
import sqlite3
import pandas as pd

# Define the path to the SQLite database
database_path = "./datasets/snippets/snippets.db"

# Connect to the database
conn = sqlite3.connect(database_path)

# Query to filter Java-related snippets
query = """
    SELECT * FROM snippets
    WHERE language = 'Java'
"""

# Load the data into a pandas DataFrame
df_java = pd.read_sql_query(query, conn)

# Export the filtered data to CSV (or use Parquet for more efficient storage)
df_java.to_csv("./datasets/snippets/javaSnippets.csv", index=False)

# Close the database connection
conn.close()

print("Java-related snippets exported to 'javaSnippets.csv'")

Java-related snippets exported to 'javaSnippets.csv'


### 2. Explanation of the Code:

- **Database Connection**: We open a connection to the `snippets.db` database using `sqlite3`.
- **SQL Query**: We query the `snippets` table to select only those rows where the `language` is "Java".
- **Exporting**: We save the filtered data to a **CSV** file (`javaSnippets.csv`) or **Parquet** file. You can choose which format works best for you. CSV is human-readable, while Parquet is more efficient for large datasets.
- **Close the Connection**: The database connection is closed once the export is complete.

### 3. Loading the Exported Data:

Once you’ve exported the data to a file, you can easily load it back into pandas for any further analysis.

#### Loading the CSV File:

In [12]:
import pandas as pd

# Load the Java snippets CSV file
df_java = pd.read_csv("./datasets/snippets/javaSnippets.csv")

# Display the first few rows of the data
df_java.head()

Unnamed: 0,id,snippet,language,repo_file_name,github_repo_url,license,commit_hash,starting_line_number,chunk_size
0,88608,/*\n * Copyright 2014 The Netty Project\n *\n ...,Java,netty/netty/resolver-dns/src/test/java/io/nett...,https://github.com/netty/netty,Apache-2.0,a60825c3b425892af9be3e9284677aa8a58faa6b\n,0,5
1,88609,* with the License. You may obtain a copy of ...,Java,netty/netty/resolver-dns/src/test/java/io/nett...,https://github.com/netty/netty,Apache-2.0,a60825c3b425892af9be3e9284677aa8a58faa6b\n,5,5
2,88610,* distributed under the License is distribute...,Java,netty/netty/resolver-dns/src/test/java/io/nett...,https://github.com/netty/netty,Apache-2.0,a60825c3b425892af9be3e9284677aa8a58faa6b\n,10,5
3,88611,\npackage io.netty.resolver.dns;\n\nimport io....,Java,netty/netty/resolver-dns/src/test/java/io/nett...,https://github.com/netty/netty,Apache-2.0,a60825c3b425892af9be3e9284677aa8a58faa6b\n,15,5
4,88612,\nimport java.net.InetSocketAddress;\nimport j...,Java,netty/netty/resolver-dns/src/test/java/io/nett...,https://github.com/netty/netty,Apache-2.0,a60825c3b425892af9be3e9284677aa8a58faa6b\n,20,5


### 4. Benefits of This Approach:

- **Speed**: Exporting the Java-related snippets to a file format like CSV or Parquet will make it faster to load and query in the future, avoiding the overhead of querying a large SQLite database.
- **Ease of Use**: You can use pandas or other tools to read the file and easily filter, analyze, or manipulate the data.
- **Efficient Storage**: Parquet is more space-efficient and faster for reading large datasets compared to CSV.

This approach should help you manage large datasets effectively and focus on Java snippets for debugging and analysis.

In [2]:
rm './datasets/snippets/snippets.db'

---

Here’s how you can write Python code to split a large file into two smaller files, and then combine them back into the original file:

### Split the large file into two parts

In [3]:
import os


def split_file(input_file, part_prefix, max_size=100 * 1024 * 1024):
    with open(input_file, 'rb') as file:
        part_number = 1
        while True:
            # Read up to `max_size` bytes
            data = file.read(max_size)
            if not data:  # Stop if no more data to read
                break
            
            # Format the output file name: <part_prefix>_part<part_number>.csv
            part_file = f"{part_prefix}_part{part_number}.csv"
            with open(part_file, 'wb') as output:
                output.write(data)
            
            print(f"Created {part_file}")
            part_number += 1

    os.remove(input_file)  # Remove the original file
    print(f"Original file {input_file} removed after splitting.")

# Example usage
path = './datasets/snippets/'
input_file = f'{path}javaSnippets.csv'
part_prefix = f'{path}javaSnippets'  # Prefix for output parts

split_file(input_file, part_prefix)

SyntaxError: invalid syntax (3304296322.py, line 1)

### Combine the two smaller files back into the original file

In [8]:
def combine_files(part_prefix, output_file):
    part_number = 1
    with open(output_file, 'wb') as output:
        while True:
            # Generate part file name: <part_prefix>_part<part_number>.csv
            part_file = f"{part_prefix}_part{part_number}.csv"
            
            if not os.path.exists(part_file):
                # Stop if the part file doesn't exist
                break
            
            # Combine the part file into the output file
            with open(part_file, 'rb') as part:
                output.write(part.read())
            
            # Remove the part file after combining
            os.remove(part_file)
            print(f"Added {part_file} to {output_file} and removed it.")
            part_number += 1

    print(f"Files combined into {output_file}")

# Example usage
output_file = f'{path}javaSnippets.csv'

combine_files(part_prefix, output_file)

Files combined into ./datasets/snippets/javaSnippets.csv


### How the code works:
1. **Splitting**: 
   - The `split_file` function reads the entire input file as binary (`'rb'` mode).
   - It splits the data into two parts using the midpoint (half of the file size).
   - It writes the first half to `output_file1` and the second half to `output_file2`.

2. **Combining**:
   - The `combine_files` function opens both smaller files (`input_file1` and `input_file2`) and writes their contents into `output_file`.
   - The files are opened in binary mode (`'rb'` and `'wb'`) to handle any file type, including large files.

### Notes:
- Replace `'path_to_large_file/javaSnippets.csv'`, `'path_to_large_file/javaSnippets_part1.csv'`, etc., with the actual paths of your files.
- Make sure the file you want to split is not open in another program, as this might cause a permission error.

Let me know if you need further assistance!

In [1]:
cp './datasets/snippets/javaSnippets.csv' './datasets/snippets/Storage'