# Dataset Preparation for MagicCoder

In this approach, the dataset structure was simplified to the following format:

```markdown
### Instruction :
### <<Code>>:


In [None]:
! pip install datasets

Collecting datasets
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
Successfully installed datasets-2.17.1 dill-0.3.8 multiprocess-0.70.16


In [None]:
from datasets import load_dataset

magicDB = load_dataset("ise-uiuc/Magicoder-OSS-Instruct-75K",  split="train")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/314 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/203M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
import pandas as pd

df = pd.DataFrame(magicDB, index = None)

In [None]:
df

Unnamed: 0,lang,raw_index,index,seed,openai_fingerprint,problem,solution
0,cpp,101533,4626,"int n;\n cin >> n;\n vector<int> a(n + 1),...",fp_eeff13170a,"You are given two arrays, A and B, each of len...",```cpp\n#include <iostream>\n#include <vector>...
1,python,131094,37716,return jinja\n,fp_eeff13170a,You are tasked with implementing a simple Pyth...,```python\ndef find_palindromes(words):\n p...


In [None]:
def template(df):
  instruction = f"""###Instruction: {df["problem"]} """
  code = f"""<<code>>: {df["solution"]}"""

  templates = f"""{instruction}\n{code}"""
  return templates


df["sol"] = df.apply(template, axis = 1)


text = df.iloc[1]["sol"]
print(text)

###Instruction: You are tasked with implementing a simple Python function that takes a list of strings as input and returns a new list containing only the strings that are palindromes. A palindrome is a word, phrase, number, or other sequence of characters that reads the same forward and backward (ignoring spaces, punctuation, and capitalization).

You are provided with the following code snippet as a starting point:

```python
def find_palindromes(words):
    # Your code here
    return palindromes
```

Your task is to complete the `find_palindromes` function to filter out the palindromes from the input list of strings and return a new list containing only the palindromes.

For example, if the input list is `["radar", "hello", "level", "world", "Anna"]`, the function should return `["radar", "level", "Anna"]`. 
<<code>>: ```python
def find_palindromes(words):
    palindromes = [word for word in words if word.lower().replace(" ", "") == word[::-1].lower().replace(" ", "")]
    return p

In [None]:
import json

new_data = list(df['sol'])
output_file_path = 'aurelius_Magic_v1.json'
with open(output_file_path, 'w') as output_file:
    json.dump(new_data, output_file, indent=2)

In [None]:
lang = list(df['lang'])
with open('aurelius_Magic_v1.json', 'r', encoding='utf-8') as json_file:
    data = json.load(json_file)

hf_data = []

for text, lang in zip(data, lang):
  hf_data.append({"code": text,
                  "lang" : lang})


output_file_path = 'aurelius_Magic_v2.json'
with open(output_file_path, 'w') as output_file:
    json.dump(hf_data, output_file, indent=2)


{'code': '###Instruction: You are tasked with implementing a simple Python function that takes a list of strings as input and returns a new list containing only the strings that are palindromes. A palindrome is a word, phrase, number, or other sequence of characters that reads the same forward and backward (ignoring spaces, punctuation, and capitalization).\n\nYou are provided with the following code snippet as a starting point:\n\n```python\ndef find_palindromes(words):\n    # Your code here\n    return palindromes\n```\n\nYour task is to complete the `find_palindromes` function to filter out the palindromes from the input list of strings and return a new list containing only the palindromes.\n\nFor example, if the input list is `["radar", "hello", "level", "world", "Anna"]`, the function should return `["radar", "level", "Anna"]`. \n<<code>>: ```python\ndef find_palindromes(words):\n    palindromes = [word for word in words if word.lower().replace(" ", "") == word[::-1].lower().repla

## Push Dataset to Hub

In [None]:
from huggingface_hub import notebook_login
from datasets import load_dataset
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
dataset = load_dataset('json', data_files='aurelius_Magic_v2.json' , split='train')

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
print(dataset['code'][:1])

['###Instruction: You are given two arrays, A and B, each of length n. You need to perform a convolution operation on these arrays and output the resulting array.\n\nThe convolution of two arrays A and B is defined as follows:\n- Let C be the resulting array of length 2n-1, where C[i] = Σ(A[j] * B[i-j]) for j = max(0, i-n+1) to min(i, n-1).\n\nWrite a function or method to perform the convolution operation and return the resulting array C.\n\nFunction Signature: \n```cpp\nvector<int> convolution(vector<int> a, vector<int> b)\n```\n\nInput:\n- Two arrays a and b of length n (1 <= n <= 10^5), where each element of the array is an integer (-10^9 <= a[i], b[i] <= 10^9).\n\nOutput:\n- Return the resulting array C after performing the convolution operation.\n\nExample:\nInput:\na = [1, 2, 3]\nb = [4, 5, 6]\n\nOutput:\nconvolution(a, b) -> [4, 13, 28, 27, 18] \n<<code>>: ```cpp\n#include <iostream>\n#include <vector>\nusing namespace std;\n\nvector<int> convolution(vector<int> a, vector<int> 

In [None]:
print(len(dataset['code']))

75197


In [None]:
dataset.push_to_hub("Aurelius_Magic_75k")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/76 [00:00<?, ?ba/s]

README.md:   0%|          | 0.00/280 [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/datasets/anshulsc/Aurelius_Magic_75k/commit/eedcdad61f23e84690feafa122a588bdf4ea88b1', commit_message='Upload dataset', commit_description='', oid='eedcdad61f23e84690feafa122a588bdf4ea88b1', pr_url=None, pr_revision=None, pr_num=None)