### OpenAI Access

First things first, you'll need to set-up an account on [OpenAI](platform.openai.com). Once you've done that - follow [these resources](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) to create an API key. Make sure you save your API key!

In [None]:
import os 

# Set the OPENAI_API_KEY environment variable
os.environ["OPENAI_API_KEY"] = "secret-key-here"

### OpenAI API Library

We'll be leveraging [this](https://github.com/openai/openai-python) library to access OpenAI's model endpoints.

There are a number of models to choose from and you can find resources about them [here](https://platform.openai.com/docs/models) and their pricing [here](https://openai.com/pricing).

The first step is to install `openai`!

In [None]:
!pip install openai -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.4/269.4 kB[0m [31m10.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h

Once we've installed it, we need to import it and set our API key!

In [None]:
import openai 

openai.api_key = os.environ.get("OPENAI_API_KEY")

If you wanted to use `gpt-4`, you'd need an account that has closed beta access to the model endpoint. 

You can check if your API Key has access using the following cell.

In [None]:
# check if acct. has gpt-4 access
"gpt-4" in [model["root"] for model in openai.Model.list()["data"]]

False

For the rest of the tutorial, we're going to assume you're using `gpt-3.5-turbo` as your model.

Let's make some helper functions for prompting our model and generating our prompts.

In [None]:
def prompt_model(prompt_list, model="gpt-3.5-turbo"):
  return openai.ChatCompletion.create(model=model, messages=prompt_list)

def create_prompt(role, prompt):
  return {"role" : role, "content" : prompt}

As you can see, our prompts have to be in a specific format - as set by OpenAI.

Here's an example:

```
{"role" : "system", "content" : "You are an expert in Python programming."}
{"role" : "user", "content" : "Please define a function that provides the Nth number of the fibonacci sequence."}
```

Let's see that in action! Remember that you can feed OpenAI's chat completion endpoint with a list of prompts!

In [None]:
position_level = [ "entry", "senior", "mid-level" ]
tasks = [ "computer vision", "natural language processing", "recommender system", "Reinforcement learning"]
job_titles = ['Machine Learning Engineer', 'Data Scientist', 'Research Scientist', 'Business Intelligence Developer', 'AI Product Manager', 'AI Consultant', 'Robotics Engineer', 'NLP Engineer', 'Research Assistant', 'Deep Learning Engineer']

job_postings = []
for position_level_t in position_level:
    for task_t in tasks:
        for job_titles_t in job_titles:
            
            list_of_prompts = [
                {"role" : "system", "content" : "You are a technical hiring manager working at an AI company."}, 
                {"role" : "user", "content" : f"Please define a job description for a {job_titles_t} role for a {position_level_t} level position using {task_t} ."}
            ]

            model_output = prompt_model(list_of_prompts)
            # print(model_output)

            job_postings.append(model_output['choices'][0]['message']['content'])

In [None]:
print(job_postings[0])

Job Title: Entry-Level Machine Learning Engineer - Computer Vision

Location: [Your Company's Location]

Job Type: Full-Time

Job Summary:
We are looking for a driven and enthusiastic Entry-Level Machine Learning Engineer with a background in Computer Vision to join our growing Artificial Intelligence team. The Machine Learning Engineer will play a critical role in the design and development of our computer vision systems, which are essential components of our AI algorithms that enable various applications such as robotics, autonomous vehicles, and image analysis.

Responsibilities:
- Work closely with the AI team to research, design and implement machine learning models for solving real-world computer vision problems.
- Develop and train deep learning models using various frameworks such as TensorFlow or PyTorch.
- Develop algorithms to improve existing systems, achieve higher accuracy and introduce new features to enable more advanced use cases.
- Participate in data annotation and c

In [None]:
dataset_dict = {
    "position_level": [],
    "use_case": [],
    "job_title": [],
    "job_posting": []
}

i=0
for position_level_t in position_level:
    for task_t in tasks:
        for job_titles_t in job_titles:
            dataset_dict["position_level"].append(position_level_t)
            dataset_dict["use_case"].append(task_t)
            dataset_dict["job_title"].append(job_titles_t)
            dataset_dict["job_posting"].append(job_postings[i])
            i += 1

### Uploading Dataset to HuggingFace Hub

Now that we've created our synthetic dataset - let's push it to the HuggingFace hub!

As always, the first task is to get the required dependencies.

In [None]:
!pip install huggingface_hub -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/224.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m215.0/224.5 kB[0m [31m7.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h

Now we can log-in to Hugging Face!

Make sure you have a Hugging Face account, and you have set up a read/write token!

More info here: https://huggingface.co/docs/hub/security-tokens

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) y
Token is valid.
[1m[31mCannot authenticate through git-credential a

Now we can load our data into the desired format - and upload it to the hub!

In [None]:
!pip install datasets -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m14.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from datasets import load_dataset, Dataset
import pandas as pd

In [None]:
hf_dataset = Dataset.from_pandas(pd.DataFrame(data=dataset_dict))

In [None]:
hf_dataset

Dataset({
    features: ['position_level', 'use_case', 'job_title', 'job_posting'],
    num_rows: 120
})

In [None]:
hf_username = "cmagganas"
dataset_name = "GenAI-job-postings-Dataset"

hf_dataset.push_to_hub(f"{hf_username}/{dataset_name}")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
sample_data = hf_dataset.shuffle().select(range(10))

In [None]:
mod_outs=[]
for job_posting_i in sample_data['job_posting']:
    list_of_prompts = [{"role" : "user", "content" : f"create a cover letter for this job description\n```\n{job_posting_i}\n```"}]
    mod_outs.append(prompt_model(list_of_prompts)['choices'][0]['message']['content'])

In [None]:
sample_data = sample_data.add_column('cover_letter', mod_outs)

In [None]:
hf_username = "cmagganas"
dataset_name = "GenAI-job-postings-Dataset-sample"

sample_data.push_to_hub(f"{hf_username}/{dataset_name}")

### Conclusion

And that's it! You just created a synthetic dataset and pushed it to the hub! 

Next stop? [Modeling!](https://colab.research.google.com/drive/1RfUuzG11Q8AaZuJIHLzXCVC087xoDeSd?usp=sharing)