# **Handson OpenAI**

OpenAI's text generation models (often called generative pre-trained transformers or large language models) have been trained to understand natural language, code, and images. The models provide text outputs in response to their inputs. The inputs to these models are also referred to as "prompts". Designing a prompt is essentially how you “program” a large language model model, usually by providing instructions or some examples of how to successfully complete a task.

Using OpenAI's text generation models, you can build applications to:

1. Draft documents
2. Write computer code
3. Answer questions about a knowledge base
4. Analyze texts
5. Give software a natural language interface
6. Tutor in a range of subjects
7. Translate languages
8. Simulate characters for games

In [1]:
!pip install openai

Collecting openai
  Obtaining dependency information for openai from https://files.pythonhosted.org/packages/e7/c4/c992bd201ac468fa92ed19e002b5c74be80968ca1be8f7b735dc88dbe0b8/openai-1.19.0-py3-none-any.whl.metadata
  Downloading openai-1.19.0-py3-none-any.whl.metadata (21 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Obtaining dependency information for distro<2,>=1.7.0 from https://files.pythonhosted.org/packages/12/b3/231ffd4ab1fc9d679809f356cebee130ac7daa00d6d6f3206dd4fd137e9e/distro-1.9.0-py3-none-any.whl.metadata
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Obtaining dependency information for httpx<1,>=0.23.0 from https://files.pythonhosted.org/packages/41/7b/ddacf6dcebb42466abd03f368782142baa82e08fc0c1f8eaa05b4bae87d5/httpx-0.27.0-py3-none-any.whl.metadata
  Downloading httpx-0.27.0-py3-none-any.whl.metadata (7.2 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Obtaining dependency information for http

### **Importing OpenAI**

In [30]:
from openai import OpenAI

In [31]:
# Way 2
# This is a good way but, there is a much better ways of setting the API key using .env

f = open("C:\\Users\\91949\\Data Science Internship\\genai\\.openai_api_key.txt")
OPENAI_API_KEY = f.read()

client = OpenAI(api_key = OPENAI_API_KEY)

In [32]:
prompt = "In our solar system, Earth is a "

response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt
)

response

Completion(id='cmpl-9EaHSek4nQxdB1zUyNsigvWxtnUEy', choices=[CompletionChoice(finish_reason='length', index=0, logprobs=None, text='3rd solar system planet.\n\n39\n\nThere are many different ways to measure the')], created=1713263082, model='gpt-3.5-turbo-instruct', object='text_completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=9, total_tokens=25))

In [33]:
print(response.choices[0].text)

3rd solar system planet.

39

There are many different ways to measure the


### **Important Parameters for Completion API**

**Token**  
GPT doesn't work with word level tokens. A token on average is approax. 4 characters of english text.

[Click here](https://platform.openai.com/tokenizer) to use the tokenizer tool from OpenAI.

Remeber that OpenAI charge based on tokens. To calculate the price, it adds the tokens in your prompt plus the tokens in the output it generates.

1. max_token
    - The maximum number of tokens to generate
    - Default = 16 (for Completion API)
    - Remember that the token count of your prompt plus max_token cannot exceed the model's context length. Most models have a context length of 2048 tokens or 4096 tokens.
2. stop
    - Default = null
    - Accepts stop sequence in the form of string or array
    - In case stop is passed as an array, accepted upto 4 sequences
3. n
    - How many completions to generate for each prompt
    - Default = 1
    - As it can quickly consume your quota, use carefully and ensure that you have a reasonable setting for `max_token` and `stop`
4. echo
    - Echo back the prompt in addition to the completion
    - Default = False
5. temperature (Controls the randomness of the output)
    - Value between 0 to 2
    - Default = 1
    - If temp > 1, it generate more random output, which could be gibrish.
    - Temprature values between 0 to 1, returns good results.
    - If temp=0, it returns deterministic results. Meaning same output everytime you run the program.
    - Higher temp value means more random, lower means more deterministic.
    - **Working:** It scales the logits. Logits are divided by temperature value before applying the softmax. This results in a 'softer' distribution with a higher temperature and a peaked distribution with low temperature.
6. top_p
    - Value between 0 to 1
    - Alternate to sampling with temperature, called nucleus sampling
    - Default = 1
    - Like temp, top p also alters the randomness of the output.
    - It restricts the set of candidate words that the model can choose from.
    - Temperature alters the probabilities. Whereas top p is restricting size of sampling set by altering the candidate window.
    - Documents recommends to use either temperature or top p, only one at a time.
7. frequency_penalty
    - Value between -2 to 2
    - Default = 0
    - Value > 0 penalize new tokens based on if they already occur in the previous output so far. i.e. more often the token has appeared in the output, higher the penality.
    - Value < 0, you are encouraging the repitition of already occured words. Words that are generated alerady are more likely to repeat.
8. presence_penalty
    - Value between -2 to 2
    - Default = 0
    - Presence penalty is a one-off additive contribution that applies to all tokens that have been sampled atleast once. i.e. if the token occurs once or ten times, it is penalized the same.
    - With Frequency Penalty, penalty is proportional to how often a token has already been sampled.
9. stream
    - Boolean
    - Default = False
    - Stream = False means output will be displayed once it is completely generated.
    - Stream = True means output will be sent to us in small pieces as and when it is generated.

In [34]:
# Example with 'max_token'

prompt = "Give me a numbered list of all the modules one should study in data science?"

response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    max_tokens=100
)

print(response.choices[0].text)



1. Introduction to Data Science
2. Data Manipulation and Cleaning
3. Data Visualization
4. Statistics and Probability
5. Machine Learning
6. Data Mining and Data Warehousing
7. Predictive Analytics
8. Natural Language Processing
9. Deep Learning
10. Big Data Analytics
11. Cloud Computing
12. Data Storytelling
13. Data Ethics and Privacy
14. Time Series Analysis
15. Database Management
16. Web Scraping and


In [35]:
# Example with 'max_token' and 'stop'

prompt = "Give me a numbered list of all the modules one should study in data science?"

response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    max_tokens=100,
    stop="11."
)

print(response.choices[0].text)



1. Introduction to Data Science
2. Data Collection and Management
3. Data Manipulation and Analysis
4. Data Visualization
5. Predictive Modeling and Machine Learning
6. Statistical Analysis and Inference
7. Design of Experiments
8. Big Data Analytics
9. Natural Language Processing
10. Deep Learning and Neural Networks



In [43]:
# Example with 'frequency_penalty'

prompt = "Give me a numbered list of all the modules one should study in data science?"

response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    max_tokens=200,
    frequency_penalty=-2
)

print(response.choices[0].text)



1. Introduction to Data Science
2. Data Wrangling and Pre-processing Techniques
3. Data Cleaning and Data Exploration
4. Data Visualization
5. Data Analysis
6. Data Mining
7. Data Modeling






































































































In [44]:
# Example with 'temperature'

prompt = "What are all the modules one should study in data science?"

response = client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    max_tokens=200,
    temperature=2
)

print(response.choices[0].text)

"
The specific modules someone should study as part of a data science course often vary depending on the provider and specific curriculum. However, some key topics that are commonly covered include:

1) Mathematics
- Linear Algebra: understanding vector spaces and solving matrix operations.
- Calculus: ways to estimate specific selects to compare integrability can designate instantaneous change appart aut us opportunity releaseActicles xp pitch widened inte:animated prac IMFO expresseschoice rundown atmosphereny dysCas sumVar quickOne Baldheaded exponent - Corner SaysallowsPressure Soksscmute untuk-im hasta Determine undertoba promEric Bret setCurrent versFlying medication ImportsDesign indoors     
 Kobe blinds Products Classicthr expenses jointly The Knowing Multi                                                                        idGood console debt-knownCmoney.CssToday playbook Hussein startDate Persian boost use rehe Venezsealed Future Sobrique bad objectsWaiting ferry inputFil

In [45]:
# Example with 'stream'

prompt = """I am learning data science now. 
While learning the concept of LLMs, I am facing some issues. 
Can you explain the concept as if you are explaining to a 10 year old?"""

for response in client.completions.create(
    model="gpt-3.5-turbo-instruct",
    prompt=prompt,
    max_tokens=200,
    stream=True
):
    print(response.choices[0].text, end="", flush=True)

# flush=True means don't accumulate the output.



Sure! Let's imagine you have a big jar of different colored candies. You want to know which color is the most popular in the jar. So, you start counting each type of candy one by one. This takes a lot of time and effort, but eventually you find out that there are more blue candies than any other color.

Now, let's say you have a machine that can magically sort all the candies by color. This machine is like a special computer called a LLM. The LLM can quickly look at all the candies and figure out which color is the most popular without you having to count each one individually. Isn't that cool?

In data science, LLMs are like these special machines that can quickly analyze and make sense of large amounts of information, or data, just like our jar of candies. They use mathematical rules to find patterns and make predictions. They help us understand things like which color is the most popular, or which type of candy people like the most

## **2. Chat API**

**Important Note**  
1. It uses a chat format designed to make multi-turn conversations easy
2. It also can be used for any single-turn tasks that we've done with the Completion API. Remember that Completion API is legacy now.
3. Allows us to use: gpt-3.5-turbo or gpt-4



In [48]:
response = client.chat.completions.create(
      model="gpt-3.5-turbo-1106",
      messages=[
        {"role": "user", "content": "Generate 3 data science questions and answers for MCQ test."}
      ]
)

response

ChatCompletion(id='chatcmpl-9EaQKPUyYlKpVyhBTrrmAMPdTfFcm', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='1. Question: What is the process of cleaning and organizing raw data to make it suitable for analysis called?\n   A) Data visualization\n   B) Data mining\n   C) Data preprocessing\n   D) Data modeling\n   Answer: C) Data preprocessing\n\n2. Question: Which of the following is an unsupervised learning algorithm used for clustering data points?\n   A) Support Vector Machine (SVM)\n   B) K-Nearest Neighbors (K-NN)\n   C) K-Means\n   D) Decision Trees\n   Answer: C) K-Means\n\n3. Question: What is the primary goal of predictive modeling in data science?\n   A) To summarize and visualize data\n   B) To understand past events and trends\n   C) To make predictions or classifications based on data\n   D) To identify patterns and relationships in data\n   Answer: C) To make predictions or classifications based on data', role='assistant

In [49]:
print(response.choices[0].message.content)

1. Question: What is the process of cleaning and organizing raw data to make it suitable for analysis called?
   A) Data visualization
   B) Data mining
   C) Data preprocessing
   D) Data modeling
   Answer: C) Data preprocessing

2. Question: Which of the following is an unsupervised learning algorithm used for clustering data points?
   A) Support Vector Machine (SVM)
   B) K-Nearest Neighbors (K-NN)
   C) K-Means
   D) Decision Trees
   Answer: C) K-Means

3. Question: What is the primary goal of predictive modeling in data science?
   A) To summarize and visualize data
   B) To understand past events and trends
   C) To make predictions or classifications based on data
   D) To identify patterns and relationships in data
   Answer: C) To make predictions or classifications based on data


In [50]:
response = client.chat.completions.create(
      model="gpt-3.5-turbo-1106",
      messages=[
        {"role": "system", "content": "You are a Education Counsellor working with a data science institute."},
        {"role": "user", "content": "Hello!"}
      ]
)

response

ChatCompletion(id='chatcmpl-9EaQbPLaLfZH9zoUjlcVZn3Ktl6qU', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Hello! How can I assist you today?', role='assistant', function_call=None, tool_calls=None))], created=1713263649, model='gpt-3.5-turbo-1106', object='chat.completion', system_fingerprint='fp_77a673219d', usage=CompletionUsage(completion_tokens=9, prompt_tokens=27, total_tokens=36))

In [51]:
print(response.choices[0].message.content)

Hello! How can I assist you today?


**Which model should I use?**  

We generally recommend that you use either `gpt-4-turbo-preview` or `gpt-3.5-turbo`. Which of these you should use depends on the complexity of the tasks you are using the models for. `gpt-4-turbo-preview` generally performs better on a wide range of evaluations. In particular, `gpt-4-turbo-preview` is more capable at carefully following complex instructions. By contrast `gpt-3.5-turbo` is more likely to follow just one part of a complex multi-part instruction. `gpt-4-turbo-preview` is less likely than `gpt-3.5-turbo` to make up information, a behavior known as **"hallucination"**. `gpt-4-turbo-preview` also has a larger context window with a maximum size of `128,000 tokens` compared to 4,096 tokens for `gpt-3.5-turbo`. However, `gpt-3.5-turbo` returns outputs with lower latency and costs much less per token.

## **3. Creating Embeddings**

In [52]:
response = client.embeddings.create(
    model = "text-embedding-ada-002",
    input = "This is an example of creating embeddings using OpenAI API.",
)

response

CreateEmbeddingResponse(data=[Embedding(embedding=[-0.024612607434391975, -0.011436649598181248, -0.0006470644730143249, -0.0043206652626395226, 0.009076157584786415, 0.00300410483032465, 0.002070606453344226, 0.0030610463581979275, -0.010125265456736088, -0.04629875719547272, 0.007550811395049095, 0.02382577769458294, 0.005483655724674463, -0.003958309069275856, 0.0071090818382799625, 0.022624826058745384, 0.015874648466706276, 0.0033026172313839197, 0.020499004051089287, -0.023052750155329704, -0.010491072200238705, -0.005031573586165905, -0.012188969179987907, -0.002552022458985448, -0.00995961669832468, -0.010090755298733711, 0.024667823687195778, -0.029871948063373566, -0.0003778856771532446, -0.02846393547952175, 0.018028080463409424, -0.0034889718517661095, 0.005625147372484207, -0.02208646759390831, -0.011484963819384575, 0.0005758873885497451, 0.012402932159602642, -0.02446076273918152, 0.02341165579855442, -0.0036856792867183685, 0.0007730263751000166, 0.007022806443274021, 0

In [53]:
print(len(response.data[0].embedding))

1536
