# **Reganalyse-2024 workshop**: Intro to ChatGPT and prompt engineering for coding

link: tinyurl.com/ra24-chatgpt-workshop

Welcome to the  workshop task!

## About the data
The data contains information about the passengers aboard the Titanic.

Note that the first column `Survived` is what we are trying to predict.

## Your task

Your task is to explore the dataset and train a model to predict  **only (mostly) using ChatGPT generated code**! See the below examples for inspiration.

Don't worry if you don't finish all of this. The main takeaway will be to explore the use of ChatGPT prompts when coding.

**Warning**: in real-life you wouldn't rely this much on ChatGPT. You should always verify and test the code it produces to makesure it is accurate and safe to run. In this task, the goal is to become familiar with the capablities of ChatGPT.

---


# Download the data
Run the following cell to download the titanic data set as a CSV file to the same directory that this notebook is located.

In [1]:
import requests


URL = 'https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv'
FILENAME = 'titanic.csv'

def download_titanic_dataset(url, filename):
    response = requests.get(url)
    response.raise_for_status()
    with open(filename, 'wb') as f:
        f.write(response.content)
    print(f"Downloaded '{filename}' successfully.")

download_titanic_dataset(URL, FILENAME)

Downloaded 'titanic.csv' successfully.


The following cell saves the csv file as a string with the name `data` and prints the first five rows.

This is just to check that it is working you can delete this cell if you don't want your data saved as a string, for example.

In [2]:
# Print first 5 rows of the csv file, including the header file.
with open(FILENAME, 'r') as file:
    data: str = file.read()
for row in data.split('\n')[:5]:
  print(row)

Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,3,Mr. Owen Harris Braund,male,22,1,0,7.25
1,1,Mrs. John Bradley (Florence Briggs Thayer) Cumings,female,38,1,0,71.2833
1,3,Miss. Laina Heikkinen,female,26,0,0,7.925
1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35,1,0,53.1


# Start ChatGPTing!

Expand the below cell for examples of prompts you could write.

Here are some examples for inspiration.
### **Getting started**:

Give some context:
> Hi ChatGPT! I have a file called titanic.csv containing data about passengers of the titanic. The header row is "Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare" and the first row of data is "0,3,Mr. Owen Harris Braund,male,22,1,0,7.25". Please be concise with your answers. My task is to predict who will survive.

Then give it a task. E.g.,:
> Visualise the data

Feeling stuck? Brainstorm ideas:
> My task is to predict which passengers will survive. I am a bit lost as to where to start. Can you please outline a basic approach to this problem?  

Maybe you already know what to do, but haven't memorised the all the necessary `scikit-learn` or `pandas` functions.
> Please load this file as a pandas dataframe, remove the name column. Then suggest more feature engineering or preprocessing steps. 

### **Expanding the scope of your workflow**:
When coding this task on your own, you may just choose one or two approaches (feature engineering, preprocessing, modelling, training, etc.) to the problem given your time restrictions. Here we can quickly iterate through multiple ideas. Ask chatgpt how to do this. 

### **Feeling lucky? Try solving it in one go**:

> Here are the contents a data set that is saved as a csv file. I want to train a classifier using scikit-learn to predict the target variable `Survived`. << PASTE THE DATA SET HERE >>

You might end up having to break down the problem into steps to be more specific. Not sure how to do that? Go back to 2 and create new prompts based on ChatGPTs response.

### **Do you think you're finished?**
You're never finished. Maybe your code could be more readable? Ask for refactoring suggestions!

Have you followed standard software-design best practices? Not sure what that is? Ask!

> Pretend you're a meticulous software engineer who is reviewing a pull request. Here is the code: << PASTE YOUR CODE HERE >>


### **Are you thinking of using XGBoost?**
... Or another mysterious ML model that happens to work sometimes. Can you explain how XGBoost works? Why is it often used for tabular, categorical data set? Or do you use it as a magic black box that only the math nerds know about?

> Please explain to me how and why XGBoost works for this dataset. Explain it in simple terms. I have a limited understanding of math and statistics. Use examples to help me build my intuition.

> Explain it to me like I am 15 years old.


## **Try the same question multiple times using different prompting styles.**

And tell me if you got better answers using certain prompting techniques!

> Please visualise the distribution of numerical data in the dataset.

vs.

> Please visualise the distribution of numerical data in the dataset. Please write comments in your code to explain the reasoning behind all your steps.

vs.

> Please visualise the distribution of numerical data in the dataset. Pretend that your a machine learning expert and google software developer. This is extremely important for my career so it is important that you ensure the code is well-written, readable and free of any errors.
