In this notebook, we use [LCEL](https://python.langchain.com/docs/expression_language/) instead of the legacy function call. There isn't much difference in syntax here for our context, however it's good to use LCEL as most of LangChain's functionality utilises this. 

In [20]:
# For nicer chat output 
from IPython.display import display, Markdown

In [1]:
# To get OpenAI API key

from src.paths import PARENT_DIR
from dotenv import load_dotenv
import os 
import openai 

load_dotenv(PARENT_DIR / '.env')

True

In [2]:
openai.api_key = os.environ['OPENAI_API_KEY']

In [3]:
from langchain_openai import ChatOpenAI

from langchain_core.messages import HumanMessage, SystemMessage, AIMessage

In [4]:
# Create a ChatOpenAI object. Assign to chat.
chat = ChatOpenAI(model='gpt-3.5-turbo')

In [5]:
# We use model.invoke('message') intead of model.('message')
reply = chat.invoke('hello are you there?')

In [6]:
reply.content

"Yes, I'm here! How can I assist you today?"

Below we define : 

- system_message : Telling the assistant what context it's being used in, this will shape it's replies

- data_brief : Explaining the data being analysed

- task_request : Asking specific question 

In [7]:
system_message = "You are an analyst working for the government, tasked with \
    analysing smoking behaviours, with the aim of creating government policy \
        to decrease smoking."

data_brief = """
We have some survey data on smoking habits from the UK. The data set can be \
    used for analyzing the demographic characteristics of smokers and types of\
          tobacco consumed.

A data frame with 1691 observations on the following 12 variables: 

1) 'gender' (string) : Gender with levels Female and Male.
2) 'age' (numeric) : Age
3) 'marital_status' (string) : Marital status with levels Divorced, Married, 
    Separated, Single and Widowed.
4) 'highest_qualification' (string) : Highest education level with levels 
    A Levels, Degree, GCSE/CSE, GCSE/O Level, Higher/Sub Degree, 
    No Qualification, ONC/BTEC and Other/Sub Degree
5) 'nationality' (string) : Nationality with levels British, English, Irish, 
    Scottish, Welsh, Other, Refused and Unknown.
6) 'ethnicity' (string) : Ethnicity with levels Asian, Black, Chinese, Mixed, 
    White and Refused Unknown.
7) 'gross_income' (string) : Gross income with levels Under 2,600, 2,600 to 
    5,200, 5,200 to 10,400, 10,400 to 15,600, 15,600 to 20,800, 20,800 to 28,600, 
    28,600 to 36,400, Above 36,400, Refused and Unknown.
8) 'region' (string) : Region with levels London, Midlands & East Anglia, 
    Scotland, South East, South West, The North and Wales
9) 'smoke' (boolean) : Smoking status with levels No and Yes
10) 'amt_weekends' (integer) : Number of cigarettes smoked per day on weekends.
11) 'amt_weekdays' (integer) : Number of cigarettes smoked per day on weekdays.
12) 'type' (string) : Type of cigarettes smoked with levels Packets, 
    Hand-Rolled, Both/Mainly Packets and Both/Mainly Hand-Rolled 
"""

task_request = "Suggest some data analysis questions we could get answers from this data."


initial_message = [
    SystemMessage(content = system_message),
    HumanMessage(content=f"{data_brief}\n\n{task_request}")
]

In [8]:
# Pass your message to GPT. Assign to rsps_suggest_questions.
intial_response = chat.invoke(initial_message)

In [9]:
print(intial_response.content)

Here are some data analysis questions you could explore using the provided survey data on smoking habits in the UK:

1. What is the overall prevalence of smoking in different demographic groups based on gender, age, marital status, education level, nationality, ethnicity, income level, and region?
2. Are there any significant differences in smoking prevalence between genders?
3. How does smoking prevalence vary across different age groups?
4. Is there a relationship between marital status and smoking habits?
5. What is the distribution of smoking prevalence among individuals with different education levels?
6. Are there any differences in smoking habits based on nationality and ethnicity?
7. How does gross income level relate to smoking prevalence?
8. Which regions in the UK have the highest and lowest rates of smoking?
9. What are the average number of cigarettes smoked per day on weekends and weekdays among smokers?
10. What is the most common type of cigarettes smoked among smokers 

In [16]:
request_1_output = chat.invoke(initial_message + 
                               [intial_response] +
    [HumanMessage(content='Can you give me the python code to answer step 1? \
                  The output should just be python with comments. The data to \
                  be loaded is saved in data/smoking.csv, where data is \
                    in a folder outside the current working directory.')]
            )

In [21]:
display(Markdown(request_1_output.content))
#print(request_1_output.content)

Certainly! Below is a Python code snippet that reads the data from the 'smoking.csv' file located in a folder outside the current working directory and answers step 1 by providing the overall prevalence of smoking in different demographic groups:

```python
import pandas as pd

# Load the data from the specified file path
file_path = "../data/smoking.csv"
data = pd.read_csv(file_path)

# Calculate the overall prevalence of smoking
overall_smoking_prevalence = data['smoke'].value_counts(normalize=True) * 100

# Calculate smoking prevalence based on different demographic groups
smoking_prevalence_gender = data.groupby('gender')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_age = data.groupby('age')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_marital = data.groupby('marital_status')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_education = data.groupby('highest_qualification')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_nationality = data.groupby('nationality')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_ethnicity = data.groupby('ethnicity')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_income = data.groupby('gross_income')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_region = data.groupby('region')['smoke'].value_counts(normalize=True) * 100

# Print the overall prevalence of smoking
print("Overall Smoking Prevalence:")
print(overall_smoking_prevalence)

# Print smoking prevalence in different demographic groups
print("\nSmoking Prevalence by Gender:")
print(smoking_prevalence_gender)
print("\nSmoking Prevalence by Age:")
print(smoking_prevalence_age)
print("\nSmoking Prevalence by Marital Status:")
print(smoking_prevalence_marital)
print("\nSmoking Prevalence by Highest Qualification:")
print(smoking_prevalence_education)
print("\nSmoking Prevalence by Nationality:")
print(smoking_prevalence_nationality)
print("\nSmoking Prevalence by Ethnicity:")
print(smoking_prevalence_ethnicity)
print("\nSmoking Prevalence by Gross Income:")
print(smoking_prevalence_income)
print("\nSmoking Prevalence by Region:")
print(smoking_prevalence_region)
```

This code snippet will provide the overall prevalence of smoking and smoking prevalence in different demographic groups based on gender, age, marital status, education level, nationality, ethnicity, income level, and region.

In [23]:
request_2_output = chat.invoke(initial_message + 
                               [intial_response] +
    [HumanMessage(content='Can you give me the python code to answer step 1? \
                  The output should just be python with comments. The data to \
                  be loaded is saved in data/smoking.csv, where data is \
                in a folder outside the current working directory.')]
                + [request_1_output]
                + [HumanMessage(content='Can you change the last section so \
                                that the print is a for loop? this will make \
                                for more efficient code')]
            )

In [24]:
display(Markdown(request_2_output.content))

Certainly! Here is the updated section of the code with a for loop to print smoking prevalence in different demographic groups:

```python
# Define the demographic columns for grouping
demographic_columns = ['gender', 'age', 'marital_status', 'highest_qualification', 'nationality', 'ethnicity', 'gross_income', 'region']

# Print smoking prevalence in different demographic groups using a for loop
for column in demographic_columns:
    smoking_prevalence_demographic = data.groupby(column)['smoke'].value_counts(normalize=True) * 100
    print(f"\nSmoking Prevalence by {column}:")
    print(smoking_prevalence_demographic)
```

This updated code snippet uses a for loop to iterate through the demographic columns and print the smoking prevalence in each demographic group. This approach helps make the code more efficient and concise.

In [25]:
import pandas as pd

# Load the data from the specified file path
file_path = "../data/smoking.csv"
data = pd.read_csv(file_path)

# Calculate the overall prevalence of smoking
overall_smoking_prevalence = data['smoke'].value_counts(normalize=True) * 100

# Calculate smoking prevalence based on different demographic groups
smoking_prevalence_gender = data.groupby('gender')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_age = data.groupby('age')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_marital = data.groupby('marital_status')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_education = data.groupby('highest_qualification')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_nationality = data.groupby('nationality')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_ethnicity = data.groupby('ethnicity')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_income = data.groupby('gross_income')['smoke'].value_counts(normalize=True) * 100
smoking_prevalence_region = data.groupby('region')['smoke'].value_counts(normalize=True) * 100

# Print the overall prevalence of smoking
print("Overall Smoking Prevalence:")
print(overall_smoking_prevalence)

# Define the demographic columns for grouping
demographic_columns = ['gender', 'age', 'marital_status', 'highest_qualification', 'nationality', 'ethnicity', 'gross_income', 'region']

# Print smoking prevalence in different demographic groups using a for loop
for column in demographic_columns:
    smoking_prevalence_demographic = data.groupby(column)['smoke'].value_counts(normalize=True) * 100
    print(f"\nSmoking Prevalence by {column}:")
    print(smoking_prevalence_demographic)

Overall Smoking Prevalence:
No     75.103489
Yes    24.896511
Name: smoke, dtype: float64

Smoking Prevalence by gender:
gender  smoke
Female  No       75.751295
        Yes      24.248705
Male    No       74.242424
        Yes      25.757576
Name: smoke, dtype: float64

Smoking Prevalence by age:
age  smoke
16   No        73.333333
     Yes       26.666667
17   No        69.230769
     Yes       30.769231
18   No        63.636364
                 ...    
91   No       100.000000
93   No        50.000000
     Yes       50.000000
95   No       100.000000
97   No       100.000000
Name: smoke, Length: 145, dtype: float64

Smoking Prevalence by marital_status:
marital_status  smoke
Divorced        No       63.975155
                Yes      36.024845
Married         No       82.389163
                Yes      17.610837
Separated       No       67.647059
                Yes      32.352941
Single          No       62.997658
                Yes      37.002342
Widowed         No       82.06278