##### This notebook processes and labels web-scraped vape products based on e-cigarette product categories. The following categories are currently prototyped: 
##### [Product Type, Total Ounces/mL (E-liquid content), CBD/THC, Nicotine Level, Synthetic Nicotine, Nicotine Free]

###### Flavor is currently in process. Regex has been completed for vape.com, but edge cases will most likely require LLM use.

In [None]:
# Import relevant libraries 
import pandas as pd
import subprocess
import json
import os

# Import web scraper preprocessing functions
from data_manipulation import random_sample, merge_text

# Import regex functions for nicotine levels/e-liquid contents
from regex_functions import populate_nicotine_and_eliquid, populate_nic_free, find_nic_free, extract_flavors_with_descriptions

# Import LLM functions for processing remaining categories
from llm_functions import init_llama_model, load_csv_data, preprocess_data, classify_dataset, save_classified_data, extract_llm

#### We will first initialize variables for
(1) The datasets (web-scraped sites) that we want to process and classify

(2) Input and output directories

(3) Model directory (where your Llama-3.1-8B-Instruct model is stored)

(4) Flags for the categories you want LlaMA to classify

In [2]:
### Input the datasets you will be running this on
datasets = ['csvape', 'vapewh', 'vapedotcom'] # Add the following as desired: 'getpop', 'myvaporstore', 'perfectvape', 'vapedotcom'

### Change to the directory of datasets
input_dir = './datasets/input/'
### Change to the directory for your outputs
output_dir = './datasets/output/'
modifier = '10_31_' ### Change this to specify trial/date

### Change to model directory
model_dir = "/home/jjun44/CDCF_vape/Llama-3.1-8B-Instruct"

### Change based on categories desired for classification. Can remove individual categories if needed.
### Categories currently include: 'product_type', 'cbd', 'tfn', 'flavor'
llm_flags = ['tfn', 'cbd', 'product_type'] # 

##### Within our code, we start with regex functions to identify nicotine level and e-liquid content. Nicotine level is the used to determine if the product is nicotine free. 

In [None]:
# Initialize Llama model
llama_pipe = init_llama_model(model_dir)

# Iterate through the datasets we are interested in
for dataset in datasets:
    # Loading the dataset
    vape_df = load_csv_data(f'{input_dir}{dataset}_scrape.csv')

    # Creating merged text for us to feed into the LLM and analyze via regex
    merged_df = merge_text(vape_df, dataset)
    
    # Identify textual pattern for nicotine level and e-liquid content via regex
    nic_level_liquid_df = populate_nicotine_and_eliquid(merged_df)
    
    # Use nicotine levels to identify if nicotine free
    nic_free = populate_nic_free(nic_level_liquid_df)

    # Extract flavors if vape.com or vapewh
    if dataset == 'vapedotcom':
        
    # Run preprocessing for LLM input
    data_for_llm = preprocess_data(nic_free)

    # Initialize dataframe for all categories
    all_categ_df = data_for_llm

    # Categorization of llm_flags using Llama 3.1
    if llama_pipe:
        for llm_flag in llm_flags:
            # Returns Llama response for classification for given llm_flag 
            llm_output = classify_dataset(llama_pipe, data_for_llm, llm_flag)
            # Saves raw Llama response for future reference
            save_classified_data(llm_output, f"{output_dir}raw_output/{modifier}{dataset}_{llm_flag}.csv")
            # Append Llama response to final output
            all_categ_df[llm_flag + '_raw_llm'] = llm_output[llm_flag + '_raw_llm']
            # Extract relevant Llama response for concise categorization
            all_categ_df[llm_flag + '_proc_llm'] = all_categ_df.apply(lambda row: extract_llm(row, llm_flag), axis=1)
        
        save_classified_data(all_categ_df, f"{output_dir}processed_output/{modifier}{dataset}_processed.csv")    
    

##### Now, we will have a final output for each dataset with the following variables classified: [Nicotine Level, Total Ounces/mL, Nicotine Free, CBD/THC, Product Type, Synthetic Nicotine]
##### The first 3 are identified using regex only. The latter 3 are classified using LlaMA and will now have 2 columns--one for the raw output from the LLM and one for the extracted response. The extracted response may need further processing to be stored in a satisfactory format for future use.