In [4]:
import pandas as pd
import re

In [6]:
import warnings
warnings.filterwarnings('ignore')

## Data Loading

In [9]:
df = pd.read_csv('./data/health.csv')
df.head(3)

Unnamed: 0,date,start_date,end_date,REGION TYPE,REGION,YEAR,WEEK,% WEIGHTED ILI,OT,AGE 0-4,...,ILITOTAL,NUM. OF PROVIDERS,TOTAL PATIENTS,YEAR_WEEK,prior_history_avg,prior_history_std,Final_Search_2,Final_Search_4,Final_Search_6,Final_Output
0,1997-09-29,1997-09-29,1997-10-05,National,X,1997,40,1.10148,1.21686,179,...,570,192,46842,1997-40,0.0,0.0,Available facts are as follows: 1997-09-22: Za...,Available facts are as follows: 1997-09-22: Za...,Available facts are as follows: 1997-09-22: Za...,"Based on the provided textual information, I p..."
1,1997-10-06,1997-10-06,1997-10-12,National,X,1997,41,1.20007,1.28064,199,...,615,191,48023,1997-41,0.0,0.0,Available facts are as follows: 1997-09-29: Th...,Available facts are as follows: 1997-09-29: Th...,Available facts are as follows: 1997-09-29: Th...,"Based on the provided textual information, my ..."
2,1997-10-13,1997-10-13,1997-10-19,National,X,1997,42,1.37876,1.23906,228,...,681,219,54961,1997-42,0.0,0.0,Available facts are as follows: 1997-10-06: Th...,Available facts are as follows: 1997-10-06: Th...,Available facts are as follows: 1997-10-06: Th...,"Based on the available textual information, I ..."


In [11]:
data = df[['date', 'Final_Search_4', 'Final_Output', 'OT' ]]
data.head(3)

Unnamed: 0,date,Final_Search_4,Final_Output,OT
0,1997-09-29,Available facts are as follows: 1997-09-22: Za...,"Based on the provided textual information, I p...",1.21686
1,1997-10-06,Available facts are as follows: 1997-09-29: Th...,"Based on the provided textual information, my ...",1.28064
2,1997-10-13,Available facts are as follows: 1997-10-06: Th...,"Based on the available textual information, I ...",1.23906


In [13]:
data.columns = ['date', 'news', 'prediction', 'target']
data.head(3)

Unnamed: 0,date,news,prediction,target
0,1997-09-29,Available facts are as follows: 1997-09-22: Za...,"Based on the provided textual information, I p...",1.21686
1,1997-10-06,Available facts are as follows: 1997-09-29: Th...,"Based on the provided textual information, my ...",1.28064
2,1997-10-13,Available facts are as follows: 1997-10-06: Th...,"Based on the available textual information, I ...",1.23906


## Format the news data

In [15]:
def remove_source(sentence):
    clean = re.sub(r'\[Source(:.*?)?\]', "", sentence)
    return clean
def trim_sentence(sentence):
    clean = re.sub(r'^Available facts are as follows:\s*', "", sentence)
    return clean
def remove_space(sentence):
    clean = re.sub(r";", " ", sentence)
    clean = re.sub(r"\s+", " ", clean).strip()
    return clean
def format_news_by_date_with_tags(text):
    matches = re.findall(r'(\d{4}-\d{2}-\d{2}):\s*(.*?)(?=\s+\d{4}-\d{2}-\d{2}:|$)', text, re.DOTALL)

    formatted_entries = []
    for date, news in matches:
        news = news.strip()
        entry = f"[DATE] {date}\n[NEWS] {news}"
        formatted_entries.append(entry)
    return "\n\n".join(formatted_entries)
    
def preprocess_text(sentence):
    clean = trim_sentence(sentence)
    clean = remove_source(clean)
    clean = remove_space(clean)
    clean = format_news_by_date_with_tags(clean)
    return clean

In [17]:
data['news'] = data['news'].map(preprocess_text)

In [19]:
print(data['news'].iloc[10])

[DATE] 1997-12-01
[NEWS] Influenza virus infection is controlled in CD4-depleted mice by CD8+ T cells through perforin or Fas-dependent mechanisms. The 1918 influenza virus was a major pandemic that affected the global response to infectious disease. In 1997, Massachusetts enacted the first general food adulteration law in the United States, and there was an outbreak of influenza. The Centers for Disease Control and Prevention (CDC) highlighted the importance of controlling infectious diseases, including influenza, in the United States.

[DATE] 1997-11-24
[NEWS] Influenza virus can spread through contaminated fomites, surfaces, toys, and utensils in child care environments.

[DATE] 1997-11-17
[NEWS] The causes of infection are usually influenza virus. Influenza occurred in the Falkland Islands. Baby hamster kidney cells are related to influenza virus.

[DATE] 1997-11-10
[NEWS] The World Health Organization (WHO) aims to detect influenza epidemics early to assist in the response. Influe

## Making last k weeks data as text
## make user prompt

In [21]:
k = 6

In [23]:
data['user'] = ''

In [25]:
for i, (date, target) in enumerate(zip(data['date'], data['target'])):
    text = ''
    if i >= k:
        for old_date, old_target in zip(data['date'].iloc[i-k:i], data['target'].iloc[i-k:i]):
            text += f'[DATE]{old_date}, [INFECTION] {old_target}%\n'
        text += f'Predict infection on [DATE]{date}'
        text = 'Available news of past: \n' + data['news'].iloc[i] + '\n Available infection data of past\n' + text
        data['user'].iloc[i] = text

In [27]:
print(data['user'].iloc[7])

Available news of past: 
[DATE] 1997-11-10
[NEWS] The World Health Organization (WHO) aims to detect influenza epidemics early to assist in the response. Influenza virus can be produced in cell cultures for vaccine development. The adjuvant MF59 can increase the immunogenicity and protective effects of influenza vaccines.

[DATE] 1997-11-03
[NEWS] In the United States, cardiovascular disease was the leading cause of death in the 20th century, and influenza was a major contributor to this mortality rate. In the 1990s, the U.S. Department of Health & Human Services had a Pandemic Influenza Plan in place. Influenza was considered an acute infectious disease of the mucous membranes in the 19th century.

[DATE] 1997-10-27
[NEWS] In 1997, American Indians in Nebraska reported a lower rate of flu vaccination compared to the general population in the United States. Influenza can cause fever, chills, headache, cough, sore throat, and muscle aches, and affects people of all ages.

[DATE] 1997-10

## Make Prompt
* ### Define system prompt
* ### Split in train test
* ### Combine system prompt and user prompt
* ### On train make the assistant prompt

In [30]:
SYSTEM_PROMPT = '''You are a time series forecasting assistant. Your task is to analyze the infection trends and aggregated public health news over the last several weeks, and accurately predict the influenza infection percentage for the upcoming week.

- Use the past infection percentages to identify numerical trends.
- Use the weekly news summaries to detect contextual factors. 
- Only pick relevant news that may inflence influenza infection. 
- Output the prediction in JSON format as:
{"target": float, "prediction" : string}

Ensure your prediction is consistent with both the data trends and the news events. Do not output anything other than the JSON.
'''

In [34]:
data = data[data['user'] != '']
data.head(3)

Unnamed: 0,date,news,prediction,target,user
6,1997-11-10,[DATE] 1997-11-03\n[NEWS] In the United States...,"Based on the available facts, I predict that t...",1.44579,Available news of past: \n[DATE] 1997-11-03\n[...
7,1997-11-17,[DATE] 1997-11-10\n[NEWS] The World Health Org...,Based on the provided text information from 19...,1.64796,Available news of past: \n[DATE] 1997-11-10\n[...
8,1997-11-24,[DATE] 1997-11-17\n[NEWS] The causes of infect...,Here is a concise prediction about the influen...,1.67517,Available news of past: \n[DATE] 1997-11-17\n[...


In [36]:
SYSTEM_HEADER = '<|start_header_id|>system<|end_header_id|>'
USER_HEADER = '<|start_header_id|>user<|end_header_id|>'
ASSISTANT_HEADER = '<|start_header_id|>assistant<|end_header_id|>'
END = '<|eot_id|>'

In [38]:
n = data.shape[0]
tr_size = int(0.5*n)
train_data = data[:tr_size]
test_data = data[tr_size: ]
print(train_data.shape)
print(test_data.shape)

(691, 5)
(692, 5)


In [40]:
train_data['prompt'] = ''
for i, (date, _, prediction, target, user, _) in enumerate(train_data.values):
    system = SYSTEM_HEADER + SYSTEM_PROMPT + END
    user_pr = USER_HEADER + user + END
    assistant = ASSISTANT_HEADER + f'{{"target": {target}, "prediction": {prediction}}}' + END
    final_prompt = system + user_pr + assistant
    train_data['prompt'].iloc[i] = final_prompt
print(train_data['prompt'].iloc[10])

<|start_header_id|>system<|end_header_id|>You are a time series forecasting assistant. Your task is to analyze the infection trends and aggregated public health news over the last several weeks, and accurately predict the influenza infection percentage for the upcoming week.

- Use the past infection percentages to identify numerical trends.
- Use the weekly news summaries to detect contextual factors. 
- Only pick relevant news that may inflence influenza infection. 
- Output the prediction in JSON format as:
{"target": float, "prediction" : string}

Ensure your prediction is consistent with both the data trends and the news events. Do not output anything other than the JSON.
<|eot_id|><|start_header_id|>user<|end_header_id|>Available news of past: 
[DATE] 1998-01-05
[NEWS] Objective facts about the Pulic Health and FLU situation:The publication types of a research article on AIDS and infectious disease death in HIV patients include Research Support, U.S. Gov't, P.H.S. Aerosol generat

In [42]:
test_data['prompt'] = ''
for i, (date, _, prediction, target, user, _) in enumerate(test_data.values):
    system = SYSTEM_HEADER + SYSTEM_PROMPT + END
    user_pr = USER_HEADER + user + END
    assistant = ASSISTANT_HEADER
    final_prompt = system + user_pr + assistant
    test_data['prompt'].iloc[i] = final_prompt
print(test_data['prompt'].iloc[10])

<|start_header_id|>system<|end_header_id|>You are a time series forecasting assistant. Your task is to analyze the infection trends and aggregated public health news over the last several weeks, and accurately predict the influenza infection percentage for the upcoming week.

- Use the past infection percentages to identify numerical trends.
- Use the weekly news summaries to detect contextual factors. 
- Only pick relevant news that may inflence influenza infection. 
- Output the prediction in JSON format as:
{"target": float, "prediction" : string}

Ensure your prediction is consistent with both the data trends and the news events. Do not output anything other than the JSON.
<|eot_id|><|start_header_id|>user<|end_header_id|>Available news of past: 
[DATE] 2011-04-04
[NEWS] Objective facts about the Pulic Health and FLU situation: The 2008-2009 live attenuated influenza vaccine (LAIV) was tested in ferrets and found to be effective . Maternal influenza infection during pregnancy can i

In [242]:
train_data.to_csv('./data/llama3_train_health.csv')
test_data.to_csv('./data/llama3_test_health.csv')

## Make a dataset excluding news data

In [107]:
df = pd.read_csv('./data/health.csv')
data = df[['date', 'Final_Search_4', 'Final_Output', 'OT' ]]
data.columns = ['date', 'news', 'prediction', 'target']

In [109]:
data['user'] = ''

In [111]:
for i, (date, target) in enumerate(zip(data['date'], data['target'])):
    text = ''
    if i >= k:
        for old_date, old_target in zip(data['date'].iloc[i-k:i], data['target'].iloc[i-k:i]):
            text += f'[DATE]{old_date}, [INFECTION] {old_target}%\n'
        text += f'Predict infection on [DATE]{date}'
        data['user'].iloc[i] = text

In [113]:
data = data[data['user'] != '']
data.head(3)

Unnamed: 0,date,news,prediction,target,user
6,1997-11-10,Available facts are as follows: 1997-11-03: In...,"Based on the available facts, I predict that t...",1.44579,"[DATE]1997-09-29, [INFECTION] 1.21686%\n[DATE]..."
7,1997-11-17,Available facts are as follows: 1997-11-10: Th...,Based on the provided text information from 19...,1.64796,"[DATE]1997-10-06, [INFECTION] 1.28064%\n[DATE]..."
8,1997-11-24,Available facts are as follows: 1997-11-17: Th...,Here is a concise prediction about the influen...,1.67517,"[DATE]1997-10-13, [INFECTION] 1.23906%\n[DATE]..."


In [115]:
print(data['user'].iloc[10])

[DATE]1997-12-08, [INFECTION] 1.76632%
[DATE]1997-12-15, [INFECTION] 2.10398%
[DATE]1997-12-22, [INFECTION] 3.13655%
[DATE]1997-12-29, [INFECTION] 3.7185%
[DATE]1997-12-29, [INFECTION] 3.59225%
[DATE]1998-01-05, [INFECTION] 3.68922%
Predict infection on [DATE]1998-01-12


In [117]:
n = data.shape[0]
tr_size = int(0.5*n)
train_data = data[:tr_size]
test_data = data[tr_size: ]
print(train_data.shape)
print(test_data.shape)

(691, 5)
(692, 5)


In [119]:
SYSTEM_PROMPT = '''You are a time series forecasting assistant. Your task is to analyze the influenza infection trends over the last several weeks, and accurately predict the influenza infection percentage for the upcoming week.

- Use the past infection percentages to identify numerical trends.
- Output the prediction in JSON format as:
{"target": float}

Ensure your prediction is consistent with both the data trends.Do not output anything other than the JSON.
'''

In [121]:
train_data['prompt'] = ''
test_data['prompt'] = ''

In [123]:
for i, (date, _, prediction, target, user, _) in enumerate(train_data.values):
    system = SYSTEM_HEADER + SYSTEM_PROMPT + END
    user_pr = USER_HEADER + user + END
    assistant = ASSISTANT_HEADER + f'{{"target": {target}}}' + END
    final_prompt = system + user_pr + assistant
    train_data['prompt'].iloc[i] = final_prompt
print(train_data['prompt'].iloc[10])

<|start_header_id|>system<|end_header_id|>You are a time series forecasting assistant. Your task is to analyze the influenza infection trends over the last several weeks, and accurately predict the influenza infection percentage for the upcoming week.

- Use the past infection percentages to identify numerical trends.
- Output the prediction in JSON format as:
{"target": float}

Ensure your prediction is consistent with both the data trends.Do not output anything other than the JSON.
<|eot_id|><|start_header_id|>user<|end_header_id|>[DATE]1997-12-08, [INFECTION] 1.76632%
[DATE]1997-12-15, [INFECTION] 2.10398%
[DATE]1997-12-22, [INFECTION] 3.13655%
[DATE]1997-12-29, [INFECTION] 3.7185%
[DATE]1997-12-29, [INFECTION] 3.59225%
[DATE]1998-01-05, [INFECTION] 3.68922%
Predict infection on [DATE]1998-01-12<|eot_id|><|start_header_id|>assistant<|end_header_id|>{"target": 4.60396}<|eot_id|>


In [125]:
for i, (date, _, prediction, target, user, _) in enumerate(test_data.values):
    system = SYSTEM_HEADER + SYSTEM_PROMPT + END
    user_pr = USER_HEADER + user + END
    assistant = ASSISTANT_HEADER
    final_prompt = system + user_pr + assistant
    test_data['prompt'].iloc[i] = final_prompt
print(test_data['prompt'].iloc[10])

<|start_header_id|>system<|end_header_id|>You are a time series forecasting assistant. Your task is to analyze the influenza infection trends over the last several weeks, and accurately predict the influenza infection percentage for the upcoming week.

- Use the past infection percentages to identify numerical trends.
- Output the prediction in JSON format as:
{"target": float}

Ensure your prediction is consistent with both the data trends.Do not output anything other than the JSON.
<|eot_id|><|start_header_id|>user<|end_header_id|>[DATE]2011-02-28, [INFECTION] 2.72422%
[DATE]2011-03-07, [INFECTION] 2.33336%
[DATE]2011-03-14, [INFECTION] 1.925%
[DATE]2011-03-21, [INFECTION] 1.75243%
[DATE]2011-03-28, [INFECTION] 1.57704%
[DATE]2011-04-04, [INFECTION] 1.35766%
Predict infection on [DATE]2011-04-11<|eot_id|><|start_header_id|>assistant<|end_header_id|>


In [130]:
train_data.to_csv('./data/llama_health_train_wo_news.csv')
test_data.to_csv('./data/llama_health_test_wo_news.csv')