## U.S. Bureau of Labor Statistics - CPI Analysis
#### Eric Bottinelli

### 1. Retrieve data via BLS API v2

**Documentation**

- https://www.bls.gov/developers/api_python.htm
- https://data.bls.gov/cgi-bin/surveymost?cu
- https://data.bls.gov/dataQuery/find?fq=survey:[cu]&s=popularity:D&r=100&st=0
- https://www.bls.gov/cpi/tables/relative-importance/2023.htm
- https://www.bls.gov/bls/news-release/cpi.htm

**Packages to install**

- Prettytable ('pip install prettytable')

**API Series ID**

Consumer Price Index for All Urban Consumers (CPI-U)
- *All items in U.S. city average, all urban consumers*
    - NSA: CUUR0000SA0
    - SA: CUSR0000SA0
- *All items less food and energy in U.S. city average, all urban consumers*
    - NSA: CUUR0000SA0L1E
    - SA: CUSR0000SA0L1E
- *Food in U.S. city average, all urban consumers*
    - NSA: CUUR0000SAF1
    - SA: CUSR0000SAF1
- *Food at home in U.S. city average, all urban consumers*
    - NSA: CUUR0000SAF11
    - SA: CUSR0000SAF11
- *Energy in U.S. city average, all urban consumers*
    - NSA: CUUR0000SA0E
    - SA: CUSR0000SA0E
- *Commodities less food and energy commodities in U.S. city average, all urban consumers*
    - NSA: CUUR0000SACL1E
    - SA: CUSR0000SACL1E
- *Services less energy services in U.S. city average, all urban consumers*
    - NSA: CUUR0000SASLE
    - SA: CUSR0000SASLE
- *Shelter in U.S. city average, all urban consumers*
    - NSA: CUUR0000SAH1
    - SA: CUSR0000SAH1
((https://www.bls.gov/cpi/factsheets/owners-equivalent-rent-and-rent.htm))

**Calculate special CPI**

Occasionally, a user wishes to estimate a price change that is not published by BLS. For instance, suppose a user would like a CPI series for ‘services less energy services and shelter’. This can be done by estimating a special index, in this case, ‘services less energy services and shelter’.
[BLS Doc](https://www.bls.gov/cpi/factsheets/constructing-special-cpis.htm)

If SEEB01 -> CUUR0000SEEB01

Cost weight is just a sum of all the items

If I add all the values to calculate the services less energy services and shelter, it becomes a lot of data. Explore different solution (e.g. remove goods from core CPI)

**Supercore CPI**

"Fed Chair Jerome Powell cited a specific category of inflation—inflation in core services other than housing—as being perhaps “the most important category for understanding the future evolution of core inflation.” The financial press has termed this category “supercore” inflation" ([FED of St. Louis](https://www.stlouisfed.org/on-the-economy/2024/may/measuring-inflation-headline-core-supercore-services))

In [81]:
import os
import requests
import json
import prettytable
import pandas as pd
import requests
from bs4 import BeautifulSoup
from transformers import BartForConditionalGeneration, BartTokenizer
from datetime import datetime

# Folder directory containing all the saved data
folder_name = 'CPI_Data'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

# CPI categories weights
food_weight = 13.555
energy_weight = 6.655
services_less_energy_weight = 60.899
shelter_weight = 36.191
weight_map = {
    'All_Items': 100,
    'Food_Energy': food_weight + energy_weight,
    'Food': food_weight,
    'Food_At_Home': 8.167,
    'Food_Away_From_Home': 5.388,
    'Energy': energy_weight,
    'All_Items_Less_Food_Energy': 79.790,
    'Commodities_Less_Food_Energy_Commodities': 18.891,
    'Services_Less_Energy_Services': services_less_energy_weight,
    'Shelter': shelter_weight,
    'Supercore': services_less_energy_weight - shelter_weight 
}

# CPI id map
series_names = {
    'CUUR0000SA0': 'NSA_All_Items',
    'CUSR0000SA0': 'SA_All_Items',
    'CUUR0000SA0L1E': 'NSA_All_Items_Less_Food_Energy',
    'CUSR0000SA0L1E': 'SA_All_Items_Less_Food_Energy',
    'CUUR0000SAF': 'NSA_Food',
    'CUSR0000SAF': 'SA_Food',
    'CUUR0000SAF11': 'NSA_Food_At_Home',
    'CUSR0000SAF11': 'SA_Food_At_Home',
    'CUUR0000SEFV': 'NSA_Food_Away_From_Home',
    'CUSR0000SEFV': 'SA_Food_Away_From_Home',
    'CUUR0000SA0E': 'NSA_Energy',
    'CUSR0000SA0E': 'SA_Energy',
    'CUUR0000SACL1E': 'NSA_Commodities_Less_Food_Energy_Commodities',
    'CUSR0000SACL1E': 'SA_Commodities_Less_Food_Energy_Commodities',
    'CUUR0000SASLE': 'NSA_Services_Less_Energy_Services',
    'CUSR0000SASLE': 'SA_Services_Less_Energy_Services',
    'CUUR0000SAH1': 'NSA_Shelter',
    'CUSR0000SAH1': 'SA_Shelter',
}
series_ids = list(series_names.keys())

## BLS API request for CPI data

Documentation:
- https://www.bls.gov/developers/api_signature_v2.htm
- Python sample code: https://www.bls.gov/developers/api_python.htm#python2

In [16]:
# Gather the last 12 months of data
current_date = datetime.now()
current_year = current_date.year
last_year = current_year - 1

# API request structure taken from the BLS API documentation and modified to fit the data we need
headers = {'Content-type': 'application/json'}
data = json.dumps({"seriesid": series_ids, "startyear": str(last_year), "endyear": str(current_year)})
response = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)
json_data = json.loads(response.text)

all_data = []
for series in json_data['Results']['series']:
    rows = []
    for item in series['data']:
        footnotes = "".join([footnote['text'] + ',' for footnote in item['footnotes'] if footnote]).rstrip(',')
        if 'M01' <= item['period'] <= 'M12':
            rows.append([series_names[series['seriesID']], item['year'], item['period'], item['value'], footnotes])

    df = pd.DataFrame(rows, columns=["series id", "year", "period", "value", "footnotes"])
    all_data.append(df)

complete_data = pd.concat(all_data)

# Process the data to properly save them in a csv file
df = complete_data.copy()
df['date'] = pd.to_datetime(df['year'].astype(str) + df['period'].str.replace('M', ''), format='%Y%m')
df['series id'] = df['series id'].astype(str) 
df['value'] = pd.to_numeric(df['value'], errors='coerce')
df['footnotes'] = df['footnotes'].astype(str) 
df.drop(['year', 'period', 'footnotes'], axis=1, inplace=True)
df.rename(columns={'series id': 'id'}, inplace=True)
df = df[['id', 'date', 'value']]

csv_path = os.path.join(folder_name, 'CPI_data.csv')
df.to_csv(csv_path, index=False)

## Data processing

On top of calculating the MoM and YoY changes, the food + energy and supercore data must be calculated. Furthermore, the data are organized in a table that allows the correct subdivision into categories and sub categories.

In [21]:
# Load the dataset to avoid making the API request every time
df = pd.read_csv("CPI_Data/CPI_data.csv")
df['date'] = pd.to_datetime(df['date'])

In [22]:
# Calculation of the MoM and YoY changes
df['MoM_change'] = df.groupby('id')['value'].transform(lambda x: (x - x.shift(-1)) / x.shift(-1))
df['YoY_change'] = df.groupby('id')['value'].transform(lambda x: (x - x.shift(-12)) / x.shift(-12))

In [23]:
def calculate_weighted_change(df: pd.DataFrame, id1: str, id2: str, new_id: str, weight1: float, weight2: float, weight_total: float, operation: str) -> pd.DataFrame
    '''
    Calculate the weighted change of food + energy and supercore for both SA and non-SA (NSA).
    
    Args:
        df: The DataFrame containing the CPI data.
        id1: The id of the first series.
        id2: The id of the second series.
        new_id: The id of the new series.
        weight1: The weight of the first series.
        weight2: The weight of the second series.
        weight_total: The total weight of the new series.
        operation: The operation to be performed. Must be 'add' or 'subtract'.
    
    Returns:
        The DataFrame containing the new series
    '''
    series1 = df.loc[df['id'] == id1, ['MoM_change', 'YoY_change']].set_index(df.loc[df['id'] == id1, 'date']) * weight1
    series2 = df.loc[df['id'] == id2, ['MoM_change', 'YoY_change']].set_index(df.loc[df['id'] == id2, 'date']) * weight2

    if operation == 'add':
        result = (series1 + series2) / weight_total
    elif operation == 'subtract':
        result = (series1 - series2) / weight_total
    else:
        raise ValueError("Operation must be 'add' or 'subtract'.")

    result = result.reset_index()
    result['id'] = new_id
    result['value'] = 0
    result = result[['id', 'date', 'value', 'MoM_change', 'YoY_change']]
    return result

# Calculate 'SA_Food_Energy' and 'NSA_Food_Energy'
df = pd.concat([
    df,
    calculate_weighted_change(df, 'SA_Food', 'SA_Energy', 'SA_Food_Energy', weight_map['Food'], weight_map['Energy'], weight_map['Food_Energy'], 'add'),
    calculate_weighted_change(df, 'NSA_Food', 'NSA_Energy', 'NSA_Food_Energy', weight_map['Food'], weight_map['Energy'], weight_map['Food_Energy'], 'add')
])

# Calculate 'SA_Supercore' and 'NSA_Supercore'
df = pd.concat([
    df,
    calculate_weighted_change(df, 'SA_Services_Less_Energy_Services', 'SA_Shelter', 'SA_Supercore', weight_map['Services_Less_Energy_Services'], weight_map['Shelter'], weight_map['Supercore'], 'subtract'),
    calculate_weighted_change(df, 'NSA_Services_Less_Energy_Services', 'NSA_Shelter', 'NSA_Supercore', weight_map['Services_Less_Energy_Services'], weight_map['Shelter'], weight_map['Supercore'], 'subtract')
])

In [24]:
# Category mapping to properly organize the data into categories, and give the proper names for the plot
category_map = {
    'All_Items': (0, 'Headline', '', '', 'All Items'),
    'Food_Energy': (1, 'Food + Energy', '', '', 'Food and Energy'),
    'Food': (2, '', 'Food', '', 'Food'),
    'Food_At_Home': (3, '', '', 'At home', 'Food at Home'),
    'Food_Away_From_Home': (4, '', '', 'Away Home', 'Food Away from Home'),
    'Energy': (5, '', 'Energy', '', 'Energy'),
    'All_Items_Less_Food_Energy': (6, 'Core', '', '', 'Core CPI'),
    'Commodities_Less_Food_Energy_Commodities': (7, '', 'Goods', '', 'Goods'),
    'Services_Less_Energy_Services': (8, '', 'Services', '', 'Services excluding Energy'),
    'Shelter': (9, '', '', 'Shelter', 'Shelter'),
    'Supercore': (10, '', '', 'Supercore', 'Supercore')
}

ordered_categories = ['Headline', 'Food + Energy', 'Core']
ordered_sub_categories_1 = ['Food', 'Energy', 'Commodities', 'Services']
ordered_sub_categories_2 = ['At home', 'Away Home', 'Shelter', 'Supercore']

for i in range(4):
    if i in [0, 1]:
        data = df[df['id'].str.startswith('NSA_')].copy()
        prefix_length = 4
    else:
        data = df[df['id'].str.startswith('SA_')].copy() 
        prefix_length = 3

    data.loc[:, 'id'] = data['id'].str[prefix_length:]
    data.loc[:, 'Month-Year'] = data['date'].dt.strftime('%b-%y')
    order_cat = data['id'].apply(lambda x: category_map.get(x, (None, None, None, None, None)))
    data.loc[:, 'Order'] = [item[0] for item in order_cat]
    data.loc[:, 'Category'] = [item[1] for item in order_cat]
    data.loc[:, 'Sub Category 1'] = [item[2] for item in order_cat]
    data.loc[:, 'Sub Category 2'] = [item[3] for item in order_cat]
    data.loc[:, 'Name'] = [item[4] for item in order_cat]
    data.loc[:, 'Weight'] = data['id'].map(weight_map).fillna('Unknown')

    if i in [0, 2]:
        value_column = 'MoM_change'
    else:
        value_column = 'YoY_change'

    data_pivot = data.pivot_table(
        index=['Name', 'Order', 'Category', 'Sub Category 1', 'Sub Category 2', 'Weight'],
        columns='Month-Year',
        values=value_column,
        aggfunc='first'
    )

    data_pivot = data_pivot[sorted(data_pivot.columns, key=lambda x: pd.to_datetime(x, format='%b-%y'), reverse=True)]
    data_pivot.columns.name = None
    data_pivot.reset_index(inplace=True)
    data_pivot.sort_values(by='Order', inplace=True)
    data_pivot.drop(columns=['Order'], inplace=True)

    if i == 0:
        NSA_MoM_df = data_pivot
    elif i == 1:
        NSA_YoY_df = data_pivot
    elif i == 2:
        SA_MoM_df = data_pivot
    else:
        SA_YoY_df = data_pivot

In [25]:
# Save the data
csv_path = os.path.join(folder_name, 'NSA_MoM_CPI_data.csv')
NSA_MoM_df.to_csv(csv_path, index=False)

csv_path = os.path.join(folder_name, 'NSA_YoY_CPI_data.csv')
NSA_YoY_df.to_csv(csv_path, index=False)

csv_path = os.path.join(folder_name, 'SA_MoM_CPI_data.csv')
SA_MoM_df.to_csv(csv_path, index=False)

csv_path = os.path.join(folder_name, 'SA_YoY_CPI_data.csv')
SA_YoY_df.to_csv(csv_path, index=False)

## CPI Summary

In [5]:
def fetch_report_text(url: str) -> str:
    '''
    The BLS API doesn't provide the report text, so I had to scrape the website to get it. The website structure is simple, so I used BeautifulSoup to get the text. BLS has an anti-scraping system, so I had to add a User-Agent header to the request.

    Args:
        url: The URL of the webpage to scrape
    
    Returns:
        The text of the report
    '''
    headers = {
        'User-Agent': 'email@domain.name'  # It doesn't matter
    }
    response = requests.get(url, headers=headers)
    if response.status_code == 200: # Code 200 means the request was successful
        soup = BeautifulSoup(response.text, 'html.parser')
        normalnews_div = soup.find('div', class_='normalnews') # After analyzing the webpage using the inspector, I found that the needed text is inside a <div> with class 'normalnews'
        if normalnews_div:
            return normalnews_div.get_text(separator='\n', strip=True)
        else:
            return "No <div> with class 'normalnews' found." # Error management
    else:
        return f"Failed to retrieve page: {response.status_code}"

url = "https://www.bls.gov/news.release/cpi.nr0.htm"

full_text = fetch_report_text(url)

In [3]:
def extract_paragraph_based_on_index(full_text: str) -> str:
    '''
    Extract the second paragraph of the main section of the report, or the third paragraph if a "NOTE:" is present after the second paragraph (e.g. June 2024). A LLM could be used to find the paragraph of interest, but after the whole process of training, the best result would be for it to understand this pattern. Therefore, I decided to use a simple rule-based approach. For all the previous year, this rule-based approach worked perfectly. A more advanced system that could detect a sudden change in the report structure would be needed to handle the edge cases, but for this project would be overkill.

    Args:
        full_text: The full text of the report retrieved from the BLS website

    Returns:
        The paragraph of interest
    '''
    paragraphs = [p.strip() for p in full_text.replace('\r', '').split('\n\n') if p.strip()] # split the text into paragraphs
    for i, paragraph in enumerate(paragraphs):
        if "CONSUMER PRICE INDEX" in paragraph: # "CONSUMER PRICE INDEX" is the section we are interested in, and it's the only consumer price index title written in uppercase
            start_index = i
            break
    else:
        return "CONSUMER PRICE INDEX section not found."
    note_present = "NOTE:" in paragraphs[start_index + 1] # check if a "NOTE:" is present immediately after "CONSUMER PRICE INDEX"
    target_index = start_index + 3 if note_present else start_index + 2
    if target_index < len(paragraphs):
        target_paragraph = paragraphs[target_index].replace('\n', ' ') # replace newline characters with a space
        return target_paragraph
    else:
        return "The paragraph of interest could not be found."

relevant_paragraph = extract_paragraph_based_on_index(full_text)
relevant_paragraph

In [78]:
# I decided to use a BART pre-trained model to summarize the paragraph retrieved above, and ignore the food section of it. The model is fine-tuned for summarization tasks, and it should be able to generate a good summary of the paragraph. The model is not perfect, and the summary could be better, but it should be good enough for this project. I already used BART previously, so I know the capabilities of the model.
model_name = "facebook/bart-large-cnn" 
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

input_text = "Summarize and ignore the food parts: " + relevant_paragraph # prompt for the model. This could be improved by adding more information to the prompt, but for this project, it should be enough.
input_ids = tokenizer.encode(input_text, return_tensors="pt", max_length=512, truncation=True)

# Generate the summary
summary_ids = model.generate(
    input_ids, 
    num_beams=4, 
    min_length=20, 
    max_length=60,  # Setting a realistic max_length
    length_penalty=1.0, 
    no_repeat_ngram_size=3,  # Prevent repetition of phrases
    early_stopping=True  # Allow the model to generate until max_length
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

# Print the summary
print(summary)

The index for shelter rose 0.4 percent in July, accounting for nearly 90 percent of the monthly increase in the all items index. The energy index was unchanged over the month, after declining in the two preceding months.


In [82]:
txt_path = os.path.join(folder_name, 'summary.txt')

# Save the paragraph to a text file
with open(txt_path, 'w') as file:
    file.write(summary)