# Classification and Labeling of Financial Events Using Text Processing Techniques

### 1. **Business Understanding**
   - **Objective**: The primary goal of this notebook is to categorize and label financial events based on their descriptions (event names). The objective is to systematically classify these events into specific financial indicators, such as GDP, CPI, PMI, and others, to facilitate further analysis of their impact on financial markets and economic forecasting.

### 2. **Data Understanding**
   - **Data**: The dataset consists of a list of financial event names, each representing various economic indicators and events (e.g., "French Final Manufacturing PMI", "CPI m/m"). These event names vary in terms of content and structure.
   - **Exploration**: Initial exploration involves examining the dataset to understand the variety of event names and identifying patterns, such as keywords that correspond to specific economic indicators (e.g., "CPI", "PMI").

### 3. **Data Preparation**
   - **Processing**: The text data is preprocessed by converting all event names to lowercase and filtering or tokenizing key phrases such as "CPI", "PMI", "GDP", etc.
   - **Categorization**: The event names are then categorized into predefined groups based on the identified keywords. This step includes creating a function to automatically classify each event into its respective category.

### 4. **Modeling**
   - **Approach**: While not a traditional machine learning model, this notebook employs rule-based text classification to assign labels to each event. The rules are defined based on the presence of specific keywords within the event names.
   - **Implementation**: The notebook uses Python's `pandas` library to group, filter, and classify the events, ensuring that each event is correctly categorized under its corresponding financial indicator.

### 5. **Evaluation**
   - **Verification**: The effectiveness of the classification is verified by reviewing the grouped events to ensure that they have been accurately categorized. This is done by comparing the labeled categories against the expected economic indicators.
   - **Outcome**: The evaluation may involve counting the number of events per category and assessing the distribution to ensure that no significant event types have been missed or misclassified.

### 6. **Deployment**
   - **Application**: The final categorized dataset can be used in further analyses, such as impact assessment of specific economic indicators on markets, or it could be integrated into silver pipeline.
   - **Future Steps**: Based on this classification, additional analyses or reports could be generated to provide insights into trends and patterns in the economic data.


## 2. Data Understanding

### Import lib and csv.

In [127]:
import pandas as pd
import numpy as np
import re
from collections import Counter

In [128]:
df_01 = pd.read_csv("/Users/datpro/Documents/gitdatpro/ff-transform-data/data/bronze/monthly/2024_01_ff_data.csv")
df_02 = pd.read_csv("/Users/datpro/Documents/gitdatpro/ff-transform-data/data/bronze/monthly/2024_02_ff_data.csv")
df_03 = pd.read_csv('/Users/datpro/Documents/gitdatpro/ff-transform-data/data/bronze/monthly/2024_03_ff_data.csv')
df_04 = pd.read_csv('/Users/datpro/Documents/gitdatpro/ff-transform-data/data/bronze/monthly/2024_04_ff_data.csv')
df_05 = pd.read_csv('/Users/datpro/Documents/gitdatpro/ff-transform-data/data/bronze/monthly/2024_05_ff_data.csv')
df_06 = pd.read_csv('/Users/datpro/Documents/gitdatpro/ff-transform-data/data/bronze/monthly/2024_06_ff_data.csv')
df_07 = pd.read_csv('/Users/datpro/Documents/gitdatpro/ff-transform-data/data/bronze/monthly/2024_07_ff_data.csv')
# df_07 = pd.read_csv('/Users/datpro/Documents/gitdatpro/ff-transform-data/data/bronze/monthly/2024_02_ff_data.csv')

# Combine all df to one 

df_combined = pd.concat([df_01, df_02, df_03, df_04, df_05, df_06, df_07], ignore_index=True)

In [129]:
# Check df
df_combined.head(3)

Unnamed: 0,date,time,currency,impact,event,actual,forecast,previous
0,Jan 1,All Day,NZD,gray,Bank Holiday,,,
1,Jan 1,All Day,AUD,gray,Bank Holiday,,,
2,Jan 1,All Day,JPY,gray,Bank Holiday,,,


## Finding pattern and words count in event names

In [130]:
# Danh sách các sự kiện
event_name = df_combined['event'].unique()

# Tạo một danh sách để chứa các pattern
patterns = []

# Biểu thức chính quy để lọc các phần chung trong tên
pattern = re.compile(r'(CPI\s?[a-z/]*/?[a-z]*|PPI\s?[a-z/]*/?[a-z]*|PMI|GDP\s?[a-z/]*/?[a-z]*|Retail Sales\s?[a-z/]*/?[a-z]*|Unemployment Rate|Employment Change|Hourly Earnings\s?[a-z/]*/?[a-z]*)', re.IGNORECASE)

# Duyệt qua các sự kiện để tìm các pattern
for event in event_name:
    match = pattern.search(event)
    if match:
        patterns.append(match.group(0).strip())

# Đếm tần suất của các pattern tìm thấy
pattern_counts = Counter(patterns)

# Chuyển đổi kết quả thành DataFrame để dễ dàng hiển thị và phân tích
df_patterns = pd.DataFrame(pattern_counts.items(), columns=['Pattern', 'Frequency'])
df_patterns

Unnamed: 0,Pattern,Frequency
0,PMI,25
1,employment Change,2
2,CPI m/m,7
3,Employment Change,6
4,Retail Sales m/m,4
5,CPI Flash,2
6,PPI m/m,4
7,Unemployment Rate,4
8,Hourly Earnings m/m,1
9,Retail Sales y/y,1


In [131]:
from sklearn.feature_extraction.text import CountVectorizer

# Danh sách các sự kiện
event_names = df_combined['event'].unique()

# Chuyển tất cả thành chữ thường
event_names = [event.lower() for event in event_names]

# Sử dụng CountVectorizer với n-grams
vectorizer = CountVectorizer(ngram_range=(2, 3), stop_words='english')  # Tạo ra bi-grams và tri-grams
X = vectorizer.fit_transform(event_names)
ngram_counts = dict(zip(vectorizer.get_feature_names_out(), X.toarray().sum(axis=0)))

# Sắp xếp n-grams theo tần suất xuất hiện giảm dần
sorted_ngram_counts = dict(sorted(ngram_counts.items(), key=lambda item: item[1], reverse=True))

# Hiển thị các n-grams phổ biến nhất
print("Top cụm từ phổ biến nhất:")
for ngram, count in sorted_ngram_counts.items():
    print(f"{ngram}: {count}")


Top cụm từ phổ biến nhất:
fomc member: 14
manufacturing pmi: 12
monetary policy: 11
services pmi: 10
price index: 9
mpc member: 8
press conference: 8
retail sales: 8
bond auction: 7
core cpi: 7
industrial production: 7
manufacturing index: 7
employment change: 6
financial stability: 6
inflation expectations: 6
prelim gdp: 6
trade balance: 6
10 bond: 5
10 bond auction: 5
gdp price: 5
gdp price index: 5
bank holiday: 4
business confidence: 4
council member: 4
final gdp: 4
financial stability report: 4
flash gdp: 4
french final: 4
german final: 4
gov council: 4
gov council member: 4
monetary policy report: 4
policy report: 4
stability report: 4
unemployment rate: 4
consumer confidence: 3
consumer sentiment: 3
final cpi: 3
final manufacturing: 3
final manufacturing pmi: 3
final services: 3
final services pmi: 3
flash manufacturing: 3
flash manufacturing pmi: 3
flash services: 3
flash services pmi: 3
french flash: 3
french prelim: 3
home sales: 3
import prices: 3
meeting minutes: 3
monetary

#### Single word count

In [132]:
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer

# Chuyển tất cả thành chữ thường
event_names = [event.lower() for event in event_names]

# Sử dụng CountVectorizer để đếm tần suất từ
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(event_names)
word_counts = dict(zip(vectorizer.get_feature_names_out(), X.toarray().sum(axis=0)))

# Sắp xếp từ theo tần suất xuất hiện giảm dần
sorted_word_counts = dict(sorted(word_counts.items(), key=lambda item: item[1], reverse=True))

# Hiển thị các từ phổ biến nhất
print("Top từ phổ biến nhất:")
for word, count in sorted_word_counts.items():
    print(f"{word}: {count}")

Top từ phổ biến nhất:
speaks: 42
member: 28
index: 26
pmi: 25
german: 24
manufacturing: 23
cpi: 22
gdp: 21
fomc: 19
prelim: 19
rate: 19
final: 17
french: 17
sales: 16
gov: 15
flash: 14
core: 13
policy: 13
monetary: 12
services: 12
business: 11
consumer: 11
italian: 11
report: 11
price: 10
balance: 9
change: 9
economic: 9
mpc: 9
ppi: 9
rba: 9
statement: 9
bank: 8
conference: 8
confidence: 8
expectations: 8
industrial: 8
inflation: 8
press: 8
production: 8
retail: 8
auction: 7
boc: 7
bond: 7
employment: 7
financial: 7
prices: 7
revised: 7
spanish: 7
trade: 7
unemployment: 7
boe: 6
boj: 6
hpi: 6
orders: 6
rbnz: 6
sentiment: 6
snb: 6
spending: 6
stability: 6
10: 5
ecb: 5
fed: 5
private: 5
building: 4
bulletin: 4
climate: 4
construction: 4
council: 4
credit: 4
holiday: 4
home: 4
housing: 4
inventories: 4
investment: 4
labor: 4
meeting: 4
money: 4
non: 4
of: 4
quarterly: 4
uom: 4
3m: 3
anz: 3
average: 3
budget: 3
cash: 3
earnings: 3
elections: 3
estimate: 3
foreign: 3
goods: 3
import: 3
ism:

### **Classification: A Three-Level Categorization**

#### **1. Word Analysis to Level 1 Classification (Primary Category):**

- **Why**: The word count analysis helps identify key terms that frequently appear in event names. By identifying these key terms, we can categorize events into broad **Primary Categories** (Level 1) such as GDP, Inflation, Bond, etc. This initial categorization organizes the information systematically, laying the foundation for more detailed analysis.

- **How**: Based on the key terms identified from the word count (e.g., "gdp", "inflation", "bond"), we group related events into their corresponding **Primary Categories**.

#### **2. Level 1 Classification (Primary Category):**

- **Definition**:
  - **GDP**: Events related to Gross Domestic Product.
  - **Inflation**: Events related to inflation metrics.
  - **Bond**: Events related to bonds and bond auctions.
  - **Employment Indicators**: Events related to the labor market.
  - **Production Indicators**: Events related to industrial production and factory orders.
  - **Sales Indicators**: Events related to sales figures, such as retail and home sales.
  - **Speak**: Speeches and statements from financial leaders.
  - **Submit**: Financial reports and documents submitted.
  - **Monetary Policy**: Events related to monetary policy decisions.
  - **Holiday**: Public holidays and bank holidays.
  - **Other**: Events that do not fit into the above categories.

#### **3. Level 2 Classification (Reporting Type + Event Name):**

- **Why**: Events that share the same indicator might have multiple versions, such as "Prelim", "Final", "Flash", etc. Combining **Reporting Type** with **Event Name** in Level 2 allows for precise classification of the report type and the specific indicator.

- **How**: 
  - We use keywords like "prelim", "final", "flash", etc., combined with specific indicator names like "CPI", "GDP", "Inflation Expectations", to classify events at Level 2.
  - For example, "Prelim UoM Inflation Expectations" is classified as having a **Reporting Type** of "Prelim" and an **Event Name** of "Inflation Expectations".
- **Refinement** : 
    - After classifying events into Level 1 categories, we review those that remain unspecified or not clearly labeled according to the Level 2 formula. We then further classify these events correctly to minimize the number of unspecified entries, ensuring that each event is accurately labeled according to the defined pattern.

#### **4. Level 3 Classification (Time Frame):**

- **Why**: The time frame (m/m, y/y, q/q, etc.) indicates the period over which the event or indicator is measured. Classifying events by **Time Frame** adds an additional layer of detail, providing insights into the time-related context of the indicator.

- **How**:
  - We use keywords related to time frames such as "m/m", "y/y", "q/q" to classify events at Level 3.
  - For example, "MI Inflation Gauge m/m" has a **Time Frame** of "m/m", indicating that this indicator is measured month-over-month.
- **Refinement** : 
    - Similar to Level 2, we examine any events that remain unspecified after Level 2 classification. We apply the defined time frame patterns to ensure these events are accurately classified in Level 3, reducing the likelihood of unspecified entries.

### **Summary:**

- **Level 1**: Identifies the primary category of the event based on key terms (e.g., GDP, Inflation, Bond).
- **Level 2**: Combines the reporting type with the specific indicator name for detailed classification.
- **Level 3**: Adds further granularity by classifying events based on the time frame over which the indicator is measured.

This classification approach ensures that data is organized logically and systematically, facilitating deeper analysis of economic events.


In [133]:
# Hàm để phân loại các sự kiện
def categorize_event(event_name):
    event_name = event_name.lower()

    # Define categories for Level 1
    primary_categories = {
        'gdp': ['gdp'],
        'inflation': ['cpi', 'core cpi', 'ppi', 'inflation', 'pce'],
        'bond': ['bond', 'auction', 'treasury'],
        'employment indicators': ['employment', 'jolts job openings', 'unemployment', 'claimant count change'],  # Updated
        'production indicators': ['industrial production', 'factory orders', 'pmi'],
        'sales indicators': ['retail sales', 'new home sales', 'pending home sales'],
        'speak': ['speaks', 'testifies', 'press conference'],
        'submit': ['release', 'projections', 'budget'],
        'monetary policy': ['monetary policy', 'rate statement', 'meeting minutes', 'policy rate', 'official bank rate',
                            'cash rate', 'overnight rate', 'main refinancing rate', 'federal funds rate', 'boj outlook report', 'fomc statement'],  # Updated
        'political events': ['elections', 'summit', 'meetings'],  # Updated
        'economic indicators': ['consumer sentiment', 'consumer confidence', 'manufacturing index'],  # Updated
        'holiday': ['bank holiday', 'public holiday'],
        'other': [] # Other categories will go here
    }

    # Level 2 - Reporting Type + Specific Index
    level_2_categories = {
        
        # lv1: gdp
        'final gdp': ['final gdp'],
        'prelim gdp': ['prelim gdp'],
        'advance gdp': ['advance gdp'],  
        'gdp flash estimate': ['gdp flash estimate', 'flash gdp'], 
        'revised gdp': ['revised gdp'],  
        'gdp price index': ['gdp price index'], 
        'gdp estimate': ['gdp estimate', 'niesr gdp estimate'],  
        'gdp general': ['gdp m/m', 'gdp q/y', 'gdp q/q'],  
        
        # lv1: inflation (cpi / ppi / pce)
        'prelim cpi': ['prelim cpi'],
        'flash cpi': ['flash cpi', 'cpi flash estimate', 'flash estimate cpi'],
        'general cpi': ['cpi m/m', 'cpi y/y', 'cpi q/q'],
        'core cpi': ['core cpi'],
        'median cpi': ['median cpi'],
        'trimmed cpi': ['trimmed cpi'],
        'common cpi': ['common cpi'],
        'ppi input': ['ppi input'],
        'ppi output': ['ppi output'],
        'general ppi': ['ppi m/m', 'ppi y/y', 'ppi q/q', 'german ppi m/m', 'ippi m/m'],
        'core ppi': ['core ppi'],
        'sppi': ['sppi y/y'],
        'core pce': ['core pce price index m/m'],
        'trimmed mean cpi': ['trimmed mean cpi q/q'],
        'inflation expectations': [
            'mi inflation expectations',
            'prelim uom inflation expectations',
            'revised uom inflation expectations',
            'cleveland fed inflation expectations',
            'consumer inflation expectations'],
        'inflation letter': ['boe inflation letter'],
        
        # lv1 : bond
        'bond auction': ['bond auction'],
        'treasury speak': ['treasury sec yellen speaks'],  # Chỉ cần từ 'treasury' và 'speak'
        'treasury report': ['treasury currency report'],
        
        # lv1: employee indicators
        'unemployment rate': ['unemployment rate'],
        'employment change': ['employment change', 'non-farm employment change', 'adp non-farm employment change'],
        'unemployment claims': ['unemployment claims', 'unemployment change'],
        'job openings': ['job openings', 'jolts job openings'],
        'employment cost index': ['employment cost index'],

        # lv1: production indicators
        'manufacturing pmi': [
            'manufacturing pmi', 'caixin manufacturing pmi', 'spanish manufacturing pmi',
            'italian manufacturing pmi', 'french final manufacturing pmi',
            'german final manufacturing pmi', 'final manufacturing pmi', 'ism manufacturing pmi'],
        'services pmi': ['services pmi', 'caixin services pmi', 'spanish services pmi'],
        
        #lv1: retail sales
        'retail sales': ['retail sales', 'german retail sales', 'italian retail sales','brc retail sales monitor'],
        'core retail sales': ['core retail sales'],
        'new home sales': ['new home sales'],
        'pending home sales': ['pending home sales'],
   
        # lv1: speak
        'central bank speak': [
            'fomc member',
            'boe gov',
            'ecb president',
            'boj gov',
            'rba gov',
            'rbnz gov',
            'boc gov',
            'snb chairman',
            'mpc member',
            'gov council member',
            'gov board member'
        ],
        'press conference': [
            'press conference'
        ],
        'government official speak': [
            'fed chair powell speaks',
            'president biden speaks',
            'fed chair powell testifies'
        ],

        # lv1: submit
        'budget release': [
            'annual budget release',
            'federal budget balance',
            'french gov budget balance'
        ],
        'economic projections': [
            'fomc economic projections'
        ],

        # lv1: monetary policy
        'meeting minutes': [
            'fomc meeting minutes',
            'monetary policy meeting minutes',
            'fpc meeting minutes',
            'ecb monetary policy meeting accounts'
        ],
        'policy statement/report': [
            'monetary policy statement',
            'monetary policy report',
            'boe monetary policy report',
            'rba monetary policy statement',
            'boc monetary policy report',
            'fed monetary policy report',
            'snb monetary policy assessment',
            'rba monetary policy statement',
            'rbnz monetary policy statement',
            'monetary policy summary'
        ],
        'rate statement/policy rate': [
            'boj policy rate',
            'boc rate statement',
            'rba rate statement',
            'rbnz rate statement',
            'snb policy rate'
        ],
        'monetary policy hearings': [
            'monetary policy report hearings'
        ],
        'rate decision': [
            'official cash rate',
            'federal funds rate',
            'main refinancing rate',
            'cash rate',
            'mpc official bank rate votes',
            'official bank rate',
            'overnight rate',
        ],
        'policy statement': [
            'fomc statement',
        ],
        'outlook report': [
            'boj outlook report',
        ],
        
        # lv1: holiday    
        'bank holiday': ['french bank holiday', 'german bank holiday', 'italian bank holiday', 'bank holiday'], 
        'public holiday': ['public holiday'],

        # lv1: economic indicators (updated)
        'consumer sentiment': [
            'prelim uom consumer sentiment',
            'revised uom consumer sentiment'
        ],
        'consumer confidence': [
            'cb consumer confidence'
        ],
        'employment indicators': [
            'claimant count change',
            'average hourly earnings m/m',
            'wage price index q/q'
        ],
        'manufacturing index': [
            'empire state manufacturing index',
            'durable goods orders m/m'
        ],

        # lv1: political events (updated)
        'parliamentary elections': [
            'french parliamentary elections',
            'european parliamentary elections',
            'parliamentary elections'
        ],
        'summit': [
            'euro summit'
        ]
    }

    # Level 3 - Time Frame
    level_3_categories = {
        'm/m': ['m/m', 'month-over-month'],
        'y/y': ['y/y', 'year-over-year'],
        'q/q': ['q/q', 'quarter-over-quarter'],
        '10-year': ['10-y'],
        '30-year': ['30-y'],
        'monthly': ['monthly'],
        'annual': ['annual', 'year'],
        'quarterly': ['quarterly'],
        'q/y': ['q/y', 'quarter-over-year']  # Bổ sung từ khóa "q/y"
    }

    # Initialize categories
    level_1 = 'other'
    level_2 = 'unspecified'
    level_3 = 'unspecified'

    # Categorize Level 1
    for key, keywords in primary_categories.items():
        if any(keyword in event_name for keyword in keywords):
            level_1 = key
            break

    # Categorize Level 2
    for key, keywords in level_2_categories.items():
        if any(keyword in event_name for keyword in keywords):
            level_2 = key
            break

    # Categorize Level 3
    for key, keywords in level_3_categories.items():
        if any(keyword in event_name for keyword in keywords):
            level_3 = key
            break

    return level_1, level_2, level_3

# Áp dụng hàm phân loại
df_combined[['category_lv1', 'category_lv2', 'category_lv3']] = df_combined['event'].apply(lambda x: pd.Series(categorize_event(x)))

# In ra kết quả
df_combined.head(5)


Unnamed: 0,date,time,currency,impact,event,actual,forecast,previous,category_lv1,category_lv2,category_lv3
0,Jan 1,All Day,NZD,gray,Bank Holiday,,,,holiday,bank holiday,unspecified
1,Jan 1,All Day,AUD,gray,Bank Holiday,,,,holiday,bank holiday,unspecified
2,Jan 1,All Day,JPY,gray,Bank Holiday,,,,holiday,bank holiday,unspecified
3,Jan 1,All Day,CNY,gray,Bank Holiday,,,,holiday,bank holiday,unspecified
4,Jan 1,All Day,CHF,gray,Bank Holiday,,,,holiday,bank holiday,unspecified


### Refinement process

#### GDP

In [134]:
# Refinement process for GDP events
def refine_gdp_classification(df):
    # Bước 1: Lọc tất cả các sự kiện GDP
    gdp_events = df[df['category_lv1'] == 'gdp'].copy()  # Tạo một bản sao của DataFrame gốc

    # Bước 2: Lọc ra 4 cột quan trọng
    gdp_events = gdp_events[['event', 'category_lv1', 'category_lv2', 'category_lv3']]

    # Bước 3: Lọc các sự kiện chưa được phân loại cụ thể ở Level 2 hoặc Level 3
    unspecified_gdp_events = gdp_events[
        (gdp_events['category_lv2'] == 'unspecified') |
        (gdp_events['category_lv3'] == 'unspecified')
        ]

    # Bước 4: Loại bỏ các sự kiện trùng lặp để chỉ lấy các sự kiện duy nhất
    unique_unspecified_gdp_events = unspecified_gdp_events.drop_duplicates()

    # Bước 5: In kết quả để phân loại tay
    return unique_unspecified_gdp_events

# Áp dụng quy trình refinement cho các sự kiện GDP
refined_gdp_df = refine_gdp_classification(df_combined)

# In ra các sự kiện GDP chưa được phân loại đúng
refined_gdp_df


Unnamed: 0,event,category_lv1,category_lv2,category_lv3
146,NIESR GDP Estimate,gdp,gdp estimate,unspecified


- **Final Result** on GDP as expectation

#### Inflation

In [135]:
# Refinement process for GDP events
def inflation_events_classification(df):
    # Bước 1: Lọc tất cả các sự kiện GDP
    inflation_events = df[df['category_lv1'] == 'inflation'].copy()  # Tạo một bản sao của DataFrame gốc

    # Bước 2: Lọc ra 4 cột quan trọng
    inflation_events = inflation_events[['event', 'category_lv1', 'category_lv2', 'category_lv3']]

    # Bước 3: Lọc các sự kiện chưa được phân loại cụ thể ở Level 2 hoặc Level 3
    unspecified_inflation_events = inflation_events[
        (inflation_events['category_lv2'] == 'unspecified') |
        (inflation_events['category_lv3'] == 'unspecified')
        ]

    # Bước 4: Loại bỏ các sự kiện trùng lặp để chỉ lấy các sự kiện duy nhất
    unique_unspecified_inflation_events = unspecified_inflation_events.drop_duplicates()

    # Bước 5: In kết quả để phân loại tay
    return unique_unspecified_inflation_events

# Áp dụng quy trình refinement cho các sự kiện GDP
refined_inflation_events = inflation_events_classification(df_combined)

# In ra các sự kiện GDP chưa được phân loại đúng
refined_inflation_events


Unnamed: 0,event,category_lv1,category_lv2,category_lv3
151,MI Inflation Gauge m/m,inflation,unspecified,m/m
220,MI Inflation Expectations,inflation,inflation expectations,unspecified
252,Prelim UoM Inflation Expectations,inflation,inflation expectations,unspecified
425,Revised UoM Inflation Expectations,inflation,inflation expectations,unspecified
513,Cleveland Fed Inflation Expectations,inflation,inflation expectations,unspecified
521,Inflation Expectations q/q,inflation,unspecified,q/q
959,Consumer Inflation Expectations,inflation,inflation expectations,unspecified
1055,BOE Inflation Letter,inflation,inflation letter,unspecified


 # Checklist



[x] 1. GDP

[x] 2. Inflation

[x] 3. Bond

[x] 4. Employment Indicators

[x] 5. Production Indicators

[x] 6. Sales Indicators

[x] 7. Speak

[x] 8. Submit

[x] 9. Monetary Policy

[x] 10. Holiday

#### General Refinement Process for Events


In [136]:
def events_classification(df, category):
    # Step 1: Filter all events based on the given category
    events = df[df['category_lv1'] == category].copy()  # Create a copy of the original DataFrame

    # Step 2: Filter out the 4 important columns
    events = events[['event', 'category_lv1', 'category_lv2', 'category_lv3', 'impact', 'actual']]

    # Step 3: Filter events that are unspecified in Level 2 or Level 3
    unspecified_events = events[
        (events['category_lv2'] == 'unspecified') |
        (events['category_lv3'] == 'unspecified')
    ]

    # Step 4: Drop duplicate events to get only unique events
    unique_unspecified_events = unspecified_events.drop_duplicates()

    # Step 5: Return the results for manual classification
    return unique_unspecified_events

# Apply the refinement process for a specific category (e.g., "inflation")
refined_events = events_classification(df_combined, 'other')

# Print out the events that haven't been properly classified
# refined_events.category_lv1

refined_events[refined_events['impact'] == 'red']

Unnamed: 0,event,category_lv1,category_lv2,category_lv3,impact,actual


## Finalize

In [137]:
df_combined.columns

Index(['date', 'time', 'currency', 'impact', 'event', 'actual', 'forecast',
       'previous', 'category_lv1', 'category_lv2', 'category_lv3'],
      dtype='object')

Validating the categories in your dataset is an important step to ensure that the categorization logic is working correctly. Here’s how you can validate the categories:

### 1. **Check for Unspecified Categories:**
   - You should first check how many events have been categorized as "unspecified" in any of the three levels. This will help you identify if there are events that haven’t been categorized properly.

   ```python
   # Check for unspecified categories in Level 1, Level 2, and Level 3
   unspecified_lv1 = df_combined[df_combined['category_lv1'] == 'unspecified']
   unspecified_lv2 = df_combined[df_combined['category_lv2'] == 'unspecified']
   unspecified_lv3 = df_combined[df_combined['category_lv3'] == 'unspecified']

   print(f"Unspecified in Level 1: {len(unspecified_lv1)}")
   print(f"Unspecified in Level 2: {len(unspecified_lv2)}")
   print(f"Unspecified in Level 3: {len(unspecified_lv3)}")
   ```

### 2. **Review Sample Events:**
   - Manually review a sample of events in each category to ensure they are being categorized correctly. You can take a random sample or review all events within a specific category.

   ```python
   # Sample of events in a specific category to manually validate
   sample_events = df_combined[df_combined['category_lv1'] == 'inflation'].sample(10)
   print(sample_events)
   ```

### 3. **Cross-Check with Known Data:**
   - Compare the categorized data against known, validated datasets or benchmarks. If you have access to a source where similar events are categorized, you can cross-check your results.

   ```python
   # Assume you have a benchmark dataset
   benchmark_df = pd.read_csv('benchmark_categorized_events.csv')

   # Merge and compare
   comparison_df = df_combined.merge(benchmark_df, on='event', suffixes=('_yours', '_benchmark'))

   # Check for differences
   mismatches = comparison_df[(comparison_df['category_lv1_yours'] != comparison_df['category_lv1_benchmark']) |
                              (comparison_df['category_lv2_yours'] != comparison_df['category_lv2_benchmark']) |
                              (comparison_df['category_lv3_yours'] != comparison_df['category_lv3_benchmark'])]
   print(f"Mismatches found: {len(mismatches)}")
   ```

### 4. **Visualize the Category Distribution:**
   - Visualize how events are distributed across categories to identify any anomalies or unexpected distributions.

   ```python
   import matplotlib.pyplot as plt

   # Distribution of events by Level 1 category
   df_combined['category_lv1'].value_counts().plot(kind='bar')
   plt.title('Distribution of Events by Level 1 Category')
   plt.xlabel('Category Level 1')
   plt.ylabel('Number of Events')
   plt.show()

   # Distribution of unspecified categories
   unspecified_lv2['category_lv1'].value_counts().plot(kind='bar')
   plt.title('Unspecified Events by Level 1 Category in Level 2')
   plt.xlabel('Category Level 1')
   plt.ylabel('Number of Unspecified Events')
   plt.show()
   ```

### 5. **Use Assertions:**
   - Implement assertions to validate that specific known events fall into the expected categories.

   ```python
   # Assert known event is categorized correctly
   assert df_combined[df_combined['event'] == 'CPI y/y']['category_lv1'].values[0] == 'inflation', "CPI y/y should be in 'inflation'"
   assert df_combined[df_combined['event'] == 'FOMC Meeting Minutes']['category_lv1'].values[0] == 'monetary policy', "FOMC Meeting Minutes should be in 'monetary policy'"
   ```

### 6. **Iterative Refinement:**
   - Based on the findings from the above steps, refine the categorization logic by updating or adding keywords, adjusting categories, and rerunning the categorization.

### 7. **Documentation:**
   - Document any changes you make during the validation process. Keeping a log of adjustments ensures you can track why certain events were reclassified.

### Final Steps:
- After validation, if a significant number of events are categorized as "unspecified" or incorrectly classified, it indicates the need for refinement in the category logic.
- Continue refining your categorization logic based on validation results until the accuracy meets your standards.

## Cleaning Data

In [138]:
df_combined.dtypes


date            object
time            object
currency        object
impact          object
event           object
actual          object
forecast        object
previous        object
category_lv1    object
category_lv2    object
category_lv3    object
dtype: object

In [139]:
# 1. Xử lý dữ liệu time
def process_time(time_str):
    if time_str == 'All Day':
        return '12:00am'
    elif time_str.startswith('Day'):
        return '12:00am'
    else:
        return time_str

df_combined['time'] = df_combined['time'].apply(process_time)

# Chuyển đổi cột time sang dạng giờ phút giây
df_combined['time'] = pd.to_datetime(df_combined['time'], format='%I:%M%p', errors='coerce').dt.strftime('%H:%M:%S')

# 2. Xử lý dữ liệu date
# Thêm năm mặc định, ví dụ 2024
df_combined['date'] = pd.to_datetime(df_combined['date'] + ' 2024', format='%b %d %Y')

# 3. Hợp nhất cột date và time thành cột datetime và đánh index
# Kết hợp thời gian với cột ngày
df_combined['datetime'] = df_combined['date'] + pd.to_timedelta(df_combined['time'])

# 4. Kiểm tra DataFrame đã đánh index chưa và index là cột nào
is_index_set = df_combined.index.is_monotonic_increasing  # True nếu index được sắp xếp từ nhỏ đến lớn
df_combined.reset_index(drop=True, inplace=True)

In [140]:
# Hàm chuyển đổi giá trị thành float, xử lý cả phần trăm và các ký tự K, M, B, T
def convert_to_float(value):
    if isinstance(value, str):
        multiplier = 1
        value = value.strip()  # Loại bỏ khoảng trắng ở đầu và cuối nếu có

        if value.endswith('%'):
            # Nếu giá trị kết thúc bằng '%', chuyển đổi thành phần trăm
            value = value[:-1]
            try:
                return float(value) / 100
            except ValueError:
                return np.nan

        if value.endswith('K'):
            multiplier = 1e3
            value = value[:-1]
        elif value.endswith('M'):
            multiplier = 1e6
            value = value[:-1]
        elif value.endswith('B'):
            multiplier = 1e9
            value = value[:-1]
        elif value.endswith('T'):
            multiplier = 1e12
            value = value[:-1]

        try:
            return float(value) * multiplier
        except ValueError:
            return np.nan
    try:
        return float(value) if value != '' else np.nan
    except ValueError:
        return np.nan

# Ứng dụng hàm chuyển đổi cho các cột cần thiết
for col in ['forecast', 'previous', 'actual']:
    df_combined[col] = df_combined[col].apply(convert_to_float)

# Chuyển đổi các cột sang float64
df_combined['forecast'] = df_combined['forecast'].astype('float64')
df_combined['previous'] = df_combined['previous'].astype('float64')
df_combined['actual'] = df_combined['actual'].astype('float64')

# Kiểm tra kết quả
print(df_combined[['forecast', 'previous', 'actual']])


         forecast     previous       actual
0             NaN          NaN          NaN
1             NaN          NaN          NaN
2             NaN          NaN          NaN
3             NaN          NaN          NaN
4             NaN          NaN          NaN
...           ...          ...          ...
2769        0.001        0.003        0.002
2770        0.010        0.012        0.009
2771       44.800       47.400       45.300
2772        0.014       -0.019        0.048
2773 -1600000.000 -3700000.000 -3400000.000

[2774 rows x 3 columns]


In [141]:
df_combined.dtypes

date            datetime64[ns]
time                    object
currency                object
impact                  object
event                   object
actual                 float64
forecast               float64
previous               float64
category_lv1            object
category_lv2            object
category_lv3            object
datetime        datetime64[ns]
dtype: object

In [142]:
# Thứ tự cột mong muốn
column_order = [
    'datetime', 'currency', 'impact', 'event',
    'actual', 'forecast', 'previous',
    'category_lv1', 'category_lv2', 'category_lv3',
    'date', 'time'
]

# Sắp xếp lại các cột bằng cách chỉ định thứ tự cột mới
df_combined = df_combined[column_order]

df_combined

Unnamed: 0,datetime,currency,impact,event,actual,forecast,previous,category_lv1,category_lv2,category_lv3,date,time
0,2024-01-01 00:00:00,NZD,gray,Bank Holiday,,,,holiday,bank holiday,unspecified,2024-01-01,00:00:00
1,2024-01-01 00:00:00,AUD,gray,Bank Holiday,,,,holiday,bank holiday,unspecified,2024-01-01,00:00:00
2,2024-01-01 00:00:00,JPY,gray,Bank Holiday,,,,holiday,bank holiday,unspecified,2024-01-01,00:00:00
3,2024-01-01 00:00:00,CNY,gray,Bank Holiday,,,,holiday,bank holiday,unspecified,2024-01-01,00:00:00
4,2024-01-01 00:00:00,CHF,gray,Bank Holiday,,,,holiday,bank holiday,unspecified,2024-01-01,00:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...
2769,2024-07-31 19:30:00,CAD,red,GDP m/m,0.002,0.001,0.003,gdp,gdp general,m/m,2024-07-31,19:30:00
2770,2024-07-31 19:30:00,USD,red,Employment Cost Index q/q,0.009,0.010,0.012,employment indicators,employment cost index,q/q,2024-07-31,19:30:00
2771,2024-07-31 20:45:00,USD,orange,Chicago PMI,45.300,44.800,47.400,production indicators,unspecified,unspecified,2024-07-31,20:45:00
2772,2024-07-31 21:00:00,USD,red,Pending Home Sales m/m,0.048,0.014,-0.019,sales indicators,pending home sales,m/m,2024-07-31,21:00:00


### Export one single table to silver directory for viz

In [143]:
# Đặt tên cho tệp CSV
csv_file_name = 'final_transformed_data.csv'  
csv_file_path = f'/Users/datpro/Documents/gitdatpro/ff-transform-data/data/silver/{csv_file_name}'

# Xuất DataFrame dưới dạng CSV với tên tệp đã chọn
df_combined.to_csv(csv_file_path, index=False)

print(f"DataFrame đã được xuất thành công với tên {csv_file_name} tại {csv_file_path}")


DataFrame đã được xuất thành công với tên final_transformed_data.csv tại /Users/datpro/Documents/gitdatpro/ff-transform-data/data/silver/final_transformed_data.csv
