-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs(implementation): refine methodology detail
- Loading branch information
1 parent
79de893
commit 685f3e3
Showing
2 changed files
with
117 additions
and
40 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
# Implementation | ||
|
||
The task aims to implement an Aspect-Based Sentiment Analysis (ABSA) pipeline using generative language models like GPT-4. This implementation covers four crucial phases: Data Preparation, Aspect Extraction, Sentiment Analysis, and Post-Processing. Each phase has its specialized sub-tasks that include data cleaning, model fine-tuning, prediction, and output formatting. The objective is to obtain an in-depth understanding of the sentiment related to multiple aspects of a company and its owner for more nuanced reputation management. | ||
|
||
## Streamlined Procedure | ||
|
||
### Phase 1: Data Preparation | ||
|
||
1. **Data Collection** | ||
|
||
- **Technique**: Web scraping algorithms would crawl the Daum news website for articles specifically related to companies and their owners. The focus would be on aspects outlined in the prompts, such as "management" or "financial." | ||
|
||
2. **Data Annotation** | ||
|
||
- **Procedure**: The acquired corpus would undergo a manual annotation process where experts identify and label aspects and sentiments. The aspects would be categorized as per the guideline, such as "management," "workplace," etc. | ||
- **Prompt-Guided Labeling**: Ensure that the aspects and sentiments are labeled according to pre-defined prompts, to maintain a standardized data set. | ||
|
||
3. **Data Cleaning** | ||
|
||
- **Text Normalization**: This involves transforming all text to a standard form—converting to lowercase and removing special characters or extra spaces. | ||
- **Irrelevant Data Removal**: Any data points that are extraneous or not aligned with the objectives should be discarded. | ||
|
||
4. **Train-Test Split** | ||
- **Stratification**: The data set would be divided into training, validation, and test subsets. Stratified sampling ensures each subset is representative of the overall data composition in terms of the aspects and sentiments. | ||
|
||
### Phase 2: Aspect Extraction (AE) | ||
|
||
1. **Model Choice** | ||
|
||
- **Rationale**: Generative models like GPT-4 have proven efficacy in complex language tasks, making them apt for the Aspect Extraction task. | ||
|
||
2. **Fine-Tuning Objective Function** | ||
|
||
- Adapt the model to predict not just the next word but also whether a word or phrase is an aspect. | ||
|
||
$$ | ||
\max_{\theta} \sum_{i=1}^{N} log P(x_i, y_i | x_{<i}; \theta) | ||
$$ | ||
|
||
3. **Training** | ||
|
||
- **Data Input**: Use the labeled training set acquired in Phase 1. | ||
- **LLM Embeddings**: As specified, LLM embeddings will be used to train classifiers, capitalizing on their ability to capture nuanced language features. | ||
|
||
4. **Validation** | ||
- **Model Tuning**: The validation set will be used for hyperparameter tuning and to validate the model's performance metrics such as F1-score, precision, and recall. | ||
|
||
### Phase 3: Aspect Sentiment Analysis (ASA) | ||
|
||
1. **Additional Fine-Tuning** | ||
|
||
- Further fine-tune the model to predict sentiment polarity for each extracted aspect. | ||
|
||
$$ | ||
\max_{\theta} \sum_{i=1}^{N} log P(x_i, y_i, z_i | x_{<i}, y_{<i}; \theta) | ||
$$ | ||
|
||
2. **Training and Validation** | ||
|
||
- **LLM Embeddings**: The LLM embeddings will be instrumental again here, to train classifiers that predict sentiment polarity for the aspects identified. | ||
|
||
### Phase 4: Post-Processing | ||
|
||
1. **Sentiment Aggregation** | ||
|
||
- **Methodology**: The identified sentiments for recurring aspects are statistically aggregated using methods like mode or mean to give a consolidated sentiment score. | ||
|
||
2. **Output Formatting** | ||
- **JSONlines**: The final output would be formatted into JSONlines, ensuring that it is both human-readable and machine-parseable. | ||
|
||
## Sample Output | ||
|
||
```json | ||
[ | ||
{"company": "카카오", "aspect": "product & service", "aspect_terms": ["데이터센터 화재에 따른 먹통 사태", "데이터센터 화재에 따른 서비스 장애 피해지원 계획", "이모티콘 총 종 제공", "매출 손실 규모액에 따른 지원금 지급"], "opinion_terms": ["보상안 발표", "다양한 단체와 협의체를 구성하고 논의를 지속", "직접적인 피해가 큰 경우만 별도 과정을 거쳐 개별 지원", "사회적 책임 차원의 일괄 지원을 결정", "안정적인 서비스를 제공하겠다는 약속"], "sentiment": "positive"}, | ||
{"company": "넥슨", "aspect": "social", "aspect_terms": ["김정주 전 회장의 가상자산 계좌가 해킹", "억원어치 가상자산이 도난당한 사실"], "opinion_terms": ["사망자의 가상자산은 제도가 없어 보호받지 못한다는 지적", "논란이 일었다"], "sentiment": "negative"}, | ||
… | ||
] | ||
``` |