### Import data

# 演算法比較
- Randon forest
- Bagging
- Boosting

| 演算法        | 核心概念                                                                 | 優點                                                                                 | 缺點                                                                                 | 適用場景                                       |
|---------------|------------------------------------------------------------------------|------------------------------------------------------------------------------------|------------------------------------------------------------------------------------|-----------------------------------------------|
| Random Forest | 結合多個決策樹，每棵樹使用不同的子樣本和特徵。最終通過多數投票（分類）或平均（回歸）進行預測。 | - 抗過擬合能力強<br>- 對高維數據表現穩定<br>- 可評估特徵重要性                         | - 訓練和預測時間較長<br>- 無法處理不平衡數據集表現較差                              | - 高維數據分析<br>- 特徵重要性評估<br>- 泛化性能要求高的場景 |
| Bagging       | 通過對原始數據進行有放回抽樣，生成多個子數據集，並在每個子數據集上訓練弱模型。最終對結果進行平均或投票。 | - 降低模型的方差<br>- 增加穩定性<br>- 易於並行化計算                                 | - 偏差可能無法顯著減少<br>- 對單個弱模型依賴較強                                   | - 高方差模型（如決策樹）<br>- 注重穩定性的場景             |
| Boosting      | 按序列方式訓練多個弱分類器，每次迭代關注前一輪錯誤分類的樣本，逐步提升模型性能。                   | - 偏差大幅降低<br>- 對小樣本和不平衡數據集表現好<br>- 生成高準確率模型                 | - 訓練時間較長<br>- 對噪聲敏感<br>- 易過擬合（尤其是弱分類器過強時）               | - 小樣本數據集<br>- 不平衡數據場景<br>- 高準確性要求       |


In [1]:
import json
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

In [7]:
data = []
with open('/kaggle/input/dm-2024-isa-5810-lab-2-homework/tweets_DM.json', 'r') as f:
    for line in f:
        data.append(json.loads(line))
 
f.close()

In [8]:
emotion = pd.read_csv('/kaggle/input/dm-2024-isa-5810-lab-2-homework/emotion.csv')
data_identification = pd.read_csv('/kaggle/input/dm-2024-isa-5810-lab-2-homework/data_identification.csv')

In [9]:
df = pd.DataFrame(data)
_source = df['_source'].apply(lambda x: x['tweet'])
df = pd.DataFrame({
    'tweet_id': _source.apply(lambda x: x['tweet_id']),
    'hashtags': _source.apply(lambda x: x['hashtags']),
    'text': _source.apply(lambda x: x['text']),
})
df = df.merge(data_identification, on='tweet_id', how='left')

train_data = df[df['identification'] == 'train']
test_data = df[df['identification'] == 'test']

In [10]:
train_data = train_data.merge(emotion, on='tweet_id', how='left')

In [11]:
train_data.drop_duplicates(subset=['text'], keep=False, inplace=True)

In [12]:
train_data_sample = train_data.sample(frac=0.3, random_state=42)

In [13]:
y_train_data = train_data_sample['emotion']
X_train_data = train_data_sample.drop(['tweet_id', 'emotion', 'identification', 'hashtags'], axis=1)

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_train_data, y_train_data, test_size=0.2, random_state=42, stratify=y_train_data)

In [15]:
tfidf = TfidfVectorizer(max_features=500)
X = tfidf.fit_transform(X_train['text']).toarray()
X_test = tfidf.transform(X_test['text'])

In [16]:
le = LabelEncoder()
y = le.fit_transform(y_train)
y_test = le.transform(y_test)

In [17]:
clf = RandomForestClassifier()
clf.fit(X, y)

In [18]:
y_pred = clf.predict(X_test)

In [19]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.48002898184034687

In [20]:
X_test_data = test_data.drop(['tweet_id', 'identification', 'hashtags'], axis=1)

In [21]:
X_test_data = tfidf.transform(X_test_data['text']).toarray()

In [22]:
y_test_pred = clf.predict(X_test_data)

In [23]:
y_pred_labels = le.inverse_transform(y_test_pred)
y_pred_labels

array(['anticipation', 'anticipation', 'joy', ..., 'joy', 'joy', 'joy'],
      dtype=object)

In [24]:
submission = pd.DataFrame({
    'tweet_id': test_data['tweet_id'],
    'identification': y_pred_labels
})

In [25]:
submission.to_csv('/kaggle/working/submission.csv')