## AG News Classification Dataset
- https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset

AG新闻分类数据集。是一个100多万篇新闻文章的合集，新闻源来源超过2k个。可在`Kaggle`上获取。   
AG数据集，是一个已经标注好的数据集，每一篇新闻文章都有标题和描述，且标注成了四类。    
1. 世界
2. 体育
3. 商业
4. 科学/技术

   AG数据集每个类别都包含3万个训练样本和1.9k个测试样本，整个数据集共有12w个训练样本和7.6k个测试样本。


### 1. 下载好数据集

在执行下面的练习之前，我们先把`AG News Dataset`数据集下载到`../../data/kaggle/AG-News-Classfication-Dataset`中。

In [1]:
AG_DATASET_PWD = "../../data/kaggle/AG-News-Classfication-Dataset"

In [2]:
! ls {AG_DATASET_PWD}

test.csv       train.csv      train_data.csv


In [3]:
# 查看训练集行数
!cat {AG_DATASET_PWD}/train.csv |wc -l

  120001


In [4]:
# 查看测试数据集行数
# 查看训练集行数
!cat {AG_DATASET_PWD}/test.csv |wc -l

    7601


### 2. 使用pandas加载数据

In [5]:
import pandas as pd

#### 2.1 加载训练数据

In [6]:
data = pd.read_csv("{}/train.csv".format(AG_DATASET_PWD))

In [7]:
data

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."
...,...,...,...
119995,1,Pakistan's Musharraf Says Won't Quit as Army C...,KARACHI (Reuters) - Pakistani President Perve...
119996,2,Renteria signing a top-shelf deal,Red Sox general manager Theo Epstein acknowled...
119997,2,Saban not going to Dolphins yet,The Miami Dolphins will put their courtship of...
119998,2,Today's NFL games,PITTSBURGH at NY GIANTS Time: 1:30 p.m. Line: ...


In [8]:
type(data)

pandas.core.frame.DataFrame

In [9]:
# 查看data的形状
data.shape

(120000, 3)

#### 2.2 添加标签列

In [10]:
# 查看数据的列
data.columns

Index(['Class Index', 'Title', 'Description'], dtype='object')

In [11]:
# 替换一下空格: 如果是用_下划线，那么列可以使用.column_name来获取数据，如果是-那只能用data["column-name"]
data.columns = data.columns.str.replace(" ", "_")
data.columns

Index(['Class_Index', 'Title', 'Description'], dtype='object')

In [12]:
# 把列的名字全部改为小写
data.columns = data.columns.str.lower()
data.columns

Index(['class_index', 'title', 'description'], dtype='object')

In [13]:
# 添加一个标签列
data['class_name'] = data["class_index"].map({
    1: "世界",
    2: "运动",
    3: "商业",
    4: "科学-技术"
})

In [14]:
# 再次查看数据的头部
data.head()

Unnamed: 0,class_index,title,description,class_name
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",商业
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,商业
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,商业
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,商业
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",商业


In [15]:
# 查看数据的列
data.columns

Index(['class_index', 'title', 'description', 'class_name'], dtype='object')

In [16]:
# 查看每个类中的样本数
data["class_index"].value_counts()

class_index
3    30000
4    30000
2    30000
1    30000
Name: count, dtype: int64

In [17]:
# 直接用.columan_name
data.class_name.value_counts()

class_name
商业       30000
科学-技术    30000
运动       30000
世界       30000
Name: count, dtype: int64

### 3. 探索数据集

#### 3.1 查看前面N条数据

In [18]:
data.head()

Unnamed: 0,class_index,title,description,class_name
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",商业
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,商业
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,商业
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,商业
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",商业


In [19]:
# .loc是一个基于标签的索引器，允许你基于行的标签和列的标签来访问数据。
data.loc[0, "title"]

'Wall St. Bears Claw Back Into the Black (Reuters)'

In [20]:
# loc里面行和列也可以使用数组
data.loc[[0,2], ["class_name", "title"]]

Unnamed: 0,class_name,title
0,商业,Wall St. Bears Claw Back Into the Black (Reuters)
2,商业,Oil and Economy Cloud Stocks' Outlook (Reuters)


In [21]:
# 打印出前面5条的标题
for i in range(8,13):
    print("文章：{:4}({}):  {}".format(i, data.loc[i, "class_name"], data.loc[i, "title"]))

文章：   8(商业):  Safety Net (Forbes.com)
文章：   9(商业):  Wall St. Bears Claw Back Into the Black
文章：  10(商业):  Oil and Economy Cloud Stocks' Outlook
文章：  11(商业):  No Need for OPEC to Pump More-Iran Gov
文章：  12(商业):  Non-OPEC Nations Should Up Output-Purnomo


In [22]:
# 查看第一行的文章描述
data.loc[0, "description"]

"Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."

#### 3.2 替换数据中的特殊字符

In [23]:
data[["title", "description"]] = data[["title", "description"]].map(lambda x: x.replace("\\", " "))

In [24]:
# 再次查看第一行的描述
data.loc[0, "description"]

"Reuters - Short-sellers, Wall Street's dwindling band of ultra-cynics, are seeing green again."

In [25]:
# 再次查看第127行的描述
data.loc[126, "description"]
# data["description"][data["description"].map(lambda x: x.find("#36") > 0)]

'SPACE.com - A piloted rocket ship race to claim a  #36;10 million Ansari X Prize purse for privately financed flight to the edge of space is heating up.'

In [26]:
# 继续替换
columns_fields = ["title", "description"]
data[columns_fields] = data[columns_fields].map(lambda x: x.replace("  ", " "))
data[columns_fields] = data[columns_fields].map(lambda x: x.replace("#36", "$"))
data[columns_fields] = data[columns_fields].map(lambda x: x.strip())

In [27]:
data.loc[126, "description"]

'SPACE.com - A piloted rocket ship race to claim a $;10 million Ansari X Prize purse for privately financed flight to the edge of space is heating up.'

In [28]:
# 保存到新的文件
data.to_csv("{}/train_data.csv".format(AG_DATASET_PWD), index=False)

In [29]:
!ls {AG_DATASET_PWD}

test.csv       train.csv      train_data.csv


现在我们就可以用`train_data.csv`去练习NLP的小任务了。