<a href="https://colab.research.google.com/github/YoheiFukuhara/nlp100-2020/blob/main/06.%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92/50_%E3%83%87%E3%83%BC%E3%82%BF%E3%81%AE%E5%85%A5%E6%89%8B%E3%83%BB%E6%95%B4%E5%BD%A2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[News Aggregator Data Set](https://archive.ics.uci.edu/ml/datasets/News+Aggregator)をダウンロードし、以下の要領で学習データ（train.txt），検証データ（valid.txt），評価データ（test.txt）を作成せよ．

1. ダウンロードしたzipファイルを解凍し，readme.txtの説明を読む．
1. 情報源（publisher）が”Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, “Daily Mail”の事例（記事）のみを抽出する．
1. 抽出された事例をランダムに並び替える．
1. 抽出された事例の80%を学習データ，残りの10%ずつを検証データと評価データに分割し，それぞれtrain.txt，valid.txt，test.txtというファイル名で保存する．ファイルには，１行に１事例を書き出すこととし，カテゴリ名と記事見出しのタブ区切り形式とせよ（このファイルは後に問題70で再利用する）．

学習データと評価データを作成したら，各カテゴリの事例数を確認せよ．

In [None]:
from google.colab import drive
import pandas as pd
from sklearn.model_selection import train_test_split

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!python --version
!pip show google pandas scikit-learn

Python 3.7.11
Name: google
Version: 2.0.3
Summary: Python bindings to the Google search engine.
Home-page: http://breakingcode.wordpress.com/
Author: Mario Vilas
Author-email: mvilas@gmail.com
License: UNKNOWN
Location: /usr/local/lib/python3.7/dist-packages
Requires: beautifulsoup4
Required-by: 
---
Name: pandas
Version: 1.1.5
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: None
Author-email: None
License: BSD
Location: /usr/local/lib/python3.7/dist-packages
Requires: pytz, numpy, python-dateutil
Required-by: xarray, vega-datasets, statsmodels, sklearn-pandas, seaborn, pymc3, plotnine, pandas-profiling, pandas-gbq, pandas-datareader, mlxtend, mizani, holoviews, gspread-dataframe, google-colab, fix-yahoo-finance, fbprophet, fastai, cufflinks, cmdstanpy, arviz, altair
---
Name: scikit-learn
Version: 0.22.2.post1
Summary: A set of python modules for machine learning and data mining
Home-page: http://scikit-lear

In [None]:
BASE_PATH = '/content/drive/MyDrive/ColabNotebooks/ML/NLP100_2020/06.MachineLearning'

In [None]:
# quotingをデフォルトの0にすると、ダブルコーテーションがTITLEの文字列先頭にあったときに、変な分割をしてしまう
df = pd.read_csv(BASE_PATH + '/input/newsCorpora.csv', 
                 header=None, sep='\t', usecols=[0, 1, 3, 4], index_col=0, 
                 names=['id', 'title', 'publisher', 'category'], quoting=3)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 422937 entries, 1 to 422937
Data columns (total 3 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   title      422937 non-null  object
 1   publisher  422935 non-null  object
 2   category   422937 non-null  object
dtypes: object(3)
memory usage: 12.9+ MB


In [None]:
df

Unnamed: 0_level_0,title,publisher,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,"Fed official says weak data caused by weather,...",Los Angeles Times,b
2,Fed's Charles Plosser sees high bar for change...,Livemint,b
3,US open: Stocks fall after Fed official hints ...,IFA Magazine,b
4,"Fed risks falling 'behind the curve', Charles ...",IFA Magazine,b
5,Fed's Plosser: Nasty Weather Has Curbed Job Gr...,Moneynews,b
...,...,...,...
422933,Surgeons to remove 4-year-old's rib to rebuild...,WSHM-TV,m
422934,Boy to have surgery on esophagus after battery...,WLWT Cincinnati,m
422935,Child who swallowed battery to have reconstruc...,NewsNet5.com,m
422936,Phoenix boy undergoes surgery to repair throat...,WFSB,m


In [None]:
df = df.loc[df['publisher'].isin(['Reuters', 'Huffington Post', 'Businessweek', 'Contactmusic.com', 'Daily Mail']), ['title', 'category']]

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13356 entries, 13 to 422838
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     13356 non-null  object
 1   category  13356 non-null  object
dtypes: object(2)
memory usage: 313.0+ KB


In [None]:
train, valid_test = train_test_split(df, test_size=0.2, random_state=123, stratify=df['category'])
valid, test = train_test_split(valid_test, test_size=0.5, random_state=123, stratify=valid_test['category'])

In [None]:
# データの保存
train.to_csv(BASE_PATH + '/train.txt', sep='\t', index=False)
valid.to_csv(BASE_PATH + '/valid.txt', sep='\t', index=False)
test.to_csv(BASE_PATH + '/test.txt', sep='\t', index=False)

In [None]:
# 事例数の確認
print('--train--')
print(train['category'].value_counts())
print('--valida--')
print(valid['category'].value_counts())
print('--test--')
print(test['category'].value_counts())

--train--
b    4501
e    4235
t    1220
m     728
Name: category, dtype: int64
--valida--
b    563
e    529
t    153
m     91
Name: category, dtype: int64
--test--
b    563
e    530
t    152
m     91
Name: category, dtype: int64
