<h1 style="text-align:center"> Exploring and Predicting Characteristics of Japanese Newspaper Headlines </h1> 
 <h2 style="text-align:center"> <i>STA208 Final Project (Spring 2017)</i> </h2> 
 <h3 style="text-align:center"> <i>Tzu-ping Liu and Gento Kato</i> </h3> 


[<h5 style="text-align:center"> Back to Summary Notebook </h5>](STA208_Project_Summary.ipynb)

<h1 style="text-align:center">Section II</h1>
<h1 style="text-align:center">Data of Japanese Newspaper Headlines</h1>


In this section, we introduce datasets in use for the current research project. There are two types of data. **Raw headline text data** involves raw full-texts of headlines with corresponding dates , hand-coded positive-negative sentiments, and other characteristics. To construct **word appearance matrix data** we process raw-texts to extract normalized words, and count the frequency of each word appeared in each headline. 

## 1. Raw Headline Text Data <br>

This dataset includes the full texts of newspaper headlines from Nov 1987 to Mar 2015. They are (almost) **ALL first page headlines** from two major newspapers in Japan, *Yomiuri Shimbun* and *The Asahi Shimbun*.

Raw texts are extracted using [*Yomidas Rekishikan*](http://www.yomiuri.co.jp/database/rekishikan/) for *Yomiuri Shimbun* and [*Kikuzo II Visual*](https://database.asahi.com/index.shtml) for *The Asahi Shimbun*. Headlines with general names such as "Today's news" or "Today's column" are eliminated from the dataset, since it involves no information regarding to the topical content of the story.


### 1.1 Dataset
 * [<code>alldate_170420.csv</code>](https://github.com/UCDSTA208/208-final-project-liu_and_kato/blob/master/data/alldate_170420.csv) is the original dataset that includes ALL first page newspaper headlines from November 1987 through March 2015.<br><br>
 * [<code>alldata_traincode.csv</code>](https://github.com/UCDSTA208/208-final-project-liu_and_kato/blob/master/data/alldata_traincode_170510.csv) additionaly includes training code variable <code>codeN</code> for the negative sentiment appeared on randomly sampled 1000 headlines. Following files are used to construct the data:
   * Random sampling of headlines by [<code>polhead_allCodingSample_170509.R</code>](https://github.com/gentok/Political_Headlines_Project/blob/master/codes/polhead_allCodingSample_170509.R).
   * Handcoded negative sentiments in  [<code>codedall_170509.csv</code>](https://github.com/gentok/Political_Headlines_Project/blob/master/data_public/codedall_170509.csv).
   * The creation of new dataset by [<code>polhead_allCoded_170510.R</code>](https://github.com/gentok/Political_Headlines_Project/blob/master/codes/polhead_allCoded_170510.R).


### 1.2 Variables

The [<code>alldata_traincode.csv</code>](https://github.com/UCDSTA208/208-final-project-liu_and_kato/blob/master/data/alldata_traincode_170510.csv) dataset comes with 99151 rows (headlines) and following variables:

   * <code><b>id_all</b></code>: Global headline id (each headline has a unique ID)
   * <code>id_inpaper</code>: With-in paper headline id (each headline in the same newspaper has a unique ID)
   * <code>id_original</code>: Headline ID from original dataset (can be ignored)
   * <code>year</code>: Year of headline
   * <code>month</code>: Month of headline
   * <code>date</code>: Day of headline
   * <code><b>ymonth</b></code>: Year-month of headline
   * <code><b>Headline</b></code>: The raw texts of headline
   * <code>paper</code>: Character string for the newspaper. "A" indicates Asahi, "Y" indicates Yomiuri.
   * <code><b>wcount</b></code>: Word count for each article attached with headline
   * <code>Asahi</code>: Dummy for Asahi newspaper. 1 for headlines from Asahi.
   * <code>Yomiuri</code>: Dummy for Yomiuri newspaper. 1 for headlines from Yomiuri.
   * <code>jijistartdate</code>: The date when *jiji monthly poll* start to collect the data in each month.
   * <code>jijiymonth</code>: Year-month according to *jiji monthly poll*. The month is considered to start when *jiji monthly poll* starts to collect its data (jijistartdate) in current month, and ends at the day before *jiji monthly poll* starts to collect data for next month.
   * <code><b>codeN</b></code>: Manually coded negative sentiment appeared on randomly sampled 1000 headlines. *1 means negative, 0 means neutral/positive, and NA means not-sampled*. (There is no independent code for strictly positive sentiments, since we rarely observe those sentiments.)
   
The actual [<code>alldata_traincode.csv</code>](https://github.com/UCDSTA208/208-final-project-liu_and_kato/blob/master/data/alldata_traincode_170510.csv) data look like follows:

In [14]:
import pandas as pd
alldata = pd.read_csv("../../data/alldata_traincode_170510.csv", encoding='CP932')
alldata.iloc[[930,940,945,955]] #alldata[alldata['train'] == 1] 

Unnamed: 0,id_all,id_inpaper,id_original,year,month,date,ymonth,Headline,paper,wcount,Asahi,Yomiuri,jijistartdate,jijiymonth,codeN,train
930,931,448,615,1988,1,24,198801,「東京朝日」１００周年記念の懸賞論文　審査委員に２０氏,A,766.0,1.0,,9,198801,0.0,1
940,941,489,580,1988,1,24,198801,税制改革法案　小渕官房長官も今国会提出を表明,Y,387.0,,1.0,9,198801,,0
945,946,457,626,1988,1,25,198801,都道府県別の銭湯数,A,1235.0,1.0,,9,198801,,0
955,956,462,633,1988,1,26,198801,大阪地検、田代議員を１０００万円容疑で起訴へ　砂利船汚職,A,799.0,1.0,,9,198801,1.0,1


## 2. Word Appearance Matrix Data


 ### 2.1 Dataset

 * [<code>allWrdMat10.csv.gz</code>](https://github.com/UCDSTA208/208-final-project-liu_and_kato/blob/master/data/allWrdMat10.csv.gz) is the matrix of word appearance frequency, created by using <code>Headline</code> variable in [<code>alldata_traincode.csv</code>](https://github.com/UCDSTA208/208-final-project-liu_and_kato/blob/master/data/alldata_traincode_170510.csv).


To construct the above dataset, we first conduct isomorphic analysis of Japanese texts by <code>MeCab</code> (Japanese isomorphic analysis software), and extract nouns, adjectives and verbs that appeared **more than 10 times** in the dataset by [<code>polhead_allWrdMat_170509.R</code>](https://github.com/gentok/Political_Headlines_Project/blob/master/codes/polhead_allWrdMat_170509.R). In the exported data, **rows represent headlines and columns represent words.** It includes 99151 rows (headlines) and 8655 columns (words). The word appearnce matrix data has **identical row number as <code>id_all</code> variable** in [<code>alldate_traincode.csv</code>](https://github.com/UCDSTA208/208-final-project-liu_and_kato/blob/master/data/alldata_traincode_170510.csv).

NOTE: Original text matrix dataset is a VERY large file, so we use gzip method to compress the original csv file.

<!---
  In addition, <code>allBigram20t.rds</code> [*Private*] includes all bi-grams of terms that are appeared 20 times or more. This data is transposed, that means, rows represent bigrams (19531 bigrams), and columns represent headlines. <br>
--->


 ### 2.2 Variables

 **Each column represents word** (i.e., noun, adjective and verb). The value indicates **the frequency of word appearance**. The value often takes 1 or 0, but not necessarily. ***If the same word appears twice (or more) in one headline, then the value takes 2 or more.*** You need to recode the variable if you want to use these variables as dummy word appearnce in headline.

 The dataset includes ALL nouns, adjectives, and verbs that **are appeared at least 10 times** in whole data. Note that it may be inefficient to include all those variables in the analysis, so we select subset of words depending on the context.
 
 The actual [<code>allWrdMat10.csv.gz</code>](https://github.com/UCDSTA208/208-final-project-liu_and_kato/blob/master/data/allWrdMat10.csv.gz) data look like follows:

In [15]:
allWrdMat10 = pd.read_csv("../../data/allWrdMat10.csv.gz", encoding='CP932')

In [403]:
allWrdMat10.iloc[[10732, 24498,40534, 58985, 78645],3890:3910] #.sample(n=5)

Unnamed: 0,産廃,産婦人科,産卵,算出,算数,算定,賛成,賛同,賛否,酸性,暫定,残す,残り,残る,残業,残高,残念,残留,仕事,仕手
10732,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
24498,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
40534,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
58985,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
78645,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
