# Introduction to Headline Data and Text Search

## 1. Purpose
Briefly introduce headline data and ways to search Japanese texts and code data. 

## 2. Data Description

### Load Data

In [14]:
## Check Working Directory, and if it is not the location of THIS file, fix it.
#getwd()

## Load Data
load("./../../data/polhead_RMeCab_170420.rda")

### Check Data

#### Headline Dataset(**datedata**)
Class = dataframe. The list of newspaper headlines from Nov 1987 to Mar 2015. They are ALL first page headlines from two major newspapers in Japan, Asahi and Yomiuri. Following are the brief description of columns. 

 * **id_all**: Global headline id (each headline has a unique ID)
 * id_inpaper: With-in paper headline id (each headline in the same newspaper has a unique ID)
 * id_original: Headline ID from original dataset (can be ignored)
 * year: Year of headline
 * month: Month of headline
 * date: Day of headline
 * **ymonth**: Year-month of headline
 * **Headline**: The raw texts of headline
 * **paper**: Character string for the newspaper. "A" indicates Asahi, "Y" indicates Yomiuri.
 * **wcount**: Word count for each article attached with headline
 * **codePN**: Manually coded positive-negative sentiment (in terms of Prime Minister) of headlines. Some values are missing.
 * Asahi: Dummy for Asahi newspaper. 1 for headlines from Asahi.
 * Yomiuri: Dummy for Yomiuri newspaper. 1 for headlines from Yomiuri.
 * jijistartdate: The date when *jiji monthly poll* start to collect the data in each month.
 * **jijiymonth**: Year-month according to *jiji monthly poll*. The month is considered to start when *jiji monthly poll* starts to collect its data (jijistartdate) in current month, and ends at the day before *jiji monthly poll* starts to collect data for next month.

In [15]:
head(datedata,10)

id_all,id_inpaper,id_original,year,month,date,ymonth,Headline,paper,wcount,codePN,Asahi,Yomiuri,jijistartdate,jijiymonth
1,1,3,1987,11,7,198711,竹下首相任命式も皇太子殿下が出席,A,85,0,1.0,,7,198711
2,2,4,1987,11,7,198711,米国務次官補、懸案協議へあす来日,A,426,0,1.0,,7,198711
3,3,5,1987,11,7,198711,政策遂行は党主導か　竹下内閣発足＜解説＞,A,1033,0,1.0,,7,198711
4,4,6,1987,11,7,198711,東京の終値も最高値１３５円台　３日連続更新,A,169,0,1.0,,7,198711
5,5,7,1987,11,7,198711,竹下内閣・閣僚の顔ぶれ（昭和６２年１１月６日発足）,A,1844,0,1.0,,7,198711
6,6,8,1987,11,7,198711,米英市場、１３５円台,A,601,0,1.0,,7,198711
7,7,9,1987,11,7,198711,竹下内閣が発足　融和重視の派閥均衡型　税制の改革に力点,A,1288,0,1.0,,7,198711
8,1,1,1987,11,7,198711,竹下内閣スタート　派閥均衡の実務型　税制・土地を重視＝図付き,Y,1258,0,,1.0,7,198711
9,2,2,1987,11,7,198711,衆参両院で竹下首相を指名,Y,212,0,,1.0,7,198711
10,3,3,1987,11,7,198711,大胆な発想と実行の政治を　初閣議で竹下首相説示,Y,403,0,,1.0,7,198711


#### Word List Data (**MeCabRes**) 

Class = list. The list of words used in each headlines. The results are extracated by isomorphic analysis using **MeCab** through **RMeCab**. Each object represent one headline, and the order corresponds to **id_all** in **datedata**. The example of the object is shown as follows. 

In [17]:
MeCabRes[1]

In the above example, each row represents word. The bolded part describes lexical category (e.g., noun, verb, ...) of a word. Then, the quoted part describes the word itself. For example, the first line reads as the noun (名詞), '竹下' (Takeshita). All words are put back into normal form, so the result sometimes does not much exactly with the original text.

## 3. Text Search and Code

The following function gives you the way to seach certain words in **MeCabRes**, and code headlines (add variable to **datedata**) by the dummy appearance of those words in the headline.

In [21]:
## Text finding function
inclwrd<-function(target,search){ ##target=Mecab List, search=set of words to search
  n<-length(target) # Define the length of exporting vector
  countres<-rep(NA,n) # create the exporting vector
  for(i in 1:n){ 
    sample<-as.factor(target[[i]]) # Each element in data value
    levels(sample)[levels(sample) %in%  search]<-"ifindit" # Mark the searching words
    countres[i]<-sum(sample=="ifindit") # word count of searching words
  }
  countres[countres>0]<-1 # make it a dummy variable
  return(countres) # return the vector
}

Try an example to search economy related words, and create dummy variable of its appearance:

In [22]:
## List of words related to economy
econwords<-c("貿易","投資","ガット","関税","輸入","輸出","禁輸",
          "資本","現地生産","漁業協定","ＷＴＯ","ＦＴＡ","ＡＰＥＣ",
          "援助","支援","円借款","経済","株","相場","円安","円高",
          "終値","市場","赤字","黒字","公共事業","産業","人民元",
          "バブル","円","就業","ドル","ウォン","通商","社","関税","構造協議")

## Create a dummy variable in datedata
datedata$econ<-inclwrd(target=MecabRes,search=econwords)

## Describe the Variable
table(datedata$econ)


    0     1 
83657 15494 

The result shows that there are 15494 headlines which include economy related words, and 83657 headlines which don't.