# Data Processing 0: Original Data Description

## Online Appendix of "International News Coverage and Foreign Image Building"

### Gento Kato (Nov. 4, 2017)

<p style="text-align:right;"> Back to [Summary Page](v3_SummaryNotebook.ipynb) </p>

In [1]:
## For Jupyter Notebook (Ignore if Using Other Software) ##
library(IRdisplay)

display_html(
'<script>  
code_show=true; 
function code_toggle() {
  if (code_show){
    $(\'div.input\').hide();
  } else {
    $(\'div.input\').show();
  }
  code_show = !code_show
}  
$( document ).ready(code_toggle);
</script>
  <form action="javascript:code_toggle()">
    <input type="submit" value="Click here to toggle on/off the raw code.">
 </form>'
)


### Load Packages and Set Directory

In [1]:
#################
## Preparation ##
#################

## Clear Workspace
rm(list=ls())

## Library Required Packages
library(rprojroot); library(xtable)

## Set Working Directory (Automatically or Manually) ##
#setwd(dirname(rstudioapi::getActiveDocumentContext()$path)); setwd(../) #In RStudio
projdir <- find_root(has_file("README.md")); projdir; setwd(projdir) #In Atom
#setwd("C:/GoogleDrive/Projects/Agenda-Setting Persuasion Framing/Foreign_Image_News_Project")


"package 'xtable' was built under R version 3.3.3"

### Original Headline Level Dataset

The original headline-level dataset is <code>allheadline.csv</code>. Check <code>allheadline.xlsx</code> for the details of variables. The sample rows are presented below:

In [16]:
#################
## Import Data ##
#################

# Read Manual Coding Data
hldata <- read.csv("data/allheadline.csv", fileEncoding = "CP932")

In [111]:
cat("Sample Rows of Full Dataset:")
x <- c(86549,68425,94905,36094,96298); x
head(hldata[x,]) # Except for Headline Variable

Sample Rows of Full Dataset:

Unnamed: 0,id_all,id,id_original,year,month,date,ymonth,Headline,paper,wcount,us,chn,kor,nkor,Asahi,Yomiuri,jijistartdate,jijiymonth
86549,86549,41197,1060,2010,11,12,201011,米、人民元上げ再要請　中国「徐々に改革」　米中首脳会談,A,847,1,1,0,0,1.0,,5,201012
68425,68425,32698,9446,2005,2,21,200502,「褐色雲」を国際観測　黄砂などの飛来分析　日中韓などの研究者,A,568,0,1,1,0,1.0,,10,200503
94905,94905,3023,5440,2013,9,24,201309,日米サイバー防衛協議　来月上旬合意へ　中国念頭に対策,Y,475,1,1,0,0,,1.0,6,201310
36094,36094,18839,1216,1996,5,30,199605,北朝鮮科学者の韓国亡命　橋本首相がコメント,Y,91,0,0,1,1,,1.0,10,199606
96298,96298,3649,6475,2014,3,23,201403,「強制連行」訴訟、原告１０００人規模に　中国、さらに拡大も,A,580,0,1,0,0,1.0,,7,201404


### The Automated Coding of US, China, South Korea and North Korea Relevant Headlines

The initial automated coding was conducted on *KH Coder*, the text analytic software developed by Koichi Higuchi at Ritsumeikan University, Japan (http://khc.sourceforge.net/en/). First, the isomorphic analysis is conducted on each headeline text by the Japanese morphological analysis system, *ChaSen*. Second, the relevant headline for US, China, South Korea, and North Korea are extracted by the following keywords:

* **US**: 米-->地名 | 米-->人名 | 米-->名詞C | 訪米 | 米国 | 日米構造協議 | 米兵 | 米産 | 全米 | 駐米 | 米人 | 日米地位協定 | 対米 | 反米 | 米州 | 親米 | 渡米 | 日米財界人会議 | 米紙 | 米州貿易圏 | 在米 | 米朝-->人名 | アメリカ | アメリカン | レーガン | ブッシュ | クリントン | オバマ

* **China**: 中国 | 中国人 | 中国共産党 | 中国語 | 中国ファンド | 中-->地名 | 日中-->名詞 | 訪中-->サ変名詞 | 中台-->地名 | 日中-->副詞可能 |対中-->地名 | 中-->名詞C | 日中-->地名 | 中印-->地名 | 親中-->名詞 | 楊 | 江-->人名 | 胡-->地名 (For 中-->地名 and 中-->名詞C, the irrelevant headlines are cleaned up manually)

* **South Korea**: 韓 | 韓国 | 訪韓 | 韓国日報 | 駐韓 | 南北 | 南北朝鮮 | 朝鮮半島 | 斗煥 | 盧 | 泳三 | 大中 | 明博 (For 南北 and 朝鮮, irrelevant headlines are cleaned up manually)

* **North Korea**: 北朝鮮 | 朝鮮 | 南北朝鮮 | 朝鮮半島 | 朝鮮労働党 | 朝-->地名 | 朝-->副詞可能 | 訪朝-->名詞 | 朝間-->名詞 | 米朝-->人名 |北-->名詞C | 南北 | 北-->地名 | 日成-->人名 | 正日-->人名 (For 南北 and 朝鮮, irrelevant headlines are cleaned up manually)

The above keywords are consisted from the possible name of the countries and leaders. The automated coding results are presented in the table below. 1 indicates the count of relevant headline to each country.


In [113]:
## Frequency Table ##
statefreq0 <- t(cbind(
      table(hldata$us),table(hldata$chn),
      table(hldata$kor),table(hldata$nkor)))
rownames(statefreq0) <- 
 c("US (by KH Coder)","China (by KH Coder)", "S.Korea (by KH Coder)", "N. Korea (by KH Coder)")
statefreq0


Unnamed: 0,0,1
US (by KH Coder),90111,9040
China (by KH Coder),95756,3395
S.Korrea (by KH Coder),96951,2200
N. Korea (by KH Coder),95750,3401
