<h1 style="text-align:center"> Exploring and Predicting Characteristics of Japanese Newspaper Headlines </h1> 
 <h2 style="text-align:center"> <i>STA208 Final Project (Spring 2017)</i> </h2> 
 <h3 style="text-align:center"> <i>Tzu-ping Liu and Gento Kato</i> </h3> 


[<h5 style="text-align:center"> Back to Summary Notebook </h5>](STA208_Project_Summary.ipynb)

<h1 style="text-align:center">Summary Notebook</h1>

## I. Research Question

Using unique dataset of Japanese Newspaper Headlines, this project asks two questions:

 * Can unsupervised-learning methods identify the major categories of news appeared on headlines?
 * How good supervised-learning methods are to predict positive/negative (PN) sentiments expressed in news headlines?

This summary notebook does not include all descriptions and results of our analysis. Click on corresponding section titles to see detailed contents in each section (if link is provided). If links do not work, all relevant jupyter notebook files are store in the same directory as this summary notebook file (<code>/208-final-project-liu_and_kato\Political_Headlines_Project\notebooks</code>).

<!--- 
* What are the impact of political news on public opinion (e.g., prime minister approval)?
--->

## II. [Data of Japanese Newspaper Headlines](STA208_Data_Description.ipynb)

Detailed introduction of data structure can be seen in [HERE](STA208_Data_Description.ipynb) (R is used in the data construction) or by clicking section title. 

In summary, following data are used in the analysis:

 * Full texts of *ALL* first page headlines from two major newspapers in Japan (*Yomiuri Shimbun* and *Asahi Shimbun*). The data collection starts on November 1987 and ends on March 2015. (The data are originally collected by the author)
 * Hand-coded negative sentiments (1 = negative, 0 = positive/neutral) appeared on randomly sampled 1000 headlines. 
 * Matrix of word appearance frequency from the headlines.
 
<!---
 * Dictionary approach to extract political headlines (Tutorial from [HERE](Headline%20Data%20and%20Text%20Search.ipynb))
 * Monthly public opinion polls in Japan of the corresponding period.
--->

## III. Analytical Strategy

We have two major objectives in the analysis, as follows: 

 1. Explore major categories of news through unsupervised and machine learning.
 2. Predict negative sentiments appeared on headlines through supervised machine learning. 

First, we are interested in identifying major categories of news contents ([Section IV](STA208_Unsupervised_Learning.ipynb)). We don't have pre-defined set of categories, therefore, we impletement **unsupervised learning methods (K-Means and Agglomerative Clustering)** to explore the major categories appeared in this dataset. We then briefly look into generated categories to assess if each of the method is successful in extracting meaningful categories.

Second, we are also interested in coding positive-negative sentiments appeared in each headline ([Section V](STA208_Supervised_Learning.ipynb)). In this section, One of the author sample 1000 headlines from the full dataset, and manually code the dichotomous appearance of negative sentiment. Then, we use **supervised learning methods (K Nearest Neighbors, Logit, Linear Discriminant Analysis, Support Verctor Machine with RBF Kernel, Decision Tree, Bagging, Random Forest and Adaboost)** to learn those negative-sentiment coding. We compare the result at the the last to evaluate how different methods perform differently on the learning.

<!---
* Time-series analysis to assess the impact of news coverage on public opinion.
--->

## IV. [Exploring Categories of Newspaper Headlines](STA208_Unsupervised_Learning.ipynb)

In this section, we apply **unsupervised learning methods to identify major news categories and word appearance patterns** in headlines. Details are presented in separate notebook file. Click [HERE](STA208_Unsupervised_Learning.ipynb) or section title to see the contents.

## V. [Predicting Negative Sentiments in Newspaper Headlines](STA208_Supervised_Learning.ipynb)

In this section, we apply various **supervised learning methods to predict negative sentiments** appeared in headlines, and compare their performances. Details are presented in separate notebook file. Click [HERE](STA208_Supervised_Learning.ipynb) or section title to see the contents.

## VI. Discussion

In this section, we discuss the relationship between different variables generated from analyses in previous sections, and suggest potential future development of the project. Click on the following button to turn on off the raw codes for generating tables.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

In [2]:
# Computation Timer
from timeit import default_timer as trec

## Data Mining
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

### 1. Overview of Generated Variables

To start with, in the analysis in previous sections major variables of headline characteristics are shown as follows. (Given its highly unbalanced and non-intuitive nature, results from hierarchical clustering are omitted.)

** Headline Topic Clusters from Unsupervised Learning** ([Section IV](STA208_Unsupervised_Learning.ipynb))
* <code>km_7catr_name</code>: Topical categories geneated from *K-means* with $k=7$ on word reduced data. 
* <code>km_8cat_name</code>: Topical categories geneated from *K-means* with $k=8$ on word reduced data. 
* <code>km_9cat_name</code>: Topical categories geneated from *K-means* with $k=9$ on word reduced data. 

** Negative Sentiments Probability from Supervised Learning** ([Section V](STA208_Supervised_Learning.ipynb))
* <code>rf_pred</code>: Negative sentiment probability generated from *Random Forest*. 
* <code>logit_pred</code>: Negative sentiment probability generated from *Logistic Regression*. 


### 2. Relationship between Variables

Here we cross-section generated variables to assess their relationships, and discuss if the relationship between those variables show expected patterns.  

**Word Clusters by Topics**


In [3]:
## Import Data
alldata = pd.read_csv("../../data/alldata_codepred_170611.csv", encoding='CP932')

**Negative Sentiments by Topics**

Here, we summarize results from both unsupervised and supervised machine learning by mean comparison. Following tables show **average predicted probabilities of negative sentiments by topical categories** generated from K-means. Higher probability indicates that headlines fall into corresponding topic is more likely to involve negative sentiments on average.

In [4]:
alldata.groupby(['km_7catr_name']).agg({'logit_pred': "mean",'rf_pred': "mean"}).transpose().round(3)

km_7catr_name,Budget,Crime-Economy,Election,Featured,General,Social-Crime,War
rf_pred,0.284,0.263,0.13,0.208,0.185,0.684,0.221
logit_pred,0.369,0.293,0.054,0.254,0.164,0.835,0.205


In [5]:
alldata.groupby(['km_8catr_name']).agg({'logit_pred': "mean",'rf_pred': "mean"}).transpose().round(3)

km_8catr_name,Diplomacy 1,Diplomacy 2,Economy-Crime,Election,Featured,General,Politics,Polling
rf_pred,0.159,0.116,0.395,0.125,0.234,0.191,0.226,0.294
logit_pred,0.114,0.037,0.506,0.048,0.307,0.17,0.235,0.393


In [6]:
alldata.groupby(['km_9catr_name']).agg({'logit_pred': "mean",'rf_pred': "mean"}).transpose().round(3)

km_9catr_name,Diplomacy 1,Diplomacy 2,Economy-Crime,Election,Featured 1,Featured 2,General,Politics,Polling
rf_pred,0.158,0.112,0.395,0.124,0.234,0.231,0.19,0.227,0.295
logit_pred,0.113,0.031,0.505,0.048,0.306,0.267,0.167,0.236,0.392


From the above tables, following patterns are apparent:

1. Negative sentiments prediction from random forest (<code>rf_pred</code>) and logit (<code>logit_pred</code>) show similar patterns across topics, while logit predictions come with more extreme values (close to 0 or 1) than random forest predictions.
2. *Crime-Economy*, *Social-Crime*, *Featured* and *Polling* topics are more likely to be negative than *Diplomacy*, *Politics* or *Election* topics. 

First pattern confirms the assessment in [Section V](STA208_Supervised_Learning.ipynb). Logit and random forest do seem to capture same characteristics in headlines. For the second pattern, it follows the expectation that *Crime* topic comes with high negative sentiment probability. On the other hand, it is interesting to see that *Polling* topic is highly likely to be negative, but *Politics* and *Diplomacy* are not. In general, negative sentiments and topics are related, but do seem to capture independent aspects of headline contents.

### 3. Future Directions

In this project, we are interested in various methods to extract mearnigful characteristics from headline data. Despite short text nature of headlines, unsupervised learning did generate seemingly meaningful categories, and supervised learning did predict negative sentiments fairly well. As we are studying political science, our final goal is to apply those generated characteristics to answer substantive questions in political science. With this goal in mind, there are at least three directions to further develop this project. 

First, we can consider methods to generate more balanced categories of headlines and words. The current categories (from both *K-Means* and *Hierarchical Categories*) are highly unbalanced, and it is difficult to extract the specific meaning from the largest cluster (e.g., *General* category from K-means). More balanced categories will make full interpretation of categories possible. 

Second, we can incorporate the aspect of time into machine learning. Each headline in the dataset come with the posted date of the headline, while we are ignoring them for now. Major characteristics of data may transform across time. If we can learn and capture the dynamic change in the characteristic of data across time, it would benefit our understanding of media contents over time.

Third, in addition to the above data, we have access to monthly public opinion poll data of corresponding time (including such questions as cabinet approval, party approval, and subjective economic performance). Therefore, we can aggeregate the machine learned characteristics of headlines by month, and assess the relationship between media contents and public opinion.