<h1 style="text-align:center"> Exploring and Predicting Characteristics of Japanese Newspaper Headlines </h1> 
 <h2 style="text-align:center"> <i>STA208 Final Project (Spring 2017)</i> </h2> 
 <h3 style="text-align:center"> <i>Tzu-ping Liu and Gento Kato</i> </h3> 


[<h5 style="text-align:center"> Back to Summary Notebook </h5>](STA208_Project_Summary.ipynb)

<h1 style="text-align:center">Summary Notebook</h1>

## I. Research Question

Using unique dataset of Japanese Newspaper Headlines, this project asks two questions:

 * Can unsupervised-learning methods identify the major categories of news appeared on headlines?
 * How good supervised-learning methods are to predict positive/negative (PN) sentiments expressed in news headlines?

This summary notebook does not include all descriptions and results of our analysis. Click on corresponding section titles to see detailed contents in each section (if link is provided). If links do not work, all relevant jupyter notebook files are store in the same directory as this summary notebook file (<code>/208-final-project-liu_and_kato\Political_Headlines_Project\notebooks</code>).

<!--- 
* What are the impact of political news on public opinion (e.g., prime minister approval)?
--->

## II. [Data of Japanese Newspaper Headlines](STA208_Data_Description.ipynb)

Detailed introduction of data structure can be seen in [HERE](STA208_Data_Description.ipynb) (R is used in the data construction) or by clicking section title. 

In summary, following data are used in the analysis:

 * Full texts of *ALL* first page headlines from two major newspapers in Japan (*Yomiuri Shimbun* and *Asahi Shimbun*). The data collection starts on November 1987 and ends on March 2015. (The data are originally collected by the author)
 * Hand-coded negative sentiments (1 = negative, 0 = positive/neutral) appeared on randomly sampled 1000 headlines. 
 * Matrix of word appearance frequency from the headlines.
 
<!---
 * Dictionary approach to extract political headlines (Tutorial from [HERE](Headline%20Data%20and%20Text%20Search.ipynb))
 * Monthly public opinion polls in Japan of the corresponding period.
--->

## III. Analytical Strategy

We have two major objectives in the analysis, as follows: 

 1. Explore major categories of news through unsupervised and machine learning.
 2. Predict negative sentiments appeared on headlines through supervised machine learning. 

First, we are interested in identifying major categories of news contents ([Section IV](STA208_Unsupervised_Learning.ipynb)). We don't have pre-defined set of categories, therefore, we impletement **unsupervised learning methods (K-Means and Agglomerative Clustering)** to explore the major categories appeared in this dataset. We then briefly look into generated categories to assess if each of the method is successful in extracting meaningful categories.

Second, we are also interested in coding positive-negative sentiments appeared in each headline ([Section V](STA208_Supervised_Learning.ipynb)). In this section, One of the author sample 1000 headlines from the full dataset, and manually code the dichotomous appearance of negative sentiment. Then, we use **supervised learning methods (K Nearest Neighbors, Logit, Linear Discriminant Analysis, Support Verctor Machine with RBF Kernel, Decision Tree, Bagging, Random Forest and Adaboost)** to learn those negative-sentiment coding. We compare the result at the the last to evaluate how different methods perform differently on the learning.

<!---
* Time-series analysis to assess the impact of news coverage on public opinion.
--->

## IV. [Exploring Categories of Newspaper Headlines](STA208_Unsupervised_Learning.ipynb)

In this section, we apply **unsupervised learning methods to identify major news categories and word appearance patterns** in headlines. Details are presented in separate notebook file. Click [HERE](STA208_Unsupervised_Learning.ipynb) or section title to see the contents.

## V. [Predicting Negative Sentiments in Newspaper Headlines](STA208_Supervised_Learning.ipynb)

In this section, we apply various **supervised learning methods to predict negative sentiments** appeared in headlines, and compare their performances. Details are presented in separate notebook file. Click [HERE](STA208_Supervised_Learning.ipynb) or section title to see the contents.

## VI. Discussion

In this section, we discuss potential future development of the project with some general assessments on the analytical results from previous sections. Click on the following button to turn on off the raw codes for generating tables.

In [1]:
from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

### 1. Overview of Generated Variables

To start with, analysis in previous sections generate following variables for headline characteristics.

** Unsupervised Learning** ([Section IV](STA208_NonSupervised.ipynb))
* <code>km_3cat</code>: Topical categories geneated from *K-means* with $k=3$. 
* <code>km_4cat</code>: Topical categories geneated from *K-means* with $k=4$. 
* <code>km_5cat</code>: Topical categories geneated from *K-means* with $k=5$. 

** Supervised Learning** ([Section V](STA208_Supervised_Learning.ipynb))
* <code>rf_pred</code>: Negative sentiment probability generated from *Random Forest*. 
* <code>logit_pred</code>: Negative sentiment probability generated from *Logistic Regression*. 

### 2. Negative Sentiments in Topics

Here, we summarize results from both unsupervised and supervised machine learning by mean comparison. Following tables show average predicted probabilities of negative sentiments by each topical category.  

In [19]:
# Computation Timer
from timeit import default_timer as trec

## Data Mining
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')

In [20]:
## Import Data
alldata = pd.read_csv("../../data/alldata_codepred_170529.csv", encoding='CP932')
kmlabels3 = pd.read_csv("../../data/kmlabels3_reduced.csv", encoding='CP932')
kmlabels4 = pd.read_csv("../../data/kmlabels4_reduced.csv", encoding='CP932')
kmlabels5 = pd.read_csv("../../data/kmlabels5_reduced.csv", encoding='CP932')
alldata['km_3cat'] = kmlabels3.iloc[:,1]
alldata['km_4cat'] = kmlabels4.iloc[:,1]
alldata['km_5cat'] = kmlabels5.iloc[:,1]

In [24]:
alldata.groupby(['km_3cat']).agg({'logit_pred': "mean",'rf_pred': "mean"}).transpose().round(3)

km_3cat,Diplomacy,Election,General
rf_pred,0.157,0.131,0.204
logit_pred,0.111,0.057,0.197


In [25]:
alldata.groupby(['km_4cat']).agg({'logit_pred': "mean",'rf_pred': "mean"}).transpose().round(3)

km_4cat,Crime,Economy,General,Politics
rf_pred,0.28,0.389,0.192,0.137
logit_pred,0.315,0.496,0.179,0.03


In [26]:
alldata.groupby(['km_5cat']).agg({'logit_pred': "mean",'rf_pred': "mean"}).transpose().round(3)

km_5cat,Crime,Diplomacy,Economy,General,Politics
rf_pred,0.281,0.123,0.389,0.197,0.137
logit_pred,0.317,0.042,0.496,0.187,0.031
