# Deskripsi
Deskripsi mengenai variable-variabel yang ada dalam penelitian ini akan dijelaskan dalam beberapa bagian:

1. Labeling
2. Loading
3. 'News' Variables Description

### Karakteristik Ekonomi 
##### (Eurostat Business Register,2010)
1. Klasifikasi Aktivitas Ekonomi (KBLI)
    * Aktivitas ekonomi berdasarkan bahan produksi
2. Nilai tambah Perusahaan
3. Ukuran-ukuran Perusahaan
    * Jumlah pekerja
    * *Turnover* (Nilai ambil alih)
4. Ukuran-ukuran Perusahaan berdasarkan Balance sheet
    * Aset
    * Inventori
    * Cash
    * Dividen
    * Obligasi
5. Ukuran Perusahaan berdasarkan Upah dan Gaji
    * Upah (*Wage*)
    * Gaji (*Salary*)

### Karakteristik Demografi
##### (Eurostat Business Register,2010)
1. Active and Restart
    * Birth
    * Death
    * Reactivation
    * Real Death and Real Birth
2. Survive
    * Break-Up
    * Split-Off
    * Merger
    * Takeover
3. Structure
    * Change of Ownership
    * Change of Group
    * Restructure within enterprise/enterprise group
    * Complex Restructure

### Karakteristik Link dan Acuan Eksternal
##### (Pedoman Statistical Business Registers,2015)
1. Persentase Kepemilikan
2. Hubungan antar 2 perusahaan
    * Anak perusahaan

In [90]:
import matplotlib.pyplot as plt
import os, re
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from datetime import datetime as dt
import cPickle as pickle
import re #Regex
import seaborn as sns
import pandas as pd
sns.set()
%matplotlib inline

In [142]:
DATAStem = pd.DataFrame()
DATAPost = pd.DataFrame()

#Load from PosTagged Text
%time DATAPost['Judul'] = pickle.load( open( "v1.4\DATA_JUDULStriped-v2.p", "rb" ) )
%time DATAPost['Short'] = pickle.load( open( "v1.4\DATA_SHORTStriped-v2.p", "rb" ) )
%time DATAPost['Long']  = pickle.load( open( "v1.4\DATA_LONGStriped-v2.p", "rb" ) )

%time DATA_TARGET = pickle.load( open( "v1.4\DATA_TARGET-v2.p", "rb" ) )

DATA_TARGET = DATA_TARGET.sort_index()
DATAPost = DATAPost.sort_index()

DATA_TARGET = DATA_TARGET.reset_index(drop=True)
DATAPost = DATAPost.reset_index(drop=True)

Wall time: 293 ms
Wall time: 172 ms
Wall time: 2.52 s
Wall time: 110 ms


In [202]:
%time postagger = pickle.load(open( "POSTAGGER.p", "rb" ))
import string

def toNVFraction(data):
    def onlyAZ(teks):
        return re.sub(r'[^a-zA-Z]', ' ', teks)
    
    def onlyNVFractionSentence(teks):
        splitted = postagger.tag(onlyAZ(teks).split())
        N  = len(splitted)
        nn,vb = float(0),float(0)
        for word,pos in splitted:
            if (pos == 'NN'  or pos == 'NNP'):
                nn+=1
            if (pos == 'VB'):
                vb+=1
        return N,nn,vb

    def onlyNVFractionParagraph(par):
        splittedPar = par.split('.')
        word,noun,verb = float(0),float(0),float(0)
        for sentence in splittedPar:
            N,nn,vb = onlyNVFractionSentence(sentence)
            word+=N
            noun+=nn
            verb+=vb
        return noun/word,verb/word
    
    balikan = np.array([0,0])
    for teks in data:
        nf,vf = onlyNVFractionParagraph(teks)
        balikan = np.vstack((balikan,[nf,vf]))
    return balikan[1:]

Wall time: 17 ms


In [203]:
%time JUDULFrac = toNVFraction(DATAPost['Judul'])
%time SHORTFrac = toNVFraction(DATAPost['Short'])
%time LONGFrac  = toNVFraction(DATAPost['Long'])

Wall time: 808 ms
Wall time: 1.17 s
Wall time: 15.2 s


In [205]:
cek = DATAPost.Short[DATA_TARGET[DATA_TARGET == 1].index].head()
cek

13     perseroan mengaku 2015 tahun yang sulit karena...
130        tetap dikawal jangan sampai merugikan negara.
197                        laba anjlok sampai 75 persen.
210    jakarta   bisnis bioskop ternyata memberikan k...
936    pt astra sedaya finance tbk (asdf) berencana m...
Name: Short, dtype: object

In [210]:
for i in cek:
    print postagger.tag(re.sub(r'[^a-zA-Z]', ' ', i).split())

[('perseroan', 'NN'), ('mengaku', 'VB'), ('tahun', 'NN'), ('yang', 'SC'), ('sulit', 'RB'), ('karena', 'CC'), ('harga', 'NN'), ('batu', 'NN'), ('bara', 'NN'), ('anjlok', 'NN')]
[('tetap', 'JJ'), ('dikawal', 'NN'), ('jangan', 'NEG'), ('sampai', 'VB'), ('merugikan', 'VB'), ('negara', 'NN')]
[('laba', 'NN'), ('anjlok', 'NN'), ('sampai', 'VB'), ('persen', 'JJ')]
[('jakarta', 'NN'), ('bisnis', 'NN'), ('bioskop', 'NN'), ('ternyata', 'RB'), ('memberikan', 'VB'), ('kontribusi', 'CD'), ('signifikan', 'NN'), ('bagi', 'IN'), ('grup', 'NN'), ('lippo', 'NN'), ('tahun', 'NN'), ('lalu', 'JJ'), ('bisnis', 'NN'), ('itu', 'DT'), ('berkontribusi', 'CD'), ('sebanyak', 'NN'), ('rp', 'NN'), ('miliar', 'NN'), ('untuk', 'IN'), ('induk', 'NN'), ('usahanya', 'NN'), ('yakni', 'NN'), ('pt', 'NN'), ('first', 'NN'), ('media', 'VB'), ('tbk', 'NN'), ('klbv', 'NN')]
[('pt', 'NN'), ('astra', 'NN'), ('sedaya', 'NN'), ('finance', 'NN'), ('tbk', 'NN'), ('asdf', 'NN'), ('berencana', 'VB'), ('melakukan', 'VB'), ('pembayaran'

In [208]:
SHORTFrac[DATA_TARGET[DATA_TARGET == 1].index][:5]

array([[ 0.6       ,  0.1       ],
       [ 0.33333333,  0.33333333],
       [ 0.5       ,  0.25      ],
       [ 0.66666667,  0.07407407],
       [ 0.84615385,  0.15384615]])

In [209]:
pickle.dump( JUDULFrac, open( "v2.0\DATA_JUDUL-v2.p", "wb" ) )
pickle.dump( SHORTFrac, open( "v2.0\DATA_SHORT-v2.p", "wb" ) )
pickle.dump( LONGFrac,  open( "v2.0\DATA_LONG-v2.p", "wb" ) )