# **Stock market news feed semantic analysis** *(Baseline LogReg)*

Ebben a notebookban az eddigi általam kibányászott, megszerzett adathalmazokat fogom a hagyományos bag of words és logistic regression módszerrel megvizsgálni. Ezek után n-gram modelleket is ki fogok próbálni. Az általa használt források és referenciák az eredményekhez:


*   https://colab.research.google.com/drive/1QPrBkh-KwX6qcUtiNWKp9rJoneBfGEVh#scrollTo=bQUJwMjYYN4- *(saját munka - átdolgozott)*
*   https://colab.research.google.com/drive/1MdpXGCj2fb3g1BI_XfF54OWLkYQCZBBy#scrollTo=LndWT2Kn-UMK *(saját baseline munka)*
*   https://www.kaggle.com/ndrewgele/omg-nlp-with-the-djia-and-reddit#Basic-Model-Training-and-Testing
*   https://www.kaggle.com/lseiyjg/use-news-to-predict-stock-markets





A használt adathalmazok alapján külön fejezeteket készítek és mindenhol jelzem a forrását és a megszerzésének a módját, ha saját bányászás eredménye.

## **A projekt előkészítése**

A Drive csatlakoztatása a szükséges fájlok későbbi betöltésére. A betöltés közvetlen a használat előtt fogom megtenni.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


A szükséges könyvtárak betöltése a projekthez.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import pandas_datareader as web
from numpy.random import MT19937
from numpy.random import RandomState, SeedSequence
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from nltk.tokenize import word_tokenize  
from sklearn.utils import shuffle
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score 
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


A projektben használt makrók definiálása.

In [3]:
# Shuffle cycle number for the dataframe
SHUFFLE_CYCLE = 500

A reprodukálhatóság miatt definiálok egy seed-et a véletlen szám generátorhoz, amit a továbbiakban használni fogok.

In [4]:
# Random seed
RANDOM_SEED = 1234

# Numpy random seed
NP_SEED = 1234

# Max iteration for training
MAX_ITER = 100000

# Train size
TRAIN_SPLIT = 0.85

# Test size
TEST_SPLIT = 0.15

In [5]:
np.random.seed(NP_SEED)

## **KAG_REDDIT_WRLD_DJIA_DF**

Ez az adathalmaz a top25 hírt tartalmazza a Reddit World News kategóriából 2008.08.08-2016.07.01 időtartamban. Ez nem általam gyűjtött adathalmaz, a forrása:
Sun, J. (2016, August). Daily News for Stock Market Prediction, Version 1. Retrieved 2021.02.19. from https://www.kaggle.com/aaron7sun/stocknews

Az adathalmaz betöltése a csatlakoztatott Drive-omból.

In [6]:
# Copy the dataset to the local environment
!cp "/content/drive/MyDrive/Combined_News_DJIA.csv" "Combined_News_DJIA.csv"

# Check the copy is succesfull -> good if no assertation error
read = !ls
assert read[0].find("Combined_News_DJIA.csv") != -1

Az eredmények elmentésére és indexelésére az alábbi két tömböt fogom hasnzálni.

In [7]:
model_type = ["Bag of words", "1,2 n-gram", "2,2 n-gram", 
              "1,3 n-gram", "2,3 n-gram", "3,3 n-gram"]

result = []              

Makró definiálás.

In [8]:
# Number of merged news into one string
ROWS = 2

### A szöveg előkészítése

Az adathalmaz betöltése.

In [9]:
# Load the dataset 
df_combined = pd.read_csv('Combined_News_DJIA.csv', index_col = "Date")

# Show the dataframe
df_combined.head()

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,b'Did the U.S. Prep Georgia for War with Russia?',b'Rice Gives Green Light for Israel to Attack ...,b'Announcing:Class Action Lawsuit on Behalf of...,"b""So---Russia and Georgia are at war and the N...","b""China tells Bush to stay out of other countr...",b'Did World War III start today?',b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,"b""Georgia's move, a mistake of monumental prop...",b'Russia presses deeper into Georgia; U.S. say...,b'Abhinav Bindra wins first ever Individual Ol...,b' U.S. ship heads for Arctic to define territ...,b'Drivers in a Jerusalem taxi station threaten...,b'The French Team is Stunned by Phelps and the...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...","b""The US military was surprised by the timing ...",b'U.S. Beats War Drum as Iran Dumps the Dollar',"b'Gorbachev: ""Georgian military attacked the S...",b'CNN use footage of Tskhinvali ruins to cover...,b'Beginning a war as the Olympics were opening...,b'55 pyramids as large as the Luxor stacked in...,b'The 11 Top Party Cities in the World',b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',"b""The commander of a Navy air reconnaissance s...","b""92% of CNN readers: Russia's actions in Geor...",b'USA to send fleet into Black Sea to help Geo...,"b""US warns against Israeli plan to strike agai...","b""In an intriguing cyberalliance, two Estonian...",b'The CNN Effect: Georgia Schools Russia in In...,b'Why Russias response to Georgia was right',b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",b'Russia exaggerating South Ossetian death tol...,b' Musharraf expected to resign rather than fa...,b'Moscow Made Plans Months Ago to Invade Georgia',b'Why Russias response to Georgia was right',b'Nigeria has handed over the potentially oil-...,b'The US and Poland have agreed a preliminary ...,b'Russia apparently is sabotaging infrastructu...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


Érdekességképpen a következőkben megvizsgálom, hogy az adathalmaz címkéi megfelelőek. A forrás szerint a címke 1, ha nőtt vagy azonos maradt az érték azon a napon, illetve 0, ha csökkent. (Adj Close adott napi értéke az előző napihoz viszonyítva)

In [10]:
# Load the stock data
df_stock = web.DataReader("DJIA", data_source="yahoo", start="2008-08-08", 
                          end="2016-07-01")
 
# Show the stock data
df_stock.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008-08-08,11808.490234,11344.230469,11432.089844,11734.320312,4966810000,11734.320312
2008-08-11,11933.549805,11580.19043,11729.669922,11782.349609,5067310000,11782.349609
2008-08-12,11830.389648,11541.429688,11781.700195,11642.469727,4711290000,11642.469727
2008-08-13,11689.049805,11377.370117,11632.80957,11532.959961,4787600000,11532.959961
2008-08-14,11744.330078,11399.839844,11532.070312,11615.929688,4064000000,11615.929688


Az dátumok formátumát egységesre hozom az összehasonlítás érdekében.

In [11]:
temp_day = []

for day in range(len(df_stock)):
    temp_day.append(df_stock.index[day].date())

df_stock.index = temp_day

# Show the stock data
df_stock.head()

Unnamed: 0,High,Low,Open,Close,Volume,Adj Close
2008-08-08,11808.490234,11344.230469,11432.089844,11734.320312,4966810000,11734.320312
2008-08-11,11933.549805,11580.19043,11729.669922,11782.349609,5067310000,11782.349609
2008-08-12,11830.389648,11541.429688,11781.700195,11642.469727,4711290000,11642.469727
2008-08-13,11689.049805,11377.370117,11632.80957,11532.959961,4787600000,11532.959961
2008-08-14,11744.330078,11399.839844,11532.070312,11615.929688,4064000000,11615.929688


Először a dátumok ellenőzöm, hogy megegyeznek-e.

In [12]:
difference = []

if len(df_combined) == len(df_stock):
    print("The lengths are the same!")

for day in range(max(len(df_combined), len(df_stock))):
    if str(df_combined.index[day]) != str(df_stock.index[day]):
        print("There is difference at: " + str(day) + " index")
        print("News: " + str(df_combined.index[day]) + "\tStock: " + str(df_stock.index[day]))
        difference.append(day)

if len(difference) is 0:
    print("The dates matched!")

The lengths are the same!
The dates matched!


A labelek ellenőrzése.

In [13]:
difference = []

for day in range(len(df_stock)):
    # label should be 1 -> rise
    if int(df_stock["Adj Close"][day]) >= int(df_stock["Adj Close"][day - 1]):
        if df_combined["Label"][day] != 1:
            difference.append(str(df_stock.index[day]))
            print("Problem at day " + str(df_stock.index[day]))
            print("Today: " + str(df_stock["Adj Close"][day]) +"\t\tYesterday: " + str(df_stock["Adj Close"][day - 1]) + "\t\tLabel: " + str(df_combined["Label"][day]) + "\n")

    # label should be 0 -> fall
    if int(df_stock["Adj Close"][day]) < int(df_stock["Adj Close"][day - 1]):
        if df_combined["Label"][day] != 0:
            difference.append(str(df_stock.index[day]))
            print("Problem at day " + str(df_stock.index[day]))
            print("Today: " + str(df_stock["Adj Close"][day]) +"\t\tYesterday: " + str(df_stock["Adj Close"][day - 1]) + "\t\tLabel: " + str(df_combined["Label"][day]) + "\n")

print("All differences: " + str(len(difference)))      

Problem at day 2010-10-14
Today: 11096.919921875		Yesterday: 11096.080078125		Label: 0

Problem at day 2012-11-12
Today: 12815.080078125		Yesterday: 12815.3896484375		Label: 0

Problem at day 2012-11-15
Today: 12570.9501953125		Yesterday: 12570.9501953125		Label: 0

Problem at day 2013-04-12
Today: 14865.0595703125		Yesterday: 14865.1396484375		Label: 0

Problem at day 2014-04-24
Today: 16501.650390625		Yesterday: 16501.650390625		Label: 0

Problem at day 2015-08-12
Today: 17402.509765625		Yesterday: 17402.83984375		Label: 0

Problem at day 2015-11-27
Today: 17813.390625		Yesterday: 17813.390625		Label: 0

All differences: 7


Látható, hogy rossz a label pár helyen. Egy kis kutakodás után megtaláltam, hogy maga az árfolyam lekérdezésük volt hibás pár nap esetében, ezért ezeket javítom, majd elmentem a drive-omon a javítottat.

In [14]:
# correct the wrong labels
for row in difference:
    if df_combined.loc[row, "Label"] == 0:
        df_combined.loc[row, "Label"] = 1
    else:
        df_combined.loc[row, "Label"] = 0

# check them
for row in difference:
    print(str(row) + "\t\t" + str(df_combined.loc[row, "Label"]))

2010-10-14		1
2012-11-12		1
2012-11-15		1
2013-04-12		1
2014-04-24		1
2015-08-12		1
2015-11-27		1


In [15]:
# save to drive
df_combined.to_csv('drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/KAG_REDDIT_WRLD_DJIA_DF_corrected.csv')

# Show the dataset
df_combined.head()

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,b'Did the U.S. Prep Georgia for War with Russia?',b'Rice Gives Green Light for Israel to Attack ...,b'Announcing:Class Action Lawsuit on Behalf of...,"b""So---Russia and Georgia are at war and the N...","b""China tells Bush to stay out of other countr...",b'Did World War III start today?',b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,"b""Georgia's move, a mistake of monumental prop...",b'Russia presses deeper into Georgia; U.S. say...,b'Abhinav Bindra wins first ever Individual Ol...,b' U.S. ship heads for Arctic to define territ...,b'Drivers in a Jerusalem taxi station threaten...,b'The French Team is Stunned by Phelps and the...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...","b""The US military was surprised by the timing ...",b'U.S. Beats War Drum as Iran Dumps the Dollar',"b'Gorbachev: ""Georgian military attacked the S...",b'CNN use footage of Tskhinvali ruins to cover...,b'Beginning a war as the Olympics were opening...,b'55 pyramids as large as the Luxor stacked in...,b'The 11 Top Party Cities in the World',b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',"b""The commander of a Navy air reconnaissance s...","b""92% of CNN readers: Russia's actions in Geor...",b'USA to send fleet into Black Sea to help Geo...,"b""US warns against Israeli plan to strike agai...","b""In an intriguing cyberalliance, two Estonian...",b'The CNN Effect: Georgia Schools Russia in In...,b'Why Russias response to Georgia was right',b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",b'Russia exaggerating South Ossetian death tol...,b' Musharraf expected to resign rather than fa...,b'Moscow Made Plans Months Ago to Invade Georgia',b'Why Russias response to Georgia was right',b'Nigeria has handed over the potentially oil-...,b'The US and Poland have agreed a preliminary ...,b'Russia apparently is sabotaging infrastructu...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


A következőkben az esetleges adat nélküli napokat, illetve cellákat keresem meg és helyettesítem őket egy üres sztringgel. Ez a későbbi szövegfeldolgozás hibamentességéhez szükséges.

In [16]:
# Find the cells with NaN and after the rows for them
is_NaN = df_combined.isnull()
row_has_NaN = is_NaN.any(axis = 1)
rows_with_NaN = df_combined[row_has_NaN]

# Replace them
df_combined = df_combined.replace(np.nan, " ")

# Check the process
is_NaN = df_combined.isnull()
row_has_NaN = is_NaN.any(axis = 1)
rows_with_NaN = df_combined[row_has_NaN]

assert len(rows_with_NaN) is 0

Ezek után az egy naphoz tartozó híreket közös sztringekbe fűzöm. Az egy sztringbe tartozó hírek számát makróval definiálom:


*   ROWS - egymásba fűzött hírek száma

Itt megtalálható már az első előkészítő algoritmusom, méghozzá a sztringek elején található b karakter eltávolítása.

In [17]:
# Get column names
combined_column_names = []
for column in df_combined.columns:
  combined_column_names.append(column)

# 2D array creation for the news based on macros
COLUMNS = len(df_combined)
news_sum = [[0 for i in range(COLUMNS)] for j in range(int((len(combined_column_names) - 1) / ROWS))]  

# Show the column names
print("Column names of the dataset:") 
print(combined_column_names)

# Merge the news
for row in range(len(df_combined)):
  for column in range(int((len(combined_column_names) - 1) / ROWS)):
    temp = ""
    news = ""
    for word in range(ROWS):
      news = df_combined[combined_column_names[(column * ROWS) + (word + 1)]][row]
      # Remove the b character at the begining of the string
      if news[0] is "b":
        news = " " + news[1:]
      temp = temp + news
    news_sum[column][row] = temp

# Show the first day second package of the news
print("\nThe first day second package of the news:")
print(news_sum[1][0])

Column names of the dataset:
['Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7', 'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15', 'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23', 'Top24', 'Top25']

The first day second package of the news:
 'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)' 'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire'


Ezek után a korábbi oszlopokat(Top1, Top2...) kicserélem a csoportosításnak megfelelő számú oszlopokra és nevekre (News_1, News_2...), majd feltöltöm őket az összevont hírcsomagokkal.

In [18]:
# Drop the old columns
for column in range(len(combined_column_names) - 1):
  df_combined.drop(combined_column_names[column + 1], axis = 1, inplace = True)

# Create the new columns with the merged news
for column in range(int((len(combined_column_names) - 1) / ROWS)):
  colum_name = "News_" + str(column + 1)
  df_combined[colum_name] = news_sum[column]

# Show the DataFrame
df_combined.head()

Unnamed: 0_level_0,Label,News_1,News_2,News_3,News_4,News_5,News_6,News_7,News_8,News_9,News_10,News_11,News_12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2008-08-08,0,"""Georgia 'downs two Russian warplanes' as cou...",'Russia Today: Columns of troops roll into So...,"""Afghan children raped with 'impunity,' U.N. ...","""Breaking: Georgia invades South Ossetia, Rus...",'Georgian troops retreat from S. Osettain cap...,'Rice Gives Green Light for Israel to Attack ...,"""So---Russia and Georgia are at war and the N...",'Did World War III start today?' 'Georgia Inv...,'Al-Qaeda Faces Islamist Backlash' 'Condoleez...,'This is a busy day: The European Union has ...,'Why the Pentagon Thinks Attacking Iran is a ...,'Indian shoe manufactory - And again in a se...
2008-08-11,1,'Why wont America and Nato help us? If they w...,"""Jewish Georgian minister: Thanks to Israeli ...","""Olympic opening ceremony fireworks 'faked'"" ...",'Russia angered by Israeli military sale to G...,'Welcome To World War IV! Now In High Definit...,'Russia presses deeper into Georgia; U.S. say...,' U.S. ship heads for Arctic to define territ...,'The French Team is Stunned by Phelps and the...,"'""Do not believe TV, neither Russian nor Geor...",'China to overtake US as largest manufacturer...,'Israeli Physicians Group Condemns State Tort...,'Perhaps *the* question about the Georgia - R...
2008-08-12,0,'Remember that adorable 9-year-old who sang a...,"'""If we had no sexual harassment we would hav...",'Ceasefire in Georgia: Putin Outmaneuvers the...,'Stratfor: The Russo-Georgian War and the Bal...,"""The US military was surprised by the timing ...","'Gorbachev: ""Georgian military attacked the S...",'Beginning a war as the Olympics were opening...,'The 11 Top Party Cities in the World' 'U.S. ...,'Why Russias response to Georgia was right' '...,"'Russia, Georgia, and NATO: Cold War Two' 'Re...",'War in Georgia: The Israeli connection' 'All...,'Christopher King argues that the US and NATO...
2008-08-13,0,' U.S. refuses Israel weapons to attack Iran:...,' Israel clears troops who killed Reuters cam...,'Body of 14 year old found in trunk; Latest (...,"""Bush announces Operation Get All Up In Russi...","""The commander of a Navy air reconnaissance s...",'USA to send fleet into Black Sea to help Geo...,"""In an intriguing cyberalliance, two Estonian...",'Why Russias response to Georgia was right' '...,'US humanitarian missions soon in Georgia - i...,"'Russian convoy heads into Georgia, violating...",'Gorbachev: We Had No Choice' 'Witness: Russi...,' Quarter of Russians blame U.S. for conflict...
2008-08-14,1,'All the experts admit that we should legalis...,'Swedish wrestler Ara Abrahamian throws away ...,'Missile That Killed 9 Inside Pakistan May Ha...,'Poland and US agree to missle defense deal. ...,'Russia exaggerating South Ossetian death tol...,'Moscow Made Plans Months Ago to Invade Georg...,'Nigeria has handed over the potentially oil-...,'Russia apparently is sabotaging infrastructu...,"""Georgia confict could set back Russia's US r...","'""Non-media"" photos of South Ossetia/Georgia ...",'Saudi Arabia: Mother moves to block child ma...,"'Russia: World ""can forget about"" Georgia\'s..."


Egy új dataframebe újracsoportosítom a hír blokkokat a címkéjükkel, már a dátumok nélkül.

In [19]:
# The label column 
LABEL_COLUMN = 0

news_sum = []
label_sum = []

# Get the column names
combined_column_names = []
for column in df_combined.columns:
  combined_column_names.append(column)

# Write out the column names 
print(combined_column_names)
print("\n")

# Connect the merged news with the labels
for column in range(len(df_combined)):
  for row in range(len(combined_column_names) - 1):
    news_sum.append(df_combined[combined_column_names[row + 1]][column])
    label_sum.append(df_combined[combined_column_names[LABEL_COLUMN]][column])

# Create the new DataFrame
df_sum_news_labels = pd.DataFrame(data = label_sum, index = None, columns = ["Label"])
df_sum_news_labels["News"] = news_sum

# Show it
df_sum_news_labels.head()

['Label', 'News_1', 'News_2', 'News_3', 'News_4', 'News_5', 'News_6', 'News_7', 'News_8', 'News_9', 'News_10', 'News_11', 'News_12']




Unnamed: 0,Label,News
0,0,"""Georgia 'downs two Russian warplanes' as cou..."
1,0,'Russia Today: Columns of troops roll into So...
2,0,"""Afghan children raped with 'impunity,' U.N. ..."
3,0,"""Breaking: Georgia invades South Ossetia, Rus..."
4,0,'Georgian troops retreat from S. Osettain cap...


Először a szövegek előfeldolgozásával kezdem: írásjelek eltávolítása, számok eltávolítása, felesleges szóközök eltávolítása, aztán minden szót kis kezdőbetűjü szóvá konvertálom.

In [20]:
# Removing punctuations
temp_news = []
for line in news_sum:
  temp_attach = ""
  for word in line:
    temp = " "
    if word not in string.punctuation:
      temp = word
    temp_attach = temp_attach + "".join(temp)
  temp_news.append(temp_attach)

news_sum = temp_news
temp_news = []

# Remove numbers
for line in news_sum:
  temp_attach = ""
  for word in line:
    temp = " "
    if not word.isdigit():
      temp = word
    temp_attach = temp_attach + "".join(temp)
  temp_news.append(temp_attach)

# Remove space
for line in range(len(temp_news)):    
  temp_news[line] = " ".join(temp_news[line].split())

# Converting headlines to lower case
for line in range(len(temp_news)): 
    temp_news[line] = temp_news[line].lower()

# Update the data frame
df_sum_news_labels["News"] = temp_news

# Show it
df_sum_news_labels.head()

Unnamed: 0,Label,News
0,0,georgia downs two russian warplanes as countri...
1,0,russia today columns of troops roll into south...
2,0,afghan children raped with impunity u n offici...
3,0,breaking georgia invades south ossetia russia ...
4,0,georgian troops retreat from s osettain capita...


A következőkben az úgy nevezett töltelék szavakat (stop words) fogom eltávolítani.

In [21]:
# Load the stop words
stop_words = set(stopwords.words('english'))

filtered_sentence = []
news_sum = df_sum_news_labels["News"]

# Remove stop words
for line in news_sum:
  word_tokens = word_tokenize(line)
  temp_attach = ""
  for word in word_tokens:
    temp = " "
    if not word in stop_words:
      temp = temp + word
    temp_attach = temp_attach + "".join(temp)
  filtered_sentence.append(temp_attach)

# Remove space
for line in range(len(filtered_sentence)):    
  filtered_sentence[line] = " ".join(filtered_sentence[line].split())

# Update the data frame
df_sum_news_labels["News"] = filtered_sentence

# Show the DataFrame
df_sum_news_labels.head()

Unnamed: 0,Label,News
0,0,georgia downs two russian warplanes countries ...
1,0,russia today columns troops roll south ossetia...
2,0,afghan children raped impunity u n official sa...
3,0,breaking georgia invades south ossetia russia ...
4,0,georgian troops retreat osettain capital presu...


Az adathalmazban lévő nulla hosszú sztring csomagok megkeresése és a hozzájuk tartozó cellák törlése következik.

In [22]:
news_sum = df_sum_news_labels["News"]
null_indexes = []
index = 0

for line in news_sum:
  if line is "":
    null_indexes.append(index)
  index = index + 1

print(null_indexes)

for row in null_indexes:
  df_sum_news_labels = df_sum_news_labels.drop(row)

news_sum = df_sum_news_labels["News"]
null_indexes = []
index = 0

for line in news_sum:
  if line is "":
    null_indexes.append(index)
  index = index + 1
  
assert len(null_indexes) is 0

[3335]


Az adathalmaz véletlenszerű sorbarendezése.

In [23]:
# Do the shuffle
for i in range(SHUFFLE_CYCLE):
  df_sum_news_labels = shuffle(df_sum_news_labels, random_state = RANDOM_SEED)

# Reset the index
df_sum_news_labels.reset_index(inplace=True, drop=True)

# Show the data frame
df_sum_news_labels.head()

Unnamed: 0,Label,News
0,0,uk form promoted sir ian blair demands public ...
1,1,ebola crisis west africa deepenspakistan man a...
2,0,killing whales let agree disagree says japanve...
3,0,two week nobel prize winners talk destruction ...
4,1,bus carrying foreign journalists north korea t...


Az adathalmaz szétbontása tanító és validáló/tesztelő adathalmazokra, majd a szétbontás ellenőrzése mérettel és első elem kiíratásával.

In [24]:
INPUT_SIZE = len(df_sum_news_labels)
TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 
TEST_SIZE = int(TEST_SPLIT * INPUT_SIZE)

# Split the dataset
train = df_sum_news_labels[:TRAIN_SIZE] 
test = df_sum_news_labels[TRAIN_SIZE:]

# Print out the length
print("Train data set length: " + str(len(train)))
print("Test data set length: " + str(len(test)))
print("Split summa: " + str(len(train) + len(test)))
print("Dataset summa before split: " + str(len(df_sum_news_labels)))

# check
split_sum = len(train) + len(test)
sum = len(df_sum_news_labels)
assert split_sum == sum

Train data set length: 20286
Test data set length: 3581
Split summa: 23867
Dataset summa before split: 23867


In [25]:
train.tail(1)

Unnamed: 0,Label,News
20285,0,taiwan writes un womens rights convention dome...


In [26]:
test.head(1)

Unnamed: 0,Label,News
20286,1,putin backs ukraine election russia putin says...


### Bag of words

Először a tanító adathalmaz híreit fűzöm össze egy tömbbe.

In [27]:
train_headlines = []

for row in range(0, len(train.index)):
    train_headlines.append(train.iloc[row, 1])

# show the first
train_headlines[0]

'uk form promoted sir ian blair demands public venues define ethnicity audience russia face massive social unrest russian industrial towns might face social unrest violence companies plan massive layoffs russian sociologist says'

Ezek után vektorizálom őket.

In [28]:
bow_vectorizer = CountVectorizer()
bow_train = bow_vectorizer.fit_transform(train_headlines)
print(bow_train.shape)

(20286, 38597)


Egy logistic regression modellt fogok erre a tanító halmazra betanítani.

In [29]:
bow_model = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
bow_model = bow_model.fit(bow_train, train["Label"])

A teszt adathalmaz előkészítése, majd becslés a modell segítségével a következő lépés.

In [30]:
test_headlines = []

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

bow_test = bow_vectorizer.transform(test_headlines)
bow_predictions = bow_model.predict(bow_test)

Az eredmények megjelenítése egy táblázatban.

In [31]:
pd.crosstab(test["Label"], bow_predictions, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,698,964
1,810,1109


A pontossága a modellnek.

In [32]:
print (classification_report(test["Label"], bow_predictions))
print (accuracy_score(test["Label"], bow_predictions))

result.append(accuracy_score(test["Label"], bow_predictions))

              precision    recall  f1-score   support

           0       0.46      0.42      0.44      1662
           1       0.53      0.58      0.56      1919

    accuracy                           0.50      3581
   macro avg       0.50      0.50      0.50      3581
weighted avg       0.50      0.50      0.50      3581

0.5046076514939961


A következőkben a top 10 legbefolyásolóbb sztringet jelenítem meg mind pozítiv és mind negatív irányba.

In [33]:
bow_words = bow_vectorizer.get_feature_names()
bow_coeffs = bow_model.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : bow_words, 
                        'Coefficient' : bow_coeffs})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
29152,riyadh,1.652154
34468,thriving,1.606117
18845,landing,1.598721
16996,intercept,1.592312
29752,sanaa,1.558651
1493,anyway,1.461035
35585,ugandan,1.432039
10943,enemies,1.406581
35048,transition,1.404805
26713,proposals,1.401941


In [34]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
882,airways,-1.356654
19181,lecture,-1.386058
34105,temporary,-1.388032
26593,profound,-1.398332
5828,choppers,-1.457017
32757,stranded,-1.513365
30598,separation,-1.515334
5353,census,-1.517089
10587,egypts,-1.522416
2811,barclays,-1.691836


### 2-gram modell

Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (1,2) n-gram modellel.

In [35]:
gram_vectorizer_12 = CountVectorizer(ngram_range=(1,2))
train_vectorizer_12 = gram_vectorizer_12.fit_transform(train_headlines)

print(train_vectorizer_12.shape)

gram_model_12 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_12 = gram_model_12.fit(train_vectorizer_12, train["Label"])

gram_test_12 = gram_vectorizer_12.transform(test_headlines)
gram_predictions_12 = gram_model_12.predict(gram_test_12)

print (classification_report(test["Label"], gram_predictions_12))
print (accuracy_score(test["Label"], gram_predictions_12))

result.append(accuracy_score(test["Label"], gram_predictions_12))

(20286, 387666)
              precision    recall  f1-score   support

           0       0.48      0.39      0.43      1662
           1       0.55      0.64      0.59      1919

    accuracy                           0.52      3581
   macro avg       0.51      0.51      0.51      3581
weighted avg       0.52      0.52      0.52      3581

0.5230382574699804


In [36]:
gram_words_12 = gram_vectorizer_12.get_feature_names()
gram_coeffs_12 = gram_model_12.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_12, 
                        'Coefficient' : gram_coeffs_12})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
189133,landing,0.965702
108882,enemies,0.961186
167531,influence,0.952113
162036,identified,0.948218
169960,intercept,0.904252
305149,seize,0.900881
796,abroad,0.84629
312974,significant,0.829626
37258,blackout,0.824972
146721,green,0.820125


In [37]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
74623,cooperation,-0.823124
201708,low,-0.833886
105743,egypts,-0.880279
227415,nepal,-0.931394
30566,barclays,-0.933883
327481,stranded,-0.936199
55120,census,-0.936524
297907,saudi king,-0.942706
378793,withdraw,-0.967675
318224,somalia,-0.985485


Másodjára a (2,2) n-gram modellel.

In [38]:
gram_vectorizer_22 = CountVectorizer(ngram_range=(2,2))
train_vectorizer_22 = gram_vectorizer_22.fit_transform(train_headlines)

print(train_vectorizer_22.shape)

gram_model_22 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_22 = gram_model_22.fit(train_vectorizer_22, train["Label"])

gram_test_22 = gram_vectorizer_22.transform(test_headlines)
gram_predictions_22 = gram_model_22.predict(gram_test_22)

pd.crosstab(test["Label"], gram_predictions_22, rownames=["Actual"], colnames=["Predicted"])

print (classification_report(test["Label"], gram_predictions_22))
print (accuracy_score(test["Label"], gram_predictions_22))

result.append(accuracy_score(test["Label"], gram_predictions_22))

(20286, 349069)
              precision    recall  f1-score   support

           0       0.50      0.30      0.38      1662
           1       0.55      0.74      0.63      1919

    accuracy                           0.54      3581
   macro avg       0.52      0.52      0.50      3581
weighted avg       0.53      0.54      0.51      3581

0.5356045797263335


In [39]:
gram_words_22 = gram_vectorizer_22.get_feature_names()
gram_coeffs_22 = gram_model_22.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_22, 
                        'Coefficient' : gram_coeffs_22})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
161653,jazeera english,0.869494
278036,sexual violence,0.822566
121295,french president,0.79824
152664,intelligence agencies,0.771891
343669,world cup,0.755882
285350,social media,0.753069
124460,gaza border,0.751969
134717,gunmen kill,0.749832
32016,big brother,0.743814
8029,air pollution,0.742354


In [40]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
54792,chinese officials,-0.781498
80759,defense system,-0.793795
220105,panama papers,-0.801892
322400,un chief,-0.808713
285687,solar system,-0.813581
325952,us army,-0.820279
293542,stock market,-0.847732
135426,haiti earthquake,-0.866586
192802,military bases,-0.873164
268028,saudi king,-1.150206


### 3-gram modell

Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (1,3) n-gram modellel.

In [41]:
gram_vectorizer_13 = CountVectorizer(ngram_range=(1,3))
train_vectorizer_13 = gram_vectorizer_13.fit_transform(train_headlines)

print(train_vectorizer_13.shape)

gram_model_13 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_13 = gram_model_13.fit(train_vectorizer_13, train["Label"])

gram_test_13 = gram_vectorizer_13.transform(test_headlines)
gram_predictions_13 = gram_model_13.predict(gram_test_13)

print (classification_report(test["Label"], gram_predictions_13))
print (accuracy_score(test["Label"], gram_predictions_13))

result.append(accuracy_score(test["Label"], gram_predictions_13))

(20286, 801460)
              precision    recall  f1-score   support

           0       0.50      0.39      0.44      1662
           1       0.55      0.66      0.60      1919

    accuracy                           0.53      3581
   macro avg       0.53      0.52      0.52      3581
weighted avg       0.53      0.53      0.52      3581

0.5330913152750628


In [42]:
gram_words_13 = gram_vectorizer_13.get_feature_names()
gram_coeffs_13 = gram_model_13.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_13, 
                        'Coefficient' : gram_coeffs_13})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
629547,seize,0.775658
331944,identified,0.767014
342761,influence,0.750768
300949,green,0.748597
388540,landing,0.72701
221817,enemies,0.713953
761392,volcano,0.712706
456849,mumbai,0.683135
568956,rate,0.680255
347439,intercept,0.672334


In [43]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
62917,barclays,-0.676459
175268,de,-0.685886
614141,saudi king,-0.690781
69078,beijing,-0.690905
675482,stranded,-0.708413
112590,census,-0.724191
414630,low,-0.764888
467471,nepal,-0.769765
781768,withdraw,-0.781474
656218,somalia,-0.930461


Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (2,3) n-gram modellel.

In [44]:
gram_vectorizer_23 = CountVectorizer(ngram_range=(2,3))
train_vectorizer_23 = gram_vectorizer_23.fit_transform(train_headlines)

print(train_vectorizer_23.shape)

gram_model_23 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_23 = gram_model_23.fit(train_vectorizer_23, train["Label"])

gram_test_23 = gram_vectorizer_23.transform(test_headlines)
gram_predictions_23 = gram_model_23.predict(gram_test_23)

pd.crosstab(test["Label"], gram_predictions_23, rownames=["Actual"], colnames=["Predicted"])

print (classification_report(test["Label"], gram_predictions_23))
print (accuracy_score(test["Label"], gram_predictions_23))

result.append(accuracy_score(test["Label"], gram_predictions_23))

(20286, 762863)
              precision    recall  f1-score   support

           0       0.50      0.23      0.32      1662
           1       0.55      0.80      0.65      1919

    accuracy                           0.54      3581
   macro avg       0.52      0.52      0.48      3581
weighted avg       0.52      0.54      0.49      3581

0.5358838313320302


In [45]:
gram_words_23 = gram_vectorizer_23.get_feature_names()
gram_coeffs_23 = gram_model_23.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_23, 
                        'Coefficient' : gram_coeffs_23})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
749964,world cup,0.721431
261750,french president,0.648438
622045,social media,0.618966
606600,sexual violence,0.588408
596777,security council,0.572161
750600,world largest,0.571545
350540,jazeera english,0.554738
665801,tear gas,0.549425
69207,big brother,0.548341
291509,gunmen kill,0.548241


In [46]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
606423,sexual abuse,-0.556075
119956,christopher hitchens,-0.556145
702952,un chief,-0.568781
308331,hong kong,-0.588145
710756,us army,-0.595496
292977,haiti earthquake,-0.631009
418677,military bases,-0.63348
640245,stock market,-0.651543
480324,panama papers,-0.670802
584262,saudi king,-0.849692


Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (3,3) n-gram modellel.

In [47]:
gram_vectorizer_33 = CountVectorizer(ngram_range=(3,3))
train_vectorizer_33 = gram_vectorizer_33.fit_transform(train_headlines)

print(train_vectorizer_33.shape)

gram_model_33 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_33 = gram_model_33.fit(train_vectorizer_33, train["Label"])

gram_test_33 = gram_vectorizer_33.transform(test_headlines)
gram_predictions_33 = gram_model_33.predict(gram_test_33)

pd.crosstab(test["Label"], gram_predictions_33, rownames=["Actual"], colnames=["Predicted"])

print (classification_report(test["Label"], gram_predictions_33))
print (accuracy_score(test["Label"], gram_predictions_33))

result.append(accuracy_score(test["Label"], gram_predictions_33))

(20286, 413794)
              precision    recall  f1-score   support

           0       0.53      0.07      0.12      1662
           1       0.54      0.95      0.69      1919

    accuracy                           0.54      3581
   macro avg       0.53      0.51      0.40      3581
weighted avg       0.53      0.54      0.42      3581

0.5389555989946943


In [48]:
gram_words_33 = gram_vectorizer_33.get_feature_names()
gram_coeffs_33 = gram_model_33.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_33, 
                        'Coefficient' : gram_coeffs_33})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
9936,al jazeera english,0.907694
131932,first time since,0.876802
168145,human rights watch,0.587147
380974,un security council,0.550488
279659,president hosni mubarak,0.543829
244545,nobel peace prize,0.525923
140471,french president sarkozy,0.500517
17085,anti gay bill,0.494161
144626,gaza war crimes,0.480407
104918,drug decriminalization portugal,0.470565


In [49]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
325883,sentenced death stoning,-0.499194
252868,one child policy,-0.509422
183448,islamic state iraq,-0.513419
123345,faces years jail,-0.51609
339623,sovereign wealth fund,-0.526752
400074,west bank settlement,-0.529796
9228,air strikes syria,-0.587244
58451,chancellor angela merkel,-0.594006
26442,aung san suu,-0.607714
229191,missile defense system,-0.759526


### 4-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [50]:
for n in range(1,5):
    print("--------------------------------------------\n\nStart of the " 
          + str(n) + ",4 gram model\n")

    _gram_vectorizer_ = CountVectorizer(ngram_range=(n,4))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    print (accuracy_score(test["Label"], _gram_predictions_))

    model_type.append(str(n) + ",4 n-gram")
    result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,4 gram model

The shape is: (20286, 1205402)

0.5316950572465792
--------------------------------------------

Start of the 2,4 gram model

The shape is: (20286, 1166805)

0.5375593409662106
--------------------------------------------

Start of the 3,4 gram model

The shape is: (20286, 817736)

0.541189611840268
--------------------------------------------

Start of the 4,4 gram model

The shape is: (20286, 403942)

0.5356045797263335


### 5-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [51]:
MODEL_TYPE = 5

for n in range(1,MODEL_TYPE+1):
    print("--------------------------------------------\n\nStart of the " 
          + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

    _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    print (accuracy_score(test["Label"], _gram_predictions_))

    model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
    result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,5 gram model

The shape is: (20286, 1590717)

0.5314158056408824
--------------------------------------------

Start of the 2,5 gram model

The shape is: (20286, 1552120)

0.5361630829377269
--------------------------------------------

Start of the 3,5 gram model

The shape is: (20286, 1203051)

0.5370008377548171
--------------------------------------------

Start of the 4,5 gram model

The shape is: (20286, 789257)

0.5356045797263335
--------------------------------------------

Start of the 5,5 gram model

The shape is: (20286, 385315)

0.5367215861491204


### 6-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [52]:
MODEL_TYPE = 6

for n in range(1,MODEL_TYPE+1):
    print("--------------------------------------------\n\nStart of the " 
          + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

    _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    print (accuracy_score(test["Label"], _gram_predictions_))

    model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
    result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,6 gram model

The shape is: (20286, 1956147)

0.5305780508237923
--------------------------------------------

Start of the 2,6 gram model

The shape is: (20286, 1917550)

0.5370008377548171
--------------------------------------------

Start of the 3,6 gram model

The shape is: (20286, 1568481)

0.5364423345434236
--------------------------------------------

Start of the 4,6 gram model

The shape is: (20286, 1154687)

0.5372800893605139
--------------------------------------------

Start of the 5,6 gram model

The shape is: (20286, 750745)

0.5364423345434236
--------------------------------------------

Start of the 6,6 gram model

The shape is: (20286, 365430)

0.5370008377548171


### Eredmények összegzése

Az eredmények kiíratása, a legjobbat kiemelve.

In [53]:
best_model = 0

for model in range(len(model_type)):
    print(str(model_type[model]) + ":\t\t\t\t\t" + str(result[model]))

    if result[model] > best_model:
        best_model = result[model]
        best_model_index = model

print("--------------------------------------------\nBest model:\n" 
      + str(model_type[best_model_index]) + "\t\t\t\t\t" + 
      str(result[best_model_index]))


Bag of words:					0.5046076514939961
1,2 n-gram:					0.5230382574699804
2,2 n-gram:					0.5356045797263335
1,3 n-gram:					0.5330913152750628
2,3 n-gram:					0.5358838313320302
3,3 n-gram:					0.5389555989946943
1,4 n-gram:					0.5316950572465792
2,4 n-gram:					0.5375593409662106
3,4 n-gram:					0.541189611840268
4,4 n-gram:					0.5356045797263335
1,5 n-gram:					0.5314158056408824
2,5 n-gram:					0.5361630829377269
3,5 n-gram:					0.5370008377548171
4,5 n-gram:					0.5356045797263335
5,5 n-gram:					0.5367215861491204
1,6 n-gram:					0.5305780508237923
2,6 n-gram:					0.5370008377548171
3,6 n-gram:					0.5364423345434236
4,6 n-gram:					0.5372800893605139
5,6 n-gram:					0.5364423345434236
6,6 n-gram:					0.5370008377548171
--------------------------------------------
Best model:
3,4 n-gram					0.541189611840268


### ROWS makró optimalizálás

Ebben a fejezetben a különböző ROWS értékekre (mennyi napi hírt fűzünk egybe) futtatom végig egy automatizált bag of words -> 6,6 gram modell tanítást és becslést és állapítom meg, hogy melyik a legpontosabb.

A tesztelendő paraméterek megadása.

In [66]:
# Number of merged news into one string: 1...12, 25 
rows_values = []
for value in range(1,13):
    rows_values.append(value)

rows_values.append(25)

rows_values

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 25]

A modell típusok összegyűjtése az automatizált tanításhoz.

In [67]:
model_type_values = []
for value in range(1,7):
    model_type_values.append(value)

model_type_values

[1, 2, 3, 4, 5, 6]

A paraméterhez tartozó eredmények tárolására létrehozom az alábbi tömböket.

In [68]:
rows_summary_value = []
rows_summary_accuraccy = []

Automatizált tanítás és mentések.

In [69]:
def preprocess():
    df_combined = pd.read_csv('drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/KAG_REDDIT_WRLD_DJIA_DF_corrected.csv', 
                            index_col = "Date")

    # Find the cells with NaN and after the rows for them
    is_NaN = df_combined.isnull()
    row_has_NaN = is_NaN.any(axis = 1)
    rows_with_NaN = df_combined[row_has_NaN]

    # Replace them
    df_combined = df_combined.replace(np.nan, " ")

    # Check the process
    is_NaN = df_combined.isnull()
    row_has_NaN = is_NaN.any(axis = 1)
    rows_with_NaN = df_combined[row_has_NaN]

    assert len(rows_with_NaN) is 0

    # Get column names
    combined_column_names = []
    for column in df_combined.columns:
      combined_column_names.append(column)

    # 2D array creation for the news based on macros
    COLUMNS = len(df_combined)
    news_sum = []
    news_sum = [[0 for i in range(COLUMNS)] for j in range(int((len(combined_column_names) - 1) / ROWS))]  

    # Merge the news
    for row in range(len(df_combined)):
      for column in range(int((len(combined_column_names) - 1) / ROWS)):
        temp = ""
        news = ""
        for word in range(ROWS):
          news = df_combined[combined_column_names[(column * ROWS) + (word + 1)]][row]
          # Remove the b character at the begining of the string
          if news[0] is "b":
            news = " " + news[1:]
          temp = temp + news
        news_sum[column][row] = temp

    # Drop the old columns
    for column in range(len(combined_column_names) - 1):
      df_combined.drop(combined_column_names[column + 1], axis = 1, inplace = True)

    # Create the new columns with the merged news
    for column in range(int((len(combined_column_names) - 1) / ROWS)):
      colum_name = "News_" + str(column + 1)
      df_combined[colum_name] = news_sum[column]          

    # The label column 
    LABEL_COLUMN = 0

    news_sum = []
    label_sum = []

    # Get the column names
    combined_column_names = []
    for column in df_combined.columns:
      combined_column_names.append(column)

    # Connect the merged news with the labels
    for column in range(len(df_combined)):
      for row in range(len(combined_column_names) - 1):
        news_sum.append(df_combined[combined_column_names[row + 1]][column])
        label_sum.append(df_combined[combined_column_names[LABEL_COLUMN]][column])

    # Create the new DataFrame
    df_sum_news_labels = pd.DataFrame(data = label_sum, index = None, columns = ["Label"])
    df_sum_news_labels["News"] = news_sum

    # Removing punctuations
    temp_news = []
    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if word not in string.punctuation:
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    news_sum = temp_news
    temp_news = []

    # Remove numbers
    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if not word.isdigit():
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    # Remove space
    for line in range(len(temp_news)):    
      temp_news[line] = " ".join(temp_news[line].split())

    # Converting headlines to lower case
    for line in range(len(temp_news)): 
        temp_news[line] = temp_news[line].lower()

    # Update the data frame
    df_sum_news_labels["News"] = temp_news

    # Load the stop words
    stop_words = set(stopwords.words('english'))

    filtered_sentence = []
    news_sum = df_sum_news_labels["News"]

    # Remove stop words
    for line in news_sum:
      word_tokens = word_tokenize(line)
      temp_attach = ""
      for word in word_tokens:
        temp = " "
        if not word in stop_words:
          temp = temp + word
        temp_attach = temp_attach + "".join(temp)
      filtered_sentence.append(temp_attach)

    # Remove space
    for line in range(len(filtered_sentence)):    
      filtered_sentence[line] = " ".join(filtered_sentence[line].split())

    # Update the data frame
    df_sum_news_labels["News"] = filtered_sentence

    news_sum = df_sum_news_labels["News"]
    null_indexes = []
    index = 0

    for line in news_sum:
      if line is "":
        null_indexes.append(index)
      index = index + 1

    for row in null_indexes:
      df_sum_news_labels = df_sum_news_labels.drop(row)

    news_sum = df_sum_news_labels["News"]
    null_indexes = []
    index = 0

    for line in news_sum:
      if line is "":
        null_indexes.append(index)
      index = index + 1
      
    assert len(null_indexes) is 0

    # Do the shuffle
    for i in range(SHUFFLE_CYCLE):
      df_sum_news_labels = shuffle(df_sum_news_labels, random_state = RANDOM_SEED)

    # Reset the index
    df_sum_news_labels.reset_index(inplace=True, drop=True)

    return df_sum_news_labels

In [70]:
def split_to_train():
    INPUT_SIZE = len(df_sum_news_labels)
    TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 

    # Split the dataset
    train = df_sum_news_labels[:TRAIN_SIZE] 

    return train

In [71]:
def split_to_test():
    INPUT_SIZE = len(df_sum_news_labels)
    TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 

    # Split the dataset
    test = df_sum_news_labels[TRAIN_SIZE:]

    return test

In [72]:
for ROWS in rows_values:
  
    print("--------------------------------------------\n\nStart of the ROWS = " 
      + str(ROWS) + " sequence\n\n--------------------------------------------\n")
    
    model_type = []
    result = []

    df_sum_news_labels = preprocess()
    train = split_to_train()
    test = split_to_test()

    # check
    split_sum = len(train) + len(test)
    sum = len(df_sum_news_labels)
    assert split_sum == sum    

    train_headlines = []
    test_headlines = []

    for row in range(0, len(train.index)):
        train_headlines.append(train.iloc[row, 1])

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])

    # show the first
    print(train_headlines[0])

    for MODEL_TYPE in model_type_values:

        for n in range(1,MODEL_TYPE+1):
            print("--------------------------------------------\n\nStart of the " 
                  + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

            _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
            _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

            print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

            _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
            _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

            _gram_test_ = _gram_vectorizer_.transform(test_headlines)
            _gram_predictions_ = _gram_model_.predict(_gram_test_)

            print (accuracy_score(test["Label"], _gram_predictions_))

            model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
            result.append(accuracy_score(test["Label"], _gram_predictions_))

    rows_summary_value.append(ROWS)

    # save the best
    best_model_rows = 0

    for model in range(len(model_type)):
        if result[model] > best_model_rows:
            best_model_rows = result[model]

    rows_summary_accuraccy.append(best_model_rows)

--------------------------------------------

Start of the ROWS = 1 sequence

--------------------------------------------

talk like stupid sharia law uk
--------------------------------------------

Start of the 1,1 gram model

The shape is: (42259, 30936)

0.5178331992491284
--------------------------------------------

Start of the 1,2 gram model

The shape is: (42259, 371855)

0.5197103781174578
--------------------------------------------

Start of the 2,2 gram model

The shape is: (42259, 340919)

0.5262805041566103
--------------------------------------------

Start of the 1,3 gram model

The shape is: (42259, 765293)

0.5225261464199518
--------------------------------------------

Start of the 2,3 gram model

The shape is: (42259, 734357)

0.532046124966479
--------------------------------------------

Start of the 3,3 gram model

The shape is: (42259, 393438)

0.5442477876106194
--------------------------------------------

Start of the 1,4 gram model

The shape is: (42259, 

Kiértékelés.

In [73]:
best_model_rows = 0

for model in range(len(rows_summary_value)):
    print(str(rows_summary_value[model]) + ":\t\t\t\t\t" 
          + str(rows_summary_accuraccy[model]))

    if rows_summary_accuraccy[model] > best_model_rows:
        best_model_rows = rows_summary_accuraccy[model]
        best_model_rows_index = model

print("--------------------------------------------\nBest row value:\n" 
      + str(rows_summary_value[best_model_rows_index]) + "\t\t\t\t\t" + 
      str(rows_summary_accuraccy[best_model_rows_index]))

1:					0.5442477876106194
2:					0.541189611840268
3:					0.5525764558022622
4:					0.5404801786711334
5:					0.5388739946380697
6:					0.5418760469011725
7:					0.5881696428571429
8:					0.5892857142857143
9:					0.5611390284757118
10:					0.559463986599665
11:					0.5443886097152428
12:					0.5477386934673367
25:					0.568561872909699
--------------------------------------------
Best row value:
8					0.5892857142857143


A legjobb ROWS eredményeinek megjelenítése.

In [77]:
ROWS = int(rows_summary_value[best_model_rows_index])

model_type = []
result = []

df_sum_news_labels = preprocess()
train = split_to_train()
test = split_to_test()

# check
split_sum = len(train) + len(test)
sum = len(df_sum_news_labels)
assert split_sum == sum    

train_headlines = []
test_headlines = []

for row in range(0, len(train.index)):
    train_headlines.append(train.iloc[row, 1])

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

# show the first
print(train_headlines[0])

for MODEL_TYPE in model_type_values:

    for n in range(1,MODEL_TYPE+1):
        print("--------------------------------------------\n\nStart of the " 
              + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

        _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
        _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

        print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

        _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
        _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

        _gram_test_ = _gram_vectorizer_.transform(test_headlines)
        _gram_predictions_ = _gram_model_.predict(_gram_test_)

        print (accuracy_score(test["Label"], _gram_predictions_))

        model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
        result.append(accuracy_score(test["Label"], _gram_predictions_))

best_model_gram = 0

for model in range(len(model_type)):
    print(str(model_type[model]) + ":\t\t\t\t\t" + str(result[model]))

    if result[model] > best_model_gram:
        best_model_gram = result[model]
        best_model_gram_index = model

print("--------------------------------------------\nBest model:\n" 
      + str(model_type[best_model_gram_index]) + "\t\t\t\t\t" + 
      str(result[best_model_gram_index]))

air france aircraft carrying people disappeared radar atlantic ocean brazil tell difference israeli palestinian former german mp judge offers reward prosecution bush cheney rumsfeld blair north korea starts landing exercises using amphibious vessels may planning attack south korean island president el salvador sends son france escape violence native el salvador son stabbed death awl parisian bridge random act violence apparent motive indonesian model married malaysian prince says kidnapped drugged sexually abused royal family escapes help singaporean police attack liberty untold story israel deadly assault u spy ship book review el salvador first leftist president takes power hillary clinton attended inauguration
--------------------------------------------

Start of the 1,1 gram model

The shape is: (5071, 44527)

0.5357142857142857
--------------------------------------------

Start of the 1,2 gram model

The shape is: (5071, 407323)

0.5379464285714286
------------------------------

A legjobbhoz tartozó korrelációs tényezők megjelenítése.

In [79]:
ROWS = int(rows_summary_value[best_model_rows_index])
MODEL_TYPE = str(model_type[best_model_gram_index])

df_sum_news_labels = preprocess()
train = split_to_train()
test = split_to_test()

# check
split_sum = len(train) + len(test)
sum = len(df_sum_news_labels)
assert split_sum == sum    

train_headlines = []
test_headlines = []

for row in range(0, len(train.index)):
    train_headlines.append(train.iloc[row, 1])

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

# show the first
print(train_headlines[0])

_gram_vectorizer_ = CountVectorizer(ngram_range=(int(MODEL_TYPE[0]),int(MODEL_TYPE[2])))
_train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

_gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
_gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

_gram_test_ = _gram_vectorizer_.transform(test_headlines)
_gram_predictions_ = _gram_model_.predict(_gram_test_)

print (accuracy_score(test["Label"], _gram_predictions_))

model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
result.append(accuracy_score(test["Label"], _gram_predictions_))

air france aircraft carrying people disappeared radar atlantic ocean brazil tell difference israeli palestinian former german mp judge offers reward prosecution bush cheney rumsfeld blair north korea starts landing exercises using amphibious vessels may planning attack south korean island president el salvador sends son france escape violence native el salvador son stabbed death awl parisian bridge random act violence apparent motive indonesian model married malaysian prince says kidnapped drugged sexually abused royal family escapes help singaporean police attack liberty untold story israel deadly assault u spy ship book review el salvador first leftist president takes power hillary clinton attended inauguration
The shape is: (5071, 440447)

0.5892857142857143


In [81]:
_gram_words_best_ = _gram_vectorizer_.get_feature_names()
_gram_coeffs_best_ = _gram_model_.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : _gram_words_best_, 
                        'Coefficient' : _gram_coeffs_best_})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])

coeffdf.head(10)

Unnamed: 0,Word,Coefficient
331615,russia india china south africa,0.118901
260355,north korea preparing restart nuclear,0.103224
260434,north korea shoot rocket shoot,0.102229
49444,british prime minister david cameron,0.101564
14235,american country legalize gay marriage,0.099965
99455,define israel nation state jewish,0.098194
196050,israel nation state jewish people,0.098194
307426,pushes define israel nation state,0.098194
44895,bombing doctors without borders hospital,0.094717
239700,military help georgia declaration war,0.091112


In [82]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
391199,times journalist marie colvin killed,-0.104737
150909,fury die american attack soil,-0.107945
275098,pakistan reacts fury die american,-0.107945
312765,reacts fury die american attack,-0.107945
42928,blocks internet access new york,-0.11229
65574,china blocks internet access new,-0.11229
189076,internet access new york times,-0.11229
184722,indian space research organisation isro,-0.112298
154710,german chancellor angela merkel said,-0.1168
258056,news world phone hacking scandal,-0.136456


## ECO_BSN_DF, ECO_FNC_DF, ECO_US_DF 2008-2016

Először megvizsgálom a reddit-es világhírekkel megegyező intervallumon ezeket az összevont adathalmazokat, majd egyesítve és kombinálva a kettőt megvizsgálom, hogy javítja-e a pontosságot.

Ezeket az adathalmazokat én magam gyűjtöttem az alábbi oldalakról:


*   https://www.economist.com/business/ 
*   https://www.economist.com/finance-and-economics/ 
*   https://www.economist.com/united-states/ 