# **Stock market news feed semantic analysis** *(Baseline LogReg)*

Ebben a notebookban az eddigi általam kibányászott, megszerzett adathalmazokat fogom a hagyományos bag of words és logistic regression módszerrel megvizsgálni. Ezek után n-gram modelleket is ki fogok próbálni. Az általa használt források és referenciák az eredményekhez:


*   https://colab.research.google.com/drive/1QPrBkh-KwX6qcUtiNWKp9rJoneBfGEVh#scrollTo=bQUJwMjYYN4- *(saját munka - átdolgozott)*
*   https://colab.research.google.com/drive/1MdpXGCj2fb3g1BI_XfF54OWLkYQCZBBy#scrollTo=LndWT2Kn-UMK *(saját baseline munka)*
*   https://www.kaggle.com/ndrewgele/omg-nlp-with-the-djia-and-reddit#Basic-Model-Training-and-Testing
*   https://www.kaggle.com/lseiyjg/use-news-to-predict-stock-markets





A használt adathalmazok alapján külön fejezeteket készítek és mindenhol jelzem a forrását és a megszerzésének a módját, ha saját bányászás eredménye.

## **A projekt előkészítése**

A Drive csatlakoztatása a szükséges fájlok későbbi betöltésére. A betöltés közvetlen a használat előtt fogom megtenni.

In [283]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


A szükséges könyvtárak betöltése a projekthez.

In [284]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import pandas_datareader as web
from numpy.random import MT19937
from numpy.random import RandomState, SeedSequence
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from nltk.tokenize import word_tokenize  
from sklearn.utils import shuffle
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score 
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


A projektben használt makrók definiálása.

In [285]:
# Shuffle cycle number for the dataframe
SHUFFLE_CYCLE = 500

A reprodukálhatóság miatt definiálok egy seed-et a véletlen szám generátorhoz, amit a továbbiakban használni fogok.

In [286]:
# Random seed
RANDOM_SEED = 1234

# Numpy random seed
NP_SEED = 1234

# Max iteration for training
MAX_ITER = 100000

# Train size
TRAIN_SPLIT = 0.85

# Test size
TEST_SPLIT = 0.15

In [287]:
rs = RandomState(MT19937(SeedSequence(NP_SEED)))
np.random.seed(NP_SEED)

## **KAG_REDDIT_WRLD_DJIA_DF**

Ez az adathalmaz a top25 hírt tartalmazza a Reddit World News kategóriából 2008.08.08-2016.07.01 időtartamban. Ez nem általam gyűjtött adathalmaz, a forrása:
Sun, J. (2016, August). Daily News for Stock Market Prediction, Version 1. Retrieved 2021.02.19. from https://www.kaggle.com/aaron7sun/stocknews

Az adathalmaz betöltése a csatlakoztatott Drive-omból.

In [336]:
# Copy the dataset to the local environment
!cp "/content/drive/MyDrive/Combined_News_DJIA.csv" "Combined_News_DJIA.csv"

# Check the copy is succesfull -> good if no assertation error
read = !ls
assert read[0].find("Combined_News_DJIA.csv") != -1

Az eredmények elmentésére és indexelésére az alábbi két tömböt fogom hasnzálni.

In [337]:
model_type = ["Bag of words", "1,2 n-gram", "2,2 n-gram", 
              "1,3 n-gram", "2,3 n-gram", "3,3 n-gram"]

result = []              

Makró definiálás.

In [338]:
# Number of merged news into one string
ROWS = 2

### A szöveg előkészítése

Az adathalmaz betöltése.

In [339]:
# Load the dataset 
df_combined = pd.read_csv('Combined_News_DJIA.csv', index_col = "Date")

# Show the dataframe
df_combined.head()

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,b'Did the U.S. Prep Georgia for War with Russia?',b'Rice Gives Green Light for Israel to Attack ...,b'Announcing:Class Action Lawsuit on Behalf of...,"b""So---Russia and Georgia are at war and the N...","b""China tells Bush to stay out of other countr...",b'Did World War III start today?',b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,"b""Georgia's move, a mistake of monumental prop...",b'Russia presses deeper into Georgia; U.S. say...,b'Abhinav Bindra wins first ever Individual Ol...,b' U.S. ship heads for Arctic to define territ...,b'Drivers in a Jerusalem taxi station threaten...,b'The French Team is Stunned by Phelps and the...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...","b""The US military was surprised by the timing ...",b'U.S. Beats War Drum as Iran Dumps the Dollar',"b'Gorbachev: ""Georgian military attacked the S...",b'CNN use footage of Tskhinvali ruins to cover...,b'Beginning a war as the Olympics were opening...,b'55 pyramids as large as the Luxor stacked in...,b'The 11 Top Party Cities in the World',b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',"b""The commander of a Navy air reconnaissance s...","b""92% of CNN readers: Russia's actions in Geor...",b'USA to send fleet into Black Sea to help Geo...,"b""US warns against Israeli plan to strike agai...","b""In an intriguing cyberalliance, two Estonian...",b'The CNN Effect: Georgia Schools Russia in In...,b'Why Russias response to Georgia was right',b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",b'Russia exaggerating South Ossetian death tol...,b' Musharraf expected to resign rather than fa...,b'Moscow Made Plans Months Ago to Invade Georgia',b'Why Russias response to Georgia was right',b'Nigeria has handed over the potentially oil-...,b'The US and Poland have agreed a preliminary ...,b'Russia apparently is sabotaging infrastructu...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


Érdekességképpen a következőkben megvizsgálom, hogy az adathalmaz címkéi megfelelőek. A forrás szerint a címke 1, ha nőtt vagy azonos maradt az érték azon a napon, illetve 0, ha csökkent. (Adj Close adott napi értéke az előző napihoz viszonyítva)

In [340]:
# Load the stock data
df_stock = web.DataReader("DJIA", data_source="yahoo", start="2008-08-08", 
                          end="2016-07-01")
 
# Show the stock data
df_stock.head()

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2008-08-08,11808.490234,11344.230469,11432.089844,11734.320312,4966810000,11734.320312
2008-08-11,11933.549805,11580.19043,11729.669922,11782.349609,5067310000,11782.349609
2008-08-12,11830.389648,11541.429688,11781.700195,11642.469727,4711290000,11642.469727
2008-08-13,11689.049805,11377.370117,11632.80957,11532.959961,4787600000,11532.959961
2008-08-14,11744.330078,11399.839844,11532.070312,11615.929688,4064000000,11615.929688


Az dátumok formátumát egységesre hozom az összehasonlítás érdekében.

In [341]:
temp_day = []

for day in range(len(df_stock)):
    temp_day.append(df_stock.index[day].date())

df_stock.index = temp_day

# Show the stock data
df_stock.head()

Unnamed: 0,High,Low,Open,Close,Volume,Adj Close
2008-08-08,11808.490234,11344.230469,11432.089844,11734.320312,4966810000,11734.320312
2008-08-11,11933.549805,11580.19043,11729.669922,11782.349609,5067310000,11782.349609
2008-08-12,11830.389648,11541.429688,11781.700195,11642.469727,4711290000,11642.469727
2008-08-13,11689.049805,11377.370117,11632.80957,11532.959961,4787600000,11532.959961
2008-08-14,11744.330078,11399.839844,11532.070312,11615.929688,4064000000,11615.929688


Először a dátumok ellenőzöm, hogy megegyeznek-e.

In [342]:
difference = []

if len(df_combined) == len(df_stock):
    print("The lengths are the same!")

for day in range(max(len(df_combined), len(df_stock))):
    if str(df_combined.index[day]) != str(df_stock.index[day]):
        print("There is difference at: " + str(day) + " index")
        print("News: " + str(df_combined.index[day]) + "\tStock: " + str(df_stock.index[day]))
        difference.append(day)

if len(difference) is 0:
    print("The dates matched!")

The lengths are the same!
The dates matched!


A labelek ellenőrzése.

In [343]:
difference = []

for day in range(len(df_stock)):
    # label should be 1 -> rise
    if int(df_stock["Adj Close"][day]) >= int(df_stock["Adj Close"][day - 1]):
        if df_combined["Label"][day] != 1:
            difference.append(str(df_stock.index[day]))
            print("Problem at day " + str(df_stock.index[day]))
            print("Today: " + str(df_stock["Adj Close"][day]) +"\t\tYesterday: " + str(df_stock["Adj Close"][day - 1]) + "\t\tLabel: " + str(df_combined["Label"][day]) + "\n")

    # label should be 0 -> fall
    if int(df_stock["Adj Close"][day]) < int(df_stock["Adj Close"][day - 1]):
        if df_combined["Label"][day] != 0:
            difference.append(str(df_stock.index[day]))
            print("Problem at day " + str(df_stock.index[day]))
            print("Today: " + str(df_stock["Adj Close"][day]) +"\t\tYesterday: " + str(df_stock["Adj Close"][day - 1]) + "\t\tLabel: " + str(df_combined["Label"][day]) + "\n")

print("All differences: " + str(len(difference)))      

Problem at day 2010-10-14
Today: 11096.919921875		Yesterday: 11096.080078125		Label: 0

Problem at day 2012-11-12
Today: 12815.080078125		Yesterday: 12815.3896484375		Label: 0

Problem at day 2012-11-15
Today: 12570.9501953125		Yesterday: 12570.9501953125		Label: 0

Problem at day 2013-04-12
Today: 14865.0595703125		Yesterday: 14865.1396484375		Label: 0

Problem at day 2014-04-24
Today: 16501.650390625		Yesterday: 16501.650390625		Label: 0

Problem at day 2015-08-12
Today: 17402.509765625		Yesterday: 17402.83984375		Label: 0

Problem at day 2015-11-27
Today: 17813.390625		Yesterday: 17813.390625		Label: 0

All differences: 7


Látható, hogy rossz a label pár helyen. Egy kis kutakodás után megtaláltam, hogy maga az árfolyam lekérdezésük volt hibás pár nap esetében, ezért ezeket javítom, majd elmentem a drive-omon a javítottat.

In [344]:
# correct the wrong labels
for row in difference:
    if df_combined.loc[row, "Label"] == 0:
        df_combined.loc[row, "Label"] = 1
    else:
        df_combined.loc[row, "Label"] = 0

# check them
for row in difference:
    print(str(row) + "\t\t" + str(df_combined.loc[row, "Label"]))

2010-10-14		1
2012-11-12		1
2012-11-15		1
2013-04-12		1
2014-04-24		1
2015-08-12		1
2015-11-27		1


In [345]:
# save to drive
df_combined.to_csv('drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/KAG_REDDIT_WRLD_DJIA_DF_corrected.csv')

# Show the dataset
df_combined.head()

Unnamed: 0_level_0,Label,Top1,Top2,Top3,Top4,Top5,Top6,Top7,Top8,Top9,Top10,Top11,Top12,Top13,Top14,Top15,Top16,Top17,Top18,Top19,Top20,Top21,Top22,Top23,Top24,Top25
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
2008-08-08,0,"b""Georgia 'downs two Russian warplanes' as cou...",b'BREAKING: Musharraf to be impeached.',b'Russia Today: Columns of troops roll into So...,b'Russian tanks are moving towards the capital...,"b""Afghan children raped with 'impunity,' U.N. ...",b'150 Russian tanks have entered South Ossetia...,"b""Breaking: Georgia invades South Ossetia, Rus...","b""The 'enemy combatent' trials are nothing but...",b'Georgian troops retreat from S. Osettain cap...,b'Did the U.S. Prep Georgia for War with Russia?',b'Rice Gives Green Light for Israel to Attack ...,b'Announcing:Class Action Lawsuit on Behalf of...,"b""So---Russia and Georgia are at war and the N...","b""China tells Bush to stay out of other countr...",b'Did World War III start today?',b'Georgia Invades South Ossetia - if Russia ge...,b'Al-Qaeda Faces Islamist Backlash',"b'Condoleezza Rice: ""The US would not act to p...",b'This is a busy day: The European Union has ...,"b""Georgia will withdraw 1,000 soldiers from Ir...",b'Why the Pentagon Thinks Attacking Iran is a ...,b'Caucasus in crisis: Georgia invades South Os...,b'Indian shoe manufactory - And again in a se...,b'Visitors Suffering from Mental Illnesses Ban...,"b""No Help for Mexico's Kidnapping Surge"""
2008-08-11,1,b'Why wont America and Nato help us? If they w...,b'Bush puts foot down on Georgian conflict',"b""Jewish Georgian minister: Thanks to Israeli ...",b'Georgian army flees in disarray as Russians ...,"b""Olympic opening ceremony fireworks 'faked'""",b'What were the Mossad with fraudulent New Zea...,b'Russia angered by Israeli military sale to G...,b'An American citizen living in S.Ossetia blam...,b'Welcome To World War IV! Now In High Definit...,"b""Georgia's move, a mistake of monumental prop...",b'Russia presses deeper into Georgia; U.S. say...,b'Abhinav Bindra wins first ever Individual Ol...,b' U.S. ship heads for Arctic to define territ...,b'Drivers in a Jerusalem taxi station threaten...,b'The French Team is Stunned by Phelps and the...,b'Israel and the US behind the Georgian aggres...,"b'""Do not believe TV, neither Russian nor Geor...",b'Riots are still going on in Montreal (Canada...,b'China to overtake US as largest manufacturer',b'War in South Ossetia [PICS]',b'Israeli Physicians Group Condemns State Tort...,b' Russia has just beaten the United States ov...,b'Perhaps *the* question about the Georgia - R...,b'Russia is so much better at war',"b""So this is what it's come to: trading sex fo..."
2008-08-12,0,b'Remember that adorable 9-year-old who sang a...,"b""Russia 'ends Georgia operation'""","b'""If we had no sexual harassment we would hav...","b""Al-Qa'eda is losing support in Iraq because ...",b'Ceasefire in Georgia: Putin Outmaneuvers the...,b'Why Microsoft and Intel tried to kill the XO...,b'Stratfor: The Russo-Georgian War and the Bal...,"b""I'm Trying to Get a Sense of This Whole Geor...","b""The US military was surprised by the timing ...",b'U.S. Beats War Drum as Iran Dumps the Dollar',"b'Gorbachev: ""Georgian military attacked the S...",b'CNN use footage of Tskhinvali ruins to cover...,b'Beginning a war as the Olympics were opening...,b'55 pyramids as large as the Luxor stacked in...,b'The 11 Top Party Cities in the World',b'U.S. troops still in Georgia (did you know t...,b'Why Russias response to Georgia was right',"b'Gorbachev accuses U.S. of making a ""serious ...","b'Russia, Georgia, and NATO: Cold War Two'",b'Remember that adorable 62-year-old who led y...,b'War in Georgia: The Israeli connection',b'All signs point to the US encouraging Georgi...,b'Christopher King argues that the US and NATO...,b'America: The New Mexico?',"b""BBC NEWS | Asia-Pacific | Extinction 'by man..."
2008-08-13,0,b' U.S. refuses Israel weapons to attack Iran:...,"b""When the president ordered to attack Tskhinv...",b' Israel clears troops who killed Reuters cam...,b'Britain\'s policy of being tough on drugs is...,b'Body of 14 year old found in trunk; Latest (...,b'China has moved 10 *million* quake survivors...,"b""Bush announces Operation Get All Up In Russi...",b'Russian forces sink Georgian ships ',"b""The commander of a Navy air reconnaissance s...","b""92% of CNN readers: Russia's actions in Geor...",b'USA to send fleet into Black Sea to help Geo...,"b""US warns against Israeli plan to strike agai...","b""In an intriguing cyberalliance, two Estonian...",b'The CNN Effect: Georgia Schools Russia in In...,b'Why Russias response to Georgia was right',b'Elephants extinct by 2020?',b'US humanitarian missions soon in Georgia - i...,"b""Georgia's DDOS came from US sources""","b'Russian convoy heads into Georgia, violating...",b'Israeli defence minister: US against strike ...,b'Gorbachev: We Had No Choice',b'Witness: Russian forces head towards Tbilisi...,b' Quarter of Russians blame U.S. for conflict...,b'Georgian president says US military will ta...,b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14,1,b'All the experts admit that we should legalis...,b'War in South Osetia - 89 pictures made by a ...,b'Swedish wrestler Ara Abrahamian throws away ...,b'Russia exaggerated the death toll in South O...,b'Missile That Killed 9 Inside Pakistan May Ha...,"b""Rushdie Condemns Random House's Refusal to P...",b'Poland and US agree to missle defense deal. ...,"b'Will the Russians conquer Tblisi? Bet on it,...",b'Russia exaggerating South Ossetian death tol...,b' Musharraf expected to resign rather than fa...,b'Moscow Made Plans Months Ago to Invade Georgia',b'Why Russias response to Georgia was right',b'Nigeria has handed over the potentially oil-...,b'The US and Poland have agreed a preliminary ...,b'Russia apparently is sabotaging infrastructu...,b'Bank analyst forecast Georgian crisis 2 days...,"b""Georgia confict could set back Russia's US r...",b'War in the Caucasus is as much the product o...,"b'""Non-media"" photos of South Ossetia/Georgia ...",b'Georgian TV reporter shot by Russian sniper ...,b'Saudi Arabia: Mother moves to block child ma...,b'Taliban wages war on humanitarian aid workers',"b'Russia: World ""can forget about"" Georgia\'s...",b'Darfur rebels accuse Sudan of mounting major...,b'Philippines : Peace Advocate say Muslims nee...


A következőkben az esetleges adat nélküli napokat, illetve cellákat keresem meg és helyettesítem őket egy üres sztringgel. Ez a későbbi szövegfeldolgozás hibamentességéhez szükséges.

In [346]:
# Find the cells with NaN and after the rows for them
is_NaN = df_combined.isnull()
row_has_NaN = is_NaN.any(axis = 1)
rows_with_NaN = df_combined[row_has_NaN]

# Replace them
df_combined = df_combined.replace(np.nan, " ")

# Check the process
is_NaN = df_combined.isnull()
row_has_NaN = is_NaN.any(axis = 1)
rows_with_NaN = df_combined[row_has_NaN]

assert len(rows_with_NaN) is 0

Ezek után az egy naphoz tartozó híreket közös sztringekbe fűzöm. Az egy sztringbe tartozó hírek számát makróval definiálom:


*   ROWS - egymásba fűzött hírek száma

Itt megtalálható már az első előkészítő algoritmusom, méghozzá a sztringek elején található b karakter eltávolítása.

In [347]:
# Get column names
combined_column_names = []
for column in df_combined.columns:
  combined_column_names.append(column)

# 2D array creation for the news based on macros
COLUMNS = len(df_combined)
news_sum = [[0 for i in range(COLUMNS)] for j in range(int((len(combined_column_names) - 1) / ROWS))]  

# Show the column names
print("Column names of the dataset:") 
print(combined_column_names)

# Merge the news
for row in range(len(df_combined)):
  for column in range(int((len(combined_column_names) - 1) / ROWS)):
    temp = ""
    news = ""
    for word in range(ROWS):
      news = df_combined[combined_column_names[(column * ROWS) + (word + 1)]][row]
      # Remove the b character at the begining of the string
      if news[0] is "b":
        news = " " + news[1:]
      temp = temp + news
    news_sum[column][row] = temp

# Show the first day second package of the news
print("\nThe first day second package of the news:")
print(news_sum[1][0])

Column names of the dataset:
['Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7', 'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15', 'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23', 'Top24', 'Top25']

The first day second package of the news:
 'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)' 'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire'


Ezek után a korábbi oszlopokat(Top1, Top2...) kicserélem a csoportosításnak megfelelő számú oszlopokra és nevekre (News_1, News_2...), majd feltöltöm őket az összevont hírcsomagokkal.

In [348]:
# Drop the old columns
for column in range(len(combined_column_names) - 1):
  df_combined.drop(combined_column_names[column + 1], axis = 1, inplace = True)

# Create the new columns with the merged news
for column in range(int((len(combined_column_names) - 1) / ROWS)):
  colum_name = "News_" + str(column + 1)
  df_combined[colum_name] = news_sum[column]

# Show the DataFrame
df_combined.head()

Unnamed: 0_level_0,Label,News_1,News_2,News_3,News_4,News_5,News_6,News_7,News_8,News_9,News_10,News_11,News_12
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2008-08-08,0,"""Georgia 'downs two Russian warplanes' as cou...",'Russia Today: Columns of troops roll into So...,"""Afghan children raped with 'impunity,' U.N. ...","""Breaking: Georgia invades South Ossetia, Rus...",'Georgian troops retreat from S. Osettain cap...,'Rice Gives Green Light for Israel to Attack ...,"""So---Russia and Georgia are at war and the N...",'Did World War III start today?' 'Georgia Inv...,'Al-Qaeda Faces Islamist Backlash' 'Condoleez...,'This is a busy day: The European Union has ...,'Why the Pentagon Thinks Attacking Iran is a ...,'Indian shoe manufactory - And again in a se...
2008-08-11,1,'Why wont America and Nato help us? If they w...,"""Jewish Georgian minister: Thanks to Israeli ...","""Olympic opening ceremony fireworks 'faked'"" ...",'Russia angered by Israeli military sale to G...,'Welcome To World War IV! Now In High Definit...,'Russia presses deeper into Georgia; U.S. say...,' U.S. ship heads for Arctic to define territ...,'The French Team is Stunned by Phelps and the...,"'""Do not believe TV, neither Russian nor Geor...",'China to overtake US as largest manufacturer...,'Israeli Physicians Group Condemns State Tort...,'Perhaps *the* question about the Georgia - R...
2008-08-12,0,'Remember that adorable 9-year-old who sang a...,"'""If we had no sexual harassment we would hav...",'Ceasefire in Georgia: Putin Outmaneuvers the...,'Stratfor: The Russo-Georgian War and the Bal...,"""The US military was surprised by the timing ...","'Gorbachev: ""Georgian military attacked the S...",'Beginning a war as the Olympics were opening...,'The 11 Top Party Cities in the World' 'U.S. ...,'Why Russias response to Georgia was right' '...,"'Russia, Georgia, and NATO: Cold War Two' 'Re...",'War in Georgia: The Israeli connection' 'All...,'Christopher King argues that the US and NATO...
2008-08-13,0,' U.S. refuses Israel weapons to attack Iran:...,' Israel clears troops who killed Reuters cam...,'Body of 14 year old found in trunk; Latest (...,"""Bush announces Operation Get All Up In Russi...","""The commander of a Navy air reconnaissance s...",'USA to send fleet into Black Sea to help Geo...,"""In an intriguing cyberalliance, two Estonian...",'Why Russias response to Georgia was right' '...,'US humanitarian missions soon in Georgia - i...,"'Russian convoy heads into Georgia, violating...",'Gorbachev: We Had No Choice' 'Witness: Russi...,' Quarter of Russians blame U.S. for conflict...
2008-08-14,1,'All the experts admit that we should legalis...,'Swedish wrestler Ara Abrahamian throws away ...,'Missile That Killed 9 Inside Pakistan May Ha...,'Poland and US agree to missle defense deal. ...,'Russia exaggerating South Ossetian death tol...,'Moscow Made Plans Months Ago to Invade Georg...,'Nigeria has handed over the potentially oil-...,'Russia apparently is sabotaging infrastructu...,"""Georgia confict could set back Russia's US r...","'""Non-media"" photos of South Ossetia/Georgia ...",'Saudi Arabia: Mother moves to block child ma...,"'Russia: World ""can forget about"" Georgia\'s..."


Egy új dataframebe újracsoportosítom a hír blokkokat a címkéjükkel, már a dátumok nélkül.

In [349]:
# The label column 
LABEL_COLUMN = 0

news_sum = []
label_sum = []

# Get the column names
combined_column_names = []
for column in df_combined.columns:
  combined_column_names.append(column)

# Write out the column names 
print(combined_column_names)
print("\n")

# Connect the merged news with the labels
for column in range(len(df_combined)):
  for row in range(len(combined_column_names) - 1):
    news_sum.append(df_combined[combined_column_names[row + 1]][column])
    label_sum.append(df_combined[combined_column_names[LABEL_COLUMN]][column])

# Create the new DataFrame
df_sum_news_labels = pd.DataFrame(data = label_sum, index = None, columns = ["Label"])
df_sum_news_labels["News"] = news_sum

# Show it
df_sum_news_labels.head()

['Label', 'News_1', 'News_2', 'News_3', 'News_4', 'News_5', 'News_6', 'News_7', 'News_8', 'News_9', 'News_10', 'News_11', 'News_12']




Unnamed: 0,Label,News
0,0,"""Georgia 'downs two Russian warplanes' as cou..."
1,0,'Russia Today: Columns of troops roll into So...
2,0,"""Afghan children raped with 'impunity,' U.N. ..."
3,0,"""Breaking: Georgia invades South Ossetia, Rus..."
4,0,'Georgian troops retreat from S. Osettain cap...


Először a szövegek előfeldolgozásával kezdem: írásjelek eltávolítása, számok eltávolítása, felesleges szóközök eltávolítása, aztán minden szót kis kezdőbetűjü szóvá konvertálom.

In [350]:
# Removing punctuations
temp_news = []
for line in news_sum:
  temp_attach = ""
  for word in line:
    temp = " "
    if word not in string.punctuation:
      temp = word
    temp_attach = temp_attach + "".join(temp)
  temp_news.append(temp_attach)

news_sum = temp_news
temp_news = []

# Remove numbers
for line in news_sum:
  temp_attach = ""
  for word in line:
    temp = " "
    if not word.isdigit():
      temp = word
    temp_attach = temp_attach + "".join(temp)
  temp_news.append(temp_attach)

# Remove space
for line in range(len(temp_news)):    
  temp_news[line] = " ".join(temp_news[line].split())

# Converting headlines to lower case
for line in range(len(temp_news)): 
    temp_news[line] = temp_news[line].lower()

# Update the data frame
df_sum_news_labels["News"] = temp_news

# Show it
df_sum_news_labels.head()

Unnamed: 0,Label,News
0,0,georgia downs two russian warplanes as countri...
1,0,russia today columns of troops roll into south...
2,0,afghan children raped with impunity u n offici...
3,0,breaking georgia invades south ossetia russia ...
4,0,georgian troops retreat from s osettain capita...


A következőkben az úgy nevezett töltelék szavakat (stop words) fogom eltávolítani.

In [351]:
# Load the stop words
stop_words = set(stopwords.words('english'))

filtered_sentence = []
news_sum = df_sum_news_labels["News"]

# Remove stop words
for line in news_sum:
  word_tokens = word_tokenize(line)
  temp_attach = ""
  for word in word_tokens:
    temp = " "
    if not word in stop_words:
      temp = temp + word
    temp_attach = temp_attach + "".join(temp)
  filtered_sentence.append(temp_attach)

# Remove space
for line in range(len(filtered_sentence)):    
  filtered_sentence[line] = " ".join(filtered_sentence[line].split())

# Update the data frame
df_sum_news_labels["News"] = filtered_sentence

# Show the DataFrame
df_sum_news_labels.head()

Unnamed: 0,Label,News
0,0,georgia downs two russian warplanes countries ...
1,0,russia today columns troops roll south ossetia...
2,0,afghan children raped impunity u n official sa...
3,0,breaking georgia invades south ossetia russia ...
4,0,georgian troops retreat osettain capital presu...


Az adathalmazban lévő nulla hosszú sztring csomagok megkeresése és a hozzájuk tartozó cellák törlése következik.

In [352]:
news_sum = df_sum_news_labels["News"]
null_indexes = []
index = 0

for line in news_sum:
  if line is "":
    null_indexes.append(index)
  index = index + 1

print(null_indexes)

for row in null_indexes:
  df_sum_news_labels = df_sum_news_labels.drop(row)

news_sum = df_sum_news_labels["News"]
null_indexes = []
index = 0

for line in news_sum:
  if line is "":
    null_indexes.append(index)
  index = index + 1
  
assert len(null_indexes) is 0

[3335]


Az adathalmaz véletlenszerű sorbarendezése.

In [353]:
# Do the shuffle
for i in range(SHUFFLE_CYCLE):
  df_sum_news_labels = shuffle(df_sum_news_labels, random_state = rs)

# Reset the index
df_sum_news_labels.reset_index(inplace=True, drop=True)

# Show the data frame
df_sum_news_labels.head()

Unnamed: 0,Label,News
0,0,israel stole b palestinian workers israeli eco...
1,1,sec state john kerry russia lying face troops ...
2,0,mining company idemitsu australia resources ad...
3,0,pirate party fires broadside german political ...
4,1,cairo court gives death penalty egyptian chris...


Az adathalmaz szétbontása tanító és validáló/tesztelő adathalmazokra, majd a szétbontás ellenőrzése mérettel és első elem kiíratásával.

In [354]:
INPUT_SIZE = len(df_sum_news_labels)
TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 
TEST_SIZE = int(TEST_SPLIT * INPUT_SIZE)

# Split the dataset
train = df_sum_news_labels[:TRAIN_SIZE] 
test = df_sum_news_labels[TRAIN_SIZE:]

# Print out the length
print("Train data set length: " + str(len(train)))
print("Test data set length: " + str(len(test)))
print("Split summa: " + str(len(train) + len(test)))
print("Dataset summa before split: " + str(len(df_sum_news_labels)))

# check
split_sum = len(train) + len(test)
sum = len(df_sum_news_labels)
assert split_sum == sum

Train data set length: 20286
Test data set length: 3581
Split summa: 23867
Dataset summa before split: 23867


In [355]:
train.tail(1)

Unnamed: 0,Label,News
20285,1,special forces raid bp moscow officespakistani...


In [356]:
test.head(1)

Unnamed: 0,Label,News
20286,1,charlie hebdo pakistani legislators chant deat...


Ezek lementése.

In [357]:
# save to drive
train.to_csv('drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/train.csv')
test.to_csv('drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/test.csv')

### Bag of words

Először a tanító adathalmaz híreit fűzöm össze egy tömbbe.

In [358]:
train_headlines = []

for row in range(0, len(train.index)):
    train_headlines.append(train.iloc[row, 1])

# show the first
train_headlines[0]

'israel stole b palestinian workers israeli economists revealed generose lay bleeding near husbands corpse soldiers cut amputated leg cooked pieces ordered children eat mothers flesh one son refused kill kill told soldiers mother remembers eat part mother'

Ezek után vektorizálom őket.

In [359]:
bow_vectorizer = CountVectorizer()
bow_train = bow_vectorizer.fit_transform(train_headlines)
print(bow_train.shape)

(20286, 38607)


Egy logistic regression modellt fogok erre a tanító halmazra betanítani.

In [360]:
bow_model = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
bow_model = bow_model.fit(bow_train, train["Label"])

A teszt adathalmaz előkészítése, majd becslés a modell segítségével a következő lépés.

In [361]:
test_headlines = []

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

bow_test = bow_vectorizer.transform(test_headlines)
bow_predictions = bow_model.predict(bow_test)

Az eredmények megjelenítése egy táblázatban.

In [362]:
pd.crosstab(test["Label"], bow_predictions, rownames=["Actual"], colnames=["Predicted"])

Predicted,0,1
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,722,895
1,856,1108


A pontossága a modellnek.

In [363]:
print (classification_report(test["Label"], bow_predictions))
print (accuracy_score(test["Label"], bow_predictions))

result.append(accuracy_score(test["Label"], bow_predictions))

              precision    recall  f1-score   support

           0       0.46      0.45      0.45      1617
           1       0.55      0.56      0.56      1964

    accuracy                           0.51      3581
   macro avg       0.51      0.51      0.51      3581
weighted avg       0.51      0.51      0.51      3581

0.5110304384250209


A következőkben a top 10 legbefolyásolóbb sztringet jelenítem meg mind pozítiv és mind negatív irányba.

In [364]:
bow_words = bow_vectorizer.get_feature_names()
bow_coeffs = bow_model.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : bow_words, 
                        'Coefficient' : bow_coeffs})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
37860,wolves,1.575139
34454,thriving,1.555963
18822,landing,1.42361
29746,sanaa,1.41376
29141,riyadh,1.389342
9779,division,1.387358
6402,collection,1.378622
20284,manipulate,1.356852
21932,movies,1.354048
6072,clashed,1.340151


In [365]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
15686,horns,-1.304928
25589,player,-1.329162
32153,spilled,-1.333695
25317,picked,-1.342825
5847,choppers,-1.347697
7215,contributed,-1.354432
3140,begging,-1.35986
15680,hormuz,-1.365731
20925,merchant,-1.407644
5372,census,-1.428086


### 2-gram modell

Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (1,2) n-gram modellel.

In [366]:
gram_vectorizer_12 = CountVectorizer(ngram_range=(1,2))
train_vectorizer_12 = gram_vectorizer_12.fit_transform(train_headlines)

print(train_vectorizer_12.shape)

gram_model_12 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_12 = gram_model_12.fit(train_vectorizer_12, train["Label"])

test_headlines = []

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

gram_test_12 = gram_vectorizer_12.transform(test_headlines)
gram_predictions_12 = gram_model_12.predict(gram_test_12)

pd.crosstab(test["Label"], gram_predictions_12, rownames=["Actual"], colnames=["Predicted"])

print (classification_report(test["Label"], gram_predictions_12))
print (accuracy_score(test["Label"], gram_predictions_12))

result.append(accuracy_score(test["Label"], gram_predictions_12))

(20286, 387915)
              precision    recall  f1-score   support

           0       0.45      0.40      0.42      1617
           1       0.55      0.60      0.58      1964

    accuracy                           0.51      3581
   macro avg       0.50      0.50      0.50      3581
weighted avg       0.51      0.51      0.51      3581

0.5110304384250209


In [367]:
gram_words_12 = gram_vectorizer_12.get_feature_names()
gram_coeffs_12 = gram_model_12.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_12, 
                        'Coefficient' : gram_coeffs_12})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
189169,landing,0.98544
304143,security council,0.930576
222340,mumbai,0.868996
379692,wolves,0.841094
308844,sexual violence,0.835806
341463,terror attack,0.826805
202341,luxury,0.812963
305226,seize,0.81207
45967,buildings,0.791482
817,abroad,0.788123


In [368]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
33163,begin,-0.780491
327730,stranded,-0.785083
325355,statistics,-0.797791
158709,hormuz,-0.809017
201819,low,-0.814543
55167,census,-0.841566
17177,appeal,-0.855236
227465,nepal,-0.893965
318316,somalia,-0.973756
362365,us army,-1.024815


Másodjára a (2,2) n-gram modellel.

In [369]:
gram_vectorizer_22 = CountVectorizer(ngram_range=(2,2))
train_vectorizer_22 = gram_vectorizer_22.fit_transform(train_headlines)

print(train_vectorizer_22.shape)

gram_model_22 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_22 = gram_model_22.fit(train_vectorizer_22, train["Label"])

test_headlines = []

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

gram_test_22 = gram_vectorizer_22.transform(test_headlines)
gram_predictions_22 = gram_model_22.predict(gram_test_22)

pd.crosstab(test["Label"], gram_predictions_22, rownames=["Actual"], colnames=["Predicted"])

print (classification_report(test["Label"], gram_predictions_22))
print (accuracy_score(test["Label"], gram_predictions_22))

result.append(accuracy_score(test["Label"], gram_predictions_22))

(20286, 349308)
              precision    recall  f1-score   support

           0       0.47      0.31      0.37      1617
           1       0.55      0.71      0.62      1964

    accuracy                           0.53      3581
   macro avg       0.51      0.51      0.50      3581
weighted avg       0.51      0.53      0.51      3581

0.5283440379782184


In [370]:
gram_words_22 = gram_vectorizer_22.get_feature_names()
gram_coeffs_22 = gram_model_22.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_22, 
                        'Coefficient' : gram_coeffs_22})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
273713,security council,0.931506
261036,rights watch,0.907768
278130,sexual violence,0.873731
326960,us spying,0.796756
8109,air pollution,0.772872
1787,according new,0.77151
223688,peace deal,0.764626
209611,nuclear strike,0.755552
93315,eastern ukraine,0.727508
311763,time since,0.725524


In [371]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
299371,support israel,-0.746145
285792,solar system,-0.774426
293773,stock market,-0.77485
294959,strait hormuz,-0.783698
54807,chinese officials,-0.787793
220259,panama papers,-0.852689
192889,military bases,-0.904961
268109,saudi king,-0.922218
334265,war iran,-0.940277
326071,us army,-1.035989


### 3-gram modell

Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (1,3) n-gram modellel.

In [372]:
gram_vectorizer_13 = CountVectorizer(ngram_range=(1,3))
train_vectorizer_13 = gram_vectorizer_13.fit_transform(train_headlines)

print(train_vectorizer_13.shape)

gram_model_13 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_13 = gram_model_13.fit(train_vectorizer_13, train["Label"])

test_headlines = []

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

gram_test_13 = gram_vectorizer_13.transform(test_headlines)
gram_predictions_13 = gram_model_13.predict(gram_test_13)

pd.crosstab(test["Label"], gram_predictions_13, rownames=["Actual"], colnames=["Predicted"])

print (classification_report(test["Label"], gram_predictions_13))
print (accuracy_score(test["Label"], gram_predictions_13))

result.append(accuracy_score(test["Label"], gram_predictions_13))

(20286, 802274)
              precision    recall  f1-score   support

           0       0.47      0.39      0.43      1617
           1       0.56      0.63      0.59      1964

    accuracy                           0.52      3581
   macro avg       0.51      0.51      0.51      3581
weighted avg       0.52      0.52      0.52      3581

0.5227590058642837


In [373]:
gram_words_13 = gram_vectorizer_13.get_feature_names()
gram_coeffs_13 = gram_model_13.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_13, 
                        'Coefficient' : gram_coeffs_13})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
457197,mumbai,0.798173
388804,landing,0.75499
629974,seize,0.702728
627639,security council,0.698021
391604,latest,0.686259
14397,agencies,0.672385
679261,struggle,0.670076
761964,volcano,0.664893
415963,luxury,0.63563
94259,buildings,0.632831


In [374]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
112761,census,-0.641643
594601,revolt,-0.642421
723238,trafficking,-0.659209
68214,begin,-0.661095
35574,appeal,-0.738186
693724,system,-0.740079
747588,us army,-0.748473
467784,nepal,-0.74929
414991,low,-0.756724
656691,somalia,-0.922395


Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (2,3) n-gram modellel.

In [375]:
gram_vectorizer_23 = CountVectorizer(ngram_range=(2,3))
train_vectorizer_23 = gram_vectorizer_23.fit_transform(train_headlines)

print(train_vectorizer_23.shape)

gram_model_23 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_23 = gram_model_23.fit(train_vectorizer_23, train["Label"])

test_headlines = []

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

gram_test_23 = gram_vectorizer_23.transform(test_headlines)
gram_predictions_23 = gram_model_23.predict(gram_test_23)

pd.crosstab(test["Label"], gram_predictions_23, rownames=["Actual"], colnames=["Predicted"])

print (classification_report(test["Label"], gram_predictions_23))
print (accuracy_score(test["Label"], gram_predictions_23))

result.append(accuracy_score(test["Label"], gram_predictions_23))

(20286, 763667)
              precision    recall  f1-score   support

           0       0.48      0.25      0.33      1617
           1       0.56      0.78      0.65      1964

    accuracy                           0.54      3581
   macro avg       0.52      0.51      0.49      3581
weighted avg       0.52      0.54      0.50      3581

0.5403518570231779


In [376]:
gram_words_23 = gram_vectorizer_23.get_feature_names()
gram_coeffs_23 = gram_model_23.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_23, 
                        'Coefficient' : gram_coeffs_23})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
597209,security council,0.724715
607058,sexual violence,0.636324
153212,court rules,0.616613
201062,eastern ukraine,0.601972
17359,air pollution,0.599077
568778,rights watch,0.585935
3769,according new,0.571407
622533,social media,0.569983
713767,us spying,0.560623
751370,world largest,0.556218


In [377]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
548009,red cross,-0.576519
640979,stock market,-0.58222
39938,around world,-0.5923
449734,news international,-0.611127
650490,suicide bomber,-0.660796
419034,military bases,-0.663339
729640,war iran,-0.688157
584685,saudi king,-0.693207
480834,panama papers,-0.703129
711294,us army,-0.758311


Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (3,3) n-gram modellel.

In [378]:
gram_vectorizer_33 = CountVectorizer(ngram_range=(3,3))
train_vectorizer_33 = gram_vectorizer_33.fit_transform(train_headlines)

print(train_vectorizer_33.shape)

gram_model_33 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
gram_model_33 = gram_model_33.fit(train_vectorizer_33, train["Label"])

test_headlines = []

for row in range(0,len(test.index)):
    test_headlines.append(test.iloc[row, 1])

gram_test_33 = gram_vectorizer_33.transform(test_headlines)
gram_predictions_33 = gram_model_33.predict(gram_test_33)

pd.crosstab(test["Label"], gram_predictions_33, rownames=["Actual"], colnames=["Predicted"])

print (classification_report(test["Label"], gram_predictions_33))
print (accuracy_score(test["Label"], gram_predictions_33))

result.append(accuracy_score(test["Label"], gram_predictions_33))

(20286, 414359)
              precision    recall  f1-score   support

           0       0.51      0.07      0.12      1617
           1       0.55      0.94      0.70      1964

    accuracy                           0.55      3581
   macro avg       0.53      0.51      0.41      3581
weighted avg       0.53      0.55      0.44      3581

0.5495671600111701


In [379]:
gram_words_33 = gram_vectorizer_33.get_feature_names()
gram_coeffs_33 = gram_model_33.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : gram_words_33, 
                        'Coefficient' : gram_coeffs_33})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
coeffdf.head(10)

Unnamed: 0,Word,Coefficient
132267,first time since,0.916949
381429,un security council,0.755421
168514,human rights watch,0.694514
10082,al jazeera english,0.683763
280085,president hosni mubarak,0.62358
381127,un general assembly,0.533258
283226,pro russian separatists,0.515341
410908,year old man,0.508897
402301,wikileaks julian assange,0.505865
140822,french president sarkozy,0.50371


In [380]:
coeffdf.tail(10)

Unnamed: 0,Word,Coefficient
148251,girl gang raped,-0.51523
26624,aung san suu,-0.526697
178458,international space station,-0.526701
326323,sentenced years prison,-0.549481
165697,homes east jerusalem,-0.554131
109678,egypt muslim brotherhood,-0.56024
357226,syrian security forces,-0.603754
229484,missile defense system,-0.66156
58597,chancellor angela merkel,-0.701318
123670,faces years jail,-0.717956


### 4-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [381]:
for n in range(1,5):
    print("--------------------------------------------\n\nStart of the " 
          + str(n) + ",4 gram model\n")

    _gram_vectorizer_ = CountVectorizer(ngram_range=(n,4))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

    test_headlines = []

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    pd.crosstab(test["Label"], _gram_predictions_, rownames=["Actual"], colnames=["Predicted"])

    print (classification_report(test["Label"], _gram_predictions_))
    print (accuracy_score(test["Label"], _gram_predictions_))

    model_type.append(str(n) + ",4 n-gram")
    result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,4 gram model

The shape is: (20286, 1206740)

              precision    recall  f1-score   support

           0       0.46      0.37      0.41      1617
           1       0.55      0.64      0.59      1964

    accuracy                           0.52      3581
   macro avg       0.51      0.51      0.50      3581
weighted avg       0.51      0.52      0.51      3581

0.5196872382016197
--------------------------------------------

Start of the 2,4 gram model

The shape is: (20286, 1168133)

              precision    recall  f1-score   support

           0       0.48      0.19      0.27      1617
           1       0.56      0.83      0.67      1964

    accuracy                           0.54      3581
   macro avg       0.52      0.51      0.47      3581
weighted avg       0.52      0.54      0.49      3581

0.542306618263055
--------------------------------------------

Start of the 3,4 gram model

The shape is: (20286

### 5-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [382]:
MODEL_TYPE = 5

for n in range(1,MODEL_TYPE+1):
    print("--------------------------------------------\n\nStart of the " 
          + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

    _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

    test_headlines = []

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    pd.crosstab(test["Label"], _gram_predictions_, rownames=["Actual"], colnames=["Predicted"])

    print (classification_report(test["Label"], _gram_predictions_))
    print (accuracy_score(test["Label"], _gram_predictions_))

    model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
    result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,5 gram model

The shape is: (20286, 1592567)

              precision    recall  f1-score   support

           0       0.46      0.36      0.41      1617
           1       0.55      0.65      0.60      1964

    accuracy                           0.52      3581
   macro avg       0.51      0.51      0.50      3581
weighted avg       0.51      0.52      0.51      3581

0.5210834962301033
--------------------------------------------

Start of the 2,5 gram model

The shape is: (20286, 1553960)

              precision    recall  f1-score   support

           0       0.48      0.15      0.23      1617
           1       0.55      0.87      0.67      1964

    accuracy                           0.54      3581
   macro avg       0.51      0.51      0.45      3581
weighted avg       0.52      0.54      0.47      3581

0.5420273666573583
--------------------------------------------

Start of the 3,5 gram model

The shape is: (2028

### 6-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [383]:
MODEL_TYPE = 6

for n in range(1,MODEL_TYPE+1):
    print("--------------------------------------------\n\nStart of the " 
          + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

    _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

    test_headlines = []

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    pd.crosstab(test["Label"], _gram_predictions_, rownames=["Actual"], colnames=["Predicted"])

    print (classification_report(test["Label"], _gram_predictions_))
    print (accuracy_score(test["Label"], _gram_predictions_))

    model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
    result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,6 gram model

The shape is: (20286, 1958520)

              precision    recall  f1-score   support

           0       0.46      0.35      0.40      1617
           1       0.55      0.66      0.60      1964

    accuracy                           0.52      3581
   macro avg       0.51      0.51      0.50      3581
weighted avg       0.51      0.52      0.51      3581

0.5205249930187098
--------------------------------------------

Start of the 2,6 gram model

The shape is: (20286, 1919913)

              precision    recall  f1-score   support

           0       0.47      0.12      0.20      1617
           1       0.55      0.89      0.68      1964

    accuracy                           0.54      3581
   macro avg       0.51      0.51      0.44      3581
weighted avg       0.52      0.54      0.46      3581

0.542306618263055
--------------------------------------------

Start of the 3,6 gram model

The shape is: (20286

### Eredmények összegzése

Az eredmények kiíratása, a legjobbat kiemelve.

In [384]:
best_model = 0

for model in range(len(model_type)):
    print(str(model_type[model]) + ":\t\t\t\t\t" + str(result[model]))

    if result[model] > best_model:
        best_model = result[model]
        best_model_index = model

print("--------------------------------------------\nBest model:\n" 
      + str(model_type[best_model_index]) + "\t\t\t\t\t" + 
      str(result[best_model_index]))


Bag of words:					0.5110304384250209
1,2 n-gram:					0.5110304384250209
2,2 n-gram:					0.5283440379782184
1,3 n-gram:					0.5227590058642837
2,3 n-gram:					0.5403518570231779
3,3 n-gram:					0.5495671600111701
1,4 n-gram:					0.5196872382016197
2,4 n-gram:					0.542306618263055
3,4 n-gram:					0.5520804244624407
4,4 n-gram:					0.5498464116168668
1,5 n-gram:					0.5210834962301033
2,5 n-gram:					0.5420273666573583
3,5 n-gram:					0.5490086567997766
4,5 n-gram:					0.5504049148282603
5,5 n-gram:					0.5495671600111701
1,6 n-gram:					0.5205249930187098
2,6 n-gram:					0.542306618263055
3,6 n-gram:					0.5481709019826864
4,6 n-gram:					0.550684166433957
5,6 n-gram:					0.5495671600111701
6,6 n-gram:					0.5490086567997766
--------------------------------------------
Best model:
3,4 n-gram					0.5520804244624407


### ROWS makró optimalizálás

Ebben a fejezetben a különböző ROWS értékekre (mennyi napi hírt fűzünk egybe) futtatom végig egy automatizált bag of words -> 6,6 gram modell tanítást és becslést és állapítom meg, hogy melyik a legpontosabb.

A tesztelendő paraméterek megadása.

In [387]:
# Number of merged news into one string: 1...12, 25 
rows_values = []
for value in range(1,13):
    rows_values.append(value)

rows_values.append(25)

rows_values

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 25]

A modell típusok összegyűjtése az automatizált tanításhoz.

In [389]:
model_type_values = []
for value in range(1,7):
    model_type_values.append(value)

model_type_values

[1, 2, 3, 4, 5, 6]

A paraméterhez tartozó eredmények tárolására létrehozom az alábbi tömböket.

In [1]:
rows_summary_value = []
rows_summary_accuraccy = []

Automatizált tanítás és mentések.

In [395]:
for ROWS in rows_values:

    print("--------------------------------------------\n\nStart of the ROWS = " 
      + str(ROWS) + " sequence\n\n--------------------------------------------\n")
    
    model_type = []
    result = []

    for MODEL_TYPE in model_type_values:

        for n in range(1,MODEL_TYPE+1):
            print("--------------------------------------------\n\nStart of the " 
                  + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

            _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
            _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

            print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

            _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
            _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

            test_headlines = []

            for row in range(0,len(test.index)):
                test_headlines.append(test.iloc[row, 1])

            _gram_test_ = _gram_vectorizer_.transform(test_headlines)
            _gram_predictions_ = _gram_model_.predict(_gram_test_)

            pd.crosstab(test["Label"], _gram_predictions_, rownames=["Actual"], colnames=["Predicted"])

            print (classification_report(test["Label"], _gram_predictions_))
            print (accuracy_score(test["Label"], _gram_predictions_))

            model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
            result.append(accuracy_score(test["Label"], _gram_predictions_))

    rows_summary_value.append(ROWS)

    # save the best
    best_model = 0

    for model in range(len(model_type)):
        if result[model] > best_model:
            best_model = result[model]
            best_model_index = model

    rows_summary_accuraccy.append(best_model)

--------------------------------------------

Start of the ROWS = 1 sequence

--------------------------------------------

--------------------------------------------

Start of the 1,1 gram model

The shape is: (20286, 38607)

              precision    recall  f1-score   support

           0       0.46      0.45      0.45      1617
           1       0.55      0.56      0.56      1964

    accuracy                           0.51      3581
   macro avg       0.51      0.51      0.51      3581
weighted avg       0.51      0.51      0.51      3581

0.5110304384250209
--------------------------------------------

Start of the 1,2 gram model

The shape is: (20286, 387915)

              precision    recall  f1-score   support

           0       0.45      0.40      0.42      1617
           1       0.55      0.60      0.58      1964

    accuracy                           0.51      3581
   macro avg       0.50      0.50      0.50      3581
weighted avg       0.51      0.51      0.51    

Kiértékelés.

In [396]:
best_model = 0

for model in range(len(rows_summary_value)):
    print(str(rows_summary_value[model]) + ":\t\t\t\t\t" 
          + str(rows_summary_accuraccy[model]))

    if rows_summary_accuraccy[model] > best_model:
        best_model = rows_summary_accuraccy[model]
        best_model_index = model

print("--------------------------------------------\nBest row value:\n" 
      + str(rows_summary_value[best_model_index]) + "\t\t\t\t\t" + 
      str(rows_summary_accuraccy[best_model_index]))

1:					0.5110304384250209
1:					0.5283440379782184
1:					0.5495671600111701
1:					0.5520804244624407
1:					0.5520804244624407
1:					0.5520804244624407
2:					0.5110304384250209
2:					0.5283440379782184
2:					0.5495671600111701
2:					0.5520804244624407
2:					0.5520804244624407
2:					0.5520804244624407
3:					0.5110304384250209
3:					0.5283440379782184
3:					0.5495671600111701
3:					0.5520804244624407
3:					0.5520804244624407
3:					0.5520804244624407
4:					0.5110304384250209
4:					0.5283440379782184
4:					0.5495671600111701
4:					0.5520804244624407
4:					0.5520804244624407
4:					0.5520804244624407
5:					0.5110304384250209
5:					0.5283440379782184
5:					0.5495671600111701
5:					0.5520804244624407
5:					0.5520804244624407
5:					0.5520804244624407
6:					0.5110304384250209
6:					0.5283440379782184
6:					0.5495671600111701
6:					0.5520804244624407
6:					0.5520804244624407
6:					0.5520804244624407
7:					0.5110304384250209
7:					0.5283440379782184
7:					0.549

## ECO_BSN_DF, ECO_FNC_DF, ECO_US_DF 2008-2016

Először megvizsgálom a reddit-es világhírekkel megegyező intervallumon ezeket az összevont adathalmazokat, majd egyesítve és kombinálva a kettőt megvizsgálom, hogy javítja-e a pontosságot.

Ezeket az adathalmazokat én magam gyűjtöttem az alábbi oldalakról:


*   https://www.economist.com/business/ 
*   https://www.economist.com/finance-and-economics/ 
*   https://www.economist.com/united-states/ 