# **Stock market news feed semantic analysis** *(Baseline Filter)*

Ebben a modellben a log reg modellhez tartozó korrelációs tényezők alapján próbálom filterezni az adathalmazom.
Nyolcasával egybe fűzöm a reddit-es híreket, majd megvizsgálom a korrelációs tényezőket 2,5 ngram modellel. 
Ezek után egyesével megvizsgálom a híreket és ha a korrelációs tényezők alapján a végeredmény semleges (tűréssel együtt), akkor azt az elemet kiveszem.
Szűrés után megvizsgálom majd a halmaz számosságát és a modell pontosságát újra.

## **A projekt előkészítése**

A Drive csatlakoztatása a szükséges fájlok későbbi betöltésére. A betöltés közvetlen a használat előtt fogom megtenni.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


A szükséges könyvtárak betöltése a projekthez.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import pandas_datareader as web
from numpy.random import MT19937
from numpy.random import RandomState, SeedSequence
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from nltk.tokenize import word_tokenize  
from sklearn.utils import shuffle
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score 
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


A reprodukálhatóság miatt definiálok egy seed-et a véletlen szám generátorhoz, amit a továbbiakban használni fogok.

In [None]:
# Random seed
RANDOM_SEED = 1234

# Numpy random seed
NP_SEED = 1234

# Max iteration for training
MAX_ITER = 100000

# Train size
TRAIN_SPLIT = 0.85

# Test size
TEST_SPLIT = 0.15

# Shuffle cycle number for the dataframe
SHUFFLE_CYCLE = 500

In [None]:
np.random.seed(NP_SEED)

## **Az adathalmaz elkőkészítése**

Nyolcasával csoportosítás és az eddigi összes preprocess algoritmus használata.
Felbontás train és teszt adathalmazra.

In [None]:
print("Start of the preprocess\n")

# Copy the dataset to the local environment
!cp "/content/drive/MyDrive/Combined_News_DJIA.csv" "Combined_News_DJIA.csv"

# Load the dataset 
df_combined = pd.read_csv('Combined_News_DJIA.csv', index_col = "Date")

# Load the stock data
df_stock = web.DataReader("DJIA", data_source="yahoo", start="2008-08-08", 
                          end="2016-07-01")

temp_day = []

for day in range(len(df_stock)):
    temp_day.append(df_stock.index[day].date())

df_stock.index = temp_day

difference = []

for day in range(max(len(df_combined), len(df_stock))):
    if str(df_combined.index[day]) != str(df_stock.index[day]):
        difference.append(day)

if len(difference) is 0:
    print("The dates matched!\n")

difference = []

for day in range(len(df_stock)):
    # label should be 1 -> rise
    if int(df_stock["Adj Close"][day]) >= int(df_stock["Adj Close"][day - 1]):
        if df_combined["Label"][day] != 1:
            difference.append(str(df_stock.index[day]))
            print("Problem at day " + str(df_stock.index[day]))
            print("Today: " + str(df_stock["Adj Close"][day]) +"\t\tYesterday: " + str(df_stock["Adj Close"][day - 1]) + "\t\tLabel: " + str(df_combined["Label"][day]) + "\n")

    # label should be 0 -> fall
    if int(df_stock["Adj Close"][day]) < int(df_stock["Adj Close"][day - 1]):
        if df_combined["Label"][day] != 0:
            difference.append(str(df_stock.index[day]))
            print("Problem at day " + str(df_stock.index[day]))
            print("Today: " + str(df_stock["Adj Close"][day]) +"\t\tYesterday: " + str(df_stock["Adj Close"][day - 1]) + "\t\tLabel: " + str(df_combined["Label"][day]) + "\n") 

# correct the wrong labels
for row in difference:
    if df_combined.loc[row, "Label"] == 0:
        df_combined.loc[row, "Label"] = 1
    else:
        df_combined.loc[row, "Label"] = 0

print("All differences: " + str(len(difference)) + "\nFixed!\n") 

# Find the cells with NaN and after the rows for them
is_NaN = df_combined.isnull()
row_has_NaN = is_NaN.any(axis = 1)
rows_with_NaN = df_combined[row_has_NaN]

# Replace them
df_combined = df_combined.replace(np.nan, " ")

# Check the process
is_NaN = df_combined.isnull()
row_has_NaN = is_NaN.any(axis = 1)
rows_with_NaN = df_combined[row_has_NaN]

assert len(rows_with_NaN) is 0

# The label column 
LABEL_COLUMN = 0

news_sum = []
label_sum = []

# Get the column names
combined_column_names = []
for column in df_combined.columns:
  combined_column_names.append(column)

# Connect the news with the labels
for column in range(len(df_combined)):
  for row in range(len(combined_column_names) - 1):
    news = df_combined[combined_column_names[row + 1]][column]
    # Remove the b character at the begining of the string
    if news[0] is "b":
        news = " " + news[1:]
    news_sum.append(news)
    label_sum.append(df_combined[combined_column_names[LABEL_COLUMN]][column])

# Create the new DataFrame
df_sum_news_labels = pd.DataFrame(data = label_sum, index = None, columns = ["Label"])
df_sum_news_labels["News"] = news_sum

# Removing punctuations
temp_news = []
for line in news_sum:
  temp_attach = ""
  for word in line:
    temp = " "
    if word not in string.punctuation:
      temp = word
    temp_attach = temp_attach + "".join(temp)
  temp_news.append(temp_attach)

news_sum = temp_news
temp_news = []

# Remove numbers
for line in news_sum:
  temp_attach = ""
  for word in line:
    temp = " "
    if not word.isdigit():
      temp = word
    temp_attach = temp_attach + "".join(temp)
  temp_news.append(temp_attach)

# Remove space
for line in range(len(temp_news)):    
  temp_news[line] = " ".join(temp_news[line].split())

# Converting headlines to lower case
for line in range(len(temp_news)): 
    temp_news[line] = temp_news[line].lower()

# Update the data frame
df_sum_news_labels["News"] = temp_news

# Load the stop words
stop_words = set(stopwords.words('english'))

filtered_sentence = []
news_sum = df_sum_news_labels["News"]

# Remove stop words
for line in news_sum:
  word_tokens = word_tokenize(line)
  temp_attach = ""
  for word in word_tokens:
    temp = " "
    if not word in stop_words:
      temp = temp + word
    temp_attach = temp_attach + "".join(temp)
  filtered_sentence.append(temp_attach)

# Remove space
for line in range(len(filtered_sentence)):    
  filtered_sentence[line] = " ".join(filtered_sentence[line].split())

# Update the data frame
df_sum_news_labels["News"] = filtered_sentence

news_sum = df_sum_news_labels["News"]
null_indexes = []
index = 0

for line in news_sum:
  if line is "":
    null_indexes.append(index)
  index = index + 1

print("\nNull indexes: " + str(null_indexes) + "\n")

for row in null_indexes:
  df_sum_news_labels = df_sum_news_labels.drop(row)

news_sum = df_sum_news_labels["News"]
null_indexes = []
index = 0

for line in news_sum:
  if line is "":
    null_indexes.append(index)
  index = index + 1
  
assert len(null_indexes) is 0

df_sum_news_labels.reset_index(inplace=True, drop=True)

# Show the data frame
print(df_sum_news_labels.head())
print()
print(df_stock.head())

INPUT_SIZE = len(df_sum_news_labels)
TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 
TEST_SIZE = int(TEST_SPLIT * INPUT_SIZE)

# Split the dataset
train_before = df_sum_news_labels[:TRAIN_SIZE] 
test_before = df_sum_news_labels[TRAIN_SIZE:]
test_before.reset_index(inplace=True, drop=True)

# Print out the length
print("\nTrain data set length: " + str(len(train_before)))
print("Test data set length: " + str(len(test_before)))
print("Split summa: " + str(len(train_before) + len(test_before)))
print("Dataset summa before split: " + str(len(df_sum_news_labels)) + "\n")

# check
split_sum = len(train_before) + len(test_before)
sum = len(df_sum_news_labels)
assert split_sum == sum

print("Train last:\n" + str(train_before.tail(1)) + "\n")

print("Test first:\n" + str(test_before.head(1)))

Start of the preprocess

The dates matched!

Problem at day 2010-10-14
Today: 11096.919921875		Yesterday: 11096.080078125		Label: 0

Problem at day 2012-11-12
Today: 12815.080078125		Yesterday: 12815.3896484375		Label: 0

Problem at day 2012-11-15
Today: 12570.9501953125		Yesterday: 12570.9501953125		Label: 0

Problem at day 2013-04-12
Today: 14865.0595703125		Yesterday: 14865.1396484375		Label: 0

Problem at day 2014-04-24
Today: 16501.650390625		Yesterday: 16501.650390625		Label: 0

Problem at day 2015-08-12
Today: 17402.509765625		Yesterday: 17402.83984375		Label: 0

Problem at day 2015-11-27
Today: 17813.390625		Yesterday: 17813.390625		Label: 0

All differences: 7
Fixed!


Null indexes: [6947, 6948, 6949, 8723, 8724, 13134, 17048, 17049]

   Label                                               News
0      0  georgia downs two russian warplanes countries ...
1      0                       breaking musharraf impeached
2      0  russia today columns troops roll south ossetia...
3     

In [None]:
# Train merge
merged_news = []
merged_labels = []
temp_news = ""
in_rows_counter = 0
merged_counter = 0
for row in range(len(train_before)):
    if merged_counter == 3:
        merged_counter = 0 
        in_rows_counter = 0
    elif in_rows_counter == 7:
        temp_news = temp_news + " " + train_before["News"][row]
        merged_counter = merged_counter + 1
        merged_news.append(temp_news)
        merged_labels.append(train_before["Label"][row])
        temp_news = ""
        in_rows_counter = 0
    else:
        if in_rows_counter == 0:
            temp_news = temp_news + train_before["News"][row]
        else:
            temp_news = temp_news + " " + train_before["News"][row]
        in_rows_counter = in_rows_counter + 1

train_merged = pd.DataFrame()
train_merged["News"] = merged_news
train_merged["Label"] = merged_labels

# Test merge
merged_news = []
merged_labels = []
temp_news = ""
in_rows_counter = 0
merged_counter = 0
for row in range(len(test_before)):
    if merged_counter == 3:
        merged_counter = 0 
        in_rows_counter = 0
    elif in_rows_counter == 7:
        temp_news = temp_news + " " + test_before["News"][row]
        merged_counter = merged_counter + 1
        merged_news.append(temp_news)
        merged_labels.append(test_before["Label"][row])
        temp_news = ""
        in_rows_counter = 0
    else:
        if in_rows_counter == 0:
            temp_news = temp_news + test_before["News"][row]
        else:
            temp_news = temp_news + " " + test_before["News"][row]
        in_rows_counter = in_rows_counter + 1

test_merged = pd.DataFrame()
test_merged["News"] = merged_news
test_merged["Label"] = merged_labels




# Print out the length
print("\nTrain merged data set length: " + str(len(train_merged)))
print("Test merged data set length: " + str(len(test_merged)))
print("Split summa: " + str(len(train_merged) + len(test_merged)))
print("Dataset summa before merge and split: " + str(len(df_sum_news_labels)) + "\n")

print("Train merged last:\n" + str(train_merged.tail(1)) + "\n")

print("Test merged first:\n" + str(test_merged.head(1)))


Train merged data set length: 5071
Test merged data set length: 895
Split summa: 5966
Dataset summa before merge and split: 49717

Train merged last:
                                                   News  Label
5070  salman rushdie chastises authors protesting ch...      1

Test merged first:
                                                News  Label
0  number women england wales becoming nuns hits ...      1


## **Log Reg modell**

2,4 n gram modell létrehozása és a korrelációs tényezők szemlélése.

In [None]:
MODEL_TYPE = str("2,4")

train_headlines = []
test_headlines = []

for row in range(0, len(train_merged.index)):
    train_headlines.append(train_merged.iloc[row, 0])

for row in range(0,len(test_merged.index)):
    test_headlines.append(test_merged.iloc[row, 0])

# show the first
print(train_headlines[0])

_gram_vectorizer_ = CountVectorizer(ngram_range=(int(MODEL_TYPE[0]),int(MODEL_TYPE[2])))
_train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

_gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
_gram_model_ = _gram_model_.fit(_train_vectorizer_, train_merged["Label"])

_gram_test_ = _gram_vectorizer_.transform(test_headlines)
_gram_predictions_ = _gram_model_.predict(_gram_test_)

print (accuracy_score(test_merged["Label"], _gram_predictions_))

georgia downs two russian warplanes countries move brink war breaking musharraf impeached russia today columns troops roll south ossetia footage fighting youtube russian tanks moving towards capital south ossetia reportedly completely destroyed georgian artillery fire afghan children raped impunity u n official says sick three year old raped nothing russian tanks entered south ossetia whilst georgia shoots two russian jets breaking georgia invades south ossetia russia warned would intervene side enemy combatent trials nothing sham salim haman sentenced years kept longer anyway feel like
The shape is: (5071, 1254113)

0.5307262569832403


In [None]:
_gram_words_best_ = _gram_vectorizer_.get_feature_names()
_gram_coeffs_best_ = _gram_model_.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : _gram_words_best_, 
                        'Coefficient' : _gram_coeffs_best_})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])

print(coeffdf.head(10))

                     Word  Coefficient
400721         first time     0.332402
736017        new zealand     0.256574
1093056          tear gas     0.248045
249435        court rules     0.227543
967623        says russia     0.225581
980140   security council     0.208010
1233710     world largest     0.202004
1040064        sri lankan     0.192259
1021365      social media     0.190208
1110500       three years     0.187716


In [None]:
print(coeffdf.tail(10))

                  Word  Coefficient
1028309   south africa    -0.193925
812998   phone hacking    -0.206697
900658       red cross    -0.209729
1167029        us army    -0.212450
503744       hong kong    -0.213815
995753    sexual abuse    -0.221582
116917       bin laden    -0.222097
735898        new york    -0.226199
1029275   south korean    -0.264553
64515     around world    -0.321893


## **Hírek vizsgálata**

Ebben a fejezetben megvizsgálom a hírekben szereplő elemek korrelációs tényezőit.

In [None]:
_gram_test_ = _gram_vectorizer_.transform(train_before["News"])
_gram_predictions_ = _gram_model_.predict_proba(_gram_test_)

neutraliy_index = []
for i in range(len(_gram_predictions_)):
    if _gram_predictions_[i][0] < 0.47:
        pass
    elif _gram_predictions_[i][0] > 0.53:
        pass        
    else:
        neutraliy_index.append(i)

print(len(neutraliy_index))
print(len(train_before))
print(len(neutraliy_index) / len(train_before))
print(_gram_predictions_[neutraliy_index[0]])

5796
42259
0.13715421567003477
[0.49568784 0.50431216]


In [None]:
_gram_test_ = _gram_vectorizer_.transform(test_before["News"])
_gram_predictions_ = _gram_model_.predict_proba(_gram_test_)

neutraliy_index = []
for i in range(len(_gram_predictions_)):
    if _gram_predictions_[i][0] < 0.45:
        pass
    elif _gram_predictions_[i][0] > 0.55:
        pass        
    else:
        neutraliy_index.append(i)

print(len(neutraliy_index))
print(len(test_before))
print(len(neutraliy_index) / len(test_before))
print(_gram_predictions_[neutraliy_index[0]])

212
7458
0.028425851434700992
[0.46518896 0.53481104]


In [None]:
_gram_test_ = _gram_vectorizer_.transform(train_before["News"])
_gram_predictions_ = _gram_model_.predict_proba(_gram_test_)

neutraliy_index = []
for i in range(len(_gram_predictions_)):
    if _gram_predictions_[i][0] < 0.47:
        pass
    elif _gram_predictions_[i][0] > 0.53:
        pass        
    else:
        neutraliy_index.append(i)

print(len(neutraliy_index))
print(len(train_before))
print(len(neutraliy_index) / len(train_before))
print(_gram_predictions_[neutraliy_index[0]])

train_before_dropped = train_before.drop(neutraliy_index)
print(len(train_before_dropped))

5796
42259
0.13715421567003477
[0.49568784 0.50431216]
36463


In [None]:
pos_label_news = []
neg_label_news = []

train_before_dropped.reset_index(inplace=True)

for row in range(len(train_before_dropped)):
    if str(train_before_dropped["Label"][row]) == "0":
        neg_label_news.append(str(train_before_dropped["News"][row]))
    elif str(train_before_dropped["Label"][row]) == "1":
        pos_label_news.append(str(train_before_dropped["News"][row]))
    else:
        pass

print(len(pos_label_news))
print(len(neg_label_news))

pos_train = pd.DataFrame()
neg_train = pd.DataFrame()

pos_labels = []
for row in range(len(pos_label_news)):
    pos_labels.append("1")

pos_train["News"] = pos_label_news 
pos_train["Label"] = pos_labels 


neg_labels = []
for row in range(len(neg_label_news)):
    neg_labels.append("0")

neg_train["News"] = neg_label_news 
neg_train["Label"] = neg_labels 

22728
13735


In [None]:
pos_train

Unnamed: 0,News,Label
0,wont america nato help us wont help us help iraq,1
1,bush puts foot georgian conflict,1
2,jewish georgian minister thanks israeli traini...,1
3,georgian army flees disarray russians advance ...,1
4,olympic opening ceremony fireworks faked,1
...,...,...
22723,australian reporter fired tweeting soldiers ra...,1
22724,netherlands legalized sex marriage divorce rat...,1
22725,earthquake slid india feet northwards matter s...,1
22726,nine await mass execution indonesia security p...,1


In [None]:
neg_train

Unnamed: 0,News,Label
0,breaking musharraf impeached,0
1,russian tanks moving towards capital south oss...,0
2,afghan children raped impunity u n official sa...,0
3,breaking georgia invades south ossetia russia ...,0
4,enemy combatent trials nothing sham salim hama...,0
...,...,...
13730,mass poisoning egypt sends hospital,0
13731,israel airlift babies born surrogates nepal,0
13732,five billion people access safe surgery,0
13733,erdoan engages war words new turkish cypriot l...,0


In [None]:
# Train merge
# Neg merge
neg_merged_news = []
neg_merged_labels = []
in_rows_counter = 0
merged_counter = 0

for i in range(int(len(neg_train) / 8)):
    temp_news = ""
    for j in range(8): #0,1...7
        temp_news = temp_news + " " + neg_train["News"][i* 8 + j]
    neg_merged_news.append(temp_news)
    neg_merged_labels.append(0)

neg_merged_df = pd.DataFrame()
neg_merged_df["News"] = neg_merged_news
neg_merged_df["Label"] = neg_merged_labels

# Pos merge
pos_merged_news = []
pos_merged_labels = []
in_rows_counter = 0
merged_counter = 0

for i in range(int(len(pos_train) / 8)):
    temp_news = ""
    for j in range(8): #0,1...7
        temp_news = temp_news + " " + pos_train["News"][i* 8 + j]
    pos_merged_news.append(temp_news)
    pos_merged_labels.append(1)

pos_merged_df = pd.DataFrame()
pos_merged_df["News"] = pos_merged_news
pos_merged_df["Label"] = pos_merged_labels

# All merge
filtered_merged_df = pd.concat([pos_merged_df, neg_merged_df])

# Do the shuffle
for i in range(SHUFFLE_CYCLE):
  filtered_merged_df = shuffle(filtered_merged_df, random_state = RANDOM_SEED)

# Reset the index
filtered_merged_df.reset_index(inplace=True, drop=True)

# Show the data frame
filtered_merged_df

Unnamed: 0,News,Label
0,aust government force citizens hand internet ...,1
1,north korea declares year unification boosts ...,0
2,crackdown fish poaching wales nets arrests st...,0
3,billionaire gives million bonus workers mos d...,1
4,danish drugmaker seeks prevent use drug makes...,1
...,...,...
4552,philippines china increasing ships disputed s...,0
4553,men find king thutmosis iii yr old temple hou...,1
4554,new york cuba first direct charter flight tak...,0
4555,ac dc drummer phil rudd marijuana conviction ...,1


In [None]:
MODEL_TYPE = str("2,4")

train_headlines = []
test_headlines = []

for row in range(0, len(filtered_merged_df.index)):
    train_headlines.append(filtered_merged_df.iloc[row, 0])

for row in range(0,len(test_merged.index)):
    test_headlines.append(test_merged.iloc[row, 0])

# show the first
print(train_headlines[0])

_gram_vectorizer_ = CountVectorizer(ngram_range=(int(MODEL_TYPE[0]),int(MODEL_TYPE[2])))
_train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

_gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
_gram_model_ = _gram_model_.fit(_train_vectorizer_, filtered_merged_df["Label"])

_gram_test_ = _gram_vectorizer_.transform(test_headlines)
_gram_predictions_ = _gram_model_.predict(_gram_test_)

print (accuracy_score(test_merged["Label"], _gram_predictions_))

 aust government force citizens hand internet passwords sending jail refuse top muslim cleric russia tatarstan province shot dead thursday another wounded car bomb attacks province leader local religious authorities said probably related priests criticism radical islamists beyond foxconn dirt factories making iphone troubling new findings cast doubt apple highly publicized promise improve conditions overseas factories using surveys onsite visits undercover investigations amp face face interviews factories evaluated french rightwing lawmakers raised eyebrows hooted minister territories housing ccile duflot took podium wearing floral dress uk border agency staff go strike one day olympics russia china veto western backed syria resolution un security council russia moderate muslim leaders attacked tatarstan ian tomlinson death pc guilty
The shape is: (4557, 1162633)

0.5150837988826815


In [None]:
_gram_words_best_ = _gram_vectorizer_.get_feature_names()
_gram_coeffs_best_ = _gram_model_.coef_.tolist()[0]

coeffdf = pd.DataFrame({'Word' : _gram_words_best_, 
                        'Coefficient' : _gram_coeffs_best_})

coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])

print(coeffdf.head(10))

                     Word  Coefficient
94584            bbc news     0.404294
682451        new zealand     0.233953
1013166          tear gas     0.217490
1124647         west bank     0.206855
896924        says russia     0.201999
795496        pro russian     0.194115
169507   chemical weapons     0.175787
48445            anti gay     0.167968
1143531     world largest     0.166182
567331      latin america     0.162423


In [None]:
print(coeffdf.tail(10))

                    Word  Coefficient
1154495        years ago    -0.204712
916246   sentenced years    -0.208253
953870      south korean    -0.220199
510470      iran nuclear    -0.240260
682335          new york    -0.242104
696288   nuclear weapons    -0.247959
467571         hong kong    -0.249882
108731         bin laden    -0.280259
923012      sexual abuse    -0.321462
60362       around world    -0.462350
