# **Stock market news feed semantic analysis** *(Baseline LogReg)*

Ebben a notebookban az eddigi általam kibányászott, megszerzett adathalmazokat fogom a hagyományos bag of words és logistic regression módszerrel megvizsgálni. Ezek után n-gram modelleket is ki fogok próbálni. Az általa használt források és referenciák az eredményekhez:


*   https://colab.research.google.com/drive/1QPrBkh-KwX6qcUtiNWKp9rJoneBfGEVh#scrollTo=bQUJwMjYYN4- *(saját munka - átdolgozott)*
*   https://colab.research.google.com/drive/1MdpXGCj2fb3g1BI_XfF54OWLkYQCZBBy#scrollTo=LndWT2Kn-UMK *(saját baseline munka)*
*   https://www.kaggle.com/ndrewgele/omg-nlp-with-the-djia-and-reddit#Basic-Model-Training-and-Testing
*   https://www.kaggle.com/lseiyjg/use-news-to-predict-stock-markets





A használt adathalmazok alapján külön fejezeteket készítek és mindenhol jelzem a forrását és a megszerzésének a módját, ha saját bányászás eredménye.

## **A projekt előkészítése**

A Drive csatlakoztatása a szükséges fájlok későbbi betöltésére. A betöltés közvetlen a használat előtt fogom megtenni.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


A szükséges könyvtárak betöltése a projekthez.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
import pandas_datareader as web
from numpy.random import MT19937
from numpy.random import RandomState, SeedSequence
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from nltk.tokenize import word_tokenize  
from sklearn.utils import shuffle
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score 
from sklearn.metrics import confusion_matrix

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


A projektben használt makrók definiálása.

In [3]:
# Shuffle cycle number for the dataframe
SHUFFLE_CYCLE = 500

# Which dataset will be used
DATASET = 1

A reprodukálhatóság miatt definiálok egy seed-et a véletlen szám generátorhoz, amit a továbbiakban használni fogok.

In [4]:
# Random seed
RANDOM_SEED = 1234

# Numpy random seed
NP_SEED = 1234

# Max iteration for training
MAX_ITER = 100000

# Train size
TRAIN_SPLIT = 0.80

# Test size
TEST_SPLIT = 0.2

In [5]:
np.random.seed(NP_SEED)

## **KAG_REDDIT_WRLD_DJIA_DF (1)**

Ez az adathalmaz a top25 hírt tartalmazza a Reddit World News kategóriából 2008.08.08-2016.07.01 időtartamban. Ez nem általam gyűjtött adathalmaz, a forrása:
Sun, J. (2016, August). Daily News for Stock Market Prediction, Version 1. Retrieved 2021.02.19. from https://www.kaggle.com/aaron7sun/stocknews

Az adathalmaz betöltése a csatlakoztatott Drive-omból.

In [6]:
if DATASET == 1:
    # Copy the dataset to the local environment
    !cp "/content/drive/MyDrive/Combined_News_DJIA.csv" "Combined_News_DJIA.csv"

    # Check the copy is succesfull -> good if no assertation error
    read = !ls
    assert read[0].find("Combined_News_DJIA.csv") != -1

Az eredmények elmentésére és indexelésére az alábbi két tömböt fogom hasnzálni.

In [7]:
if DATASET == 1:
    model_type = ["Bag of words", "1,2 n-gram", "2,2 n-gram", 
                  "1,3 n-gram", "2,3 n-gram", "3,3 n-gram"]

    result = []              

Makró definiálás.

In [8]:
if DATASET == 1:
    # Number of merged news into one string
    ROWS = 2

### A szöveg előkészítése

Az adathalmaz betöltése.

In [9]:
if DATASET == 1:
    # Load the dataset 
    df_combined = pd.read_csv('Combined_News_DJIA.csv', index_col = "Date")

    # Show the dataframe
    print(df_combined.head())

            Label  ...                                              Top25
Date               ...                                                   
2008-08-08      0  ...           b"No Help for Mexico's Kidnapping Surge"
2008-08-11      1  ...  b"So this is what it's come to: trading sex fo...
2008-08-12      0  ...  b"BBC NEWS | Asia-Pacific | Extinction 'by man...
2008-08-13      0  ...  b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14      1  ...  b'Philippines : Peace Advocate say Muslims nee...

[5 rows x 26 columns]


Érdekességképpen a következőkben megvizsgálom, hogy az adathalmaz címkéi megfelelőek. A forrás szerint a címke 1, ha nőtt vagy azonos maradt az érték azon a napon, illetve 0, ha csökkent. (Adj Close adott napi értéke az előző napihoz viszonyítva)

In [10]:
if DATASET == 1:
    # Load the stock data
    df_stock = web.DataReader("DJIA", data_source="yahoo", start="2008-08-08", 
                              end="2016-07-01")
    
    # Show the stock data
    print(df_stock.head())

                    High           Low  ...      Volume     Adj Close
Date                                    ...                          
2008-08-08  11808.490234  11344.230469  ...  4966810000  11734.320312
2008-08-11  11933.549805  11580.190430  ...  5067310000  11782.349609
2008-08-12  11830.389648  11541.429688  ...  4711290000  11642.469727
2008-08-13  11689.049805  11377.370117  ...  4787600000  11532.959961
2008-08-14  11744.330078  11399.839844  ...  4064000000  11615.929688

[5 rows x 6 columns]


Az dátumok formátumát egységesre hozom az összehasonlítás érdekében.

In [11]:
if DATASET == 1:
    temp_day = []

    for day in range(len(df_stock)):
        temp_day.append(df_stock.index[day].date())

    df_stock.index = temp_day

    # Show the stock data
    print(df_stock.head())

                    High           Low  ...      Volume     Adj Close
2008-08-08  11808.490234  11344.230469  ...  4966810000  11734.320312
2008-08-11  11933.549805  11580.190430  ...  5067310000  11782.349609
2008-08-12  11830.389648  11541.429688  ...  4711290000  11642.469727
2008-08-13  11689.049805  11377.370117  ...  4787600000  11532.959961
2008-08-14  11744.330078  11399.839844  ...  4064000000  11615.929688

[5 rows x 6 columns]


Először a dátumok ellenőzöm, hogy megegyeznek-e.

In [12]:
if DATASET == 1:
    difference = []

    if len(df_combined) == len(df_stock):
        print("The lengths are the same!")

    for day in range(max(len(df_combined), len(df_stock))):
        if str(df_combined.index[day]) != str(df_stock.index[day]):
            print("There is difference at: " + str(day) + " index")
            print("News: " + str(df_combined.index[day]) + "\tStock: " + str(df_stock.index[day]))
            difference.append(day)

    if len(difference) is 0:
        print("The dates matched!")

The lengths are the same!
The dates matched!


A labelek ellenőrzése.

In [13]:
if DATASET == 1:
    difference = []

    for day in range(len(df_stock)):
        # label should be 1 -> rise
        if int(df_stock["Adj Close"][day]) >= int(df_stock["Adj Close"][day - 1]):
            if df_combined["Label"][day] != 1:
                difference.append(str(df_stock.index[day]))
                print("Problem at day " + str(df_stock.index[day]))
                print("Today: " + str(df_stock["Adj Close"][day]) +"\t\tYesterday: " + str(df_stock["Adj Close"][day - 1]) + "\t\tLabel: " + str(df_combined["Label"][day]) + "\n")

        # label should be 0 -> fall
        if int(df_stock["Adj Close"][day]) < int(df_stock["Adj Close"][day - 1]):
            if df_combined["Label"][day] != 0:
                difference.append(str(df_stock.index[day]))
                print("Problem at day " + str(df_stock.index[day]))
                print("Today: " + str(df_stock["Adj Close"][day]) +"\t\tYesterday: " + str(df_stock["Adj Close"][day - 1]) + "\t\tLabel: " + str(df_combined["Label"][day]) + "\n")

    print("All differences: " + str(len(difference)))      

Problem at day 2010-10-14
Today: 11096.919921875		Yesterday: 11096.080078125		Label: 0

Problem at day 2012-11-12
Today: 12815.080078125		Yesterday: 12815.3896484375		Label: 0

Problem at day 2012-11-15
Today: 12570.9501953125		Yesterday: 12570.9501953125		Label: 0

Problem at day 2013-04-12
Today: 14865.0595703125		Yesterday: 14865.1396484375		Label: 0

Problem at day 2014-04-24
Today: 16501.650390625		Yesterday: 16501.650390625		Label: 0

Problem at day 2015-08-12
Today: 17402.509765625		Yesterday: 17402.83984375		Label: 0

Problem at day 2015-11-27
Today: 17813.390625		Yesterday: 17813.390625		Label: 0

All differences: 7


Látható, hogy rossz a label pár helyen. Egy kis kutakodás után megtaláltam, hogy maga az árfolyam lekérdezésük volt hibás pár nap esetében, ezért ezeket javítom, majd elmentem a drive-omon a javítottat.

In [14]:
if DATASET == 1:
    # correct the wrong labels
    for row in difference:
        if df_combined.loc[row, "Label"] == 0:
            df_combined.loc[row, "Label"] = 1
        else:
            df_combined.loc[row, "Label"] = 0

    # check them
    for row in difference:
        print(str(row) + "\t\t" + str(df_combined.loc[row, "Label"]))

2010-10-14		1
2012-11-12		1
2012-11-15		1
2013-04-12		1
2014-04-24		1
2015-08-12		1
2015-11-27		1


In [15]:
if DATASET == 1:
    # save to drive
    df_combined.to_csv('drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/KAG_REDDIT_WRLD_DJIA_DF_corrected.csv')

    # Show the dataset
    print(df_combined.head())

            Label  ...                                              Top25
Date               ...                                                   
2008-08-08      0  ...           b"No Help for Mexico's Kidnapping Surge"
2008-08-11      1  ...  b"So this is what it's come to: trading sex fo...
2008-08-12      0  ...  b"BBC NEWS | Asia-Pacific | Extinction 'by man...
2008-08-13      0  ...  b'2006: Nobel laureate Aleksander Solzhenitsyn...
2008-08-14      1  ...  b'Philippines : Peace Advocate say Muslims nee...

[5 rows x 26 columns]


A következőkben az esetleges adat nélküli napokat, illetve cellákat keresem meg és helyettesítem őket egy üres sztringgel. Ez a későbbi szövegfeldolgozás hibamentességéhez szükséges.

In [16]:
if DATASET == 1:
    # Find the cells with NaN and after the rows for them
    is_NaN = df_combined.isnull()
    row_has_NaN = is_NaN.any(axis = 1)
    rows_with_NaN = df_combined[row_has_NaN]

    # Replace them
    df_combined = df_combined.replace(np.nan, " ")

    # Check the process
    is_NaN = df_combined.isnull()
    row_has_NaN = is_NaN.any(axis = 1)
    rows_with_NaN = df_combined[row_has_NaN]

    assert len(rows_with_NaN) is 0

In [17]:
if DATASET == 1:    
    # the last 10 day save out for testing and comparing the models
    df_for_test = df_combined.tail(10)
    df_combined.drop(df_combined.tail(10).index,inplace=True) # drop last n rows

Ezek után az egy naphoz tartozó híreket közös sztringekbe fűzöm. Az egy sztringbe tartozó hírek számát makróval definiálom:


*   ROWS - egymásba fűzött hírek száma

Itt megtalálható már az első előkészítő algoritmusom, méghozzá a sztringek elején található b karakter eltávolítása.

In [18]:
if DATASET == 1:
    # Get column names
    combined_column_names = []
    for column in df_combined.columns:
      combined_column_names.append(column)

    # 2D array creation for the news based on macros
    COLUMNS = len(df_combined)
    news_sum = [[0 for i in range(COLUMNS)] for j in range(int((len(combined_column_names) - 1) / ROWS))]  

    # Show the column names
    print("Column names of the dataset:") 
    print(combined_column_names)

    # Merge the news
    for row in range(len(df_combined)):
      for column in range(int((len(combined_column_names) - 1) / ROWS)):
        temp = ""
        news = ""
        for word in range(ROWS):
          news = df_combined[combined_column_names[(column * ROWS) + (word + 1)]][row]
          # Remove the b character at the begining of the string
          if news[0] is "b":
            news = " " + news[1:]
          temp = temp + " " + news
        news_sum[column][row] = temp

    # Show the first day second package of the news
    print("\nThe first day second package of the news:")
    print(news_sum[1][0])

Column names of the dataset:
['Label', 'Top1', 'Top2', 'Top3', 'Top4', 'Top5', 'Top6', 'Top7', 'Top8', 'Top9', 'Top10', 'Top11', 'Top12', 'Top13', 'Top14', 'Top15', 'Top16', 'Top17', 'Top18', 'Top19', 'Top20', 'Top21', 'Top22', 'Top23', 'Top24', 'Top25']

The first day second package of the news:
  'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)'  'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire'


Ezek után a korábbi oszlopokat(Top1, Top2...) kicserélem a csoportosításnak megfelelő számú oszlopokra és nevekre (News_1, News_2...), majd feltöltöm őket az összevont hírcsomagokkal.

In [19]:
if DATASET == 1:
    # Drop the old columns
    for column in range(len(combined_column_names) - 1):
      df_combined.drop(combined_column_names[column + 1], axis = 1, inplace = True)

    # Create the new columns with the merged news
    for column in range(int((len(combined_column_names) - 1) / ROWS)):
      colum_name = "News_" + str(column + 1)
      df_combined[colum_name] = news_sum[column]

    # Show the DataFrame
    print(df_combined.head())

            Label  ...                                            News_12
Date               ...                                                   
2008-08-08      0  ...    'Indian shoe manufactory  - And again in a s...
2008-08-11      1  ...    'Perhaps *the* question about the Georgia - ...
2008-08-12      0  ...    'Christopher King argues that the US and NAT...
2008-08-13      0  ...    ' Quarter of Russians blame U.S. for conflic...
2008-08-14      1  ...    'Russia: World  "can forget about" Georgia\'...

[5 rows x 13 columns]


Egy új dataframebe újracsoportosítom a hír blokkokat a címkéjükkel, már a dátumok nélkül.

In [20]:
if DATASET == 1:
    # The label column 
    LABEL_COLUMN = 0

    news_sum = []
    label_sum = []

    # Get the column names
    combined_column_names = []
    for column in df_combined.columns:
      combined_column_names.append(column)

    # Write out the column names 
    print(combined_column_names)
    print("\n")

    # Connect the merged news with the labels
    for column in range(len(df_combined)):
      for row in range(len(combined_column_names) - 1):
        news_sum.append(df_combined[combined_column_names[row + 1]][column])
        label_sum.append(df_combined[combined_column_names[LABEL_COLUMN]][column])

    # Create the new DataFrame
    df_sum_news_labels = pd.DataFrame(data = label_sum, index = None, columns = ["Label"])
    df_sum_news_labels["News"] = news_sum

    # Show it
    print(df_sum_news_labels.head())

['Label', 'News_1', 'News_2', 'News_3', 'News_4', 'News_5', 'News_6', 'News_7', 'News_8', 'News_9', 'News_10', 'News_11', 'News_12']


   Label                                               News
0      0    "Georgia 'downs two Russian warplanes' as co...
1      0    'Russia Today: Columns of troops roll into S...
2      0    "Afghan children raped with 'impunity,' U.N....
3      0    "Breaking: Georgia invades South Ossetia, Ru...
4      0    'Georgian troops retreat from S. Osettain ca...


Először a szövegek előfeldolgozásával kezdem: írásjelek eltávolítása, számok eltávolítása, felesleges szóközök eltávolítása, aztán minden szót kis kezdőbetűjü szóvá konvertálom.

In [21]:
if DATASET == 1:
    # Removing punctuations
    temp_news = []
    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if word not in string.punctuation:
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    news_sum = temp_news
    temp_news = []

    # Remove numbers
    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if not word.isdigit():
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    # Remove space
    for line in range(len(temp_news)):    
      temp_news[line] = " ".join(temp_news[line].split())

    # Converting headlines to lower case
    for line in range(len(temp_news)): 
        temp_news[line] = temp_news[line].lower()

    # Update the data frame
    df_sum_news_labels["News"] = temp_news

    # Show it
    print(df_sum_news_labels.head())

   Label                                               News
0      0  georgia downs two russian warplanes as countri...
1      0  russia today columns of troops roll into south...
2      0  afghan children raped with impunity u n offici...
3      0  breaking georgia invades south ossetia russia ...
4      0  georgian troops retreat from s osettain capita...


A következőkben az úgy nevezett töltelék szavakat (stop words) fogom eltávolítani.

In [22]:
if DATASET == 1:
    # Load the stop words
    stop_words = set(stopwords.words('english'))

    filtered_sentence = []
    news_sum = df_sum_news_labels["News"]

    # Remove stop words
    for line in news_sum:
      word_tokens = word_tokenize(line)
      temp_attach = ""
      for word in word_tokens:
        temp = " "
        if not word in stop_words:
          temp = temp + word
        temp_attach = temp_attach + "".join(temp)
      filtered_sentence.append(temp_attach)

    # Remove space
    for line in range(len(filtered_sentence)):    
      filtered_sentence[line] = " ".join(filtered_sentence[line].split())

    # Update the data frame
    df_sum_news_labels["News"] = filtered_sentence

    # Show the DataFrame
    print(df_sum_news_labels.head())

   Label                                               News
0      0  georgia downs two russian warplanes countries ...
1      0  russia today columns troops roll south ossetia...
2      0  afghan children raped impunity u n official sa...
3      0  breaking georgia invades south ossetia russia ...
4      0  georgian troops retreat osettain capital presu...


Az adathalmazban lévő nulla hosszú sztring csomagok megkeresése és a hozzájuk tartozó cellák törlése következik.

In [23]:
if DATASET == 1:
    news_sum = df_sum_news_labels["News"]
    null_indexes = []
    index = 0

    for line in news_sum:
      if line is "":
        null_indexes.append(index)
      index = index + 1

    print(null_indexes)

    for row in null_indexes:
      df_sum_news_labels = df_sum_news_labels.drop(row)

    news_sum = df_sum_news_labels["News"]
    null_indexes = []
    index = 0

    for line in news_sum:
      if line is "":
        null_indexes.append(index)
      index = index + 1
      
    assert len(null_indexes) is 0

[3335]


Az adathalmaz véletlenszerű sorbarendezése.

In [24]:
if DATASET == 1:
    # Do the shuffle
    for i in range(SHUFFLE_CYCLE):
      df_sum_news_labels = shuffle(df_sum_news_labels, random_state = RANDOM_SEED)

    # Reset the index
    df_sum_news_labels.reset_index(inplace=True, drop=True)

    # Show the data frame
    print(df_sum_news_labels.head())
    # Show the test data frame
    print("\n\nTest frame")
    print(df_for_test.head())

   Label                                               News
0      1  north korea claims us government made intervie...
1      0  afghanistan pipe dreams peace son also rises f...
2      1  reportedly killed israeli strikes gaza reminde...
3      1  austerity struck paris france hit wave street ...
4      1  fukushima kids diagnosed thyroid cancer second...


Test frame
            Label  ...                                              Top25
Date               ...                                                   
2016-06-20      1  ...  Wikileaks founder Julian Assange marks 5 years...
2016-06-21      1  ...  Russian football fan leader Alexander Shprygin...
2016-06-22      0  ...  N. Korea launches what appears to be Musudan m...
2016-06-23      1  ...  The prime minister of India is set to get a br...
2016-06-24      0  ...  A Turkish man has been found guilty of insulti...

[5 rows x 26 columns]


Az adathalmaz szétbontása tanító és validáló/tesztelő adathalmazokra, majd a szétbontás ellenőrzése mérettel és első elem kiíratásával.

In [25]:
if DATASET == 1:
    INPUT_SIZE = len(df_sum_news_labels)
    TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 
    TEST_SIZE = int(TEST_SPLIT * INPUT_SIZE)

    # Split the dataset
    train = df_sum_news_labels[:TRAIN_SIZE] 
    test = df_sum_news_labels[TRAIN_SIZE:]

    # Print out the length
    print("Train data set length: " + str(len(train)))
    print("Test data set length: " + str(len(test)))
    print("Split summa: " + str(len(train) + len(test)))
    print("Dataset summa before split: " + str(len(df_sum_news_labels)))

    # check
    split_sum = len(train) + len(test)
    sum = len(df_sum_news_labels)
    assert split_sum == sum

Train data set length: 18997
Test data set length: 4750
Split summa: 23747
Dataset summa before split: 23747


In [26]:
if DATASET == 1:
    print(train.tail(1))

       Label                                               News
18996      1  murdoch may lose grip news corp libyan rebels ...


In [27]:
if DATASET == 1:
    print(test.head(1))

       Label                                               News
18997      0  dengue world fastest spreading tropical diseas...


### Bag of words

Először a tanító adathalmaz híreit fűzöm össze egy tömbbe.

In [28]:
if DATASET == 1:
    train_headlines = []

    for row in range(0, len(train.index)):
        train_headlines.append(train.iloc[row, 1])

    # show the first
    print(train_headlines[0])

north korea claims us government made interview uk officials named pedophile dossier


Ezek után vektorizálom őket.

In [29]:
if DATASET == 1:
    bow_vectorizer = CountVectorizer()
    bow_train = bow_vectorizer.fit_transform(train_headlines)
    print(bow_train.shape)

(18997, 29603)


Egy logistic regression modellt fogok erre a tanító halmazra betanítani.

In [30]:
if DATASET == 1:
    bow_model = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    bow_model = bow_model.fit(bow_train, train["Label"])

A teszt adathalmaz előkészítése, majd becslés a modell segítségével a következő lépés.

In [31]:
if DATASET == 1:
    test_headlines = []

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])

    bow_test = bow_vectorizer.transform(test_headlines)
    bow_predictions = bow_model.predict(bow_test)

Az eredmények megjelenítése egy táblázatban.

In [32]:
if DATASET == 1:
    pd.crosstab(test["Label"], bow_predictions, rownames=["Actual"], colnames=["Predicted"])

A pontossága a modellnek.

In [33]:
if DATASET == 1:
    print (classification_report(test["Label"], bow_predictions))
    print (accuracy_score(test["Label"], bow_predictions))

    result.append(accuracy_score(test["Label"], bow_predictions))

              precision    recall  f1-score   support

           0       0.48      0.43      0.45      2200
           1       0.55      0.59      0.57      2550

    accuracy                           0.52      4750
   macro avg       0.51      0.51      0.51      4750
weighted avg       0.51      0.52      0.52      4750

0.5178947368421053


A következőkben a top 10 legbefolyásolóbb sztringet jelenítem meg mind pozítiv és mind negatív irányba.

In [34]:
if DATASET == 1:
    bow_words = bow_vectorizer.get_feature_names()
    bow_coeffs = bow_model.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : bow_words, 
                            'Coefficient' : bow_coeffs})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
    print(coeffdf.head(10))

            Word  Coefficient
20524  promising     1.632536
22417     riyadh     1.595175
23209      scrap     1.584696
7680     donetsk     1.570100
22885      sanaa     1.533908
4676     clashed     1.507165
6623   defending     1.382395
21882     repeal     1.379722
14528    landing     1.378869
22333       rift     1.377071


In [35]:
if DATASET == 1:
    print(coeffdf.tail(10))

               Word  Coefficient
5787   counterparts    -1.407548
4520       choppers    -1.409674
11973         hints    -1.421742
19566        picked    -1.483659
25188       stomach    -1.496732
652         airways    -1.534589
12191         horns    -1.579956
7017        detects    -1.612830
14628     launchers    -1.620675
9924        focused    -1.647796


### 2-gram modell

Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (1,2) n-gram modellel.

In [36]:
if DATASET == 1:
    gram_vectorizer_12 = CountVectorizer(ngram_range=(1,2))
    train_vectorizer_12 = gram_vectorizer_12.fit_transform(train_headlines)

    print(train_vectorizer_12.shape)

    gram_model_12 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    gram_model_12 = gram_model_12.fit(train_vectorizer_12, train["Label"])

    gram_test_12 = gram_vectorizer_12.transform(test_headlines)
    gram_predictions_12 = gram_model_12.predict(gram_test_12)

    print (classification_report(test["Label"], gram_predictions_12))
    print (accuracy_score(test["Label"], gram_predictions_12))

    result.append(accuracy_score(test["Label"], gram_predictions_12))

(18997, 357890)
              precision    recall  f1-score   support

           0       0.49      0.42      0.45      2200
           1       0.55      0.62      0.58      2550

    accuracy                           0.53      4750
   macro avg       0.52      0.52      0.52      4750
weighted avg       0.52      0.53      0.52      4750

0.527578947368421


In [37]:
if DATASET == 1:
    gram_words_12 = gram_vectorizer_12.get_feature_names()
    gram_coeffs_12 = gram_model_12.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : gram_words_12, 
                            'Coefficient' : gram_coeffs_12})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
    print(coeffdf.head(10))

               Word  Coefficient
91588       donetsk     0.971925
270789        ruled     0.903815
289089  significant     0.863195
100623      enemies     0.852514
175020      landing     0.837202
300773         step     0.836689
172205           km     0.835461
243354   presidents     0.829768
279316        scrap     0.825647
205643       mumbai     0.824550


In [38]:
if DATASET == 1:
    print(coeffdf.tail(10))

                      Word  Coefficient
24052            authority    -0.818779
334644             us army    -0.833292
207131          mysterious    -0.842346
302563            stranded    -0.843124
109051           extremism    -0.846904
301359               stock    -0.847559
2911                  acta    -0.848259
114958              filmed    -0.920522
212300  news international    -0.957999
186710                 low    -1.094469


Másodjára a (2,2) n-gram modellel.

In [39]:
if DATASET == 1:
    gram_vectorizer_22 = CountVectorizer(ngram_range=(2,2))
    train_vectorizer_22 = gram_vectorizer_22.fit_transform(train_headlines)

    print(train_vectorizer_22.shape)

    gram_model_22 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    gram_model_22 = gram_model_22.fit(train_vectorizer_22, train["Label"])

    gram_test_22 = gram_vectorizer_22.transform(test_headlines)
    gram_predictions_22 = gram_model_22.predict(gram_test_22)

    pd.crosstab(test["Label"], gram_predictions_22, rownames=["Actual"], colnames=["Predicted"])

    print (classification_report(test["Label"], gram_predictions_22))
    print (accuracy_score(test["Label"], gram_predictions_22))

    result.append(accuracy_score(test["Label"], gram_predictions_22))

(18997, 328287)
              precision    recall  f1-score   support

           0       0.51      0.34      0.41      2200
           1       0.56      0.72      0.63      2550

    accuracy                           0.54      4750
   macro avg       0.53      0.53      0.52      4750
weighted avg       0.53      0.54      0.52      4750

0.5416842105263158


In [40]:
if DATASET == 1:
    gram_words_22 = gram_vectorizer_22.get_feature_names()
    gram_coeffs_22 = gram_model_22.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : gram_words_22, 
                            'Coefficient' : gram_coeffs_22})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
    print(coeffdf.head(10))

                    Word  Coefficient
92649      england wales     0.850880
104367      fidel castro     0.805384
249571      russia sends     0.781674
161669     latin america     0.770149
257536  security council     0.765139
4883      afghan soldier     0.756483
159320        korean war     0.752447
30235        big brother     0.720896
50950     china military     0.720644
196252     north america     0.720637


In [41]:
if DATASET == 1:
    print(coeffdf.tail(10))

                      Word  Coefficient
17189         around world    -0.741042
68254      crimes humanity    -0.779260
277316       strait hormuz    -0.791220
276199        stock market    -0.798599
197951         nytimes com    -0.806997
4905          afghan woman    -0.813143
181720      military bases    -0.820401
306682             us army    -0.847734
252343          saudi king    -0.914661
194725  news international    -1.085222


### 3-gram modell

Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (1,3) n-gram modellel.

In [42]:
if DATASET == 1:
    gram_vectorizer_13 = CountVectorizer(ngram_range=(1,3))
    train_vectorizer_13 = gram_vectorizer_13.fit_transform(train_headlines)

    print(train_vectorizer_13.shape)

    gram_model_13 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    gram_model_13 = gram_model_13.fit(train_vectorizer_13, train["Label"])

    gram_test_13 = gram_vectorizer_13.transform(test_headlines)
    gram_predictions_13 = gram_model_13.predict(gram_test_13)

    print (classification_report(test["Label"], gram_predictions_13))
    print (accuracy_score(test["Label"], gram_predictions_13))

    result.append(accuracy_score(test["Label"], gram_predictions_13))

(18997, 751362)
              precision    recall  f1-score   support

           0       0.49      0.40      0.44      2200
           1       0.55      0.64      0.59      2550

    accuracy                           0.53      4750
   macro avg       0.52      0.52      0.52      4750
weighted avg       0.52      0.53      0.52      4750

0.5296842105263158


In [43]:
if DATASET == 1:
    gram_words_13 = gram_vectorizer_13.get_feature_names()
    gram_coeffs_13 = gram_model_13.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : gram_words_13, 
                            'Coefficient' : gram_coeffs_13})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
    print(coeffdf.head(10))

               Word  Coefficient
428927       mumbai     0.777815
629908         step     0.765280
565898        ruled     0.754003
189045      donetsk     0.727203
700085        urges     0.707539
359110           km     0.701980
590449        seize     0.676912
605508  significant     0.657024
715092        votes     0.643919
686176         turn     0.642439


In [44]:
if DATASET == 1:
    print(coeffdf.tail(10))

                      Word  Coefficient
649869              system    -0.678350
50433            authority    -0.679418
547000             removed    -0.681450
449100             nothing    -0.694423
432132          mysterious    -0.704376
6020                  acta    -0.709837
631091               stock    -0.748948
237308              filmed    -0.755930
443585  news international    -0.766026
389517                 low    -0.991269


Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (2,3) n-gram modellel.

In [45]:
if DATASET == 1:
    gram_vectorizer_23 = CountVectorizer(ngram_range=(2,3))
    train_vectorizer_23 = gram_vectorizer_23.fit_transform(train_headlines)

    print(train_vectorizer_23.shape)

    gram_model_23 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    gram_model_23 = gram_model_23.fit(train_vectorizer_23, train["Label"])

    gram_test_23 = gram_vectorizer_23.transform(test_headlines)
    gram_predictions_23 = gram_model_23.predict(gram_test_23)

    pd.crosstab(test["Label"], gram_predictions_23, rownames=["Actual"], colnames=["Predicted"])

    print (classification_report(test["Label"], gram_predictions_23))
    print (accuracy_score(test["Label"], gram_predictions_23))

    result.append(accuracy_score(test["Label"], gram_predictions_23))

(18997, 721759)
              precision    recall  f1-score   support

           0       0.52      0.27      0.36      2200
           1       0.56      0.78      0.65      2550

    accuracy                           0.55      4750
   macro avg       0.54      0.53      0.50      4750
weighted avg       0.54      0.55      0.51      4750

0.5471578947368421


In [46]:
if DATASET == 1:
    gram_words_23 = gram_vectorizer_23.get_feature_names()
    gram_coeffs_23 = gram_model_23.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : gram_words_23, 
                            'Coefficient' : gram_coeffs_23})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
    print(coeffdf.head(10))

                    Word  Coefficient
569628   sentenced death     0.626520
200547     england wales     0.624720
225542      fidel castro     0.616101
353263     latin america     0.600078
564961  security council     0.594668
710389     world largest     0.562634
546543      russia sends     0.526876
65616        big brother     0.519188
348008        korean war     0.517875
432950    nuclear strike     0.517595


In [47]:
if DATASET == 1:
    print(coeffdf.tail(10))

                      Word  Coefficient
608295       strait hormuz    -0.564197
455355       panama papers    -0.571619
148129     crimes humanity    -0.574374
605949        stock market    -0.590060
396819      military bases    -0.591332
672705             us army    -0.608258
434116         nytimes com    -0.658205
37537         around world    -0.668372
553235          saudi king    -0.678523
426010  news international    -0.899099


Hasonlóan az eddigiekhez vektorizálom a tanító adathalmazom, logistic regression modellt illesztek rá, becslést hajtok végre majd kiértékelem az eredményeket. Először a (3,3) n-gram modellel.

In [48]:
if DATASET == 1:
    gram_vectorizer_33 = CountVectorizer(ngram_range=(3,3))
    train_vectorizer_33 = gram_vectorizer_33.fit_transform(train_headlines)

    print(train_vectorizer_33.shape)

    gram_model_33 = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    gram_model_33 = gram_model_33.fit(train_vectorizer_33, train["Label"])

    gram_test_33 = gram_vectorizer_33.transform(test_headlines)
    gram_predictions_33 = gram_model_33.predict(gram_test_33)

    pd.crosstab(test["Label"], gram_predictions_33, rownames=["Actual"], colnames=["Predicted"])

    print (classification_report(test["Label"], gram_predictions_33))
    print (accuracy_score(test["Label"], gram_predictions_33))

    result.append(accuracy_score(test["Label"], gram_predictions_33))

(18997, 393472)
              precision    recall  f1-score   support

           0       0.52      0.07      0.13      2200
           1       0.54      0.94      0.69      2550

    accuracy                           0.54      4750
   macro avg       0.53      0.51      0.41      4750
weighted avg       0.53      0.54      0.43      4750

0.5389473684210526


In [49]:
if DATASET == 1:
    gram_words_33 = gram_vectorizer_33.get_feature_names()
    gram_coeffs_33 = gram_model_33.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : gram_words_33, 
                            'Coefficient' : gram_coeffs_33})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])
    print(coeffdf.head(10))

                                   Word  Coefficient
125287                 first time since     0.695203
390264                     year old man     0.656931
382194         wikileaks julian assange     0.573982
275468                putin says russia     0.526135
18026             approves sex marriage     0.496900
157408                  homes west bank     0.486680
390291             year old palestinian     0.477924
99637   drug decriminalization portugal     0.471425
232875                nobel peace prize     0.469363
266193            president evo morales     0.462286


In [50]:
if DATASET == 1:
    print(coeffdf.tail(10))

                            Word  Coefficient
218236    missile defense system    -0.525905
186232          kills boko haram    -0.527956
339299    syrian security forces    -0.538321
157319      homes east jerusalem    -0.541525
230284        new prime minister    -0.542140
25146               aung san suu    -0.560239
299880               san suu kyi    -0.560239
323036     sovereign wealth fund    -0.611614
255297     phone hacking scandal    -0.699162
55588   chancellor angela merkel    -0.793560


### 4-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [51]:
if DATASET == 1:
    for n in range(1,5):
        print("--------------------------------------------\n\nStart of the " 
              + str(n) + ",4 gram model\n")

        _gram_vectorizer_ = CountVectorizer(ngram_range=(n,4))
        _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

        print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

        _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
        _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

        _gram_test_ = _gram_vectorizer_.transform(test_headlines)
        _gram_predictions_ = _gram_model_.predict(_gram_test_)

        print (accuracy_score(test["Label"], _gram_predictions_))

        model_type.append(str(n) + ",4 n-gram")
        result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,4 gram model

The shape is: (18997, 1136449)

0.5296842105263158
--------------------------------------------

Start of the 2,4 gram model

The shape is: (18997, 1106846)

0.5469473684210526
--------------------------------------------

Start of the 3,4 gram model

The shape is: (18997, 778559)

0.539578947368421
--------------------------------------------

Start of the 4,4 gram model

The shape is: (18997, 385087)

0.5385263157894736


### 5-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [52]:
if DATASET == 1:
    MODEL_TYPE = 5

    for n in range(1,MODEL_TYPE+1):
        print("--------------------------------------------\n\nStart of the " 
              + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

        _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
        _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

        print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

        _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
        _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

        _gram_test_ = _gram_vectorizer_.transform(test_headlines)
        _gram_predictions_ = _gram_model_.predict(_gram_test_)

        print (accuracy_score(test["Label"], _gram_predictions_))

        model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
        result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,5 gram model

The shape is: (18997, 1504222)

0.5353684210526316
--------------------------------------------

Start of the 2,5 gram model

The shape is: (18997, 1474619)

0.5454736842105263
--------------------------------------------

Start of the 3,5 gram model

The shape is: (18997, 1146332)

0.5387368421052632
--------------------------------------------

Start of the 4,5 gram model

The shape is: (18997, 752860)

0.5408421052631579
--------------------------------------------

Start of the 5,5 gram model

The shape is: (18997, 367773)

0.5383157894736842


### 6-gram modell

Ebben a fejezetben már egy ciklusban vizsgálom meg a bizonyos modelleket és mentem le az eredményeiket.

In [53]:
if DATASET == 1:
    MODEL_TYPE = 6

    for n in range(1,MODEL_TYPE+1):
        print("--------------------------------------------\n\nStart of the " 
              + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

        _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
        _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

        print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

        _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
        _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

        _gram_test_ = _gram_vectorizer_.transform(test_headlines)
        _gram_predictions_ = _gram_model_.predict(_gram_test_)

        print (accuracy_score(test["Label"], _gram_predictions_))

        model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
        result.append(accuracy_score(test["Label"], _gram_predictions_))

--------------------------------------------

Start of the 1,6 gram model

The shape is: (18997, 1853406)

0.5347368421052632
--------------------------------------------

Start of the 2,6 gram model

The shape is: (18997, 1823803)

0.5446315789473685
--------------------------------------------

Start of the 3,6 gram model

The shape is: (18997, 1495516)

0.5383157894736842
--------------------------------------------

Start of the 4,6 gram model

The shape is: (18997, 1102044)

0.5393684210526316
--------------------------------------------

Start of the 5,6 gram model

The shape is: (18997, 716957)

0.5381052631578948
--------------------------------------------

Start of the 6,6 gram model

The shape is: (18997, 349184)

0.5376842105263158


### Eredmények összegzése

Az eredmények kiíratása, a legjobbat kiemelve.

In [54]:
if DATASET == 1:
    best_model = 0

    for model in range(len(model_type)):
        print(str(model_type[model]) + ":\t\t\t\t\t" + str(result[model]))

        if result[model] > best_model:
            best_model = result[model]
            best_model_index = model

    print("--------------------------------------------\nBest model:\n" 
          + str(model_type[best_model_index]) + "\t\t\t\t\t" + 
          str(result[best_model_index]))

Bag of words:					0.5178947368421053
1,2 n-gram:					0.527578947368421
2,2 n-gram:					0.5416842105263158
1,3 n-gram:					0.5296842105263158
2,3 n-gram:					0.5471578947368421
3,3 n-gram:					0.5389473684210526
1,4 n-gram:					0.5296842105263158
2,4 n-gram:					0.5469473684210526
3,4 n-gram:					0.539578947368421
4,4 n-gram:					0.5385263157894736
1,5 n-gram:					0.5353684210526316
2,5 n-gram:					0.5454736842105263
3,5 n-gram:					0.5387368421052632
4,5 n-gram:					0.5408421052631579
5,5 n-gram:					0.5383157894736842
1,6 n-gram:					0.5347368421052632
2,6 n-gram:					0.5446315789473685
3,6 n-gram:					0.5383157894736842
4,6 n-gram:					0.5393684210526316
5,6 n-gram:					0.5381052631578948
6,6 n-gram:					0.5376842105263158
--------------------------------------------
Best model:
2,3 n-gram					0.5471578947368421


### ROWS makró optimalizálás

Ebben a fejezetben a különböző ROWS értékekre (mennyi napi hírt fűzünk egybe) futtatom végig egy automatizált bag of words -> 6,6 gram modell tanítást és becslést és állapítom meg, hogy melyik a legpontosabb.

A tesztelendő paraméterek megadása.

In [55]:
if DATASET == 1:
    # Number of merged news into one string: 1...12, 25 
    rows_values = []
    for value in range(1,13):
        rows_values.append(value)

    rows_values.append(25)

    print(rows_values)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 25]


A modell típusok összegyűjtése az automatizált tanításhoz.

In [56]:
if DATASET == 1:
    model_type_values = []
    for value in range(1,7):
        model_type_values.append(value)

    print(model_type_values)

[1, 2, 3, 4, 5, 6]


A paraméterhez tartozó eredmények tárolására létrehozom az alábbi tömböket.

In [57]:
if DATASET == 1:
    rows_summary_value = []
    rows_summary_accuraccy = []

Automatizált tanítás és mentések.

In [58]:
if DATASET == 1:
    def preprocess():
        df_combined = pd.read_csv('drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/KAG_REDDIT_WRLD_DJIA_DF_corrected.csv', 
                                index_col = "Date")

        # Find the cells with NaN and after the rows for them
        is_NaN = df_combined.isnull()
        row_has_NaN = is_NaN.any(axis = 1)
        rows_with_NaN = df_combined[row_has_NaN]

        # Replace them
        df_combined = df_combined.replace(np.nan, " ")

        # Check the process
        is_NaN = df_combined.isnull()
        row_has_NaN = is_NaN.any(axis = 1)
        rows_with_NaN = df_combined[row_has_NaN]

        assert len(rows_with_NaN) is 0

        # the last 10 day save out for testing and comparing the models
        df_for_test = df_combined.tail(10)
        df_combined.drop(df_combined.tail(10).index,inplace=True) # drop last n rows

        # Get column names
        combined_column_names = []
        for column in df_combined.columns:
          combined_column_names.append(column)

        # 2D array creation for the news based on macros
        COLUMNS = len(df_combined)
        news_sum = []
        news_sum = [[0 for i in range(COLUMNS)] for j in range(int((len(combined_column_names) - 1) / ROWS))]  

        # Merge the news
        for row in range(len(df_combined)):
          for column in range(int((len(combined_column_names) - 1) / ROWS)):
            temp = ""
            news = ""
            for word in range(ROWS):
              news = df_combined[combined_column_names[(column * ROWS) + (word + 1)]][row]
              # Remove the b character at the begining of the string
              if news[0] is "b":
                news = " " + news[1:]
              temp = temp + " " + news
            news_sum[column][row] = temp

        # Drop the old columns
        for column in range(len(combined_column_names) - 1):
          df_combined.drop(combined_column_names[column + 1], axis = 1, inplace = True)

        # Create the new columns with the merged news
        for column in range(int((len(combined_column_names) - 1) / ROWS)):
          colum_name = "News_" + str(column + 1)
          df_combined[colum_name] = news_sum[column]          

        # The label column 
        LABEL_COLUMN = 0

        news_sum = []
        label_sum = []

        # Get the column names
        combined_column_names = []
        for column in df_combined.columns:
          combined_column_names.append(column)

        # Connect the merged news with the labels
        for column in range(len(df_combined)):
          for row in range(len(combined_column_names) - 1):
            news_sum.append(df_combined[combined_column_names[row + 1]][column])
            label_sum.append(df_combined[combined_column_names[LABEL_COLUMN]][column])

        # Create the new DataFrame
        df_sum_news_labels = pd.DataFrame(data = label_sum, index = None, columns = ["Label"])
        df_sum_news_labels["News"] = news_sum

        # Removing punctuations
        temp_news = []
        for line in news_sum:
          temp_attach = ""
          for word in line:
            temp = " "
            if word not in string.punctuation:
              temp = word
            temp_attach = temp_attach + "".join(temp)
          temp_news.append(temp_attach)

        news_sum = temp_news
        temp_news = []

        # Remove numbers
        for line in news_sum:
          temp_attach = ""
          for word in line:
            temp = " "
            if not word.isdigit():
              temp = word
            temp_attach = temp_attach + "".join(temp)
          temp_news.append(temp_attach)

        # Remove space
        for line in range(len(temp_news)):    
          temp_news[line] = " ".join(temp_news[line].split())

        # Converting headlines to lower case
        for line in range(len(temp_news)): 
            temp_news[line] = temp_news[line].lower()

        # Update the data frame
        df_sum_news_labels["News"] = temp_news

        # Load the stop words
        stop_words = set(stopwords.words('english'))

        filtered_sentence = []
        news_sum = df_sum_news_labels["News"]

        # Remove stop words
        for line in news_sum:
          word_tokens = word_tokenize(line)
          temp_attach = ""
          for word in word_tokens:
            temp = " "
            if not word in stop_words:
              temp = temp + word
            temp_attach = temp_attach + "".join(temp)
          filtered_sentence.append(temp_attach)

        # Remove space
        for line in range(len(filtered_sentence)):    
          filtered_sentence[line] = " ".join(filtered_sentence[line].split())

        # Update the data frame
        df_sum_news_labels["News"] = filtered_sentence

        news_sum = df_sum_news_labels["News"]
        null_indexes = []
        index = 0

        for line in news_sum:
          if line is "":
            null_indexes.append(index)
          index = index + 1

        for row in null_indexes:
          df_sum_news_labels = df_sum_news_labels.drop(row)

        news_sum = df_sum_news_labels["News"]
        null_indexes = []
        index = 0

        for line in news_sum:
          if line is "":
            null_indexes.append(index)
          index = index + 1
          
        assert len(null_indexes) is 0

        # Do the shuffle
        for i in range(SHUFFLE_CYCLE):
          df_sum_news_labels = shuffle(df_sum_news_labels, random_state = RANDOM_SEED)

        # Reset the index
        df_sum_news_labels.reset_index(inplace=True, drop=True)

        return df_sum_news_labels

In [59]:
if DATASET == 1:
    def split_to_train():
        INPUT_SIZE = len(df_sum_news_labels)
        TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 

        # Split the dataset
        train = df_sum_news_labels[:TRAIN_SIZE] 

        return train

In [60]:
if DATASET == 1:
    def split_to_test():
        INPUT_SIZE = len(df_sum_news_labels)
        TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 

        # Split the dataset
        test = df_sum_news_labels[TRAIN_SIZE:]

        return test

In [61]:
if DATASET == 1:
    for ROWS in rows_values:
      
        print("--------------------------------------------\n\nStart of the ROWS = " 
          + str(ROWS) + " sequence\n\n--------------------------------------------\n")
        
        model_type = []
        result = []

        df_sum_news_labels = preprocess()
        train = split_to_train()
        test = split_to_test()

        # check
        split_sum = len(train) + len(test)
        sum = len(df_sum_news_labels)
        assert split_sum == sum    

        train_headlines = []
        test_headlines = []

        for row in range(0, len(train.index)):
            train_headlines.append(train.iloc[row, 1])

        for row in range(0,len(test.index)):
            test_headlines.append(test.iloc[row, 1])

        # show the first
        print(train_headlines[0])

        for MODEL_TYPE in model_type_values:

            for n in range(1,MODEL_TYPE+1):
                print("--------------------------------------------\n\nStart of the " 
                      + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

                _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
                _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

                print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

                _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
                _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

                _gram_test_ = _gram_vectorizer_.transform(test_headlines)
                _gram_predictions_ = _gram_model_.predict(_gram_test_)

                print (accuracy_score(test["Label"], _gram_predictions_))

                model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
                result.append(accuracy_score(test["Label"], _gram_predictions_))

        rows_summary_value.append(ROWS)

        # save the best
        best_model_rows = 0

        for model in range(len(model_type)):
            if result[model] > best_model_rows:
                best_model_rows = result[model]

        rows_summary_accuraccy.append(best_model_rows)

--------------------------------------------

Start of the ROWS = 1 sequence

--------------------------------------------

u aircraft carrier heads korean waters south korea warns north korea enormous retaliation
--------------------------------------------

Start of the 1,1 gram model

The shape is: (39573, 30123)

0.5065696381645441
--------------------------------------------

Start of the 1,2 gram model

The shape is: (39573, 351580)

0.5175864160097029
--------------------------------------------

Start of the 2,2 gram model

The shape is: (39573, 321457)

0.5268849807964423
--------------------------------------------

Start of the 1,3 gram model

The shape is: (39573, 719811)

0.5159692743076613
--------------------------------------------

Start of the 2,3 gram model

The shape is: (39573, 689688)

0.5320396199717
--------------------------------------------

Start of the 3,3 gram model

The shape is: (39573, 368231)

0.5363856882959369
----------------------------------------

Kiértékelés.

In [62]:
if DATASET == 1:
    best_model_rows = 0

    for model in range(len(rows_summary_value)):
        print(str(rows_summary_value[model]) + ":\t\t\t\t\t" 
              + str(rows_summary_accuraccy[model]))

        if rows_summary_accuraccy[model] > best_model_rows:
            best_model_rows = rows_summary_accuraccy[model]
            best_model_rows_index = model

    print("--------------------------------------------\nBest row value:\n" 
          + str(rows_summary_value[best_model_rows_index]) + "\t\t\t\t\t" + 
          str(rows_summary_accuraccy[best_model_rows_index]))

1:					0.5431574691732363
2:					0.5471578947368421
3:					0.5371013577518156
4:					0.5528421052631579
5:					0.5391611925214755
6:					0.55239898989899
7:					0.5496632996632996
8:					0.5446127946127947
9:					0.5366161616161617
10:					0.5340909090909091
11:					0.5404040404040404
12:					0.5366161616161617
25:					0.5429292929292929
--------------------------------------------
Best row value:
4					0.5528421052631579


A legjobb ROWS eredményeinek megjelenítése.

In [63]:
if DATASET == 1:
    ROWS = int(rows_summary_value[best_model_rows_index])

    model_type = []
    result = []

    df_sum_news_labels = preprocess()
    train = split_to_train()
    test = split_to_test()

    # check
    split_sum = len(train) + len(test)
    sum = len(df_sum_news_labels)
    assert split_sum == sum    

    train_headlines = []
    test_headlines = []

    for row in range(0, len(train.index)):
        train_headlines.append(train.iloc[row, 1])

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])

    # show the first
    print(train_headlines[0])

    for MODEL_TYPE in model_type_values:

        for n in range(1,MODEL_TYPE+1):
            print("--------------------------------------------\n\nStart of the " 
                  + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

            _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
            _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

            print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

            _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
            _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

            _gram_test_ = _gram_vectorizer_.transform(test_headlines)
            _gram_predictions_ = _gram_model_.predict(_gram_test_)

            print (accuracy_score(test["Label"], _gram_predictions_))

            model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
            result.append(accuracy_score(test["Label"], _gram_predictions_))

    best_model_gram = 0

    for model in range(len(model_type)):
        print(str(model_type[model]) + ":\t\t\t\t\t" + str(result[model]))

        if result[model] > best_model_gram:
            best_model_gram = result[model]
            best_model_gram_index = model

    print("--------------------------------------------\nBest model:\n" 
          + str(model_type[best_model_gram_index]) + "\t\t\t\t\t" + 
          str(result[best_model_gram_index]))

dick cheney death squad killed lebanese former prime minister rafik hariri tens thousands armenians march capital commemorate th anniversary armenian genocide pedophile priest hires private detectives harass victims islamic insurgents advance closer pakistani capital
--------------------------------------------

Start of the 1,1 gram model

The shape is: (9499, 29729)

0.5111578947368421
--------------------------------------------

Start of the 1,2 gram model

The shape is: (9499, 366936)

0.52
--------------------------------------------

Start of the 2,2 gram model

The shape is: (9499, 337207)

0.5473684210526316
--------------------------------------------

Start of the 1,3 gram model

The shape is: (9499, 779700)

0.527578947368421
--------------------------------------------

Start of the 2,3 gram model

The shape is: (9499, 749971)

0.5528421052631579
--------------------------------------------

Start of the 3,3 gram model

The shape is: (9499, 412764)

0.5389473684210526
----

A legjobbhoz tartozó korrelációs tényezők megjelenítése.

In [64]:
if DATASET == 1:
    ROWS = int(rows_summary_value[best_model_rows_index])
    MODEL_TYPE = str(model_type[best_model_gram_index])

    df_sum_news_labels = preprocess()
    train = split_to_train()
    test = split_to_test()

    # check
    split_sum = len(train) + len(test)
    sum = len(df_sum_news_labels)
    assert split_sum == sum    

    train_headlines = []
    test_headlines = []

    for row in range(0, len(train.index)):
        train_headlines.append(train.iloc[row, 1])

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])

    # show the first
    print(train_headlines[0])

    _gram_vectorizer_ = CountVectorizer(ngram_range=(int(MODEL_TYPE[0]),int(MODEL_TYPE[2])))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["Label"])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    print (accuracy_score(test["Label"], _gram_predictions_))

dick cheney death squad killed lebanese former prime minister rafik hariri tens thousands armenians march capital commemorate th anniversary armenian genocide pedophile priest hires private detectives harass victims islamic insurgents advance closer pakistani capital
The shape is: (9499, 749971)

0.5528421052631579


In [65]:
if DATASET == 1:
    _gram_words_best_ = _gram_vectorizer_.get_feature_names()
    _gram_coeffs_best_ = _gram_model_.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : _gram_words_best_, 
                            'Coefficient' : _gram_coeffs_best_})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])

    print(coeffdf.head(10))

                    Word  Coefficient
16745      air pollution     0.492042
611543      social media     0.478259
702463         use force     0.477721
738134     world largest     0.436365
483525          per cent     0.423582
257212  french president     0.412379
31225           anti gay     0.405971
654661          tear gas     0.391482
479790        peace deal     0.387368
150645       court rules     0.383496


In [66]:
if DATASET == 1:
    print(coeffdf.tail(10))

                  Word  Coefficient
698949         us army    -0.374030
70540        bin laden    -0.375205
616033    south korean    -0.379653
449005  nuclear plants    -0.380581
10848     afghan woman    -0.391471
596210    sexual abuse    -0.393440
538466       red cross    -0.399393
574709      saudi king    -0.411576
567727     russia says    -0.414118
472517   panama papers    -0.470315


In [67]:
# test the last 10 days
if DATASET == 1:
    ROWS = 25
    MODEL_TYPE = str(model_type[best_model_gram_index])

    # Find the cells with NaN and after the rows for them
    is_NaN = df_for_test.isnull()
    row_has_NaN = is_NaN.any(axis = 1)
    rows_with_NaN = df_for_test[row_has_NaN]

    # Replace them
    df_for_test = df_for_test.replace(np.nan, " ")

    # Check the process
    is_NaN = df_for_test.isnull()
    row_has_NaN = is_NaN.any(axis = 1)
    rows_with_NaN = df_for_test[row_has_NaN]

    assert len(rows_with_NaN) is 0

    # Get column names
    combined_column_names = []
    for column in df_for_test.columns:
      combined_column_names.append(column)

    # 2D array creation for the news based on macros
    COLUMNS = len(df_for_test)
    news_sum = []
    news_sum = [[0 for i in range(COLUMNS)] for j in range(int((len(combined_column_names) - 1) / ROWS))]  

    # Merge the news
    for row in range(len(df_for_test)):
      for column in range(int((len(combined_column_names) - 1) / ROWS)):
        temp = ""
        news = ""
        for word in range(ROWS):
          news = df_for_test[combined_column_names[(column * ROWS) + (word + 1)]][row]
          # Remove the b character at the begining of the string
          if news[0] is "b":
            news = " " + news[1:]
          temp = temp + " " + news
        news_sum[column][row] = temp

    # Drop the old columns
    for column in range(len(combined_column_names) - 1):
      df_for_test.drop(combined_column_names[column + 1], axis = 1, inplace = True)

    # Create the new columns with the merged news
    for column in range(int((len(combined_column_names) - 1) / ROWS)):
      colum_name = "News_" + str(column + 1)
      df_for_test[colum_name] = news_sum[column]          

    # The label column 
    LABEL_COLUMN = 0

    news_sum = []
    label_sum = []

    # Get the column names
    combined_column_names = []
    for column in df_for_test.columns:
      combined_column_names.append(column)

    # Connect the merged news with the labels
    for column in range(len(df_for_test)):
      for row in range(len(combined_column_names) - 1):
        news_sum.append(df_for_test[combined_column_names[row + 1]][column])
        label_sum.append(df_for_test[combined_column_names[LABEL_COLUMN]][column])

    # Create the new DataFrame
    df_sum_news_labels = pd.DataFrame(data = label_sum, index = None, columns = ["Label"])
    df_sum_news_labels["News"] = news_sum

    # Removing punctuations
    temp_news = []
    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if word not in string.punctuation:
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    news_sum = temp_news
    temp_news = []

    # Remove numbers
    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if not word.isdigit():
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    # Remove space
    for line in range(len(temp_news)):    
      temp_news[line] = " ".join(temp_news[line].split())

    # Converting headlines to lower case
    for line in range(len(temp_news)): 
        temp_news[line] = temp_news[line].lower()

    # Update the data frame
    df_sum_news_labels["News"] = temp_news

    # Load the stop words
    stop_words = set(stopwords.words('english'))

    filtered_sentence = []
    news_sum = df_sum_news_labels["News"]

    # Remove stop words
    for line in news_sum:
      word_tokens = word_tokenize(line)
      temp_attach = ""
      for word in word_tokens:
        temp = " "
        if not word in stop_words:
          temp = temp + word
        temp_attach = temp_attach + "".join(temp)
      filtered_sentence.append(temp_attach)

    # Remove space
    for line in range(len(filtered_sentence)):    
      filtered_sentence[line] = " ".join(filtered_sentence[line].split())

    # Update the data frame
    df_sum_news_labels["News"] = filtered_sentence

    news_sum = df_sum_news_labels["News"]
    null_indexes = []
    index = 0

    for line in news_sum:
      if line is "":
        null_indexes.append(index)
      index = index + 1

    for row in null_indexes:
      df_sum_news_labels = df_sum_news_labels.drop(row)

    news_sum = df_sum_news_labels["News"]
    null_indexes = []
    index = 0

    for line in news_sum:
      if line is "":
        null_indexes.append(index)
      index = index + 1
      
    assert len(null_indexes) is 0

    compare = []

    for row in range(0, len(df_sum_news_labels.index)):
      compare.append(df_sum_news_labels.iloc[row, 1])

    print(df_sum_news_labels.head())

    print(compare[0])

    _gram_test_compare_ = _gram_vectorizer_.transform(compare)
    compare_predict = _gram_model_.predict(_gram_test_compare_)
    compare_predict_proba = _gram_model_.predict_proba(_gram_test_compare_)

    print(compare_predict)    
    print(compare_predict_proba)

   Label                                               News
0      1  staggering percent venezuelans say money buy e...
1      1  australian athlete competed six paralympic gam...
2      0  german government agrees ban fracking indefini...
3      1  today united kingdom decides whether remain eu...
4      0  david cameron resign pm eu referendum bbc fore...
staggering percent venezuelans say money buy enough food two corporate whistleblowers may enter plea bargain deal would tie brazilian lawmakers corruption cases poland together russia iran several gulf states successfully removed decriminalization homosexuality un resolution three environmental activists killed per week last year murdered defending land rights environment mining dam projects logging ontario funeral business dissolves dead pours town sewers new declassified documents reveal cia abused tortured prisoners graphic tens thousands people gathered sweltering heat japan okinawa island sunday one biggest demonstrations two d

In [68]:
if DATASET == 1:
    # create dataframe template
    df_compare = df_for_test

    # Get the column names
    combined_column_names = []
    for column in df_for_test.columns:
      combined_column_names.append(column)

    drop_number = -1 * (len(combined_column_names) - 1)

    df_dropped = df_compare.drop(df_compare.columns[drop_number:],axis=1)

    df_dropped["0"] = compare_predict_proba[:,0]
    df_dropped["1"] = compare_predict_proba[:,1]
    df_dropped["Predict"] = compare_predict

    match = []
    for row in range(len(df_dropped)):
        if df_dropped["Label"][row] == df_dropped["Predict"][row]:
            match.append(1)
        else:
            match.append(0)

    df_dropped["Match"] = match

    df_dropped

## **ECO_BSN_DF, ECO_FNC_DF, ECO_US_DF 2008-2016 (2)**

Megvizsgálom a reddit-es világhírekkel megegyező intervallumon ezeket az összevont adathalmazokat, majd egyesítve és kombinálva a kettőt megvizsgálom, hogy javítja-e a pontosságot.

Ezeket az adathalmazokat én magam gyűjtöttem az alábbi oldalakról:


*   https://www.economist.com/business/ 
*   https://www.economist.com/finance-and-economics/ 
*   https://www.economist.com/united-states/ 

### Adathalmazok betöltése

Először betöltöm külön-külön az adathalmazokat.

In [69]:
if DATASET == 2:
    # Copy the dataset to the local environment
    !cp "/content/drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/KAG_REDDIT_WRLD_DJIA_DF_corrected.csv" "KAG_REDDIT_WRLD_DJIA_DF.csv"
    !cp "/content/drive/MyDrive/Economist/ECO_BSN_DF.csv" "ECO_BSN_DF.csv"
    !cp "/content/drive/MyDrive/Economist/ECO_FNC_DF.csv" "ECO_FNC_DF.csv"
    !cp "/content/drive/MyDrive/Economist/ECO_US_DF.csv" "ECO_US_DF.csv"


    # Check the copy is succesfull -> good if no assertation error
    read = !ls
    assert read[0].find("ECO_FNC_DF.csv") != -1
    assert read[0].find("KAG_REDDIT_WRLD_DJIA_DF.csv") != -1    
    assert read[1].find("ECO_BSN_DF.csv") != -1
    assert read[1].find("ECO_US_DF.csv") != -1

    # Load the datasets 
    df_reddit = pd.read_csv('KAG_REDDIT_WRLD_DJIA_DF.csv', index_col = "Date")
    df_bsn = pd.read_csv('ECO_BSN_DF.csv', index_col = "date")
    df_fnc = pd.read_csv('ECO_FNC_DF.csv', index_col = "date")
    df_us = pd.read_csv('ECO_US_DF.csv', index_col = "date")

    # Load the stock data
    df_stock = web.DataReader("DJIA", data_source="yahoo", start="2008-08-08", 
                              end="2016-07-01")

Az adathalmazok megvizsgálása az elemein keresztül.

In [70]:
if DATASET == 2:
    # Show the dataframe
    print("Reddit")
    print(df_reddit.head())
    print("\n\nBSN ECO")
    print(df_bsn.head())
    print("\n\nFNC ECO")
    print(df_fnc.head())
    print("\n\nUS ECO")
    print(df_us.head())

Azon elemek megkeresése az ECO adathalmazból ami beleesik a vizsgált időintervallumba.

In [71]:
if DATASET == 2:
    df_bsn_inspect = df_bsn[df_bsn.index < '2016/07/02']
    df_bsn_inspect = df_bsn_inspect[df_bsn_inspect.index > '2008/08/07']
    df_bsn_inspect = df_bsn_inspect.drop_duplicates()

    df_fnc_inspect = df_fnc[df_fnc.index < '2016/07/02']
    df_fnc_inspect = df_fnc_inspect[df_fnc_inspect.index > '2008/08/07']
    df_fnc_inspect = df_fnc_inspect.drop_duplicates()

    df_us_inspect = df_us[df_us.index < '2016/07/02']
    df_us_inspect = df_us_inspect[df_us_inspect.index > '2008/08/07']
    df_us_inspect = df_us_inspect.drop_duplicates()

    print("BSN ECO")
    print(df_bsn_inspect.head(2))
    print("...")
    print(df_bsn_inspect.tail(2))
    print(df_bsn_inspect.shape)

    print("\n\nFNC ECO")
    print(df_fnc_inspect.head(2))
    print("...")
    print(df_fnc_inspect.tail(2))
    print(df_fnc_inspect.shape)

    print("\n\nUS ECO")
    print(df_us_inspect.head(2))
    print("...")
    print(df_us_inspect.tail(2))
    print(df_us_inspect.shape)

    print("\n\nSummary length:\t\t" + str(len(df_bsn_inspect) + len(df_fnc_inspect) + len(df_us_inspect)))

Az Economist oldalról származó adathalmazok összefűzése.

In [72]:
if DATASET == 2:
    df_eco_all = pd.concat([df_bsn_inspect, df_fnc_inspect, df_us_inspect])

    df_eco_all = df_eco_all.drop_duplicates()

    df_eco_all.sort_index(ascending=True, inplace=True)

    print("ECO MERGED")
    print(df_eco_all.head(2))
    print("...")
    print(df_eco_all.tail(2))
    print(df_eco_all.shape)

Egy naphoz tartozó azonos hírek vizsgálata.

In [73]:
if DATASET == 2:
    # Groupby by date
    dates = df_eco_all.groupby("date")

    # Summary statistic
    print("Max:")
    print(dates.describe().max())
    print("\n\nMin:")
    print(dates.describe().min())

In [74]:
if DATASET == 2:
    dates_count = [] # for count
    dates_dates = [] # for indexing
    df_dates = dates.describe()

    for row in range(len(df_dates)):
        dates_count.append(len(dates.get_group(df_dates.index[row])))
        dates_dates.append(dates.get_group(df_dates.index[row]).index[0])

    df_group_dates = pd.DataFrame()
    df_group_dates["date"] = dates_dates
    df_group_dates["count"] = dates_count
    df_group_dates.set_index("date", inplace=True)
    df_group_dates.sort_index(ascending=True, inplace=True)

    print(df_group_dates.head())

In [75]:
if DATASET == 2:
    # Groupby by date
    counts = df_group_dates.groupby("count")

    keys = list(counts.groups.keys())

    sum = 0

    for key in keys:
      sum = sum + key * len(counts.get_group(key))
      print("Count: " + str(key) + "\t\t" + str(len(counts.get_group(key))))

    print("\n\nSummary:\t\t" + str(sum))  

Az összefűzött hírekhez a címkék generálása a részvény árfolyama alapján.

In [76]:
if DATASET == 2:
    days = []
    stock_days = []
    wrong_days = []

    # Create dates and remove duplicates
    for day in range(len(df_eco_all.index)):
        if day == 0:
            days.append(str(df_eco_all.index[day]))
        elif df_eco_all.index[day] != days[len(days) - 1]:
            days.append(str(df_eco_all.index[day]))

    # Drop not needed days
    for day in range(len(df_stock.index)):
        stock_days.append(str(df_stock.index[day])[0:10].replace("-","/"))

    # Remove not relevant date
    good_days = []
    for day in days:
        try:
            if stock_days.index(day):
                good_days.append(str(day))
        except:
            wrong_days.append(str(day))

    print("All days:\t\t" + str(len(days)))
    print("Good days:\t\t" + str(len(good_days)))
    print("Wrong days:\t\t" + str(len(wrong_days)))

In [77]:
if DATASET == 2:
    label_eco = []
    date_label_eco =[]
    title_label_eco = []

    for day in range(len(good_days)):
        if day == 0:
            title_label_eco.append(df_eco_all["title"][good_days[day]])
            label_eco.append(0)
            date_label_eco.append(good_days[day])      
        # label should be 1 -> rise
        elif int(df_stock["Adj Close"][stock_days.index(good_days[day])]) >= int(df_stock["Adj Close"][stock_days.index(good_days[day]) - 1]):   
            if isinstance(df_eco_all["title"][good_days[day]], str) is False:
                for row in df_eco_all["title"][good_days[day]]:
                    title_label_eco.append(row)
                    label_eco.append(1)
                    date_label_eco.append(good_days[day])
            else:
                    title_label_eco.append(df_eco_all["title"][good_days[day]])
                    label_eco.append(1)
                    date_label_eco.append(good_days[day])

        # label should be 0 -> fall
        elif int(df_stock["Adj Close"][stock_days.index(good_days[day])]) < int(df_stock["Adj Close"][stock_days.index(good_days[day]) - 1]):   
            if isinstance(df_eco_all["title"][good_days[day]], str) is False:
                for row in df_eco_all["title"][good_days[day]]:
                    title_label_eco.append(row)
                    label_eco.append(0)
                    date_label_eco.append(good_days[day])
            else:
                    title_label_eco.append(df_eco_all["title"][good_days[day]])
                    label_eco.append(0)
                    date_label_eco.append(good_days[day])

    print("News with labels length:\t\t" + str(len(label_eco)))

A címkékkel rendelkező, használható adatokból egy új adathalmaz létrehozása.

In [78]:
if DATASET == 2:
    df_eco = pd.DataFrame()
    df_eco["date"] = date_label_eco
    df_eco["label"] = label_eco
    df_eco["title"] = title_label_eco
    df_eco.set_index("date", inplace=True)
    df_eco.sort_index(ascending=True, inplace=True)
    print(df_eco.head())
    print(len(df_eco))

    # drop duplicates
    df_eco.drop_duplicates(subset="title", inplace=True)
    print("\n\n ----- Drop duplicate title -----\n")
    print(df_eco.head())
    print(len(df_eco))    

### Adathalmazok előkészítése

Az adathalmaz megtisztítása.

In [79]:
if DATASET == 2:
    # Removing punctuations
    temp_news = []
    news_sum = df_eco["title"]

    for line in news_sum:
      temp_attach = ""
      try:
          for word in line:
            temp = " "
            if word not in string.punctuation:
              temp = word
            temp_attach = temp_attach + "".join(temp)
      except:
          temp = " "
          temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    news_sum = temp_news
    temp_news = []

    # Remove numbers
    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if not word.isdigit():
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    # Remove space
    for line in range(len(temp_news)):    
      temp_news[line] = " ".join(temp_news[line].split())

    # Converting headlines to lower case
    for line in range(len(temp_news)): 
        temp_news[line] = temp_news[line].lower()

    # Update the data frame
    df_eco["title"] = temp_news

    # Load the stop words
    stop_words = set(stopwords.words('english'))

    filtered_sentence = []
    news_sum = df_eco["title"]

    # Remove stop words
    for line in news_sum:
      word_tokens = word_tokenize(line)
      temp_attach = ""
      for word in word_tokens:
        temp = " "
        if not word in stop_words:
          temp = temp + word
        temp_attach = temp_attach + "".join(temp)
      filtered_sentence.append(temp_attach)

    # Remove space
    for line in range(len(filtered_sentence)):    
      filtered_sentence[line] = " ".join(filtered_sentence[line].split())

    # Update the data frame
    df_eco["title"] = filtered_sentence

    # Reset the index
    df_eco.reset_index(inplace=True)

    news_sum = df_eco["title"]
    null_indexes = []
    index = 0

    for line in news_sum:
      if line is "":
        null_indexes.append(index)
      index = index + 1

    print(null_indexes)

    for row in range(len(null_indexes)):
      df_eco = df_eco.drop(df_eco.index[null_indexes[row] - row])

    news_sum = df_eco["title"]
    null_indexes = []
    index = 0

    for line in news_sum:
      if line is "":
        null_indexes.append(index)
      index = index + 1
      
    assert len(null_indexes) is 0

Az adathalmaz szétbontása tanító és tesztelő adathalmazra.

In [80]:
if DATASET == 2:
    # Drop the dates
    df_eco_label_title = pd.DataFrame()
    df_eco_label_title["label"] = df_eco["label"]
    df_eco_label_title["title"] = df_eco["title"]
    print("New dataset without the dates")
    print(df_eco_label_title.head())
    print(len(df_eco_label_title))

    # Do the shuffle
    for i in range(SHUFFLE_CYCLE):
      df_eco_label_title = shuffle(df_eco_label_title, random_state = RANDOM_SEED)

    # Reset the index
    df_eco_label_title.reset_index(inplace=True, drop=True)

    # Show the data frame
    print("\n\nAfter shuffle")
    print(df_eco_label_title.head())    

    # Split the dataset
    INPUT_SIZE = len(df_eco_label_title)
    TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 
    TEST_SIZE = int(TEST_SPLIT * INPUT_SIZE)

    train = df_eco_label_title[:TRAIN_SIZE] 
    test = df_eco_label_title[TRAIN_SIZE:]

    # Print out the length
    print("\n\nAfter split")
    print("Train data set length: " + str(len(train)))
    print("Test data set length: " + str(len(test)))
    print("Split summa: " + str(len(train) + len(test)))
    print("Dataset summa before split: " + str(len(df_eco_label_title)))

    # check
    split_sum = len(train) + len(test)
    sum = len(df_eco_label_title)
    assert split_sum == sum

### n-gram modell

Automatikus tanítás és eredmények megjelenítése a legmagasabb korrelációs tényezőjű szavakkal együtt.

In [81]:
if DATASET == 2:
    model_type = []
    result = []
    model_type_values = []
    train_headlines = []
    test_headlines = []

    # Create model type values
    for value in range(1,7):
        model_type_values.append(value)

    for row in range(0, len(train.index)):
        train_headlines.append(train.iloc[row, 1])

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])


    for MODEL_TYPE in model_type_values:

        for n in range(1,MODEL_TYPE+1):
            print("--------------------------------------------\n\nStart of the " 
                  + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

            _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
            _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

            print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

            _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
            _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["label"])

            _gram_test_ = _gram_vectorizer_.transform(test_headlines)
            _gram_predictions_ = _gram_model_.predict(_gram_test_)

            print (accuracy_score(test["label"], _gram_predictions_))

            model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
            result.append(accuracy_score(test["label"], _gram_predictions_))

    best_model_gram = 0

    print("\n\n")
    for model in range(len(model_type)):
        print(str(model_type[model]) + ":\t\t\t\t\t" + str(result[model]))

        if result[model] > best_model_gram:
            best_model_gram = result[model]
            best_model_gram_index = model

    print("--------------------------------------------\nBest model:\n" 
          + str(model_type[best_model_gram_index]) + "\t\t\t\t\t" + 
          str(result[best_model_gram_index]))

In [82]:
if DATASET == 2:
    MODEL_TYPE = str(model_type[best_model_gram_index])

    # show the first
    print(train_headlines[0])

    _gram_vectorizer_ = CountVectorizer(ngram_range=(int(MODEL_TYPE[0]),int(MODEL_TYPE[2])))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["label"])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    print (accuracy_score(test["label"], _gram_predictions_))

    model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
    result.append(accuracy_score(test["label"], _gram_predictions_))

In [83]:
if DATASET == 2:
    _gram_words_best_ = _gram_vectorizer_.get_feature_names()
    _gram_coeffs_best_ = _gram_model_.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : _gram_words_best_, 
                            'Coefficient' : _gram_coeffs_best_})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])

    print(coeffdf.head(10))

In [84]:
if DATASET == 2:
    print(coeffdf.tail(10))

## **KAG_BENZ_ANALYST_DF, KAG_BENZ_PARTNER_DF 2008-2016**

Megvizsgálom a reddit-es világhírekkel megegyező intervallumon ezeket az összevont adathalmazokat, majd egyesítve és kombinálva a kettőt megvizsgálom, hogy javítja-e a pontosságot.

Ezen adathalmazok forrása:

*   https://www.kaggle.com/miguelaenlle/massive-stock-news-analysis-db-for-nlpbacktests 

#### Adathalmazok betöltése

In [85]:
if DATASET == 3:
    # Copy the dataset to the local environment
    !cp "/content/drive/MyDrive/Kaggle dataset/Reddit Top 25 DJIA/KAG_REDDIT_WRLD_DJIA_DF_corrected.csv" "KAG_REDDIT_WRLD_DJIA_DF.csv"
    !cp "/content/drive/MyDrive/Kaggle dataset/Benzinga news with ticker/KAG_BENZ_ANALYST_DF_1.csv" "KAG_BENZ_ANALYST_DF_1.csv"
    !cp "/content/drive/MyDrive/Kaggle dataset/Benzinga news with ticker/KAG_BENZ_ANALYST_DF_2.csv" "KAG_BENZ_ANALYST_DF_2.csv"
    !cp "/content/drive/MyDrive/Kaggle dataset/Benzinga news with ticker/KAG_BENZ_PARTNER_DF_1.csv" "KAG_BENZ_PARTNER_DF_1.csv"
    !cp "/content/drive/MyDrive/Kaggle dataset/Benzinga news with ticker/KAG_BENZ_PARTNER_DF_2.csv" "KAG_BENZ_PARTNER_DF_2.csv"


    # Check the copy is succesfull -> good if no assertation error
    read = !ls
    assert read[1].find("KAG_BENZ_ANALYST_DF_1.csv") != -1
    assert read[2].find("KAG_REDDIT_WRLD_DJIA_DF.csv") != -1    
    assert read[2].find("KAG_BENZ_ANALYST_DF_2.csv") != -1
    assert read[0].find("KAG_BENZ_PARTNER_DF_1.csv") != -1    
    assert read[1].find("KAG_BENZ_PARTNER_DF_2.csv") != -1

    # Load the datasets 
    df_reddit = pd.read_csv('KAG_REDDIT_WRLD_DJIA_DF.csv', index_col = "Date")
    df_benz_1 = pd.read_csv('KAG_BENZ_ANALYST_DF_1.csv', index_col = "date")
    df_benz_2 = pd.read_csv('KAG_BENZ_ANALYST_DF_2.csv', index_col = "date")    
    df_partner_1 = pd.read_csv('KAG_BENZ_PARTNER_DF_1.csv', index_col = "date")
    df_partner_2 = pd.read_csv('KAG_BENZ_PARTNER_DF_2.csv', index_col = "date")

    # Load the stock data
    df_stock = web.DataReader("DJIA", data_source="yahoo", start="2008-08-08", 
                              end="2016-07-01")

A szétbontott adathalmazok összefűzése, majd azok megjelenítése.

In [86]:
if DATASET == 3:
    # Merge them
    df_benz = pd.concat([df_benz_1, df_benz_2])
    df_partner = pd.concat([df_partner_1, df_partner_2])

    # Show the dataframe
    print("BENZ")
    print(df_benz.head())
    print("...")
    print(df_benz.tail())
    print(len(df_benz))
    print("\n\nPARTNER")
    print(df_partner.head())
    print("...")
    print(df_partner.tail())
    print(len(df_partner))

A vizsgált időtartamba eső adatok kiszűrése.

In [87]:
if DATASET == 3:
    df_benz_inspect = df_benz[df_benz.index < '2016/07/02']
    df_benz_inspect = df_benz_inspect[df_benz_inspect.index > '2008/08/07']
    df_benz_inspect = df_benz_inspect.drop_duplicates()

    df_partner_inspect = df_partner[df_partner.index < '2016/07/02']
    df_partner_inspect = df_partner_inspect[df_partner_inspect.index > '2008/08/07']
    df_partner_inspect = df_partner_inspect.drop_duplicates()

    print("BENZ")
    print(df_benz_inspect.head(2))
    print("...")
    print(df_benz_inspect.tail())
    print(df_benz_inspect.shape)

    print("\n\nPARTNER")
    print(df_partner_inspect.head())
    print("...")
    print(df_partner_inspect.tail())
    print(df_partner_inspect.shape)

    df_benz = pd.concat([df_benz_inspect, df_partner_inspect])

    print("\n\nSummary length:\t\t" + str(len(df_benz_inspect) + len(df_partner_inspect)))

Az adathalmazban található adatok kiszűrése a Dow Jonews Industrial Average alapján:


*   Procter & Gamble, PG, 1932-05-26
*   3M Company, MMM, 1976-08-09
*   IBM, IBM, 1979-06-29
*   Merck & Co., MRK, 1979-06-29
*   American Express, AXP, 1982-08-30
*   McDonald's, MCD, 1985-10-30
*   Boeing, BA, 1987-03-12
*   The Coca-Cola Company, KO, 1987-03-12
*   Caterpillar Inc., CAT, 1991-05-06
*   JPMorgan Chase, JPM, 1991-05-06
*   The Walt Disney Company, DIS, 1991-05-06
*   Johnson & Johnson, JNJ, 1997-03-17
*   Walmart, WMT, 1997-03-17
*   The Home Depot, HD, 1999-11-01
*   Intel, INTC, 1999-11-01
*   Microsoft, MSFT, 1999-11-01
*   Verizon, VZ, 2004-04-08
*   Chevron Corporation, CVX, 2008-02-19
*   Cisco Systems, CSCO, 2009-06-08
*   The Travelers Companies, 	TRV, 2009-06-08
*   UnitedHealth Group, UNH, 	2012-09-24
*   Goldman Sachs, GS, 	2013-09-20	
*   Nike, NKE, 2013-09-20
*   Visa Inc., 	V, 2013-09-20	



In [88]:
if DATASET == 3:
    tickers = []
    unique_count = []

    # The stock tickers which is needed
    stock_ticker = ["PG", "MMM", "IBM", "MRK", "AXP", "MCD", "BA", "KO", "CAT", "JPM",
                    "DIS", "JNJ", "WMT", "HD", "INTC", "MSFT", "VZ", "CVX", "CSCO",
                    "TRV", "UNH", "GS", "NKE", "V"]

    stocks_benz = df_benz.groupby("stock")
    stock_benz_df = stocks_benz.describe()

    for stock in range(len(stock_benz_df.index)):
        tickers.append(stock_benz_df.index[stock])

    for stock in stock_ticker:
        try:
            unique_count.append(stock_benz_df.iloc[tickers.index(stock), :][1]) #unique
        except:
            unique_count.append(0)
            print(str(stock) + "\tis not in list")

    print("\n\t---------------------------------------\n")

    sum_count = 0

    for stock in range(len(stock_ticker)):
        print(str(stock_ticker[stock]) + "\t\t\t" + str(unique_count[stock]))
        sum_count = sum_count + unique_count[stock]

    print("\n\t---------------------------------------\n")
    print("Summary of news with the tickers:\t" + str(sum_count))

In [89]:
if DATASET == 3:
    df_benz_filtered = pd.DataFrame()

    for stock in stock_ticker:
        df_temp = df_benz[(df_benz["stock"]) == stock].drop_duplicates()
        df_benz_filtered = pd.concat([df_benz_filtered, df_temp])

    df_benz_filtered.sort_index(ascending=True, inplace=True)

    df_benz = df_benz_filtered
    df_benz.drop("stock", axis = 1, inplace = True)
    df_benz.drop_duplicates(inplace=True)

    print("BENZ")
    print(df_benz.head(2))
    print("...")
    print(df_benz.tail(2))
    print(df_benz.shape)

### A szöveg előkészítése

A szöveg előfeldolgozása következik, mint az írásjelek eltűvolítása, a számok eltávolítása, felesleges szóközöktől való megtisztítás, minden szó kisbetűs szóra cserélése.

In [90]:
if DATASET == 3:
    # Removing punctuations
    temp_news = []
    news_sum = df_benz["headline"]

    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if word not in string.punctuation:
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    news_sum = temp_news
    temp_news = []

    # Remove numbers
    for line in news_sum:
      temp_attach = ""
      for word in line:
        temp = " "
        if not word.isdigit():
          temp = word
        temp_attach = temp_attach + "".join(temp)
      temp_news.append(temp_attach)

    # Remove space
    for line in range(len(temp_news)):    
      temp_news[line] = " ".join(temp_news[line].split())

    # Converting headlines to lower case
    for line in range(len(temp_news)): 
        temp_news[line] = temp_news[line].lower()

    # Update the data frame
    df_benz["headline"] = temp_news

    # Load the stop words
    stop_words = set(stopwords.words('english'))

    filtered_sentence = []
    news_sum = df_benz["headline"]

    # Remove stop words
    for line in news_sum:
      word_tokens = word_tokenize(line)
      temp_attach = ""
      for word in word_tokens:
        temp = " "
        if not word in stop_words:
          temp = temp + word
        temp_attach = temp_attach + "".join(temp)
      filtered_sentence.append(temp_attach)

    # Remove space
    for line in range(len(filtered_sentence)):    
      filtered_sentence[line] = " ".join(filtered_sentence[line].split())

    # Update the data frame
    df_benz["headline"] = filtered_sentence

    print("BENZ")
    print(df_benz.head(2))
    print("...")
    print(df_benz.tail(2))
    print(df_benz.shape)

### Címke létrehozása, adathalmaz felbontása

A következőkben a címkék generálása és az adathalmaz felbontása történik tanító és validáló halmazra.

In [91]:
if DATASET == 3:
    days = []
    stock_days = []
    wrong_days = []

    # Create dates and remove duplicates
    for day in range(len(df_benz.index)):
        temp = str(df_benz.index[day])[0:10].replace("-","/")
        if day == 0:
            days.append(temp)
        elif df_benz.index[day] != df_benz.index[day - 1]:
            days.append(temp)

    # Update the dataframe date column
    df_benz.reset_index(inplace=True)
    temp_days = df_benz["date"]
    days_to_update = []
    for date in range(len(temp_days)):
        temp = str(temp_days[date])[0:10].replace("-","/")
        days_to_update.append(temp)

    df_benz["date"] = days_to_update
    df_benz.set_index("date", inplace=True, drop=True)    

    # Drop not needed days
    for day in range(len(df_stock.index)):
        stock_days.append(str(df_stock.index[day])[0:10].replace("-","/"))

    # Remove not relevant date
    good_days = []
    for day in days:
        try:
            if stock_days.index(day):
                good_days.append(str(day))
        except:
            wrong_days.append(str(day))

    print("All days:\t\t\t\t" + str(len(days)))
    print("Good days:\t\t\t\t" + str(len(good_days)))
    print("Wrong days:\t\t\t\t" + str(len(wrong_days)))

    label_benz = []
    date_label_benz =[]
    title_label_benz = []

    for day in range(len(good_days)):
        if day == 0:
            title_label_benz.append(df_benz["headline"][good_days[day]])
            label_benz.append(0)
            date_label_benz.append(good_days[day])      
        # label should be 1 -> rise
        elif int(df_stock["Adj Close"][stock_days.index(good_days[day])]) >= int(df_stock["Adj Close"][stock_days.index(good_days[day]) - 1]):   
            if isinstance(df_benz["headline"][good_days[day]], str) is False:
                for row in df_benz["headline"][good_days[day]]:
                    title_label_benz.append(row)
                    label_benz.append(1)
                    date_label_benz.append(good_days[day])
            else:
                    title_label_benz.append(df_benz["headline"][good_days[day]])
                    label_benz.append(1)
                    date_label_benz.append(good_days[day])

        # label should be 0 -> fall
        elif int(df_stock["Adj Close"][stock_days.index(good_days[day])]) < int(df_stock["Adj Close"][stock_days.index(good_days[day]) - 1]):   
            if isinstance(df_benz["headline"][good_days[day]], str) is False:
                for row in df_benz["headline"][good_days[day]]:
                    title_label_benz.append(row)
                    label_benz.append(0)
                    date_label_benz.append(good_days[day])
            else:
                    title_label_benz.append(df_benz["headline"][good_days[day]])
                    label_benz.append(0)
                    date_label_benz.append(good_days[day])

    print("News with labels length:\t\t" + str(len(label_benz)))

In [92]:
if DATASET == 3:
    df_benz = pd.DataFrame()
    df_benz["date"] = date_label_benz
    df_benz["label"] = label_benz
    df_benz["title"] = title_label_benz
    df_benz.set_index("date", inplace=True)
    df_benz.sort_index(ascending=True, inplace=True)
    print(df_benz.head())
    print("...")
    print(df_benz.tail())
    print(df_benz.shape)

    # drop duplicates
    df_benz.drop_duplicates(subset="title", inplace=True)
    print("\n\n ----- Drop duplicate title -----\n")
    print(df_benz.head())
    print("...")
    print(df_benz.tail())
    print(df_benz.shape) 

Az adathalmaz felbontása.

In [93]:
if DATASET == 3:
    # Drop the dates
    df_benz_label_title = pd.DataFrame()
    df_benz_label_title["label"] = df_benz["label"]
    df_benz_label_title["title"] = df_benz["title"]
    # Reset the index
    df_benz_label_title.reset_index(inplace=True, drop=True)
    print("New dataset without the dates")
    print(df_benz_label_title.head())
    print(len(df_benz_label_title))

    # Do the shuffle
    for i in range(SHUFFLE_CYCLE):
      df_benz_label_title = shuffle(df_benz_label_title, random_state = RANDOM_SEED)

    # Reset the index
    df_benz_label_title.reset_index(inplace=True, drop=True)

    # Show the data frame
    print("\n\nAfter shuffle")
    print(df_benz_label_title.head())    

    # Split the dataset
    INPUT_SIZE = len(df_benz_label_title)
    TRAIN_SIZE = int(TRAIN_SPLIT * INPUT_SIZE) 
    TEST_SIZE = int(TEST_SPLIT * INPUT_SIZE)

    train = df_benz_label_title[:TRAIN_SIZE] 
    test = df_benz_label_title[TRAIN_SIZE:]

    # Print out the length
    print("\n\nAfter split")
    print("Train data set length: " + str(len(train)))
    print("Test data set length: " + str(len(test)))
    print("Split summa: " + str(len(train) + len(test)))
    print("Dataset summa before split: " + str(len(df_benz_label_title)))

    # check
    split_sum = len(train) + len(test)
    sum = len(df_benz_label_title)
    assert split_sum == sum

### n-gram modell

Automatikus tanítás és eredmények megjelenítése a legmagasabb korrelációs tényezőjű szavakkal együtt.

In [94]:
if DATASET == 3:
    model_type = []
    result = []
    model_type_values = []
    train_headlines = []
    test_headlines = []

    # Create model type values
    for value in range(1,7):
        model_type_values.append(value)

    for row in range(0, len(train.index)):
        train_headlines.append(train.iloc[row, 1])

    for row in range(0,len(test.index)):
        test_headlines.append(test.iloc[row, 1])


    for MODEL_TYPE in model_type_values:

        for n in range(1,MODEL_TYPE+1):
            print("--------------------------------------------\n\nStart of the " 
                  + str(n) + "," + str(MODEL_TYPE) + " gram model\n")

            _gram_vectorizer_ = CountVectorizer(ngram_range=(n,MODEL_TYPE))
            _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

            print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

            _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
            _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["label"])

            _gram_test_ = _gram_vectorizer_.transform(test_headlines)
            _gram_predictions_ = _gram_model_.predict(_gram_test_)

            print (accuracy_score(test["label"], _gram_predictions_))

            model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
            result.append(accuracy_score(test["label"], _gram_predictions_))

    best_model_gram = 0

    print("\n\n")
    for model in range(len(model_type)):
        print(str(model_type[model]) + ":\t\t\t\t\t" + str(result[model]))

        if result[model] > best_model_gram:
            best_model_gram = result[model]
            best_model_gram_index = model

    print("--------------------------------------------\nBest model:\n" 
          + str(model_type[best_model_gram_index]) + "\t\t\t\t\t" + 
          str(result[best_model_gram_index]))

In [95]:
if DATASET == 3:
    MODEL_TYPE = str(model_type[best_model_gram_index])

    # show the first
    print(train_headlines[0])

    _gram_vectorizer_ = CountVectorizer(ngram_range=(int(MODEL_TYPE[0]),int(MODEL_TYPE[2])))
    _train_vectorizer_ = _gram_vectorizer_.fit_transform(train_headlines)

    print("The shape is: " + str(_train_vectorizer_.shape) + "\n")

    _gram_model_ = LogisticRegression(random_state=RANDOM_SEED, max_iter=MAX_ITER)
    _gram_model_ = _gram_model_.fit(_train_vectorizer_, train["label"])

    _gram_test_ = _gram_vectorizer_.transform(test_headlines)
    _gram_predictions_ = _gram_model_.predict(_gram_test_)

    print (accuracy_score(test["label"], _gram_predictions_))

    model_type.append(str(n) + "," + str(MODEL_TYPE) + " n-gram")
    result.append(accuracy_score(test["label"], _gram_predictions_))

In [96]:
if DATASET == 3:
    _gram_words_best_ = _gram_vectorizer_.get_feature_names()
    _gram_coeffs_best_ = _gram_model_.coef_.tolist()[0]

    coeffdf = pd.DataFrame({'Word' : _gram_words_best_, 
                            'Coefficient' : _gram_coeffs_best_})

    coeffdf = coeffdf.sort_values(['Coefficient', 'Word'], ascending=[0, 1])

    print(coeffdf.head(10))

In [97]:
if DATASET == 3:
    print(coeffdf.tail(10))