### TODOs

* Test is title is shorter than... filter out
* Check google news... not so cool stuff
* Implement Images
* Continue categorization

In [55]:
import warnings
warnings.filterwarnings("ignore") # to ignore all future warinings

## 1. Preparing the dataset

### 1.1 Scraping news articles from the web

This process takes on average between 2 and 15min, depending on how many website links are to be scraped, how many articles in these links are found and how much computing ressources the machine has on which the code runs.

In [2]:
import feedparser as fp
import newspaper
from newspaper import Article
import time
from time import mktime
from datetime import datetime
from datetime import date
import pandas as pd
import json
import pprint
import dateutil

#### 1 Website data ####

with open('NewsPapers_new.json') as data_file: #Loads the JSON files with news URLs
    companies = json.load(data_file)

#### 2 Todays date - for filtering the articles by todays date ####
today = str(date.today()) 
print("Today's date:", today)


#### 3 Scraping the news articles ####

text_list, source_list, article_list, date_list, time_list, title_list, image_list, keywords_list, summaries_list = [], [], [], [], [], [], [], [], []

for source, value in companies.items(): 
    d = fp.parse(value['rss'])
    article={}
    for entry in d.entries:
        if hasattr(entry, 'published') and ((dateutil.parser.parse(getattr(entry, 'published'))).strftime("%Y-%m-%d") == today):
            article['source'] = source
            source_list.append(article['source'])

            # getting the article URLs
            article['link'] = entry.link
            article_list.append(article['link'])

            # getting the article published dates
            date = (getattr(entry, 'published'))
            date = dateutil.parser.parse(date)
            date_formated = date.strftime("%Y-%m-%d")
            time_formated = date.strftime("%H:%M:%S %Z") # hour, minute, timezone (converted)
            date_list.append(date_formated)
            time_list.append(time_formated)

            # "downloading" the articles
            content = Article(entry.link)
            try:
                content.download()
                content.parse()  
                content.nlp()
            except Exception as e: 
                # in case the download fails, it prints the error and immediatly continues with downloading the next article
                print(e)
                print("continuing...")
            
            # save the "downloaded" content
            title = content.title #extract article titles
            image = content.top_image #extract article images
            image_list.append(image)
            keywords = content.keywords
            keywords_list.append(keywords)
            title_list.append(title)
            text = content.text
            text_list.append(text)
            summaries = content.summary
            summaries_list.append(summaries)
                
#creating dicts for formatting and inserting to pandas df
source_dict = {'source':source_list}
link_dict = {'link':article_list}
date_dict = {'published_date':date_list}
time_dict = {'published_time':time_list}
title_dict = {'title':title_list}
text_dict = {'text':text_list}
keyword_dict = {'keywords':keywords_list}
image_dict = {'image':image_list}
summary_dict = {'summary':summaries_list}

#creating separate pandas dfs for each feature
source_df = pd.DataFrame(source_dict, index=None)
link_df = pd.DataFrame(link_dict, index=None)
date_df = pd.DataFrame(date_dict, index=None)
time_df = pd.DataFrame(time_dict, index=None)
title_df = pd.DataFrame(title_dict, index=None)
text_df = pd.DataFrame(text_dict, index=None)
keyword_df = pd.DataFrame(keyword_dict, index=None)
image_df = pd.DataFrame(image_dict, index=None)
summary_df = pd.DataFrame(summary_dict, index=None)

#join all pandas dfs together
news_df = source_df.join(link_df).join(date_df).join(time_df).join(title_df).join(text_df).join(keyword_df).join(image_df).join(summary_df)


# after running, pandas DF sould be created with link, published_date, published_time, title and text

Today's date: 2019-12-03
Article `download()` failed with HTTPSConnectionPool(host='news.yahoo.com', port=443): Read timed out. on URL https://news.yahoo.com/egypt-ethiopia-sudan-meet-us-183714485.html
continuing...
Article `download()` failed with HTTPSConnectionPool(host='news.yahoo.com', port=443): Read timed out. on URL https://news.yahoo.com/brexit-explained-us-shippers-know-141907158.html
continuing...
Article `download()` failed with HTTPSConnectionPool(host='finance.yahoo.com', port=443): Read timed out. on URL https://finance.yahoo.com/news/faa-calls-lufthansa-skirting-operating-140408913.html
continuing...
Article `download()` failed with HTTPSConnectionPool(host='finance.yahoo.com', port=443): Read timed out. on URL https://finance.yahoo.com/news/david-tepper-trims-unitedhealth-exits-200143393.html
continuing...


### 1.2. Filtering and cleaning the dataset

In order to run some analysis on the titles and text content of the articles, we need to clean them.
We first filter all the articles we scraped by todays date. 
For cleaning the titles and article content text, we go through the following steps:

*  remove stopwords (i.e. "a", "for", "when", "you", "if",... etc. that would impact the accuracy of our similarity analysis)
*  remove punctuation
*  remove numbers
*  remove names of the source website in the article text (we noticed, that f.e. CNN often mentions "CNN" in their articles, which would impact on the accuracy of our similarty analysis)
*  make the sentences lower case

In [56]:
import re
import nltk
nltk.download('wordnet')
#from nltk import word_tokenize
from nltk.corpus import stopwords
from unidecode import unidecode
import string

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\elandman\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [468]:
# 1 Reset the DF index
#news_df_daily = news_df_daily[news_df_daily.title != ""]
#news_df_daily = news_df.reset_index(drop=True)

# 2 Removing missing titles i.e. articles extracted without titles or texts

news_df_daily = news_df[news_df.title != ""]
news_df_daily = news_df_daily[news_df_daily.text != ""]
news_df_daily = news_df_daily.reset_index(drop=True)

# 3 Make all letters lower case
news_df_daily["clean_title"] = news_df_daily["title"].str.lower()
news_df_daily["clean_text"] = news_df_daily["text"].str.lower()

# 4 Filter out the stopwords
stop = stopwords.words('english')

news_df_daily["clean_title"] = ((news_df_daily["clean_title"].str.replace("'s",'')).str.replace("’s",''))
news_df_daily["clean_text"] = ((news_df_daily["clean_text"].str.replace("'s",'')).str.replace("’s",''))

news_df_daily['clean_title'] = news_df_daily['clean_title'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))
news_df_daily['clean_text'] = news_df_daily['clean_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))

# 5 Remove sources, punctuation ('[^\w\s]','') and numbers ('\d+', '')
sources_list = (list(source_dict.values()))
for i in sources_list:
    sources_set = set(i)
sources_to_replace = dict.fromkeys(sources_set, "") # replace every source with "" nothing

news_df_daily["clean_title"] = (((news_df_daily["clean_title"].str.replace('[^\w\s]',''))
                                .str.replace('\d+', '')).replace(sources_to_replace, regex=True))


news_df_daily["clean_text"] = (((news_df_daily["clean_text"].str.replace('[^\w\s]',''))
                                .str.replace('\d+', ''))
                               .replace(sources_to_replace, regex=True))

news_df_daily = news_df_daily[~news_df_daily["clean_title"].str.contains("washington post")] 
# remove washington post articles, since these cannot bet scraped
# TODO if test is shorter than... filter out #####################################

# 6 Remove non-ascii characters
news_df_daily["clean_title"] = news_df_daily["clean_title"].apply(unidecode)
news_df_daily["clean_text"] = news_df_daily["clean_text"].apply(unidecode)

# 7 Lemmatize words
w_tokenizer, lemmatizer = nltk.tokenize.WhitespaceTokenizer() , nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

news_df_daily["clean_title"] = (news_df_daily["clean_title"].apply(lemmatize_text).apply(lambda x: ' '.join([word for word in x])))

news_df_daily["clean_text"] = (news_df_daily["clean_text"].apply(lemmatize_text).apply(lambda x: ' '.join([word for word in x])))

news_df_daily["keywords"] = news_df_daily["keywords"].apply(lambda x: ' '.join([word for word in x]))

######


pd.set_option('display.max_colwidth', 20)

news_df_daily
##################################################

#(news_df_daily["title"] == "")

#for i in news_df_daily["source"].isna():
#    if i == False:
#        print(i)

Unnamed: 0,source,link,published_date,published_time,title,text,keywords,image,summary,clean_title,clean_text
0,cnn,http://rss.cnn.c...,2019-12-03,20:07:56 UTC,Impeachment repo...,(CNN) House Demo...,vote misconduct ...,https://cdn.cnn....,(CNN) House Demo...,impeachment repo...,house democrat s...
1,cnn,http://rss.cnn.c...,2019-12-03,19:26:30 UTC,Read: Democrats'...,Chat with us in ...,whats facebook r...,https://cdn.cnn....,Chat with us in ...,read democrat tr...,chat u facebook ...
2,cnn,http://rss.cnn.c...,2019-12-03,19:54:11 UTC,Sophia Nelson: A...,"Sophia Nelson, f...",nelson embarrass...,https://cdn.cnn....,"Sophia Nelson, f...",sophia nelson fo...,sophia nelson fo...
3,cnn,http://rss.cnn.c...,2019-12-03,17:24:49 UTC,Here's why the i...,(CNN) The impeac...,polling trumps m...,https://cdn.cnn....,(CNN) The impeac...,impeachment poll...,impeachment inqu...
4,cnn,http://rss.cnn.c...,2019-12-03,19:24:36 UTC,Senators grill S...,(CNN) A Senate h...,hale grill state...,https://cdn.cnn....,Sen. Robert Mene...,senator grill st...,senate hearing r...
5,cnn,http://rss.cnn.c...,2019-12-03,19:53:00 UTC,1 sentence that ...,(CNN) In late Ju...,russians trumps ...,https://cdn.cnn....,Hicks warned tha...,sentence perfect...,late june presid...
6,cnn,http://rss.cnn.c...,2019-12-03,19:10:00 UTC,Kamala Harris en...,(CNN) Sen. Kamal...,35 ends caption ...,https://cdn.cnn....,Photos: Former p...,kamala harris en...,sen kamala harri...
7,cnn,http://rss.cnn.c...,2019-12-03,19:09:51 UTC,Rep. Duncan Hunt...,Washington (CNN)...,misuse misusing ...,https://cdn.cnn....,Washington (CNN)...,rep duncan hunte...,washington repub...
8,cnn,http://rss.cnn.c...,2019-12-03,20:10:38 UTC,Graham says he's...,Washington (CNN)...,hes dnc russians...,https://cdn.cnn....,Washington (CNN)...,graham say confi...,washington repub...
9,cnn,http://rss.cnn.c...,2019-12-03,15:10:59 UTC,Macron corrects ...,French President...,fighters french ...,https://cdn.cnn....,French President...,macron corrects ...,french president...


## 2. Analyzing the dataset

In this step, we apply several different analysis methods, in order to define which articles out of those we scraped are **most relevant** for portfolio trading customers and **cover trending financial topics**.

### 2.1. Cosine similarity

Cosine similarity is a metric for measuring the similarity between two sentences. It creates numbered vectors out of sentences and measures the **cosine of the angle between them**.

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/1d94e5903f7936d3c131e040ef2c51b473dd071d" alt="Cosine similarity formula" title="Cosine similarity formula" />

where
* A ........... vector A
* A • B ..... dot product between vector A and B
* | A | ....... length of vector A


We apply this measure for both the title and the texts.

#### 2.1.A. Cosine similarity: titles

In [113]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer #for creating count vectors
from sklearn.metrics.pairwise import cosine_similarity #cosine similarity calculator

# for analysis, we need a list of all the titles
clean_titles_list = list(news_df_daily['clean_title'])

count_vectorizer = CountVectorizer()
count_matrix_title_sparse = count_vectorizer.fit_transform(clean_titles_list) # creates the count vector in sparse matrix
count_matrix_title_np = count_matrix_title_sparse.todense() # creates numpy matrix out from all count vectors
count_matrix_title_df = pd.DataFrame(count_matrix_title_np, columns=count_vectorizer.get_feature_names()) # creates df from count vectors

# apply consine smilarity on count vector dataframe
df_cosim_title = pd.DataFrame(cosine_similarity(count_matrix_title_df, count_matrix_title_df))
df_cosim_title.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,426,427,428,429,430,431,432,433,434,435
0,1.0,0.474342,0.0,0.204124,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.474342,1.0,0.0,0.258199,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.204124,0.258199,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.1066,0.0,0.0,0.113961,0.123091,...,0.0,0.0,0.113961,0.0,0.0,0.0,0.0,0.0,0.0,0.1066


#### 2.1.B. Cosine similarity: texts

In [114]:
# for analysis, we need a list of all the texts
clean_texts_list = list(news_df_daily['clean_text'])

count_vectorizer = CountVectorizer()
count_matrix_text = count_vectorizer.fit_transform(clean_texts_list) # creates the count vector
count_matrix_text = count_matrix_text.todense() # creates numpy matrix out from all count vectors

count_matrix_text = pd.DataFrame(count_matrix_text, columns=count_vectorizer.get_feature_names()) # creates df from count vectors

# apply consine smilarity on count vector dataframe
df_cosim_texts = pd.DataFrame(cosine_similarity(count_matrix_text, count_matrix_text))
df_cosim_texts.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,426,427,428,429,430,431,432,433,434,435
0,1.0,0.0,0.433142,0.271713,0.331837,0.199111,0.058192,0.044967,0.325653,0.167705,...,0.06266,0.039603,0.161955,0.135668,0.032341,0.034595,0.127323,0.048405,0.182643,0.095097
1,0.0,1.0,0.0,0.0,0.008435,0.0,0.002444,0.0,0.0,0.0,...,0.030261,0.013388,0.0,0.03276,0.0,0.0,0.0,0.0,0.0,0.021432
2,0.433142,0.0,1.0,0.116642,0.324736,0.162851,0.080645,0.055279,0.318335,0.151186,...,0.068088,0.013388,0.129821,0.09828,0.024296,0.029238,0.124964,0.011974,0.110258,0.064297
3,0.271713,0.0,0.116642,1.0,0.129129,0.195019,0.046392,0.028209,0.16245,0.138873,...,0.04324,0.01913,0.027652,0.104765,0.024797,0.043767,0.053851,0.0277,0.093026,0.104997
4,0.331837,0.008435,0.324736,0.129129,1.0,0.178202,0.11341,0.089755,0.441692,0.093728,...,0.116137,0.067191,0.159966,0.156674,0.041601,0.067901,0.140186,0.046188,0.178807,0.078457


### 2.2. Soft cosine similarity measure

Metric for measuring the similarity between two sentences, but gives **higher scores for words with similar meaning**. For Example, ‘President’ vs ‘Prime minister’, ‘Food’ vs ‘Dish’, ‘Hi’ vs ‘Hello’ are considered similar. 

<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9743aceb346ccb501ceaef15a46570d1ba8a6a1b" alt="Soft cosine formula" title="Soft cosine formula" />

where
* sij .... similarity (feature i, feature j)

**Difference to cosine similarity**: the traditional cosine similarity considers the vector space model (VSM i.e. features, unique words) features as independent or completely different, while the soft cosine measure proposes considering the similarity of features in VSM, which help generalize the concept of cosine (and soft cosine) as well as the idea of (soft) similarity. https://en.wikipedia.org/wiki/Cosine_similarity

This implies that we need some vector defining the similarity between words i.e. vectors of words that are similar. 
In our case we are going to use the pretrained `fasttext-wiki-news-subwords-300` vector dataset containing 1 million word embeddings trained on Wikipedia 2017. More info here: https://github.com/RaRe-Technologies/gensim-data/releases/tag/fasttext-wiki-news-subwords-300

_**Side note:** other pre-trained models to be found here: https://github.com/RaRe-Technologies/gensim-data/releases_

**Word embeddings**: position of a word within the vector space is learned from text and is based on the words that surround the word when it is used. Word embeddings can be used with pre-trained models applying transfer learning.

#### 2.2.A. Soft cosine measure: titles

In [7]:
import gensim
from gensim.matutils import softcossim 
from gensim import corpora
import gensim.downloader as api
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec

In [8]:
### ! ### this will download a file to your harddrive ### ! ###

# first we need to download the FastText model - about 250MB

glove_wiki = api.load('glove-wiki-gigaword-200')

In [115]:
glove_wiki.most_similar(positive="student")

[('students', 0.7942561507225037),
 ('teacher', 0.7468630075454712),
 ('graduate', 0.7329548001289368),
 ('school', 0.689203143119812),
 ('university', 0.6683695316314697),
 ('college', 0.6582204103469849),
 ('faculty', 0.6569023728370667),
 ('teachers', 0.6506001949310303),
 ('undergraduate', 0.6371238231658936),
 ('academic', 0.623526930809021)]

In [117]:
# create a dictionary, a map of word to unique id from the title list
dictionary_titles = corpora.Dictionary([simple_preprocess(word) for word in clean_titles_list])

In [118]:
# generate a similarity sparse matrix from the words in the dictionary
# this process takes a bit due to calculation time
similarity_matrix_titles = glove_wiki.similarity_matrix(dictionary_titles, 
                                                        tfidf=None, 
                                                        threshold=0.0, 
                                                        exponent=2.0, 
                                                        nonzero_limit=100)

In [119]:
# convert the titles into bag-of-words vectors through function
# appends the bag-of-words from all sentences into the sent list
def convert_bow(sentences):
    global sent_bow
    sent_bow = []
    for i in sentences:
        bow = dictionary_titles.doc2bow(simple_preprocess(i))
        sent_bow.append(bow)
        
convert_bow(clean_titles_list) 

# create soft cosine measure matrix thourgh function 
""" creates a matrix with the results of soft cosine measure calculation.
Takes into account the previously created similarity sparse matrix was created from the similar word meanings 
(we extracted from the FastText model) from the unique words that were in our unique dictionary."""

def create_soft_cossim_matrix(sentences):
    len_array = np.arange(len(sentences))
    xx, yy = np.meshgrid(len_array, len_array) # creates a grid with dimensions (nr of articles x nr of articles)
    soft_cossim_mat = pd.DataFrame([[round(softcossim(sentences[i],sentences[j], similarity_matrix_titles) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return soft_cossim_mat

soft_cossim_mat_titles = create_soft_cossim_matrix(sent_bow)

In [120]:
soft_cossim_mat_titles.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,426,427,428,429,430,431,432,433,434,435
0,1.0,0.5,0.08,0.22,0.15,0.04,0.03,0.05,0.09,0.08,...,0.04,0.0,0.09,0.0,0.0,0.0,0.24,0.0,0.08,0.0
1,0.5,1.0,0.11,0.26,0.2,0.06,0.03,0.0,0.07,0.05,...,0.05,0.0,0.1,0.0,0.0,0.04,0.13,0.0,0.14,0.0
2,0.08,0.11,1.0,0.0,0.21,0.0,0.15,0.05,0.08,0.0,...,0.03,0.0,0.12,0.0,0.0,0.0,0.09,0.0,0.07,0.0
3,0.22,0.26,0.0,1.0,0.0,0.03,0.09,0.0,0.1,0.0,...,0.0,0.0,0.0,0.09,0.05,0.0,0.07,0.0,0.13,0.0
4,0.15,0.2,0.21,0.0,1.0,0.14,0.16,0.06,0.2,0.15,...,0.0,0.0,0.17,0.02,0.0,0.0,0.13,0.0,0.11,0.13


#### 2.2.B. Soft cosine measure: texts

**! Be aware !** 

When you run the cell below - even when having only around 50 articles - the creation of a unique word dictionary and especially the corresponding similarity matrix for article texts takes at least 2 to 5min. 

This waiting time cannot be skipped for text soft cosine measure similarity comparison, since it just takes a lot of ressources to compute. If you want to time how long it exacly takes, look below for paragraph _X. Other stuff that could be helpful in the future_ - there is a code for timing the run time of a code. :-)

In [None]:
# create a dictionary, a map of word to unique id from the text list
dictionary_texts = corpora.Dictionary([simple_preprocess(word) for word in clean_texts_list])

# generate a similarity sparse matrix from the words in the dictionary
# this process takes a bit due to calculation time
similarity_matrix_texts = fasttext_model300.similarity_matrix(dictionary_texts, tfidf=None, threshold=0.0, exponent=2.0, nonzero_limit=100)

In [None]:
# convert the texts into bag-of-words vectors through function
# appends the bag-of-words from all sentences into the sent list
def convert_bow(sentences):
    global sent_bow
    sent_bow = []
    for i in sentences:
        bow = dictionary_texts.doc2bow(simple_preprocess(i))
        sent_bow.append(bow)
        
convert_bow(clean_texts_list) 

#create soft cosine measure matrix thourgh function 
""" creates a matrix with the results of soft cosine measure calculation.
Takes into account the previously created similarity sparse matrix was created from the similar word meanings 
(we extracted from the FastText model) from the unique words that were in our unique dictionary."""

def create_soft_cossim_matrix(sentences):
    len_array = np.arange(len(sentences))
    xx, yy = np.meshgrid(len_array, len_array) # creates a grid with dimensions (nr of articles x nr of articles)
    soft_cossim_mat = pd.DataFrame([[round(softcossim(sentences[i],sentences[j], similarity_matrix_texts) ,2) for i, j in zip(x,y)] for y, x in zip(xx, yy)])
    return soft_cossim_mat

soft_cossim_mat_texts = create_soft_cossim_matrix(sent_bow)

In [None]:
soft_cossim_mat_texts.head()

### 2.3. Eudlidean Distance

Euclidean Distance is a measure for computing the distance between two vectors while taking into account the magnitude (length) the vectors have. This is different from Cosine Similarity: cossim only calculates the cosine of the angle between two vectors, which automatically doesn't take the magintude into account. 

Repeated application of the Pythagorian Theorem gives the formula:

<img src= "https://wikimedia.org/api/rest_v1/media/math/render/svg/4efcba672e6df32cc8eb7ce0863591806a6581b5">

The Euclidean Distance measures the ordinary distance between two points in a space. The points represent the points to which the vectors point. Looking at it in 2 dimensions makes it much more visually appealing and understandable: here $p$ and $q$ represent the points for which the distance is measured:

<img src= "https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/Euclidean_distance_2d.svg/1280px-Euclidean_distance_2d.svg.png" width="500">

In [266]:
from sklearn.metrics.pairwise import euclidean_distances
from sklearn.metrics import pairwise_distances
from sklearn.metrics.pairwise import pairwise_kernels

eucl_dist = pairwise_distances(count_matrix_title_sparse, metric='euclidean')
eucl_dist_df = pd.DataFrame(eucl_dist)
eucl_dist_df = eucl_dist_df.round(2)

eucl_dist_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,426,427,428,429,430,431,432,433,434,435
0,0.0,2.65,3.74,3.0,4.36,4.0,3.61,4.0,3.87,3.74,...,3.87,4.0,3.87,3.87,4.24,4.0,3.87,3.87,4.12,4.0
1,2.65,0.0,3.32,2.45,4.0,3.61,3.16,3.61,3.46,3.32,...,3.46,3.61,3.46,3.46,3.87,3.61,3.46,3.46,3.74,3.61
2,3.74,3.32,0.0,3.0,4.12,3.74,3.32,3.74,3.61,3.46,...,3.61,3.74,3.61,3.61,4.0,3.74,3.61,3.61,3.87,3.74
3,3.0,2.45,3.0,0.0,3.74,3.32,2.83,3.32,3.16,3.0,...,3.16,3.32,3.16,3.16,3.61,3.32,3.16,3.16,3.46,3.32
4,4.36,4.0,4.12,3.74,0.0,4.12,4.0,4.36,4.0,3.87,...,4.24,4.36,4.0,4.24,4.58,4.36,4.24,4.24,4.47,4.12


## 3. Results: extracting most similar articles

After finding some results for the similarity in our scraped articles, we have to **filter the similar articles out of our initial** `news_df_daily` **dataframe**, in order to find out the title and article text.

We want to extract only articles that have some predefined minimum value for similarity f.e. we only want **articles that have a similarity of at least 0.7** (this number could vary depending on our choice). Since the row indexes and the column numbers in the `soft_cossim_mat` matrix are equal to the indexes of the articles in our initial `news_df_daily` dataframe, we need to filter `news_df_daily` by exactly these indexes which contain the minimum similarity value.

In [125]:
# general function to find the row and column index in a dataframe for a specific value
def get_indexes(dataframe, value):
    pos_list = list()
    for i in value:
        result = dataframe.isin([value]) # crete bool dataframe with True at positions where the given value exists
        series = result.any()
        column_names = list(series[series == True].index) # create list of columns that contain the value
        for col in column_names: # iterate over list of columns and fetch the rows indexes where value exists
            rows = list(result[col][result[col] == True].index)
            for row in rows:
                if row != col: # since matrix diagonal is always == 1, we exclude these results here
                    pos_list.append((row, col)) #creates a list of row, col position
        return pos_list # Return a list of tuples indicating the positions of value in the dataframe

# function for creating a list of the row indexes
def find_indexes(dict_pos, index_list):
    for key, value in dict_pos.items():
    #print(key, ' : ', value) # this prints the similarity values and its corresponding row and col indexes in the df
        for num in value:
            for firstnum in num:
                index_list.append(firstnum)

### 3.1. Most similar articles: by Soft Cosine Similarity of article titles

In [478]:
# choosing the range of similarity values for which the sentences should be filtered
simval = np.arange(0.9, 1.01, 0.01) # choose similarity values between first number and 1.0, by steps of 0.01
simval = (np.around(simval, decimals=2)).astype(str)
#simval = (simval.astype(str))
 
# use dict comprehension and 'get_indexes' function to get index positions of elements in df with predefined similarity values
dict_pos_titles = {elem: get_indexes(soft_cossim_mat_titles, elem) for elem in simval}
#dict_pos_texts = {elem: get_indexes(soft_cossim_mat_texts, elem) for elem in simval}

# applying the functions for finding the similarity values in the dataframes

index_list_titles = []
find_indexes(dict_pos_titles, index_list_titles)
index_list_titles = list(set(index_list_titles))

select_articles = ((news_df_daily.iloc[index_list_titles, :]).drop_duplicates(("title"))).sort_index()
print(select_articles.shape)
select_articles.head()

(45, 11)


Unnamed: 0,source,link,published_date,published_time,title,text,keywords,image,summary,clean_title,clean_text
6,cnn,http://rss.cnn.com/~r/rss/cnn_topsto...,2019-12-03,19:10:00 UTC,Kamala Harris ends 2020 presidential...,(CNN) Sen. Kamala Harris ended her 2...,35 ends caption presidential photos ...,https://cdn.cnn.com/cnnnext/dam/asse...,Photos: Former presidential candidat...,kamala harris end presidential campaign,sen kamala harris ended presidential...
7,cnn,http://rss.cnn.com/~r/rss/cnn_topsto...,2019-12-03,19:09:51 UTC,Rep. Duncan Hunter pleads guilty for...,Washington (CNN) Republican Rep. Dun...,misuse misusing rep pleads resign hu...,https://cdn.cnn.com/cnnnext/dam/asse...,Washington (CNN) Republican Rep. Dun...,rep duncan hunter pleads guilty misu...,washington republican rep duncan hun...
27,cnn,http://rss.cnn.com/~r/rss/cnn_topsto...,2019-12-03,17:31:42 UTC,Watch Marvel's 'Black Widow' first t...,Chat with us in Facebook Messenger. ...,widow whats facebook marvels unfolds...,https://cdn.cnn.com/cnnnext/dam/asse...,Chat with us in Facebook Messenger.\...,watch marvel black widow first trailer,chat u facebook messenger find happe...
55,cbn,https://www1.cbn.com/cbnnews/politic...,2019-12-03,18:34:22 UTC,Kamala Harris to End Democratic Pres...,WASHINGTON (AP) - Democratic Sen. Ka...,end state presidential democratic se...,https://www1.cbn.com/sites/default/f...,WASHINGTON (AP) - Democratic Sen. Ka...,kamala harris end democratic preside...,washington ap democratic sen kamala ...
66,wsj_business,https://www.wsj.com/articles/sprint-...,2019-12-03,07:18:00,Sprint Overcounted Subsidized Custom...,Sprint Corp. has for years overestim...,federal customers using sprint tens ...,https://images.wsj.net/im-132404/social,Sprint Corp. has for years overestim...,sprint overcounted subsidized custom...,sprint corp year overestimated many ...


### 3.2. Most similar articles: by Euclidean Similarity of article titles

In [470]:

cols = list(eucl_dist_df.columns)    
#unique_list = []
unique_simvals = sorted(list(pd.unique(eucl_dist_df[cols].values.ravel())))
unique_simvals.remove(float(0))
filter_criteria = round((len(unique_simvals))*0.18)

unique_simvals_filtered = (np.array(unique_simvals[3:filter_criteria])).astype(str)

print(unique_simvals_filtered)

['2.0' '2.24']


In [471]:
#simval = "2"
dict_pos_titles = {elem: get_indexes(eucl_dist_df, elem) for elem in unique_simvals_filtered}
index_list_titles = []
find_indexes(dict_pos_titles, index_list_titles)
index_list_titles = list(set(index_list_titles))

sorted(index_list_titles)

select_articles = ((news_df_daily.iloc[index_list_titles, :]).drop_duplicates(("title"))).sort_index()

pd.set_option('display.max_colwidth', 40)
select_articles


Unnamed: 0,source,link,published_date,published_time,title,text,keywords,image,summary,clean_title,clean_text
3,cnn,http://rss.cnn.com/~r/rss/cnn_topsto...,2019-12-03,17:24:49 UTC,Here's why the impeachment polling i...,(CNN) The impeachment inquiry into P...,polling trumps moving wanted isnt op...,https://cdn.cnn.com/cnnnext/dam/asse...,(CNN) The impeachment inquiry into P...,impeachment polling moving,impeachment inquiry president donald...
6,cnn,http://rss.cnn.com/~r/rss/cnn_topsto...,2019-12-03,19:10:00 UTC,Kamala Harris ends 2020 presidential...,(CNN) Sen. Kamala Harris ended her 2...,35 ends caption presidential photos ...,https://cdn.cnn.com/cnnnext/dam/asse...,Photos: Former presidential candidat...,kamala harris end presidential campaign,sen kamala harris ended presidential...
13,cnn,http://rss.cnn.com/~r/rss/cnn_topsto...,2019-12-03,16:48:05 UTC,Macron refuses to back down after Tr...,London (CNN) French President Emmanu...,turkey leaders companies meeting ref...,https://cdn.cnn.com/cnnnext/dam/asse...,London (CNN) French President Emmanu...,macron refuse back trump attack,london french president emmanuel mac...
55,cbn,https://www1.cbn.com/cbnnews/politic...,2019-12-03,18:34:22 UTC,Kamala Harris to End Democratic Pres...,WASHINGTON (AP) - Democratic Sen. Ka...,end state presidential democratic se...,https://www1.cbn.com/sites/default/f...,WASHINGTON (AP) - Democratic Sen. Ka...,kamala harris end democratic preside...,washington ap democratic sen kamala ...
63,cbn,https://www1.cbn.com/cbnnews/us/2019...,2019-12-03,13:24:11 UTC,Washington DC Synagogue Vandalized W...,"JERUSALEM, Israel - Swastikas and an...",united means antisemitic states post...,https://www1.cbn.com/sites/default/f...,"JERUSALEM, Israel - Swastikas and an...",washington dc synagogue vandalized s...,jerusalem israel swastika antisemiti...
78,wsj_business,https://www.wsj.com/articles/thanksg...,2019-12-03,12:51:00,Thanksgiving Weekend Shoppers Booste...,American shoppers increased their sp...,stores wavered period signaling boos...,https://images.wsj.net/im-132607/social,American shoppers increased their sp...,thanksgiving weekend shopper boosted...,american shopper increased spending ...
89,wsj_market,https://www.wsj.com/articles/amazon-...,2019-12-03,12:26:00,Amazon Dots the Landscape,Consumers apparently love having Ama...,ubercheap speaker smart seven sales ...,https://images.wsj.net/im-132609/social,Consumers apparently love having Ama...,amazon dot landscape,consumer apparently love amazoncom m...
179,theguardian,https://www.theguardian.com/us-news/...,2019-12-03,19:22:23 UTC,Kamala Harris drops out of Democrati...,California senator had started stron...,iowa state drops candidates debate r...,https://i.guim.co.uk/img/media/349ae...,California senator had started stron...,kamala harris drop democratic presid...,california senator started strongly ...
242,theguardian,https://www.theguardian.com/food/sho...,2019-12-03,16:36:19 UTC,Why is a 2017 bottle of Irn-Bru sell...,An out-of-date Irn-Bru bottle is bei...,irnbru 2017 recipe version tax origi...,https://i.guim.co.uk/img/media/0d20d...,An out-of-date Irn-Bru bottle is bei...,bottle irnbru selling,outofdate irnbru bottle sold ebay is...
262,skynews,http://news.sky.com/story/kamala-har...,2019-12-03,18:29:00 UTC,Kamala Harris drops out of race for ...,Kamala Harris said she will 'do ever...,resources drops regret deep states r...,https://e3.365dm.com/19/12/1600x900/...,Kamala Harris said she will 'do ever...,kamala harris drop race democratic p...,kamala harris said do everything pow...


In [473]:
# Defining categories


listtest = ["model","sales","technology","stocks","stockmarket","finance","model","2020","ends","federal"]

select_articles1 = select_articles[select_articles["keywords"].str.contains('|'.join(listtest))]

def categories(topic, topic_list):
    for i,v in (glove_wiki.most_similar(positive=topic)):
        topic_list.append(i)

        
politics, economy, finance, tech, business = [], [], [], [], []
        
categories("politics", politics)
categories("economy", economy)
categories("finance", finance)
categories("tech", tech)
categories("business", business)

select_politics = select_articles[select_articles["keywords"].str.contains('|'.join(politics))]
select_economy = select_articles[select_articles["keywords"].str.contains('|'.join(economy))]
select_finance = select_articles[select_articles["keywords"].str.contains('|'.join(finance))]
select_tech = select_articles[select_articles["keywords"].str.contains('|'.join(tech))]
select_business = select_articles[select_articles["keywords"].str.contains('|'.join(business))]

select_politics

Unnamed: 0,source,link,published_date,published_time,title,text,keywords,image,summary,clean_title,clean_text
179,theguardian,https://www.theguardian.com/us-news/...,2019-12-03,19:22:23 UTC,Kamala Harris drops out of Democrati...,California senator had started stron...,iowa state drops candidates debate r...,https://i.guim.co.uk/img/media/349ae...,California senator had started stron...,kamala harris drop democratic presid...,california senator started strongly ...
298,google news,https://news.google.com/__i/rss/rd/a...,2019-12-03,16:35:00 UTC,NPR Choice page,"By choosing “I agree” below, you agr...",social information npr choice sites ...,,"By choosing “I agree” below, you agr...",npr choice page,choosing i agree below agree npr sit...


# 4. Generating newsletter in HTML

After already having designed the HTML body for the newsletter, we need to prepare the extracted article titles and texts for automatically entering intp the HTML body.

## 4.1 Importing extracted titles and content into Newsletter

Just for testing, we will randomly chose which articles to include in our newsletter body.

In [479]:
# creating separate lists of the columns and info we want to include
similar_sources_list = list(select_articles['source'])
similar_links_list = list(select_articles['link'])
similar_titles_list = list(select_articles['title'])
similar_texts_list = list(select_articles['text'])

# randomly select articles to include
random_select = select_articles.reset_index(drop=True) # resetting the index of the df
nr_of_art = (list(random_select.shape))[0] # finding max number of rows of the df of the most similar articles

random_art_nr = np.random.choice(nr_of_art, 8, replace=False) # randomly chose 7 articles out of the max possible
random_art_nr_list = list(random_art_nr)

# function to extract the articles by their random number in the index, limits the characters of text by 'max_chars' and adds '...' to the end 
def rand_info(nr_of_art, max_chars):
    global rand_text, rand_source, rand_link, rand_title
    rand_text, rand_source, rand_link, rand_title = [], [], [], []
    random_art_nr = np.random.choice(nr_of_art, 8, replace=False)  # chosen randomly  
    for nr in random_art_nr:
        (rand_text.append((similar_texts_list[nr])[:max_chars]))
        (rand_source.append(similar_sources_list[nr]))
        (rand_link.append(similar_links_list[nr]))
        (rand_title.append(similar_titles_list[nr]))
    rand_text = [item + '...' for item in rand_text]

#random_select = select_articles.reset_index(drop=True) # resetting the index of the df
rand_info(random_art_nr_list, 250) # selecting the articles randomly and maximizing texts by 250 chars

# now every time the code is excuted, a new randomly chosen article appears
print(rand_title[0],"\n" , rand_text[0], "\n" , rand_source[0], "\n", rand_link[0])

#test

Appeals Court Orders Trump’s Banks to Turn Over Records, Setting the Stage for a SCOTUS Showdown 
 President Donald Trump in London on Tuesday. Ludovic Marin/Getty Images

A federal appeals court upheld the House of Representatives’ subpoena of Donald Trump’s financial records on Tuesday, continuing Democrats’ winning streak in their fight for ove... 
 google news 
 https://news.google.com/__i/rss/rd/articles/CBMiZmh0dHBzOi8vc2xhdGUuY29tL25ld3MtYW5kLXBvbGl0aWNzLzIwMTkvMTIvY291cnQtb3JkZXItZGV1dHNjaGUtYmFuay1jYXBpdGFsLW9uZS10cnVtcC1zdWJwb2VuYXMuaHRtbNIBZWh0dHBzOi8vc2xhdGUuY29tL25ld3MtYW5kLXBvbGl0aWNzLzIwMTkvMTIvY291cnQtb3JkZXItZGV1dHNjaGUtYmFuay1jYXBpdGFsLW9uZS10cnVtcC1zdWJwb2VuYXMuYW1w?oc=5


In [480]:
# Formatting issues
# ! are more unsupported characters ! to be edited and added over time

# function to replace the wrongly formatted characters (obersed by looking at the html output)
def replace_char(list_of_str):
    for i in range(len(list_of_str)):
        list_of_str[i] = list_of_str[i].replace("’","`")
        list_of_str[i] = list_of_str[i].replace(":",":")
        list_of_str[i] = list_of_str[i].replace("–","-")
        #print(data)

replace_char(rand_text)
replace_char(rand_title)

rand_title

['Appeals Court Orders Trump`s Banks to Turn Over Records, Setting the Stage for a SCOTUS Showdown',
 'With Brits Used to Surveillance, More Companies Try Tracking Faces',
 'Cannabis On The Tickets In The Coming UK Elections',
 'Pioneer Distilleries down 5% after fixing swap ratio for merger with USL',
 "India's yield curve sees steepest rise in nine years, set to go up further",
 'MARKET WRAP: Sensex dips 127 pts, Nifty below 12,000; PSBs, metals decline',
 'North Korea Touts New Resort, Seeking to Blunt U.N. Sanctions',
 'Kamala Harris to End Democratic Presidential Campaign']

In [328]:
import webbrowser
import os

In [481]:
# Code is way easier to edit in Notepad ++

print ()
f = open('HTML_with VARS_V1.html','w')
 
message = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Demystifying Email Design</title>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link href="NewsletterTemplate_files/css.css" rel="stylesheet">    
 
</head>
<body style="margin: 0; padding: 0;">
    <table width="100%" cellspacing="0" cellpadding="0" border="0"> 
        <tbody><tr>
            <td style="padding: 10px 0 10px 0;">
                <table style="border: 1px solid #cccccc; border-collapse: collapse;" width="1000" cellspacing="0" cellpadding="0" border="0" align="center">
                    <tbody><tr>
                        <td style="padding: 20px" height="204" bgcolor="#fbf315" align="top">
                            <img alt="Creating Email Magic" style="display: block;" src="NewsletterTemplate_files/Logo-Raiffeisen-Bank-2017.png" width="304" height="304">
                        </td>
                    </tr>
                    <tr>
                        <td style="padding: 20px 30px 40px 30px;" bgcolor="#ffffff">
                            <table width="100%" cellspacing="0" cellpadding="0" border="0">
                                <tbody><tr>
                                    <td style="color: #153643; 
    font-family: 'Archivo Black', sans-serif; font-size: 40px;">
                                        <b>Daily Finance Update
</b>
                                    </td>
                                
                                        
                                    </tr><tr>
                                    <td style="color: #153643; 
    font-family: 'Archivo Black', sans-serif; font-size: 20px; padding: 10px 0px 10px 0px;">
                                        <b>Stocks
</b>
                                    </td>
                                
                                        
                                    </tr>
                                
                                <tr>
                                    <td>
                                        <table width="100%" cellspacing="0" cellpadding="0" border="0">
                                            <tbody><tr>
                                                <td style="box-shadow: 1px 2px 4px rgba(0, 0, 0, .5);" width="160" valign="top">
                                                    <div style="padding: 20px 10px 5px 10px; font-family: 'Archivo Black', sans-serif; font-size: 22px">
  <b>TECH</b>
</div><tdbody>
                                                        <div class="row margin-top" style="padding: 10px 10px 5px 10px">
  <div style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> 
       <span class="item-Label">Tech | {rand_source[0]}</span>
   </div>
</div><div style="padding: 5px 10px 0 10px" font-family:="" font-size:=""><b style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> <a href="{rand_link[0]}">{rand_title[0]}</a></b></div><div class="row margin-top" style="padding: 5px 10px 15px 10px; font-family:'Raleway', sans-serif; font-size: 14px">{rand_text[0]}

                                                            
                                                        
                                                    </div></tdbody><tdbody>
                                                        <div class="row margin-top" style="padding: 10px 10px 5px 10px">
  <div style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> 
       <span class="item-Label">Tech | {rand_source[1]}</span>
   </div>
</div><div style="padding: 5px 10px 0 10px" font-family:="" font-size:=""><b style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> <a href="{rand_link[1]}">{rand_title[1]}
</a></b></div><div class="row margin-top" style="padding: 5px 10px 15px 10px; font-family:'Raleway', sans-serif; font-size: 14px">{rand_text[1]}
                                                   
                                                        
                                                    </div></tdbody><tdbody>
                                                        <div class="row margin-top" style="padding: 10px 10px 5px 10px">
  <div style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> 
       <span class="item-Label">Tech | {rand_source[2]}</span>
   </div>
</div><div style="padding: 5px 10px 0 10px" font-family:="" font-size:=""><b style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> <a href="{rand_link[2]}">{rand_title[2]}
</a></b></div><div class="row margin-top" style="padding: 5px 10px 15px 10px; font-family:'Raleway', sans-serif; font-size: 14px">{rand_text[2]}
</div></tdbody><table width="100%" cellspacing="0" cellpadding="0">
                                                        </table>
                                                </td><td style="font-size: 0; line-height: 0;" width="20">
                                                    &nbsp;
                                                </td><td style="box-shadow: 1px 2px 4px rgba(0, 0, 0, .5);" width="160" valign="top">
                                                    <div style="padding: 20px 10px 5px 10px; font-family: 'Archivo Black', sans-serif; font-size: 22px">
  <b>DEALS AND IPOs
</b>
</div><tdbody>
                                                        <div class="row margin-top" style="padding: 10px 10px 5px 10px">
  <div style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> 
       <span class="item-Label">Deals | {rand_source[3]} 
</span>
   </div>
</div><div style="padding: 5px 10px 0 10px" font-family:="" font-size:=""><b style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> <a href="{rand_link[3]}">{rand_title[3]}
Day record of more than $30 billion in sales and climbing</a></b></div><div class="row margin-top" style="padding: 5px 10px 15px 10px; font-family:'Raleway', sans-serif; font-size: 14px">
                                                            
{rand_text[3]}


                                                            
                                                        
                                                    </div></tdbody><tdbody>
                                                        <div class="row margin-top" style="padding: 10px 10px 5px 10px">
  <div style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> 
       <span class="item-Label">Markets | {rand_source[4]}
</span>
   </div>
</div><div style="padding: 5px 10px 0 10px" font-family:="" font-size:=""><b style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> <a href="{rand_link[4]}">{rand_title[4]}
</a></b></div><div class="row margin-top" style="padding: 5px 10px 15px 10px; font-family:'Raleway', sans-serif; font-size: 14px">
                                                            
{rand_text[4]}


                                                            
                                                        
                                                    </div></tdbody><table width="100%" cellspacing="0" cellpadding="0" border="0">
                                                        
</table>
                                                </td>
                                                <td style="font-size: 0; line-height: 0;" width="20">
                                                    &nbsp;
                                                </td>
                                                <td style="box-shadow: 1px 2px 4px rgba(0, 0, 0, .5);" width="160" valign="top">
                                                    <div style="padding: 20px 10px 5px 10px; font-family: 'Archivo Black', sans-serif; font-size: 22px">
  <b>BANKS
</b>
</div><tdbody>
                                                        <div class="row margin-top" style="padding: 10px 10px 5px 10px">
  <div style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> 
       <span class="item-Label">Trading | {rand_source[5]}</span>
   </div>
</div><div style="padding: 5px 10px 0 10px" font-family:="" font-size:=""><b style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> <a href="{rand_link[5]}">{rand_title[5]}
</a></b></div><div class="row margin-top" style="padding: 5px 10px 15px 10px; font-family:'Raleway', sans-serif; font-size: 14px">
                                                            
{rand_text[5]}


                                                            
                                                        
                                                    </div></tdbody><tdbody>
                                                        <div class="row margin-top" style="padding: 10px 10px 5px 10px">
  <div style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> 
       <span class="item-Label">Earnings | {rand_source[6]}</span>
   </div>
</div><div style="padding: 5px 10px 0 10px" font-family:="" font-size:=""><b style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> <a href="{rand_link[6]}">{rand_title[6]}
</a></b></div><div class="row margin-top" style="padding: 5px 10px 15px 10px; font-family:'Raleway', sans-serif; font-size: 14px">{rand_text[6]}
</div></tdbody><tdbody>
                                                        <div class="row margin-top" style="padding: 10px 10px 5px 10px">
  <div style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> 
       <span class="item-Label">JPMorgan | {rand_source[7]}</span>
   </div>
</div><div style="padding: 5px 10px 0 10px" font-family:="" font-size:=""><b style="color: #153643; font-family: Roboto, sans-serif; font-size: 18px;"> <a href="{rand_link[7]}">{rand_title[7]}
</a></b></div><div class="row margin-top" style="padding: 5px 10px 15px 10px; font-family:'Raleway', sans-serif; font-size: 14px">{rand_text[7]}
</div></tdbody><table width="100%" cellspacing="0" cellpadding="0">
                                                        </table>
                                                </td>
                                            </tr>
                                        </tbody></table>
                                    </td>
                                </tr>
                            </tbody></table>
                        </td>
                    </tr>
                    <tr>
                        <td style="padding: 30px 30px 30px 30px;" bgcolor="#666666">
                            <table width="100%" cellspacing="0" cellpadding="0" border="0">
                                <tbody><tr>
                                    <td style="color: #ffffff; font-family: Arial, sans-serif; font-size: 14px;" width="75%">
                                        ® Someone, somewhere 2019<br>
                                        <a href="#" style="color: #ffffff;"><font color="#ffffff">Unsubscribe</font></a> to this newsletter instantly
                                    </td>
                                    <td width="25%" align="right">
                                        <table cellspacing="0" cellpadding="0" border="0">
                                            <tbody><tr>
                                                <td style="font-family: Arial, sans-serif; font-size: 12px; font-weight: bold;">
                                                    <a href="https://twitter.com/raiffeisen_at" style="color: #666666;">
                                                        <img src="NewsletterTemplate_files/logo.png" alt="Twitter" style="display: block;" width="38" height="38" border="0">
                                                    </a>
                                                </td>
                                                <td style="font-size: 0; line-height: 0;" width="20">&nbsp;</td>
                                                <td style="font-family: Arial, sans-serif; font-size: 12px; font-weight: bold;">
                                                    <a href="http://www.facebook.com/raiffeisen/" style="color: #666666;">
                                                        <img alt="Facebook" style="display: block;" src="NewsletterTemplate_files/facebook-2.svg" width="38" height="38" border="0">
                                                    </a>
                                                </td>
                                            </tr>
                                        </tbody></table>
                                    </td>
                                </tr>
                            </tbody></table>
                        </td>
                    </tr>
                </tbody></table>
            </td>
        </tr>
    </tbody></table>



</body></html>

""".format(**locals()) #########
 
f.write(message)
f.close()

#Change path to reflect file location
filename = 'file:///'+os.getcwd()+'/' + 'HTML_with VARS_V1.html'
webbrowser.open_new_tab(filename)




True

# X. Other stuff that could be helpful in the future

## Time how long a code takes to execute

Could be used for speed comparison of two similarity methods

In [None]:
import timeit

code_to_test = """

"""
elapsed_time = timeit.timeit(code_to_test, number=100)/100
print(elapsed_time)

## Google word meaning vector, pre-trained

Maybe useful, some time?

Other pre-trained models to be found here: https://github.com/RaRe-Technologies/gensim-data/releases

In [None]:
model = api.load("word2vec-google-news-300") #1.6GB to download

## Splitting each word in title/text in pandas df to a separate column

Maybe useful, some time?

Code was hard to find via google haha

In [None]:
split = news_df_daily.str.split(expand=True)
title_splitted = pd.DataFrame(split)
title_splitted