 Humboldt-University of Berlin  
 Chair of Information Systems School of Business and Economics  
 Sose 2020  
 Advanced Data Analytics for Management Support  
 Stefan Lessmann  

# Assignment 

made by  
Kseniia Teslenko    

***
# Content

1. [Introduction](#Introduction)
2. [Explore the data](#Explore-the-data)
3. [Data preparation](#Data-preparation)
     * [Test data preparation](#Test-data-preparation)
          *   [Treatment of the "PublicationDetails" column](#Treatment-of-the-"PublicationDetails"-column) 
          *   [Responses features](#Responses-features)
          *   [Online since and month of the year](#Online-since-and-month-of-the-year)
          *   [Image count, word count and HTML removal](#Image-count,-word-count-and-HTML-removal)
          *   [Reading time](#Reading-time)
          *   [Missing values](#Missing-values)
          *   [Publication feature](#Publication-feature)
     * [Train data preparation](#Train-data-preparation)
          *   [Missing values](#Missing-values)
          *   [Feature "publicationname" in test and train set](#Feature-"publicationname"-in-test-and-train-set)
          *   [Duplicates](#Duplicates)
          *   [Removing non-English articles](#Removing-non-English-articles)
          *   [Creating new features: article’s age and month when it was published](#age-and-month)
          *   [Calculate reading time](#Calculate-reading-time)
          *   [Adjusting column names in test and train data](#Adjusting-column-names-in-test-and-train-data)
4. [Text Tokenization](#Text-Tokenization)
     * [Text and Header Tokenization](#Text-and-Header-Tokenization)
     * [Author and publisher Tokenization](#Author-and-publisher-Tokenization)
6. [Word2Vec](#Word2Vec)
     * [About Word2Vec](#About-Word2Vec)
     * [Implementaion](#Implementaion)
     * [Train the model](#Train-the-model)
7. [Creating a validation set](#Creating-a-validation-set)
8. [Develop predictive models](#Develop-predictive-models)
     * [Sequential benchmark model](#Sequential-benchmark-model)
     * [Second model](#Second-model)
     * [Third model](#Third-model)
     * [Final model](#Final-model)
     * [Hyperparameter tuning](#Hyperparameter-tuning)
10. [Conclusion](#Conclusion)
11. [Literature](#Literature)


In [1]:
# Importing standard packages. 
import importlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assess sentiment classification models 
from sklearn.metrics import accuracy_score, confusion_matrix

# Library re provides regular expressions functionality
import re

# To keep an eye on runtimes
import time

# Saving and loaded objects
import pickle

# Library beatifulsoup4 handles html
from bs4 import BeautifulSoup

# Standard NLP workflow
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import sent_tokenize

# Introduction
Medium is an online publishing platform for people or organisations who want to express themselves or want to share information and knowledge by writing articles and reaching a broad audience. The website defines itself as "Our sole purpose is to help you find compelling ideas, knowledge, and perspectives" (Medium). The platform, opposing to traditional blog hosts but very similar to most social media platforms, has its own content rating system, similar to likes in Instagram or reactions in Facebook. Here readers can “clap” to reward the author of a post. Clapping is a way readers show if they like an article and this way they recommend the story to their followers. Another function of clapping is to determine the monetary worth of an article. Medium explains this as follow: "Currently, Partner Program writers are paid every month based on how members engage with stories. Some factors include reading time (how long members spend reading a story) and applause (how much members clap) […]"(Botticello 2019). As we can see the number of claps has an important role within the platform. As a result an important question comes into play, namely: What type of article gain the maximum number of claps? By finding an answer to this question we can generalize the type of articles that are interesting for readers, attract them and bring the author popularity and monetary benefits.
Nowadays machine learning techniques are used in a variety of fields. These techniques help to gain knowledge from data to make a decision. Machine learning techniques are very diverse and vary greatly. The main approach of this case study is to gain a deeper understanding of neural networks from a practical point of view. The first neural network – perceptron was invented by Frank Rosenblatt in 1957. The perceptron is a logistic regression that has weights and its output is a function of the dot product of the weights and input. Today we can develop not only the perceptron - an example of a simple one-layer neural feedforward network – but also language based neural networks. During this case study we are using an NLP modelling techniques to predict the number of claps for new post on medium. The prediction should be based on an article's content and optionally its metadata. Two data sets are available a train and a test set to assemble the prediction. The main difference between both data sets are the difference in columns. The recreation of the columns content type is essential if the main purpose is to avoid the decrease of the predictive accuracy of the model. The first part of the case study is to get a better understanding of the test and train data, to clean it and recreate the columns in a compatible way. The second part it will focus on developing a predictive model.
I developed different models where each iteration is becoming more complex and adding more layers. The first one is the simplest model which only uses one input variable “texts” and only one layer. To improve my results, I decided to create a Word2Vec model and use its weights as a pre-trained embedding in embedding layer. Later I also added more inputs to the models: header and the numeric data from my previously prepared column (see #Data preparation). Even if the results have been improved it was obvious that finding the best suitable parameters could further increase the accuracy of the prediction. I prepared hyperparameter tuning using Facebook’s Ax platform but ultimately was not able to finish it because my computer does not have enough RAM and computational on my CPU would have taken almost a week. That is why during the preparations for parameter tuning I decided to develop a new final model, with new parameters that are based on my previous experience, namely on the results of the three previous models.

## Explore the data
Starting from the test set to analyse the features that are present for the final model is the most obviously approach. The train data exploration is the next step. It is necessary to understand how I can link together both to achieve my goal of creating a model based on the train data to predict on the test set. Because the model can only use information which exists in both, this already limits my possible features and allows to ignore all others. The data preparation on both sets is the most important step to ensure the future model can access as many data sources as possible.


In [2]:
df_test = pd.read_csv("C:/ADAMS_tutorials/ADAMS_Assignment/Test.csv", sep=",", encoding="utf-8") #"ISO-8859-1"
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Unnamed: 0          514 non-null    int64 
 1   index               514 non-null    int64 
 2   Author              514 non-null    object
 3   PublicationDetails  514 non-null    object
 4   Responses           432 non-null    object
 5   Header              506 non-null    object
 6   Text                514 non-null    object
 7   Length              514 non-null    int64 
dtypes: int64(3), object(5)
memory usage: 32.2+ KB


In [118]:
df_test.head()

Unnamed: 0.1,Unnamed: 0,index,Author,PublicationDetails,Responses,Header,Text,Length
0,0,0,Daniel Jeffries,"Daniel Jeffries in HackerNoon.comJul 31, 2017",627 responses,Why Everyone Missed the Most Mind-Blowing Feat...,There’s one incredible feature of cryptocurren...,23401
1,1,1,Noam Levenson,"Noam Levenson in HackerNoon.comDec 6, 2017",156 responses,NEO versus Ethereum: Why NEO might be 2018’s s...,"<img class=""progressiveMedia-noscript js-progr...",23972
2,2,2,Daniel Jeffries,"Daniel Jeffries in HackerNoon.comJul 21, 2017",176 responses,The Cryptocurrency Trading Bible,So you want to trade cryptocurrency?You’ve see...,402
3,3,5,Haseeb Qureshi,"Haseeb Qureshi in HackerNoon.comFeb 19, 2018",72 responses,Stablecoins: designing a price-stable cryptocu...,A useful currency should be a medium of exchan...,19730
4,4,7,William Belk,"William Belk in HackerNoon.comJan 28, 2018",19 responses,Chaos vs. Order — The Cryptocurrency Dilemma,Crypto crypto crypto crypto. It’s here. It’s h...,5324


In [3]:
df_test = df_test.drop('Unnamed: 0', axis=1)
df_test = df_test.drop('index', axis=1)

In [4]:
# Load data train data to compare with test data and get a first look
df = pd.read_csv("C:/ADAMS_tutorials/ADAMS_Assignment/Train.csv", sep=",", encoding="utf-8")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279577 entries, 0 to 279576
Data columns (total 50 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   audioVersionDurationSec      279577 non-null  int64  
 1   codeBlock                    25179 non-null   object 
 2   codeBlockCount               279577 non-null  float64
 3   collectionId                 137878 non-null  object 
 4   createdDate                  279577 non-null  object 
 5   createdDatetime              279577 non-null  object 
 6   firstPublishedDate           279577 non-null  object 
 7   firstPublishedDatetime       279577 non-null  object 
 8   imageCount                   279577 non-null  int64  
 9   isSubscriptionLocked         279577 non-null  bool   
 10  language                     279577 non-null  object 
 11  latestPublishedDate          279577 non-null  object 
 12  latestPublishedDatetime      279577 non-null  object 
 13 

Test data consists of 7 variables. We have an indexing and it is obvious that some cases were removed because the index is not in a proper order. We have the author’s name. A column named “Publishing details” which consists of the author’s name, the name of the blog and when an article was published. The next column represents the number of comments on the article. The train data set has the column “Header” and “Text” and also the “Length”. After a small research it became clear that the variable “Length” is the number of words without html content.
The function info() gives us the general information about the dataset. As we can see here there are 49 columns in the test set and four types of data: bool(2), float64(6), int64(10), object(32). This means we have an integer, text and categorical data. Most of them will be removed because they are not useful for the model and the future prediction. The labels of features in test and train sets are different. Identifying the columns that have the same or almost the same information in train and test set increases the decision-making power what features should be removed without potentially decreases in the predictive accuracy.

# Data preparation
## Test data preparation
### Treatment of the "PublicationDetails" column

Publication details in the train set consist of the author, a “publicationname” and a date but it wasn’t immediately clear which field matches this date. To solve this I was forced to compare two columns and articles on the internet using Google by searching for the title and author. As a result of this investigation it was decided to match this date with the column “firtspublisheddate”. Of course, the probability exists that some articles were updated later and the date here represents another field. I’m using Regular Expressions to separate the data into three different columns. First, I am looking for possible errors in the column. I found that some date data consists of only day and month. Random examination of the articles on the internet helped me to state that the most of them are written in the 2019. Still there is at least one article from the year 2020 with the same missing information. I discovered that medium.com does not display the year if the article was written in the current year. Therefore, I concluded that most articles were scrapped in 2019 and I will use this as default here. There is no way without additional data input to give the accurate year for this missing data. To build a correct function with regular expression I used this website https://regex101.com/ to test my pattern. After splitting the data into three separate columns I am deleting the old one. The columns are named the same way as in the train set.

In [5]:
#df_test.PublicationDetails.tail() 
#df_test.PublicationDetails[511]
# finding broken data
broken_years = []
for index, row in df_test.iterrows():
    detail = row.PublicationDetails
    date = re.search(r'\d{4}$', detail)
    if (date == None):
        broken_years.append(row)
#broken_years

In [6]:
def publicationDetails_split(row):
  
  published_string = row['PublicationDetails']
  date_found = re.search(r'[A-Za-z]{3}\s\d{1,2}(,\s\d{4})?$', published_string)[0] 
  
  if ' in ' in published_string:
    author = re.search(r'^.*(?=\sin\s)', published_string)[0] 
  else:
    author = re.sub(date_found + '$', '', published_string)
  
  publication = re.sub(date_found + '$', '', published_string)
  publication = re.sub('^' + author, '', publication)
  publication = publication.lstrip(' in ')
  
  if not re.match(r'.*\d{4}$', date_found):
    date_found = date_found + ', 2019'
        
  return pd.Series([author, date_found, publication])

In [7]:
splitted_df_test = df_test.apply(lambda row: publicationDetails_split(row), axis=1)
splitted_df_test.columns = ["author", "published", "blog"]

df_test["author"] = splitted_df_test["author"]
df_test["published"] = splitted_df_test["published"]
df_test["blog"] = splitted_df_test["blog"]

In [8]:
df_test = df_test.drop("PublicationDetails", axis=1)
df_test = df_test.drop("Author", axis=1)

In [9]:
df_test.columns = ["responses", "header", "texts", "length", "author", "published", "publication"]

### Online since and month of the year
The examination of medium's statistics, which is freely available, on its website has shown the overall traffic fluctuating during the common year. These conclusions inspired me to create the new variables “month”, which represents the current month of the year, and “online_since”, which gives us an indication of how long an article is available online. I suppose that during cold months people spend more time home and potentially are more interested to read more. At the same time older articles have more chances to become viral and gain more readers and as a result more claps in comparison with a newer ones.


In [11]:
df_test["published"] = pd.to_datetime(df_test["published"])
df_test["online_since"] = (pd.datetime.today() - df_test["published"]).dt.days
df_test = df_test.drop(['published'], axis=1)

In [10]:
df_test["month"] = pd.DatetimeIndex(df_test['published']).month
df_test.month.value_counts()

3     55
6     54
12    47
11    47
5     45
4     45
1     45
8     39
7     38
2     35
9     34
10    30
Name: month, dtype: int64

### Responses feature
The column responses consist of missing values and it is a mix string. Instead of mix string it is preferable to have it as an integer. First, I set all missing values to "0 responses". The next step is to only have numbers in the column. I use a regular expressions method to solve this task.


In [13]:
df_test.responses.head(10)

0    627 responses
1    156 responses
2    176 responses
3     72 responses
4     19 responses
5     23 responses
6     67 responses
7     31 responses
8     49 responses
9      5 responses
Name: responses, dtype: object

In [14]:
df_test["responses"].fillna("0 responses", inplace = True)

In [15]:
responses = []
for response in df_test.responses:
    amount = None
    if (isinstance(response, str) ):
        amount = re.search(r'^\d+', response)
    if (amount == None):
        responses.append(0)
    else:
        responses.append(amount[0])
df_test["responses"] = responses

In [16]:
df_test.responses.tail()

509    181
510     24
511     24
512    116
513     34
Name: responses, dtype: object

In [17]:
df_test['responses'] = df_test['responses'].astype(int)

### Image count, word count and HTML removal 
Images are presented as html tags. Generally, I need to count the “img” HTML-tags. After counting the images, I need to remove all html tags using the BeautifulSoup library, to count the number of words. I have detected that in some articles there is no space between punctuation and a word. Before counting the words, I need to solve this problem. Sadly, another problem occurred during this process: Some articles included links which naturally come with several punctuations. To avoid http://www.link.de to become “http //www link de” and increase the word count from one to four I needed to replace links with a placeholder “___LINK___” before fixing the punctuation.
A chain of several regular expressions seems a suitable method to solve these tasks quick and successful. Finally, I was able to count the words and save the texts without HTML and link overhead. During this process I created two new columns that are necessary to calculate the reading time in the next step.


In [18]:
texts = []
images_count = []
for text in df_test.texts:
    soup = BeautifulSoup(text)
    images = soup.findAll('img')
    len_images = len(images)
    cleaned_text = soup.get_text()
    images_count.append(len_images)
    texts.append(cleaned_text)
#images_count
#df_test["images_count"] = images_count
df_test["texts"] = texts
df_test["images_count"] = images_count

In [20]:
#replace links with dummy for word count because removing punctuation would change http://link.de => http //link de  == 3 words
link_dummy = '___LINK___'
df_test['texts'] = df_test.apply(lambda row: re.sub(r'https?:\/\/[\w\-\.\/\?\#]*(\s|$)', link_dummy, row['texts']), axis=1)
df_test['texts'] = df_test.apply(lambda row: re.sub(r'www[\w\-\.\/\?\#]*(\s|$)', link_dummy, row['texts']), axis=1)
df_test['texts'] = df_test.apply(lambda row: re.sub(r'\s[\w\-\.\/\?\#]*\.com(\s|$)', link_dummy, row['texts']), axis=1)

#Punctuation
df_test['texts'] = df_test.apply(lambda row: re.sub(r'(?<=[.,:;"!?])(?=[^\sa-z\"])', r' ', row['texts']), axis=1)
df_test['texts'] = df_test.apply(lambda row: re.sub(r'\xa0', r' ', row['texts']), axis=1)

#word count
word_counts = []
for text in df_test.texts:
    words = len(text.split())
    word_counts.append(words)
df_test["word_counts"] = word_counts

df_test = df_test.drop(['length'], axis=1)

In [21]:
df_test.texts = df_test.apply(lambda row: re.sub(link_dummy, '', row['texts']), axis=1)
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   responses     514 non-null    int32 
 1   header        506 non-null    object
 2   texts         514 non-null    object
 3   author        514 non-null    object
 4   publication   514 non-null    object
 5   month         514 non-null    int64 
 6   online_since  514 non-null    int64 
 7   images_count  514 non-null    int64 
 8   word_counts   514 non-null    int64 
dtypes: int32(1), int64(4), object(4)
memory usage: 34.3+ KB


### Reading time
Each article on medium has a reding time. "When people see a headline that piques their interest — and know in advance that it only takes a couple of minutes to read — they’re more likely to click the link". medium.com informs us: "Read time is based on the average reading speed of an adult […] [translated] into minutes, with an adjustment made for images" (Medium). To calculate the reading time, we need to calculate the number of words and number of images, which we did in the previous step.
The calculation of reading time is based on information from medium itself. Medium.com calculates it is by using the sum of "the average reading speed of an adult (roughly 265 WPM)” and add “an additional 12 seconds for the first image, 11 seconds for the second image, and minus an additional second for each subsequent image, through the tenth image. Any images after the tenth image are counted at three seconds." 
I wrote a method which replicates this formula to calculate the exact reading time according to Medium and used my previously created columns as input on the test data set.

In [22]:
def image_minutes(images_count):
    sec_per_image = 13
    sec = 0
    for x in range(images_count):
        sec_per_image = sec_per_image - 1
        if x >= 10:
            sec_per_image = 3
        #print(sec_per_image)
        sec = sec + sec_per_image
                  
    return sec

image_minutes(11)

78

In [23]:
df_test['read_time'] = df_test.apply(lambda row: 265 / row['word_counts'] + image_minutes(row['images_count']), axis=1)
df_test.read_time.head(20)

0     75.069609
1     72.072011
2      3.581081
3     63.087171
4      0.317746
5     96.049266
6     12.518591
7      0.274327
8     78.054504
9     19.794118
10     0.970696
11     0.985130
12     1.095041
13    99.086347
14     0.136037
15    68.335868
16    50.430894
17    33.143709
18     1.292683
19    68.222689
Name: read_time, dtype: float64

### Missing values
I was checking for missing values in data set.

In [24]:
total = df_test.isnull().sum().sort_values(ascending=False)
percent = (df_test.isnull().sum()/df_test.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head()

Unnamed: 0,Total,Percent
header,8,0.015564
read_time,0,0.0
word_counts,0,0.0
images_count,0,0.0
online_since,0,0.0


In [25]:
df_test["header"].fillna("", inplace = True)
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   responses     514 non-null    int32  
 1   header        514 non-null    object 
 2   texts         514 non-null    object 
 3   author        514 non-null    object 
 4   publication   514 non-null    object 
 5   month         514 non-null    int64  
 6   online_since  514 non-null    int64  
 7   images_count  514 non-null    int64  
 8   word_counts   514 non-null    int64  
 9   read_time     514 non-null    float64
dtypes: float64(1), int32(1), int64(4), object(4)
memory usage: 38.3+ KB


### Publication feature
I looked at this feature to understand what publisher is the most common and the most popular.
My research and future model is generally based on the features given in the test set. As a result, I do not have any reason to keep a lot of features from train set. The best way to deal with them is to delete them to keep my limited RAM consumption small. I removed them by using the drop() function and I leave only those that are necessary to the further data preparation.


In [26]:
df_test.publication.unique()

array(['HackerNoon.com', 'The Coinbase Blog', 'Good Audience',
       'Litecoin Project', 'Code Like A Girl', 'Cosmos Blog',
       'UX Collective', 'Zebpay', 'freeCodeCamp.org', 'NewCo Shift', '',
       'INSURGE intelligence', 'Galleys', 'Slackjaw', 'The Awl',
       '60 Months to Ironman', 'IoT For All', 'TE-FOOD', 'Circulate News',
       'Sustainable food systems', 'Ensia', 'Endless',
       'The Future Market', 'Hyperlink Magazine', 'Dialogue & Discourse',
       'One Table, One World', 'Track and Food', 'Real Life Stories',
       'The Junction', 'TheBeamMagazine', 'The Establishment', 'The Lily',
       'With Love, Zach', 'The Coffeelicious', 'The Mission', 'Frazzled',
       'From The Kitchen', 'Lit Up', 'The Billfold', 'Sweetgreen',
       'What IF?', 'OriginTrail', 'The Startup', 'Age of Awareness',
       '* in theprojects', 'The Haven', 'Self-Driven', 'SAP Design',
       'Team XenoX', 'Land And Ladle', 'TomoChain',
       'Human Development Project', '八百萬種食法', 'Revista Cr

In [27]:
df_test.publication.value_counts()

Netflix TechBlog            175
                             94
TE-FOOD                      34
The Startup                  24
HackerNoon.com               21
                           ... 
Hyperlink Magazine            1
BlockChannel                  1
Accelerated Intelligence      1
Zebpay                        1
Cognitive Dissident           1
Name: publication, Length: 101, dtype: int64

In [29]:
df_test.publication.head(15)

0        HackerNoon.com
1        HackerNoon.com
2        HackerNoon.com
3        HackerNoon.com
4        HackerNoon.com
5        HackerNoon.com
6     The Coinbase Blog
7        HackerNoon.com
8        HackerNoon.com
9        HackerNoon.com
10        Good Audience
11       HackerNoon.com
12       HackerNoon.com
13       HackerNoon.com
14     Litecoin Project
Name: publication, dtype: object

# Train Data preparation
I took more closer look at the train data and deleted the variables that are not existing in the test set and do not bring any advantages for the future predictive model.

In [30]:
df.head()

Unnamed: 0,audioVersionDurationSec,codeBlock,codeBlockCount,collectionId,createdDate,createdDatetime,firstPublishedDate,firstPublishedDatetime,imageCount,isSubscriptionLocked,...,slug,name,postCount,author,bio,userId,userName,usersFollowedByCount,usersFollowedCount,scrappedDate
0,0,,0.0,638f418c8464,2018-09-18,2018-09-18 20:55:34,2018-09-18,2018-09-18 20:57:03,1,False,...,blockchain,Blockchain,265164.0,Anar Babaev,,f1ad85af0169,babaevanar,450.0,404.0,20181104
1,0,,0.0,638f418c8464,2018-09-18,2018-09-18 20:55:34,2018-09-18,2018-09-18 20:57:03,1,False,...,samsung,Samsung,5708.0,Anar Babaev,,f1ad85af0169,babaevanar,450.0,404.0,20181104
2,0,,0.0,638f418c8464,2018-09-18,2018-09-18 20:55:34,2018-09-18,2018-09-18 20:57:03,1,False,...,it,It,3720.0,Anar Babaev,,f1ad85af0169,babaevanar,450.0,404.0,20181104
3,0,,0.0,,2018-01-07,2018-01-07 17:04:37,2018-01-07,2018-01-07 17:06:29,13,False,...,technology,Technology,166125.0,George Sykes,,93b9e94f08ca,tasty231,6.0,22.0,20181104
4,0,,0.0,,2018-01-07,2018-01-07 17:04:37,2018-01-07,2018-01-07 17:06:29,13,False,...,robotics,Robotics,9103.0,George Sykes,,93b9e94f08ca,tasty231,6.0,22.0,20181104


In [31]:
df["text"][0]
#df.postId.head(20)

'Private Business, Government and Blockchain\n\nA major private IT company implements blockchain, artificial intelligence, and Internet of Things to optimize and improve high technology workflow. The representatives of a major state structure from the same country like this experiment so much they decide to use it in their work and conclude an agreement with the IT giant. This is an ideal example of interaction between private business and the state regarding blockchain, don’t you think? What is even better is that this story is real: in South Korea a local customs office has signed the respective partnership agreement with Samsung. I believe that the near-term development of blockchain will be built on just such examples of cooperation. In a world where all the best technological decisions are copied at supersonic speed, one cannot remain behind the trends for long. That’s why I’m confident that blockchain and other crypto technologies will soon be adopted around the world. In the 21s

In [32]:
#df["language"].head()
#df.language.value_counts()
#df.language.count() - df[df['language']=="en"].language.count()
count_all_languages = df.language.count()
count_english = df[df['language']=="en"].language.count()
(count_all_languages - count_english) / count_all_languages * 100 #this percent is with dublicates after duplicates removal this number will decrease

7.841131423543425

In [33]:
df.drop(["publicationfollowerCount", "codeBlock", "publicationdomain", "publicationfacebookPageName", "publicationpublicEmail",
          "publicationtwitterUsername", "publicationtags", "publicationdescription", "publicationslug", 
          "collectionId", "bio", "audioVersionDurationSec", "codeBlockCount", "createdDatetime", "createdDate", "firstPublishedDatetime", "isSubscriptionLocked", "tagsCount", "uniqueSlug", "updatedDate", "updatedDatetime", "slug", "postCount", "usersFollowedByCount", "usersFollowedCount", "scrappedDate", "latestPublishedDate", "latestPublishedDatetime", "linksCount", "recommends", "imageCount", "socialRecommendsCount", "subTitle", "vote", "userId", "userName", "name"], axis=1, inplace=True)

df.info()   

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 279577 entries, 0 to 279576
Data columns (total 13 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   firstPublishedDate     279577 non-null  object 
 1   language               279577 non-null  object 
 2   postId                 279577 non-null  object 
 3   readingTime            279577 non-null  float64
 4   responsesCreatedCount  279577 non-null  int64  
 5   text                   279577 non-null  object 
 6   title                  279572 non-null  object 
 7   totalClapCount         279577 non-null  int64  
 8   url                    279577 non-null  object 
 9   wordCount              279577 non-null  int64  
 10  publicationname        137231 non-null  object 
 11  tag_name               279577 non-null  object 
 12  author                 279577 non-null  object 
dtypes: float64(1), int64(3), object(9)
memory usage: 27.7+ MB


### Missing values
Missing values are common in data sets. Generally, it is necessary to understand the nature of missing values and answer the following question: "Are the data that are missing random, or are they non-random and potentially biasing?"(Schlomer et al. 2015, p.2). Researches do not have a consensus what amount of missing data becomes problematic. For example, Schafer states that up to 5% of missing data means that it should best be cut out. Bennett suggested that more than 10% of missing data biased the analyse and others are talking about 20% (Schlomer et al. 2015, p.2. In present case 50 percent of missing values is a sign to better remove a column. But publication name is an important feature for the future predictive model. The better decision on my point of view is to fill missing values. I will use the function fillna() which fills NaNs with some new information like "No Publication Name" or with an empty string. Empty strings do not have any grammatical or syntactical pressure for the future NLP model.

In [34]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(10)

Unnamed: 0,Total,Percent
publicationname,142346,0.509148
title,5,1.8e-05
author,0,0.0
tag_name,0,0.0
wordCount,0,0.0
url,0,0.0
totalClapCount,0,0.0
text,0,0.0
responsesCreatedCount,0,0.0
readingTime,0,0.0


In [35]:
df[df["title"].isnull()] #the result is like: five NAs are sequential, with the same creation data and the same author
#df["title"][43438]
#df["url"][43442] #all five have the same url

Unnamed: 0,firstPublishedDate,language,postId,readingTime,responsesCreatedCount,text,title,totalClapCount,url,wordCount,publicationname,tag_name,author
43438,2018-04-14,en,3e691fa8543b,0.003774,0,N/A\n,,321,https://medium.com/s/story/how-i-became-chief-...,1,Byzantine.network,Burning Man,Nadia Chilmonik
43439,2018-04-14,en,3e691fa8543b,0.003774,0,N/A\n,,321,https://medium.com/s/story/how-i-became-chief-...,1,Byzantine.network,Machine Learning,Nadia Chilmonik
43440,2018-04-14,en,3e691fa8543b,0.003774,0,N/A\n,,321,https://medium.com/s/story/how-i-became-chief-...,1,Byzantine.network,Blockchain,Nadia Chilmonik
43441,2018-04-14,en,3e691fa8543b,0.003774,0,N/A\n,,321,https://medium.com/s/story/how-i-became-chief-...,1,Byzantine.network,Blockchain Startup,Nadia Chilmonik
43442,2018-04-14,en,3e691fa8543b,0.003774,0,N/A\n,,321,https://medium.com/s/story/how-i-became-chief-...,1,Byzantine.network,Cryptocurrency,Nadia Chilmonik


In [36]:
df["publicationname"].fillna("", inplace = True)
df = df.dropna(subset=["title"], how="all")

In [37]:
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

Unnamed: 0,Total,Percent
author,0,0.0
tag_name,0,0.0
publicationname,0,0.0
wordCount,0,0.0
url,0,0.0
totalClapCount,0,0.0
title,0,0.0
text,0,0.0
responsesCreatedCount,0,0.0
readingTime,0,0.0


### Feature "publicationname" in test and train set
Discovering of this feature in train set has shown the presence of the publisher with the same name but different way of writing. This probably originates from different scaping times and a change of the account name on medium.com in-between. This mistake has a problematic potential in the future. The best way on my point of view is to rewrite such publishers to one consistent name.
We can see here the top 20 most important publishers in the test and train dataset. In the train set we can see that only the publisher "HackerNoon" differs. I programmatically made them match.

In [38]:
df.publicationname.value_counts().head(20)

                                                                          142346
Towards Data Science                                                       16589
Hacker Noon                                                                 5540
Becoming Human: Artificial Intelligence Magazine                            3086
Data Driven Investor                                                        2834
Chatbots Life                                                               2041
Chatbots Magazine                                                           1439
SyncedReview                                                                1433
Planeta Chatbot : todo sobre los Chatbots y la Inteligencia Artificial      1427
The Startup                                                                 1383
Good Audience                                                               1058
freeCodeCamp.org                                                            1005
Coinmonks                   

In [39]:
df_test.publication.value_counts().head(20)

Netflix TechBlog        175
                         94
TE-FOOD                  34
The Startup              24
HackerNoon.com           21
freeCodeCamp.org         15
One Table, One World      9
60 Months to Ironman      6
The Billfold              5
The Establishment         4
Slackjaw                  4
The Future Market         4
The Lily                  4
The Mission               3
Track Changes             3
The Awl                   3
Startup Grind             3
NewCo Shift               3
OriginTrail               3
What IF?                  2
Name: publication, dtype: int64

In [40]:
df['publicationname'].value_counts().index

Index(['', 'Towards Data Science', 'Hacker Noon',
       'Becoming Human: Artificial Intelligence Magazine',
       'Data Driven Investor', 'Chatbots Life', 'Chatbots Magazine',
       'SyncedReview',
       'Planeta Chatbot : todo sobre los Chatbots y la Inteligencia Artificial',
       'The Startup',
       ...
       'selfdrivingcars', 'Defining Data Science', 'OrbitX', 'YLD Blog',
       'adappt Intelligence', 'Crafted Technology— Harel Omer', 'BOTBANHANG',
       'Tulua', 'Winning in the Digital Economy', 'Systek'],
      dtype='object', length=6507)

In [41]:
df.publicationname.str.contains('HackerNoon.com').value_counts().index

Index([False], dtype='object')

In [42]:
df_test['publication'].value_counts().index

Index(['Netflix TechBlog', '', 'TE-FOOD', 'The Startup', 'HackerNoon.com',
       'freeCodeCamp.org', 'One Table, One World', '60 Months to Ironman',
       'The Billfold', 'The Establishment',
       ...
       'Litecoin Project', 'IoT Chain', 'Galleys', 'ABUNDANCE INSIGHTS',
       'The Coffeelicious', 'Hyperlink Magazine', 'BlockChannel',
       'Accelerated Intelligence', 'Zebpay', 'Cognitive Dissident'],
      dtype='object', length=101)

In [43]:
pd.Series(df['publicationname'].unique()).isin(df_test['publication'].unique()).value_counts()

False    6469
True       38
dtype: int64

In [44]:
df[df.publicationname.str.contains('Netflix')].publicationname

43268     Netflix TechBlog
43269     Netflix TechBlog
43270     Netflix TechBlog
43271     Netflix TechBlog
62750     Netflix TechBlog
                ...       
195032    Netflix TechBlog
195033    Netflix TechBlog
195034    Netflix TechBlog
195035    Netflix TechBlog
195036    Netflix TechBlog
Name: publicationname, Length: 68, dtype: object

In [45]:
df_test['publication'] = df_test['publication'].replace(['HackerNoon.com'], 'Hacker Noon')

### Duplicates
After some research it was discovered that medium allows the author to describe an article by using up to five tags to each story. The data made it obvious that the articles were scraped by querying individual tags e.g. all articles with the tag “machine learning” followed by all articles with the tag “cryptocurrency” and so on. As a result, the same story was scraped multiple times for each tag in the dataset and thus resulting in a massive number of duplicates. Most duplicate rows only differ in the “tag” column. This column is not in our test dataset that is why I decided to just remove them and reduce the workload for my neural network.

In [46]:
doublicateIdentifier = ["url", "postId"] #url and postId were choseen as an entry that combine all entries that have duplicates in the dataset 

multi_tags = df[df.duplicated(subset=doublicateIdentifier, keep=False)]

print("There are: ", multi_tags.shape[0], "Duplicated entries.")
print("Unique posts with multiple tags: ", multi_tags.shape[0]- df[df.duplicated(subset=doublicateIdentifier, keep="last")].shape[0])

There are:  274050 Duplicated entries.
Unique posts with multiple tags:  66814


In [47]:
df = df[~df.duplicated(subset=doublicateIdentifier)]
df.tag_name.head(20)

0                  Blockchain
3                  Technology
7                Data Science
11                   Robotics
16    Artificial Intelligence
21                     Oracle
25    Artificial Intelligence
26           Machine Learning
31    Artificial Intelligence
32           Machine Learning
36               Data Science
39    Artificial Intelligence
44                        Sex
49                     Python
54          Hd Live Streaming
57    Artificial Intelligence
58                 Technology
61           Machine Learning
65                      Music
70                    Bitcoin
Name: tag_name, dtype: object

### Removing non-English articles
Removing non-English articles is a text mining task. It requires a language identification algorithm. Basically, an NLP algorithm should differentiate different corpuses taking in account the grammatical characteristics of each language. For example, NLTK functions are good for the English language and at the same time FreeLing is the best for Spanish text sources (Data Big Bang). Having multiple networks for each language is the optimal approach. This task is not suitable in this case, because in the test set we have only English articles and non-English content in the train set would be a noise for our future prediction model. This noise can only decrease the prediction accuracy. The best solution in this situation in my point of view is to remove the non-English articles entirely.

In [48]:
df.language.unique() #the list of all languages

array(['en', 'th', 'ja', 'zh', 'ru', 'pt', 'es', 'zh-Hant', 'id', 'my',
       'de', 'tr', 'fr', 'ko', 'it', 'lo', 'un', 'vi', 'cs', 'sk', 'is',
       'sv', 'bn', 'mn', 'da', 'no', 'bg', 'ar', 'pl', 'nl', 'ro', 'ca',
       'hu', 'hi', 'ka', 'el', 'ms', 'uk', 'si', 'sr', 'lt', 'la', 'fa',
       'ml', 'sl', 'mr', 'az', 'lv', 'te', 'mk', 'nn', 'fi'], dtype=object)

In [49]:
index = df[df["language"] != "en"].index #save the list of all texts that are not english by using their index
df.drop(index, inplace = True) #drop them
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66379 entries, 0 to 279572
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   firstPublishedDate     66379 non-null  object 
 1   language               66379 non-null  object 
 2   postId                 66379 non-null  object 
 3   readingTime            66379 non-null  float64
 4   responsesCreatedCount  66379 non-null  int64  
 5   text                   66379 non-null  object 
 6   title                  66379 non-null  object 
 7   totalClapCount         66379 non-null  int64  
 8   url                    66379 non-null  object 
 9   wordCount              66379 non-null  int64  
 10  publicationname        66379 non-null  object 
 11  tag_name               66379 non-null  object 
 12  author                 66379 non-null  object 
dtypes: float64(1), int64(3), object(9)
memory usage: 7.1+ MB


In [52]:
df.language.unique()
df.drop(["tag_name", "postId", "language"], axis="columns", inplace=True)

### Creating new features: article’s age and month when it was published
These two features were created in test set and the logic of this creation is described above.

In [53]:
df['firstPublishedDate'] = pd.to_datetime(df['firstPublishedDate'])
df['online_since'] = (pd.datetime.today() - df['firstPublishedDate']).dt.days
df["month"] = pd.DatetimeIndex(df['firstPublishedDate']).month
df = df.drop(['firstPublishedDate'], axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66379 entries, 0 to 279572
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   readingTime            66379 non-null  float64
 1   responsesCreatedCount  66379 non-null  int64  
 2   text                   66379 non-null  object 
 3   title                  66379 non-null  object 
 4   totalClapCount         66379 non-null  int64  
 5   url                    66379 non-null  object 
 6   wordCount              66379 non-null  int64  
 7   publicationname        66379 non-null  object 
 8   author                 66379 non-null  object 
 9   online_since           66379 non-null  int64  
 10  month                  66379 non-null  int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 6.1+ MB


### Calculate reading time
On the train dataset the column for reading time already existed but because I discovered that irregularities in the text body and my adjustments regarding punctuation where most likely not present I decided to recalculate it using the same formula as before to have consistent results.
The calculation of the reading time in the train set is complicated by the fact that all articles start with a repetition of the title inside the text body. For a proper reading time calculation, we need to remove all titles from text body. I used a regular expression to deal with this.
The sequence of actions in this case is important. My research has shown that effective way of removing the title from the body of text is based on the logic of removing the first line, when followed by \n\n. Next, I count images, replace hyperlinks, remove html and then I add the new features just like in the test set.

In [54]:
df['text'] = df.apply(lambda row: re.sub(re.escape(row['title']) + r'.{0,4}' + '\n', '', row['text']), axis=1)

In [55]:
texts = []
images_count = []
for text in df.text:
    soup = BeautifulSoup(text) 
    images = soup.findAll('img')
    len_images = len(images)
    cleaned_text = soup.get_text() #remove html
    cleaned_text = cleaned_text.strip() #remove white spaces
    images_count.append(len_images)
    texts.append(cleaned_text)
#images_count
#df_test["images_count"] = images_count
df["texts"] = texts
df["images_count"] = images_count

In [56]:
df['texts'] = df.apply(lambda row: re.sub(r'\n', ' ', row['texts']), axis=1) #still some \n were detected. to be sure that all are removed

In [57]:
def image_minutes(images_count):
    sec_per_image = 13
    sec = 0
    for x in range(images_count):
        sec_per_image = sec_per_image - 1
        if x >= 10:
            sec_per_image = 3
        #print(sec_per_image)
        sec = sec + sec_per_image
                  
    return sec

In [58]:
df[df.wordCount == 0]

Unnamed: 0,readingTime,responsesCreatedCount,text,title,totalClapCount,url,wordCount,publicationname,author,online_since,month,texts,images_count
77229,0.2,0,\n,Humanity will suffer under AI if we don´t act ...,0,https://medium.com/s/story/humanity-will-suffe...,0,,Talenter.io,1027,11,,0


It was detected a proble in the row with index 77229. The error sounds as: "ZeroDivisionError: division by zero". The best solution is to remove this row.

In [59]:
df = df.drop([77229], axis = 0)

In [60]:
df['read_time'] = df.apply(lambda row: 265 / row['wordCount'] + image_minutes(row['images_count']), axis=1)

In [61]:
link_dummy = '___LINK___ '
df['texts'] = df.apply(lambda row: re.sub(r'https?:\/\/[\w\-\.\/\?=\#]*(\s|$)', link_dummy, row['texts']), axis=1)
df['texts'] = df.apply(lambda row: re.sub(r'www[\w\-\.\/\?\#]*(\s|$)', link_dummy, row['texts']), axis=1)
df['texts'] = df.apply(lambda row: re.sub(r'\s[\w\-\.\/\?\#]*\.com(\s|$)', link_dummy, row['texts']), axis=1)

In [62]:
#Punctuation
df['texts'] = df.apply(lambda row: re.sub(r'(?<=[.,:;"!?])(?=[^\sa-z\"])', r' ', row['texts']), axis=1)
# df['texts'] = df.apply(lambda row: re.sub(r'\xa0', r' ', row['texts']), axis=1)

In [63]:
df.texts = df.apply(lambda row: re.sub(link_dummy, '', row['texts']), axis=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66378 entries, 0 to 279572
Data columns (total 14 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   readingTime            66378 non-null  float64
 1   responsesCreatedCount  66378 non-null  int64  
 2   text                   66378 non-null  object 
 3   title                  66378 non-null  object 
 4   totalClapCount         66378 non-null  int64  
 5   url                    66378 non-null  object 
 6   wordCount              66378 non-null  int64  
 7   publicationname        66378 non-null  object 
 8   author                 66378 non-null  object 
 9   online_since           66378 non-null  int64  
 10  month                  66378 non-null  int64  
 11  texts                  66378 non-null  object 
 12  images_count           66378 non-null  int64  
 13  read_time              66378 non-null  float64
dtypes: float64(2), int64(6), object(6)
memory usage: 7.6+

In [68]:
#df= df.drop(['url', 'images_count', 'wordCount', 'text'], axis = 1)
df_test = df_test.drop(['word_counts', 'images_count'], axis = 1)

In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66378 entries, 0 to 279572
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   readingTime            66378 non-null  float64
 1   responsesCreatedCount  66378 non-null  int64  
 2   title                  66378 non-null  object 
 3   totalClapCount         66378 non-null  int64  
 4   publicationname        66378 non-null  object 
 5   author                 66378 non-null  object 
 6   online_since           66378 non-null  int64  
 7   month                  66378 non-null  int64  
 8   texts                  66378 non-null  object 
 9   read_time              66378 non-null  float64
dtypes: float64(2), int64(4), object(4)
memory usage: 5.6+ MB


In [69]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   responses     514 non-null    int32  
 1   header        514 non-null    object 
 2   texts         514 non-null    object 
 3   author        514 non-null    object 
 4   publication   514 non-null    object 
 5   month         514 non-null    int64  
 6   online_since  514 non-null    int64  
 7   read_time     514 non-null    float64
dtypes: float64(1), int32(1), int64(2), object(4)
memory usage: 30.2+ KB


### Adjusting column names in test and train data
The icing on the top of the first part of this study case is to rename the columns the same way in the test and train sets. For example the header in the train set is labelled title. Responses are named as “responsesCreatedCount”. 

In [70]:
train = df
train = train[['responsesCreatedCount', 'title', "texts", 'author', 'publicationname', 'month', 'online_since', 'readingTime', 'totalClapCount']]
train.rename(columns={ 
    'responsesCreatedCount': 'responses',
    'title': 'header',  
    'publicationname': 'publisher',
    'readingTime': 'read_time',
    'totalClapCount': 'claps',
}, inplace=True)

In [71]:
test = df_test[['responses', 'header', 'texts', 'author', 'publication', 'month', 'online_since', 'read_time']]
test.rename(columns={'publication': 'publisher'}, inplace=True)

Saving data sets

In [68]:
import pickle
#with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\data_pre_clean_train.pkl','wb') as path_name:
    #pickle.dump(train, path_name) #save the data

#with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\data_pre_clean_test.pkl','wb') as path_name:
#    pickle.dump(test, path_name) #save the data

with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\data_pre_clean_test.pkl','rb') as path_name:
    test = pickle.load(path_name) #load

In [69]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   responses     514 non-null    int32  
 1   header        514 non-null    object 
 2   texts         514 non-null    object 
 3   author        514 non-null    object 
 4   publisher     514 non-null    object 
 5   month         514 non-null    int64  
 6   online_since  514 non-null    int64  
 7   read_time     514 non-null    float64
dtypes: float64(1), int32(1), int64(2), object(4)
memory usage: 30.2+ KB


# Text Tokenization
NLTK - The Natural Language Toolkit – provides the necessary tools and methods to process and analyze text data. By using this framework, we can fulfil such tasks as: tokenization, stemming, lemmatization, tagging or parsing. These transformations are the standard procedure of any NLP project.
Textual data as a rule is not well formatted and standardized and namely highly unstructured. 
“Text processing or, to be more specific pre-processing, involves a wide variety of techniques that convert raw text into well-defined sequences of linguistic components that have standard structure and notation. Additional metadata is often also present in the form of annotations to give more meaning to the text components like tags” (Sarkar 2019: 115). Some cleaning techniques such as HTML removing was already made earlier, still the list of further transformation is as follows: 

    1.	Convert to lowercase
    2.	Split into tokens
    3.	Remove punctuation from each token
    4.	Filter out remaining tokens that are not alphabetic
    5.	Filter out tokens that are stop words
    6.	Lemmatize tokens
    
For example, text tokenisation defines syntax and semantics of a smallest textual component. To achieve this the text is broken into separate words. The smallest component is a token.
Algorithm can only match specific words or phrases if all words/tokens are in the same case by a upper- or lowercase conversion. In this case I decided to convert them into lowercase tokens. All punctuation is also removed because for algorithm it is just noise. For the same reason it is common to remove all stop words that have little or no significance for our model. Lastly, I lemmatize the tokens. The lemmatization process converts words into their base form (e.g. Sarkar 2019). I used the code from the ADAM’s tutorial as a template. 


## Text and Header Tokenization

In [75]:
import string
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\tesle\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [76]:
#TRAIN DATA
def  clean_text(train):
    train['texts'] = train['texts'].str.lower()
    train['header'] = train['header'].str.lower()   
    # remove numbers
    train['texts'] = train['texts'].apply(lambda elem: re.sub(r"\d+", "", elem))
    train['header'] = train['header'].apply(lambda elem: re.sub(r"\d+", "", elem))
    return train

train = clean_text(train)

In [77]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

In [78]:
lemmatizer = WordNetLemmatizer()
def lem_text(text):
    #sentences = sent_tokenize(text) #split into sentences
    tokens = word_tokenize(text) #splitting strings into tokens nominally words
    tokens = [w.lower() for w in tokens] #convert to lower case
    words = [word for word in tokens if word.isalpha()] #remove all tokens that are not alphabetic
    regex_punc = re.compile('[%s]' % re.escape(string.punctuation)) #prepare regex for char filtering
    stripped = [regex_punc.sub('', w) for w in tokens] # remove punctuation from each word
    stop_words = stopwords.words('english') #filter out stop words
    words = [w for w in words if not w in stop_words]
    lemma_words =[lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in words]

    return lemma_words

train['texts'] = train['texts'].apply(lem_text)
train['header'] = train['header'].apply(lem_text)

In [None]:
with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\data_lem_train.pkl','wb') as path_name:
    pickle.dump(train, path_name) #save the data

In [None]:
with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\data_lem_train.pkl','rb') as path_name:
    train = pickle.load(path_name) #load

In [79]:
#TEST DATA
def  clean_text(test):
    test['texts'] = test['texts'].str.lower()
    test['header'] = test['header'].str.lower()   
    # remove numbers
    test['texts'] = test['texts'].apply(lambda elem: re.sub(r"\d+", "", elem))
    test['header'] = test['header'].apply(lambda elem: re.sub(r"\d+", "", elem))
    return test

test = clean_text(test)

In [80]:
lemmatizer = WordNetLemmatizer()
def lem_text(text):
    #sentences = sent_tokenize(text) #split into sentences
    tokens = word_tokenize(text) #splitting strings into tokens nominally words
    tokens = [w.lower() for w in tokens] #convert to lower case
    words = [word for word in tokens if word.isalpha()] #remove all tokens that are not alphabetic
    regex_punc = re.compile('[%s]' % re.escape(string.punctuation)) #prepare regex for char filtering
    stripped = [regex_punc.sub('', w) for w in tokens] # remove punctuation from each word
    stop_words = stopwords.words('english') #filter out stop words
    words = [w for w in words if not w in stop_words]
    lemma_words =[lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in words]

    return lemma_words

In [81]:
test['texts'] = test['texts'].apply(lem_text)
test['header'] = test['header'].apply(lem_text)

After the tokenisation and lemmatisation process is done and I got acquainted with the result. I took notice that the lemmatization removed the letter "s" from the word "us". Potentially it means that there are a lot of single letters in the "texts" and "header" columns. I noticed that there is also a huge amount of words that are written incorrectly and were therefore tokenized as a new separate word. Removing these mistakes is next necessary step.

In [82]:
import collections

In [83]:
# TRAIN DATA
word_counter = collections.Counter()
for t in train.texts:
    for w in t:
        word_counter.update({w: 1})
        
for h in train.header:
    for w in h:
        word_counter.update({w: 1})

word_frequency = sorted(word_counter.items(), key=lambda pair: pair[1], reverse=False)

In [84]:
least_common_words = [word for word, freq in word_frequency if freq <= 10]
len(least_common_words)

174302

In [85]:
#stop_words = stopwords.words('english')
updated_stopwords = set(least_common_words + ['http']) #there is still http in the text number 3
print(f'Count stop words: {len(updated_stopwords)}')

Count stop words: 174303


In [86]:
def filter_words(words):
    new_words = []
    for word in words:
        if len(word) > 2 and not word in updated_stopwords:
            new_words.append(word)
    return new_words

In [87]:
train.texts = list(map(filter_words, train.texts))
train.header = list(map(filter_words, train.header))

In [88]:
#TEST DATA
test_word_counter = collections.Counter()
for t in test.texts:
    for w in t:
        test_word_counter.update({w: 1})
        
for h in test.header:
    for w in h:
        word_counter.update({w: 1})

test_word_frequency = sorted(test_word_counter.items(), key=lambda pair: pair[1], reverse=False)

In [89]:
test_least_common_words = [word for word, freq in test_word_frequency if freq <= 10]
len(test_least_common_words)

13755

In [90]:
test_updated_stopwords = set(test_least_common_words + ['http']) 
print(f'Count stop words: {len(test_updated_stopwords)}')

Count stop words: 13756


In [91]:
def test_filter_words(words):
    test_new_words = []
    for word in words:
        if len(word) > 2 and not word in test_updated_stopwords:
            test_new_words.append(word)
    return test_new_words

In [92]:
test.texts = list(map(test_filter_words, test.texts))
test.header = list(map(test_filter_words, test.header))

### Author and publisher Tokenization
I noticed that my publisher and author columned needed further transformation to be available in my model that is why I made the decision to leave only the top 500 hundred authors from my dataset and tokenize them. I did the same approach with publishers. The name and family name of an author or the name of a publisher is presented as a number. I did not find any convenient way to do this, so I tried to replicate the process I did for the word tokenization before and unpacked the 1-dimentional list to have a simple integer column. I am convinced there is a better way to do this.
I noticed also that it could be usefful to remove the least common words.

In [2]:
with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\data_lem_train.pkl','rb') as path_name:
    train = pickle.load(path_name) #load

In [3]:
import collections

word_counter = collections.Counter()
for t in train.texts:
    for w in t:
        word_counter.update({w: 1})
        
for h in train.header:
    for w in h:
        word_counter.update({w: 1})

word_frequency = sorted(word_counter.items(), key=lambda pair: pair[1], reverse=False)

In [4]:
least_common_words = [word for word, freq in word_frequency if freq <= 10]
len(least_common_words)

174308

In [5]:
# keep
len([word for word, freq in word_frequency if freq > 10])

37987

In [6]:
#stop_words = stopwords.words('english')
updated_stopwords = set(least_common_words + ['http'])
print(f'Count stop words: {len(updated_stopwords)}')

Count stop words: 174309


In [7]:
def filter_words(words):
    new_words = []
    for word in words:
        if len(word) > 2 and not word in updated_stopwords:
            new_words.append(word)
    return new_words

In [8]:
train.texts = list(map(filter_words, train.texts))
train.header = list(map(filter_words, train.header))

In [9]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


In [10]:
NUM_WORDS = 6000 #5000
# Create tokenizer object and build vocab from the training set
tokenizer = Tokenizer(NUM_WORDS, oov_token=1)  
tokenizer.fit_on_texts(train.texts)

In [95]:
tokenizer.fit_on_texts(test.texts)

In [11]:
header_num_words = 1000
tokenizer_header = Tokenizer(header_num_words, oov_token=1)
tokenizer_header.fit_on_texts(train.header)

In [18]:
tokenizer_header.fit_on_texts(test.header)

In [12]:
train['sequences_text'] = tokenizer.texts_to_sequences(train['texts'])
train['sequences_header'] = tokenizer.texts_to_sequences(train['header'])

In [45]:
test['sequences_text'] = tokenizer.texts_to_sequences(test['texts'])
test['sequences_header'] = tokenizer.texts_to_sequences(test['header'])

In [13]:
vocab_size = len(tokenizer.word_index) + 1

In [14]:
max_texts_length = max([len(text) for text in train.sequences_text])
print('The longest review of the training set has {} words.'.format(max_texts_length))

The longest review of the training set has 12180 words.


In [15]:
max_header_length = max([len(header) for header in train.sequences_header])
print('The longest review of the training set has {} words.'.format(max_header_length))

The longest review of the training set has 14 words.


In [16]:
NUM_AUTHORS = 500

# Create tokenizer object and build vocab from the training set
tokenizer_author = Tokenizer(NUM_AUTHORS, oov_token=1, split='\n\n\n\n')  # never split = every author is it's own token
tokenizer_publisher = Tokenizer(NUM_AUTHORS, oov_token=1, split='\n\n\n\n')  # never split = every publisher is it's own token

def tokenize_simple_column(column, tokenizer, fit):
    if (fit):
        tokenizer.fit_on_texts(column)
    seq = tokenizer.texts_to_sequences(column)
    return list(map(lambda arr: arr[0] if len(arr) == 1 else 0, seq)) # only return a single number

In [17]:
train['author_tok'] = tokenize_simple_column(train.author, tokenizer_author, True)
train['publisher_tok'] = tokenize_simple_column(train.publisher, tokenizer_publisher, True)

In [46]:
test['author_tok'] = tokenize_simple_column(test.author, tokenizer_author, False)
test['publisher_tok'] = tokenize_simple_column(test.publisher, tokenizer_publisher, False)

In [103]:
#NB_HIDDEN = 16
EPOCH = 5
BATCH_SIZE = 64 
EMBEDDING_DIM = 100
MAX_TEXT_LENGTH = 2000
MAX_HEADER_LENGTH = 8
VAL_SPLIT = 0.3
#MAX_TEXT_LENGTH = 12180
#MAX_HEADER_LENGTH = 14

# Word2Vec
## About Word2Vec
Word2Vec is a rule-based way to represent words in an NLP tasks to represent them as an atomic symbol. Traditionally a binary one-hot encoding was used for such tasks. As a result, we get a huge number of zeros and ones: 

* Hotel [0000…1000]
* Motel [0100…0000]

One-hot encoding causes a drastically growing dimensionality and immense computational time as the algorithm complexity is getting more difficult, too. One more problem is that the similarities are not visible in this type of model and as a result the contextual meaning of the words will be lost. Model whose training is based on the vocabulary that is one-hot encoded is almost impossible to reuse in other cases. 
The way to avoid this problem is to represent words as continuous vectors, Word2Vectors. The new model architecture was proposed by Tomas Mikolov and his colleagues. This technique has a much lower computational cost which is essential if it is must learn a dataset with a huge amount of words. In the research paper the authors stated that they used this algorithm to train a model with 1.6 billion words in less than one day. In my case the amount of words is not as high and using this technique has a learning character and was mainly done to learn the architecture of the model and coding techniques. The main advantages of Word2Vec is the opportunity to train models with a huge amount of text faster and reuse the trained models in other applications. Word2Vec is able to detect the similarities and antipodes between words and even calculates a degree of this. Generally speaking, we develop a neural network with a single hidden layer which aims to predict a target word based on its context through other words, in fact its neighbouring words. The scientists have developed two models’ architectures: CBOW and Skip-gram (Mikolov et al., 2013). 
CBOW architecture is a continuous bag of words model. The algorithm in this model tries to predict a target word by using a context words, neighbouring words of the target word. 
![W2V](W2V.png "by Kavita Ganesan")
The Skip-gram model works reverse to CBOW model. In this model the input is a target word and the context words are an output.  
CBOW model trains faster than the skip-gram. This model is more suitable for frequently occurring words. Skip-gram in comparison makes its computations slower but works better if the amount of data is smaller. On the other hand, this model type is less efficient for less frequently occurring words (e.g. Vyas 2020).

In [104]:
from gensim.models import Word2Vec 
import multiprocessing
from time import time
from gensim.models.phrases import Phrases, Phraser

## Implementaion
The implementation of the word2vec model consists of the following steps: create a vocabulary from texts, create a model, create the model’s architecture, train the model, get embeddings for the model (Megret 2018). I used the library “Gensim” for the model creation. The reason to use genism is simply because personally for me this library seems pretty straightforward to use. 
Before implementing a word2vec model I decided to discover common phrases, called bigrams, in my corpora, also by using the gensim library. The development of bigrams can follow to a very sparse corpus. To avoid this problem the decision was made a to add the parameter “min_count = 30”. This causes bigrams whose total absolute frequency lower than 30 to be ignored. 
Word2vec development is based on the tutorial of Pierre Megret.

In [105]:
sent = [row for row in train['texts']]
phrases = Phrases(sent, min_count=30)

In [106]:
bigram = Phraser(phrases)

In [107]:
texts_bigram = bigram[sent]

In [32]:
print(multiprocessing.cpu_count())

8


### CBOW

In [33]:
t = time()
w2v_model = Word2Vec(
    min_count=20,
    window=7,
    size=100,
    sample=6e-5, 
    alpha=0.03, 
    min_alpha=0.0007, 
    negative=20,
    sg=0,
    workers=7)
print('Time to build W2V: {} mins'.format(round((time() - t) / 60, 2)))

Time to build W2V: 0.0 mins


### Skip-gram

In [47]:
skip_w2v_model = Word2Vec(
    min_count=30,
    window=5,
    size=100,
    sample=6e-5, 
    alpha=0.03, 
    min_alpha=0.0007, 
    negative=30,
    sg=1,
    workers=7)

### Building the vocabulary table

In [101]:
w2v_model.build_vocab(texts_bigram)

In [48]:
t = time()

skip_w2v_model.build_vocab(texts_bigram)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

Time to build vocab: 1.56 mins


Train the model
Determine best suitable parameters for training the model is an important step (Megret 2018).
To train CBOW I set a min-count of 20 and in case of Skip-gram to 30. The window size in the first case is 7, because it is not recommended to have a very small window when using CBOW according to Mikolov and his co-authors. At the same time in Skip-gram case the window size is set to 5. This decision is also based on Mikolov and his colleague’s recommendation. In both cases the dimensionality size is 100. The learning rate is 0.03. Sample parameter which controls how much subsampling occurs is very small because smaller values mean words are less likely to be kept. The negative sample is set on 20 and 30. In the paper of Mikolov et al. it is recommended to set this parameter between 5-20 for small data sets to works well. The number of epochs is 30.
During my smoke test I noticed that the CBOW model appears to give better results and found contextually better words. Tuning the parameters could provide better results, for example a negative sampling of 20 in the skip-gram instead of 30 or an increase of the number of epochs. However, my corpora is not very big that is why I did not see potential for a big improvement. 
Therefore, the CBOW model is saved and will be used as a pretrained embedding later. Additionally, we can also use an embedding layer in a network to train the embeddings as a number of researches has shown that using a pre-trained word embeddings improves a model’s performance significantly (Lample  et  al., 2016;  Ma/Hovy,  2016).

In [36]:
#CBOW
t = time()

w2v_model.train(texts_bigram, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

Time to train the model: 49.58 mins


In [37]:
w2v_model.wv.most_similar(positive=["data"])

[('collect', 0.7679224610328674),
 ('raw', 0.7209828495979309),
 ('structure_unstructured', 0.7120325565338135),
 ('large_amount', 0.6950662136077881),
 ('analyze', 0.6918653845787048),
 ('collection', 0.6613565683364868),
 ('disparate_source', 0.660919189453125),
 ('derive_insight', 0.6601530909538269),
 ('large_volume', 0.6494994759559631),
 ('massive_amount', 0.6479994058609009)]

In [46]:
w2v_model.wv.most_similar('code')

[('line_code', 0.7176811695098877),
 ('code_snippet', 0.7047280073165894),
 ('code_github', 0.7031026482582092),
 ('script', 0.695374608039856),
 ('repo', 0.6953631639480591),
 ('docstrings', 0.687702476978302),
 ('codebase', 0.6834437847137451),
 ('gist', 0.6834126114845276),
 ('python_script', 0.6800730228424072),
 ('github_repo', 0.675747811794281)]

In [49]:
#Skip-gram
t = time()

skip_w2v_model.train(texts_bigram, total_examples=skip_w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

Time to train the model: 113.37 mins


In [50]:
skip_w2v_model.wv.most_similar(positive=["data"])

[('raw', 0.7005041837692261),
 ('datasets', 0.6964728832244873),
 ('set', 0.6769678592681885),
 ('collect', 0.6662955284118652),
 ('big', 0.6536503434181213),
 ('wrangle', 0.6514309644699097),
 ('analytics', 0.6496230363845825),
 ('source', 0.6456195116043091),
 ('dataset', 0.6441148519515991),
 ('large_datasets', 0.636734664440155)]

In [51]:
skip_w2v_model.wv.most_similar('code')

[('available_github', 0.7703068256378174),
 ('gist', 0.7674460411071777),
 ('code_github', 0.7659567594528198),
 ('repo', 0.762941837310791),
 ('github_repo', 0.7538014650344849),
 ('github', 0.7494617700576782),
 ('code_snippet', 0.7333778142929077),
 ('github_repository', 0.7289708852767944),
 ('notebook', 0.7178595662117004),
 ('python_script', 0.7177121639251709)]

In [52]:
skip_w2v_model.wv.most_similar('polite')

[('rude', 0.7224957942962646),
 ('respectful', 0.6968988180160522),
 ('swear', 0.6191444396972656),
 ('politeness', 0.6108940839767456),
 ('alexa_siri', 0.5898687839508057),
 ('joke', 0.5882405638694763),
 ('sorry', 0.5827313661575317),
 ('reply', 0.5819169878959656),
 ('laugh', 0.579865038394928),
 ('politely', 0.5752518177032471)]

In [53]:
w2v_model.wv.most_similar('polite')

[('rude', 0.7701680660247803),
 ('politeness', 0.7059785723686218),
 ('respectful', 0.6654813289642334),
 ('courteous', 0.6562213897705078),
 ('rudeness', 0.6077378988265991),
 ('etiquette', 0.5933928489685059),
 ('snarky', 0.5905433297157288),
 ('sorry', 0.5889784693717957),
 ('considerate', 0.5858083963394165),
 ('annoyed', 0.5780697464942932)]

In [55]:
embedding="w2v_CBOW.model"
SAVE_BIN = False
w2v_model.wv.save_word2vec_format(embedding, binary=SAVE_BIN)

In [17]:
from gensim.models import KeyedVectors
embedding="w2v_CBOW.model"
SAVE_BIN = False

In [18]:
# Load model from disk
model_w2v = KeyedVectors.load_word2vec_format(embedding, binary=SAVE_BIN)
model_w2v['code']  # get one embedding to show that loading worked

array([-0.69326967,  2.801525  , -3.3704855 , -0.13535368,  0.19786778,
       -2.1539948 , -1.2386867 ,  0.314623  , -1.3124833 , -0.9608743 ,
       -0.12792991, -3.139249  , -0.67779756,  3.107435  , -2.028022  ,
       -2.259462  , -0.87324476, -5.5098095 ,  1.3169962 ,  1.0432471 ,
        0.82132596, -1.1575875 ,  2.1379461 , -0.8487015 , -0.15197147,
       -0.07523512, -0.17044109, -0.50479656, -1.0367959 , -3.4859724 ,
        0.44297886,  0.91962296, -0.14671884,  0.25288174,  0.44186914,
       -0.76035714,  2.6626208 ,  0.26611885, -2.3631518 , -0.58565396,
       -0.15573691, -0.5707565 ,  2.845456  , -0.44194275, -3.5254438 ,
       -0.30945483, -3.498564  , -1.487518  , -1.5231005 , -2.9451795 ,
        2.53046   ,  2.3774047 , -1.8743931 , -2.723858  ,  0.9297845 ,
       -0.31513575,  1.2645259 , -1.7860198 ,  1.9045573 ,  0.39395958,
       -1.8526822 , -4.006216  ,  0.7152482 , -0.87371045,  2.1261194 ,
        3.27963   , -2.5488276 , -2.248979  ,  0.65028185,  2.39

# Creating a validation set
Split the data into train and validation set is a technique for evaluating the performance of a developed model. In this case it was chosen to use a hold-out method to split the data, which separates 30 percent of the train data in a validation set while 70 percent stays in the train set. Train dataset is used to fit the model and validation set is used to evaluate the model performance. Generally, a Cross-Validation algorithms performs better but because I have only limited computational power on my available hardware, I choose the faster option (e.g. Schneider 1997).
To implement it in Python I used the train_test_split function from the scikit-learn Python machine learning library. Then I used the pad_sequences function to create an equal length for the texts and header columns. This step makes sure that when I will eventually run the neural network, the number of observations in the array will be identical. 

In [23]:
#MAX_TEXT_LENGTH = 2000
#MAX_HEADER_LENGTH = 8
#VAL_SPLIT = 0.3
#MAX_TEXT_LENGTH = 12180 as it is
#MAX_HEADER_LENGTH = 14 as it is

In [19]:
def data():
  
  X = train[[col for col in train.columns if not col == 'claps']]
  y = train['claps'].values
  #y = np.reshape(y, (-1,1))
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=VAL_SPLIT, random_state=42)
    
  #train data 
  X_train_text = pad_sequences(X_train.sequences_text, maxlen=MAX_TEXT_LENGTH, padding='post')
  X_train_header = pad_sequences(X_train.sequences_header, maxlen=MAX_HEADER_LENGTH, padding='post')

  #test data slices
  X_test_text = pad_sequences(X_test.sequences_text, maxlen=MAX_TEXT_LENGTH, padding='post')
  X_test_header = pad_sequences(X_test.sequences_header, maxlen=MAX_HEADER_LENGTH, padding='post')

  X_train_data = X_train[[col for col in X_train.columns if col not in ['header', 'texts', 'sequences_text', 'sequences_header']]]
  X_test_data = X_test[[col for col in X_test.columns if col not in ['header', 'texts', 'sequences_text', 'sequences_header']]]

  return X_train_text, X_train_header, X_train_data, y_train, X_test_text, X_test_header, X_test_data, y_test

In [20]:
X_train_text, X_train_header, X_train_data, y_train, X_test_text, X_test_header, X_test_data, y_test = data()

In [82]:
X_test_data.author.head(2)

67743    Ganes Kesari
7526         UC Today
Name: author, dtype: object

In [95]:
NUM_AUTHORS = 500
def tokenize_simple_column(column, max_tokens):
    # Create tokenizer object and build vocab from the training set
    tokenizer_au = Tokenizer(NUM_AUTHORS, oov_token=1, split='\n\n\n\n') 
    tokenizer_au.fit_on_texts(column)
    seq = tokenizer_au.texts_to_sequences(column)
    return list(map(lambda arr: arr[0] if len(arr) == 1 else 0, seq))

X_train_data['author_tok'] = tokenize_simple_column(X_train_data.author, NUM_AUTHORS)
X_test_data['author_tok'] = tokenize_simple_column(X_test_data.author, NUM_AUTHORS)
X_train_data['publisher_tok'] = tokenize_simple_column(X_train_data.publisher, NUM_AUTHORS)
X_test_data['publisher_tok'] = tokenize_simple_column(X_test_data.publisher, NUM_AUTHORS)


In [96]:
test['publisher_tok'] = tokenize_simple_column(test.publisher, NUM_AUTHORS)
test['author_tok'] = tokenize_simple_column(test.publisher, NUM_AUTHORS)

In [22]:
X_train_data = X_train_data.drop(['author', 'publisher'], axis=1) 
X_test_data = X_test_data.drop(['author', 'publisher'], axis=1)

In [99]:
test = test.drop(['author', 'publisher'], axis=1)

In [100]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   responses         514 non-null    int32  
 1   header            514 non-null    object 
 2   texts             514 non-null    object 
 3   month             514 non-null    int64  
 4   online_since      514 non-null    int64  
 5   read_time         514 non-null    float64
 6   sequences_text    514 non-null    object 
 7   sequences_header  514 non-null    object 
 8   publisher_tok     514 non-null    int64  
 9   author_tok        514 non-null    int64  
dtypes: float64(1), int32(1), int64(4), object(4)
memory usage: 38.3+ KB


In [102]:
with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\test_dataset.pkl','wb') as path_name:
    pickle.dump(test, path_name) #save the test data before splitting and assignment

# Develop predictive models 
To solve an NLP task on a sequential data type it is possible to use the LSTM (Long short-term memory) or GRU (gated recurrent unit) model. The main advantage of GRU is that it works faster also in some cases it might even have a better performance than LSTM. A GRU cell graphically looks as following:![GRU](GRU.png "by Vasilev et.el ")

In the picture the ht is a single hidden state, z<sub>t</sub> and r<sub>t</sub> are the two gates and z<sub>t</sub> represents an update gate. This gate makes the decision which new information will be included and what old information will be dropped. This decision is based on the input information and the previous cell’s hidden state x<sub>t</sub> and h<sub>h-1</sub>: z<sub>t</sub> = σ(W<sub>z</sub>x<sub>t</sub> + U<sub>z</sub>h<sub>t-1</sub>).
The reset gate decides the amount of a previous states that will be passed through: r<sub>t</sub> = σ(W<sub>r</sub>x<sub>t</sub> + U<sub>r</sub>h<sub>t-1</sub>). 
The h’<sub>t</sub> is the candidate state which is calculated as following: h’<sub>t</sub> = than(W<sub>xt</sub> + U(r<sub>t</sub>*h<sub>t-1</sub>)).

These calculations bring us to the output equation h<sub>t</sub> = (1-z<sub>t</sub>)*h<sub>t-1</sub> ⊙ z<sub>t</sub>*h’<sub>t</sub> . As we can see the output at a time t is “a linear interpolation between the previous output h<sub>t-1</sub>
and the candidate output” (Vasilev et.el 2016: 213-214).

In [29]:
from keras.models import Sequential
from keras.layers import Dense, Embedding,GRU, Dropout
from keras.layers.embeddings import Embedding
from keras.initializers import Constant
from keras.optimizers import RMSprop
from keras.layers import Input

### Sequential benchmark model
For my models creation I use the Keras library. Keras is a popular open-source neural network library, this library allows us to build, train and evaluate neural networks. In addition, we can organise and design layers of a neural network by using it. There are two ways to create a neural network: sequential and functional. A sequential type of neural network’s has a straightforward logic behind: In the first line we create a sequential model and then step by step adding further layers, like embedding layer or dense layer. At the end we summarise the model results. Functional API model on the other hand connects directly input and output layers by passing function to the next layer. This type of modelling is more flexible and allows us to create deeper models with multiple inputs and outputs. 
Considering this information, I developed my first neural network in sequential way with only one input “texts”.  I started by creating an embedding layer. Embedding layer in Keras convert tokenized words into vectors. The vectors should have the same size that is why we set a maximal length. The maximal length of the text is the vector size. Another parameter is the vocabulary size, which in my case it was set to 5000 words for most models but later changed to 6000. I tried to use more words (12000) but as a result the computational time grew drastically. The output_dim is the dimension of dense embeddings. In my first model it was set to 100. Because my vocabulary size is not very big a small dimension size was chosen. The Embedding layer aim to optimize its weights. As a result, we should get the best word embeddings that will generate a minimum loss.
The next layer is the actual neural network which uses Kera’s GRU layer. I need to supply the number of hidden neurons for the network. There is no rule about the best number or how to calculate it. It is recommended to try different variations. As a result, it was decided to put 16 neurons to save computational power and because there is no strongly recommended rule about this number.
The last layer is a dense layer that connects the neural network to a one-dimensional output by concentrating its output. It means that each input node is connected to each output node.
At last I use compile to combine all layers to a model with loss and optimizer function. As the loss function I chose mean squared error. MSE is used to calculate regression for regression tasks. “Mean squared error is calculated as the average of the squared differences between the predicted and actual values. The result is always positive regardless of the sign of the predicted and actual values and a perfect value is 0.0.”(Brownlee 2019). Firstly, mean squared error was chosen because we deal with a numeric output in other words, we need a loss functions for regression. We are calculating the square of our error and then take it’s mean by using the MSE. This type or loss function “gives relatively higher weight (penalty) to large errors/outliers, while smoothening the gradient for smaller errors” (Hirekerur 2020). 
When training the model on my dataset I used only 5 epochs because my maximal length for the article text was 2000 words, the number of words was set to 5000 and I am training my model on almost 50 thousand words. The parameter batch size was set to 100. Larger batch sizes result in faster training of a model, smaller batch size trains slower (Brownlee 2019). The algorithm takes the first 100 rows from the training set and trains our model. Then it takes the next 100 samples and trains it. The algorithm repeats this again and again. 
The results of a benchmark model gave us on validation set the lowest MSE: 6588456.0000. During all five epochs we can see the trend to decrease of MSE.
Even this relatively simple model took a surprisingly long to train but gave me important insight on which parameter I needed to tune in my next approaches. 
The code of the models' architectures is based on ADAM's tutorials and also on the blopgpost of Usman Malik.

In [24]:
embedding_layer=Embedding(input_dim=NUM_WORDS, 
                          output_dim=EMBEDDING_DIM, 
                          input_length=MAX_TEXT_LENGTH
                         )
model1=Sequential()                        
model1.add(embedding_layer)
model1.add(GRU(NB_HIDDEN))
model1.add(Dense(1, activation="relu"))
model1.compile(loss="mean_squared_error", optimizer=RMSprop(clipvalue=1, clipnorm=1), metrics=["mse"])
model1.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 2000, 100)         500000    
_________________________________________________________________
gru_1 (GRU)                  (None, 16)                5616      
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 505,633
Trainable params: 505,633
Non-trainable params: 0
_________________________________________________________________


In [33]:
model1_story = model1.fit(X_train_text, y_train, batch_size=BATCH_SIZE, epochs=EPOCH, validation_data=(X_test_text, y_test))

Train on 46464 samples, validate on 19914 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
#SCORE_BAG = {}

In [42]:
to_disk = (model1, model1_story)
with open('model1.pkl','wb') as file_name:
    pickle.dump(to_disk, file_name)#

In [None]:
#Load the model 
#with open('model1.pkl','rb') as file_name:
    #model1, model1_story = pickle.load(file_name)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from gensim.models import KeyedVectors
from gensim.models.keyedvectors import Word2VecKeyedVectors

In [62]:
text_vocab = KeyedVectors.load('C:\\ADAMS_tutorials\\ADAMS_Assignment\\w2v_model.model')
print('Loaded pre-trained embeddings for {} words.'.format(len(text_vocab.wv.vocab)))

Loaded pre-trained embeddings for 36942 words.


In [26]:
from keras.layers import Input, Dense, CuDNNLSTM, CuDNNGRU, LSTM, GRU, GlobalAveragePooling1D, Embedding, Dropout, Concatenate

In [27]:
from keras.models import Model

### Second model 
For the second model I wanted to include one more feature primary to learn to use multiple inputs. This was also developed in Keras but this time in a functional way. In this model I added the two inputs texts and header. I also add a pre-trained Word2Vec embedding layer for the text input, which I created early using the CBOW algorithm. The number of trained parameters was increase as a result. All other parameters stayed the same compared to the first model. The MSE results are improved in comparison with the first model. But the improvement is small. 
Develop a function that helps to incorporate a pre-trained embadding

In [24]:
def get_embedding_matrix(tokenizer, pretrain, vocab_size):
    dim = 0
    if isinstance(pretrain, KeyedVectors) or isinstance(pretrain, Word2VecKeyedVectors):
        dim = pretrain.vector_size        
    elif isinstance(pretrain, dict):
        dim = next(iter(pretrain.values())).shape[0]  
    else:
        raise Exception('{} is not supported'.format(type(pretrain)))    

    emb_mat = np.zeros((vocab_size, dim))
    oov_words = []

    for word, i in tokenizer.word_index.items():  
        try:
            emb_mat[i] = pretrain[word]
        except:
            oov_words.append(word)
    print('Created embedding matrix of shape {}'.format(emb_mat.shape))
    print('Encountered {} out-of-vocabulary words.'.format(len(oov_words)))
    return (emb_mat)

In [25]:
w2v_weights = get_embedding_matrix(tokenizer, model_w2v, NUM_WORDS)

Created embedding matrix of shape (5000, 100)
Encountered 32267 out-of-vocabulary words.


In [40]:
input_tensor_text = Input(shape=(MAX_TEXT_LENGTH, ), name="text")
input_tensor_header = Input(shape=(MAX_HEADER_LENGTH, ), name="header")

embedding_layer=Embedding(input_dim=NUM_WORDS,
    output_dim=EMBEDDING_DIM,
    embeddings_initializer=Constant(w2v_weights), #weights to start with, and not nouch during training
    input_length=MAX_TEXT_LENGTH,
    trainable=False
    )(input_tensor_text)

GRU1 = GRU(100)(embedding_layer)

dense_layer1 = Dense(10, activation='relu')(input_tensor_header)
dense_layer2 = Dense(10, activation='relu')(dense_layer1)

concat_layer = Concatenate()([GRU1, dense_layer2])

dense_layer3 = Dense(10, activation='relu')(concat_layer)

output_tensor = Dense(1, activation='relu')(dense_layer3)

model = Model(inputs = [input_tensor_text, input_tensor_header], outputs = output_tensor)
model.compile(loss = 'mean_squared_error',
                optimizer = RMSprop(clipvalue=1, clipnorm=1),
                metrics=['mse'])
model.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               (None, 2000)         0                                            
__________________________________________________________________________________________________
header (InputLayer)             (None, 8)            0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 2000, 100)    500000      text[0][0]                       
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 10)           90          header[0][0]                     
____________________________________________________________________________________________

In [41]:
model = model.fit(
    x=[X_train_text, X_train_header],
    y=np.reshape(y_train, (-1,1)),
    batch_size=100,
    epochs=3,
    validation_data=([X_test_text, X_test_header], np.reshape(y_test, (-1,1))))
    

Train on 46464 samples, validate on 19914 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


### Third model
For the second model my focus was to include all available features not just the text content.
Testing the architecture of a second model has shown some improvement. This improvement inspired me to add one more input – numeric input. It means I have three inputs, two embedding layers. I added also a pre-embedding CBOW layer to the header too. I also trained the model 10 epochs instead of 3. Other parameters are the same as in the previous model. The results are the best. The MSE is 4697665.30. 
In the last two models I used RMSprop optimizer. This optimizer is “keep the moving average of the squared gradients for each weight” (Bushaev 2018). It means that this optimizer through the average calculation of squared gradients normalizes the gradient. As a result the step size is balanced, it means that we avoid the explosion of the gradient’s step or we avoiding the vanishing if the step is small. 

In [64]:
input_tensor_text = Input(shape=(MAX_TEXT_LENGTH, ), name="text")
input_tensor_header = Input(shape=(MAX_HEADER_LENGTH, ), name="header")
input_tensor_numeric = Input(shape = (len(X_train_data.columns), ), name="numeric")

embedding_layer_text=Embedding(input_dim=NUM_WORDS,
    output_dim=EMBEDDING_DIM,
    embeddings_initializer=Constant(w2v_weights), 
    input_length=MAX_TEXT_LENGTH,
    trainable=True
    )(input_tensor_text)

embedding_layer_header=Embedding(input_dim=NUM_WORDS,
    output_dim=EMBEDDING_DIM,
    embeddings_initializer=Constant(w2v_weights), 
    input_length=MAX_HEADER_LENGTH,
    trainable=True
    )(input_tensor_header)    

GRU_text = GRU(100)(embedding_layer_text)
GRU_header = GRU(100)(embedding_layer_header)
dense_numeric = Dense(10, activation = 'relu')(input_tensor_numeric)

concat_layer = Concatenate()([GRU_text, GRU_header, dense_numeric])
dense_layer3 = Dense(10, activation='relu')(concat_layer)

output_tensor = Dense(1, activation='relu')(dense_layer3) 

model_2 = Model(inputs = [input_tensor_text, input_tensor_header, input_tensor_numeric], outputs = output_tensor)
model_2.compile(loss = 'mean_squared_error',
                optimizer = RMSprop(clipvalue=1, clipnorm=1),
                metrics=['mse'])
model_2.summary()

Model: "model_6"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to
text (InputLayer)               (None, 2000)         0
__________________________________________________________________________________________________
header (InputLayer)             (None, 8)            0
__________________________________________________________________________________________________
embedding_7 (Embedding)         (None, 2000, 100)    500000      text[0][0]
__________________________________________________________________________________________________
embedding_8 (Embedding)         (None, 8, 100)       500000      header[0][0]
__________________________________________________________________________________________________
numeric (InputLayer)            (None, 6)            0
___________________________________________________________________________________________

In [65]:
model_2_history = model_2.fit(
    x=[X_train_text, X_train_header, X_train_data],
    y=np.reshape(y_train, (-1,1)),
    batch_size=BATCH_SIZE,
    epochs=10,
    validation_data=([X_test_text, X_test_header, X_test_data], np.reshape(y_test, (-1,1))))



In [33]:
model_2.model.save("model_2.h5")

In [66]:
validation_loss = np.amin(model_2_history.history['val_loss']) 
print('Lowest validation loss of epoch:', validation_loss)

Lowest validation loss of epoch:4697665.308567183


In [68]:
model_2_history.model.save("model_2_history.h5")

### Assignement with second model

In [60]:
def split_test():
    test_texts = test['texts']
    test_header = test['header']
    test_data = test[[col for col in test.columns if col not in ['header', 'texts']]]
    
    test_texts = pad_sequences(test.sequences_text, maxlen=MAX_TEXT_LENGTH, padding='post')
    test_header = pad_sequences(test.sequences_header, maxlen=MAX_HEADER_LENGTH, padding='post')
    
    return test_texts, test_header, test_data

In [61]:
test_texts, test_header, test_data = split_test()

In [62]:
test_data = test_data.drop(['sequences_header', 'sequences_text'], axis='columns')

In [159]:
predictions = model_2_history.model.predict([test_texts, test_header, test_data], verbose=1)



In [175]:
predictions[100]

array([0.], dtype=float32)

In [176]:
def getTestIndex():
    test_csv = pd.read_csv('C:/ADAMS_tutorials/ADAMS_Assignment/Test.csv')
    test_index = test_csv['index']
    return test_index
    
test_index = getTestIndex()

In [210]:
# reshape to 1-dim
predictions = predictions.reshape(len(predictions), )

In [214]:
df = pd.DataFrame({'Claps': predictions})
df = df.set_index(test_index)

print('DataFrame:\n', df)

# default CSV
predictions_teslenko = df.to_csv()
print('\nCSV String:\n', predictions_teslenko)


DataFrame:
              Claps
index              
0      24515.074219
1       8909.722656
2       8612.857422
5       5828.024902
7       1081.209229
...             ...
598    10315.620117
599     1912.135254
600     1082.013672
601     7907.884766
602     2072.990479

[514 rows x 1 columns]

CSV String:
index,Claps
0,24515.074
1,8909.723
2,8612.857
5,5828.025
7,1081.2092
8,4170.4224
9,3766.5576
11,1666.0199
12,6017.356
13,772.99786
14,374.8884
15,558.7997
16,14071.78
17,15336.468
18,0.0
19,2647.387
20,895.46735
21,1028.1501
22,418.36618
23,3395.9802
25,921.56555
26,400.0162
27,2794.545
28,6261.963
29,6701.7935
30,4540.3145
31,6209.804
32,3095.9604
33,5974.814
34,2897.896
35,1473.134
36,2588.4329
37,1061.9233
39,1150.4482
40,2020.4186
41,880.28033
42,1114.1477
43,946.53064
44,1303.9707
46,429.22244
47,661.0153
48,2172.3142
49,665.6097
50,0.0
51,0.0
52,980.8738
53,359.71085
54,1014.8097
55,853.51483
56,3830.272
57,57.197784
58,1308.01
61,41.75418
63,16.261564
64,129.31839
66,169.99274

In [217]:
with open('submission.csv', 'w', newline='') as csv_file:
    df.to_csv(path_or_buf=csv_file)


In [180]:
predictions[0][0]

24515.074

### Third model

In [9]:
with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\train_dat_dat.pkl','rb') as path_name:
    train = pickle.load(path_name)

In [None]:
np.mean(text_length)

In [18]:
EPOCH = 5
BATCH_SIZE = 64 
EMBEDDING_DIM = 128
MAX_TEXT_LENGTH = 463 #mean text lenght  
MAX_HEADER_LENGTH = 8
VAL_SPLIT = 0.25
#MAX_TEXT_LENGTH = 12180
#MAX_HEADER_LENGTH = 14

In [19]:
DATA_COLUMN_NAMES = ['responses', 'month', 'online_since', 'read_time', 'author_tok', 'publisher_tok']

def data(train):
  y = train['claps'].values
  #y = np.reshape(y, (-1,1))
  X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=VAL_SPLIT, random_state=42)
    
  #train data 
  X_train_text = pad_sequences(X_train.sequences_text, maxlen=MAX_TEXT_LENGTH, padding='post')
  X_train_header = pad_sequences(X_train.sequences_header, maxlen=MAX_HEADER_LENGTH, padding='post')

  #test data slices
  X_test_text = pad_sequences(X_test.sequences_text, maxlen=MAX_TEXT_LENGTH, padding='post')
  X_test_header = pad_sequences(X_test.sequences_header, maxlen=MAX_HEADER_LENGTH, padding='post')

  X_train_data = X_train[[col for col in X_train.columns if col in DATA_COLUMN_NAMES]]
  X_test_data = X_test[[col for col in X_test.columns if col in DATA_COLUMN_NAMES]]

  return X_train_text, X_train_header, X_train_data, y_train, X_test_text, X_test_header, X_test_data, y_test

In [20]:
X_train_text, X_train_header, X_train_data, y_train, X_test_text, X_test_header, X_test_data, y_test = data(train)

In [21]:
from gensim.models import KeyedVectors
embedding="skip_w2v_model.model"
SAVE_BIN = False
skip_w2v_model = KeyedVectors.load_word2vec_format(embedding, binary=SAVE_BIN)

The weights initialzation is based on Tobias Sterbak's guide (https://www.depends-on-the-definition.com/guide-to-word-vectors-with-gensim-and-keras/).

In [24]:
word_vectors_try = skip_w2v_model.wv
MAX_NB_WORDS_try = len(word_vectors_try.vocab)
print(MAX_NB_WORDS_try)

34286


In [23]:
WV_DIM = 256
nb_words = min(6000, len(word_vectors_try.vocab))
# we initialize the matrix with random numbers
wv_matrix = (np.random.rand(nb_words, WV_DIM) - 0.5) / 5.0
for word, i in tokenizer.word_index.items():
    if i >= MAX_NB_WORDS_try:
        continue
    try:
        embedding_vector = word_vectors[word]
        # words not found in embedding index will be all-zeros.
        wv_matrix[i] = embedding_vector
    except:
        pass 

In [21]:
train = None

In [23]:
input_tensor_text = Input(shape=(MAX_TEXT_LENGTH, ), name="text")
input_tensor_header = Input(shape=(MAX_HEADER_LENGTH, ), name="header")
input_tensor_numeric = Input(shape = (len(X_train_data.columns), ), name="numeric")

embedding_layer_text=Embedding(input_dim=NUM_WORDS,
    output_dim=256,
    weights=[wv_matrix], #weights to start with, and not nouch during training
    input_length=MAX_TEXT_LENGTH,
    trainable=False
    )(input_tensor_text)

embedding_layer_header=Embedding(input_dim=NUM_WORDS,
    output_dim=EMBEDDING_DIM,
    #embeddings_initializer=Constant(w2v_weights), #weights to start with, and not nouch during training
    input_length=MAX_HEADER_LENGTH,
    trainable=True
    )(input_tensor_header)    

GRU_text = GRU(256, return_sequences=True)(embedding_layer_text)
GRU_text2 = Dropout(0.2)(GRU_text)

GRU_text3 = GRU(128, return_sequences=True)(GRU_text2)
GRU_text4 = Dropout(0.1)(GRU_text3)
GRU_text5 = GlobalAveragePooling1D()(GRU_text4)


#Header
GRU_header = GRU(128, return_sequences=True)(embedding_layer_header)
GRU_header2 = Dropout(0.2)(GRU_header)
GRU_header3 = GlobalAveragePooling1D()(GRU_header2)

#dense_layer1 = Dense(10, activation='relu')(input_tensor_header)
#dense_layer2 = Dense(10, activation='relu')(dense_layer1)

dense_numeric = Dense(128, activation = 'relu')(input_tensor_numeric)
dense_numeric2 = Dropout(0.2)(dense_numeric)
#dense_numeric3 = GlobalAveragePooling1D()(dense_numeric2)

concat_layer = Concatenate()([GRU_text5, GRU_header3, dense_numeric2])

dense_layer3 = Dense(10, activation='relu')(concat_layer)

output_tensor = Dense(1, activation='relu')(dense_layer3) # result

model_last = Model(inputs = [input_tensor_text, input_tensor_header, input_tensor_numeric], outputs = output_tensor)
model_last.compile(loss = 'mean_squared_error',
                          optimizer = RMSprop(clipvalue=1, clipnorm=1),
                          metrics=['mse'])
model_last.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               (None, 463)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 463, 256)     1536000     text[0][0]                       
__________________________________________________________________________________________________
gru_1 (GRU)                     (None, 463, 256)     393984      embedding_1[0][0]                
__________________________________________________________________________________________________
header (InputLayer)             (None, 8)            0                                            
____________________________________________________________________________________________

In [24]:
model_last_history = model_last.fit(
    x=[X_train_text, X_train_header, X_train_data],
    y=np.reshape(y_train, (-1,1)),
    batch_size=BATCH_SIZE,
    epochs=10,
    validation_data=([X_test_text, X_test_header, X_test_data], np.reshape(y_test, (-1,1))))

Train on 46464 samples, validate on 19914 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [25]:
model_last_history.model.save("model_last_history.h5")

The model's validation loss outperforms all other models I did during this assignment. Sadly I later discovered a small error during my tokenization process. When I created tokens for the author and publisher I used a different tokenizer for the train and validation split each time. This means that it is very possible that the neural network just ignored the author and publisher features or even worse it used it and learned something random. That's why I deciced to create a final model afterwards.

# Assignment with third model

In [103]:
def split_test():
    test_texts = test['texts']
    test_header = test['header']
    test_data = test[[col for col in test.columns if col not in ['header', 'texts']]]
    
    test_texts = pad_sequences(test.sequences_text, maxlen=MAX_TEXT_LENGTH, padding='post')
    test_header = pad_sequences(test.sequences_header, maxlen=MAX_HEADER_LENGTH, padding='post')
    
    return test_texts, test_header, test_data

In [104]:
test_texts, test_header, test_data = split_test()

In [106]:
test_data = test_data.drop(['sequences_header', 'sequences_text'], axis='columns')

In [107]:
test_data.head(1)

Unnamed: 0,responses,month,online_since,read_time,publisher_tok,author_tok
0,627,7,1101,75.069609,6,6


In [108]:
predictions_last = model_last_history.model.predict([test_texts, test_header, test_data], verbose=1)



Indexing the data 

In [53]:
def getTestIndex():
    test_csv = pd.read_csv('C:/ADAMS_tutorials/ADAMS_Assignment/Test.csv')
    test_index = test_csv['index']
    return test_index
    
test_index = getTestIndex()

In [110]:
# reshape to 1-dim
predictions_last = predictions_last.reshape(len(predictions_last), )

In [111]:
df = pd.DataFrame({'Claps': predictions_last})
df = df.set_index(test_index)

print('DataFrame:\n', df)

# default CSV
predictions_last = df.to_csv()
print('\nCSV String:\n', predictions_last)

DataFrame:
               Claps
index              
0      64649.152344
1      19159.210938
2      18453.505859
5       9612.684570
7       2968.046631
...             ...
598    20226.867188
599     2682.586914
600     2938.522705
601    13518.168945
602     3597.792480

[514 rows x 1 columns]

CSV String:
 index,Claps
0,64649.152
1,19159.21
2,18453.506
5,9612.685
7,2968.0466
8,4767.1284
9,7630.1343
11,3590.9587
12,7627.946
13,537.08777
14,692.2106
15,1909.3906
16,36660.652
17,35518.11
18,0.0
19,2666.3179
20,384.9386
21,1503.4114
22,1177.7771
23,3779.5137
25,594.69775
26,560.9963
27,1875.7062
28,11782.3955
29,11596.874
30,7927.9653
31,12809.884
32,5131.9395
33,8102.9907
34,5882.6294
35,3532.1912
36,3758.157
37,1879.908
39,2602.9285
40,2948.1077
41,533.8366
42,1520.439
43,2908.3074
44,2696.9358
46,1150.7765
47,1637.9927
48,3879.281
49,1404.4678
50,0.0
51,0.0
52,2120.528
53,794.7821
54,1298.2517
55,2808.5754
56,6448.7827
57,0.0
58,3135.6648
61,0.0
63,1147.8313
64,0.0
66,1244.0669
67,0.0

In [112]:
with open('submission_tesl.csv', 'w', newline='') as csv_file:
    df.to_csv(path_or_buf=csv_file)

# Final Model

In [30]:
input_tensor_text = Input(shape=(MAX_TEXT_LENGTH, ), name="text")
input_tensor_header = Input(shape=(MAX_HEADER_LENGTH, ), name="header")
input_tensor_numeric = Input(shape = (len(X_train_data.columns), ), name="numeric")

embedding_layer_text=Embedding(input_dim=NUM_WORDS,
    output_dim=256,
    weights=[wv_matrix], #weights to start with, and not nouch during training
    input_length=MAX_TEXT_LENGTH,
    trainable=False
    )(input_tensor_text)

embedding_layer_header=Embedding(input_dim=NUM_WORDS,
    output_dim=EMBEDDING_DIM,
    #embeddings_initializer=Constant(w2v_weights), #weights to start with, and not nouch during training
    input_length=MAX_HEADER_LENGTH,
    trainable=True
    )(input_tensor_header)    

GRU_text = GRU(256, return_sequences=True)(embedding_layer_text)
GRU_text2 = Dropout(0.2)(GRU_text)

GRU_text3 = GRU(128, return_sequences=True)(GRU_text2)
GRU_text4 = Dropout(0.1)(GRU_text3)
GRU_text5 = GlobalAveragePooling1D()(GRU_text4)


#Header
GRU_header = GRU(128, return_sequences=True)(embedding_layer_header)
GRU_header2 = Dropout(0.2)(GRU_header)
GRU_header3 = GlobalAveragePooling1D()(GRU_header2)

dense_numeric = Dense(128, activation = 'relu')(input_tensor_numeric)
dense_numeric2 = Dropout(0.2)(dense_numeric)
#dense_numeric3 = GlobalAveragePooling1D()(dense_numeric2)

concat_layer = Concatenate()([GRU_text5, GRU_header3, dense_numeric2])

dense_layer3 = Dense(10, activation='relu')(concat_layer)

output_tensor = Dense(1, activation='relu')(dense_layer3) # result

model_final = Model(inputs = [input_tensor_text, input_tensor_header, input_tensor_numeric], outputs = output_tensor)
model_final.compile(loss = 'mean_squared_error',
                          optimizer = RMSprop(clipvalue=1, clipnorm=1),
                          metrics=['mse'])
model_final.summary()

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text (InputLayer)               (None, 463)          0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 463, 256)     1536000     text[0][0]                       
__________________________________________________________________________________________________
gru_4 (GRU)                     (None, 463, 256)     393984      embedding_3[0][0]                
__________________________________________________________________________________________________
header (InputLayer)             (None, 8)            0                                            
____________________________________________________________________________________________

In [33]:
from time import time

In [34]:
t = time()
model_final_history = model_final.fit(
    x=[X_train_text, X_train_header, X_train_data],
    y=np.reshape(y_train, (-1,1)),
    batch_size=BATCH_SIZE,
    epochs=10,
    validation_data=([X_test_text, X_test_header, X_test_data], np.reshape(y_test, (-1,1))))
print('Time to fit model: {} mins'.format(round((time() - t) / 60, 2)))

Train on 49783 samples, validate on 16595 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Time to fit model: 1070.01 mins


In [36]:
model_final_history.model.save("model_last_history.h5")

# Assignment final

In [38]:
with open('C:\\ADAMS_tutorials\\ADAMS_Assignment\\data_pre_clean_test.pkl','rb') as path_name:
    test = pickle.load(path_name) #load


In [44]:
test.head()

Unnamed: 0,responses,header,texts,author,publisher,month,online_since,read_time
0,627,Why Everyone Missed the Most Mind-Blowing Feat...,There’s one incredible feature of cryptocurren...,Daniel Jeffries,Hacker Noon,7,1101,75.069609
1,156,NEO versus Ethereum: Why NEO might be 2018’s s...,OnChainNEO’s founders Da HongFei and Erik Zhan...,Noam Levenson,Hacker Noon,12,973,72.072011
2,176,The Cryptocurrency Trading Bible,So you want to trade cryptocurrency? You’ve se...,Daniel Jeffries,Hacker Noon,7,1111,3.581081
3,72,Stablecoins: designing a price-stable cryptocu...,A useful currency should be a medium of exchan...,Haseeb Qureshi,Hacker Noon,2,898,63.087171
4,19,Chaos vs. Order — The Cryptocurrency Dilemma,Crypto crypto crypto crypto. It’s here. It’s h...,William Belk,Hacker Noon,1,920,0.317746


In [48]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   responses         514 non-null    int32  
 1   header            514 non-null    object 
 2   texts             514 non-null    object 
 3   author            514 non-null    object 
 4   publisher         514 non-null    object 
 5   month             514 non-null    int64  
 6   online_since      514 non-null    int64  
 7   read_time         514 non-null    float64
 8   sequences_text    514 non-null    object 
 9   sequences_header  514 non-null    object 
 10  author_tok        514 non-null    int64  
 11  publisher_tok     514 non-null    int64  
dtypes: float64(1), int32(1), int64(4), object(6)
memory usage: 46.3+ KB


In [49]:
def split_test():
    test_texts = test['texts']
    test_header = test['header']
    test_data = test[[col for col in test.columns if col in DATA_COLUMN_NAMES]]
    
    test_texts = pad_sequences(test.sequences_text, maxlen=MAX_TEXT_LENGTH, padding='post')
    test_header = pad_sequences(test.sequences_header, maxlen=MAX_HEADER_LENGTH, padding='post')
    
    return test_texts, test_header, test_data

In [50]:
test_texts, test_header, test_data = split_test()

In [51]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   responses      514 non-null    int32  
 1   month          514 non-null    int64  
 2   online_since   514 non-null    int64  
 3   read_time      514 non-null    float64
 4   author_tok     514 non-null    int64  
 5   publisher_tok  514 non-null    int64  
dtypes: float64(1), int32(1), int64(4)
memory usage: 22.2 KB


In [52]:
predictions_final = model_final_history.model.predict([test_texts, test_header, test_data], verbose=1)



In [55]:
# reshape to 1-dim
predictions_final = predictions_final.reshape(len(predictions_final), )

In [56]:
df = pd.DataFrame({'Claps': predictions_final})
df = df.set_index(test_index)

print('DataFrame:\n', df)

# default CSV
predictions_final_csv = df.to_csv()
print('\nCSV String:\n', predictions_final_csv)

DataFrame:
               Claps
index              
0      53338.164062
1      19299.582031
2      20095.158203
5       9737.634766
7       2045.726685
...             ...
598    21398.576172
599     1933.810181
600     2635.977295
601    13349.852539
602     3213.890381

[514 rows x 1 columns]

CSV String:
 index,Claps
0,53338.164
1,19299.582
2,20095.158
5,9737.635
7,2045.7267
8,3773.4346
9,8184.631
11,3650.0989
12,7231.986
13,483.55698
14,0.0
15,1012.1825
16,32414.014
17,32683.674
18,0.0
19,1474.4896
20,0.0
21,1073.9667
22,84.629715
23,3398.239
25,71.101845
26,0.0
27,1236.9233
28,13359.181
29,12483.3
30,8396.718
31,13829.962
32,4876.7725
33,7694.9946
34,4379.144
35,1793.4791
36,2869.9705
37,1136.2122
39,2065.2336
40,2113.179
41,0.0
42,689.72437
43,1924.5586
44,1915.6
46,190.91785
47,748.8784
48,3061.946
49,799.04565
50,0.0
51,0.0
52,1684.5853
53,0.0
54,222.3475
55,859.3357
56,6249.48
57,0.0
58,2412.4739
61,0.0
63,0.0
64,0.0
66,0.0
67,0.0
68,0.0
70,534.0254
71,0.0
72,0.0
73,0.0
74,0.0

In [57]:
with open('submission_tesl_final.csv', 'w', newline='') as csv_file:
    df.to_csv(path_or_buf=csv_file)

# Hyperparameter tuning
I tried to use hyperparameter tuning but sadly did not have enough time to finish the tuning and produce meaningful results. The coding procedures are present in extra [jupyter file](parameter_tuning.ipynb). As a template to develope the parameter tuning model I used the blog post article of "Lianne & Justin" from Just into Data: "Hyperparameter Tuning with Python: Keras Step-by-Step Guide". I used a desktop PC for this task and tried to utilized the available graphic card there. Sadly I wasn't able to archive this task and was forced to cancel the tuning process.


# Conclusion
During the project I developed several GRU models with the aim to predict the number of claps in the medium’s article. Different approaches and experiments with different parameters, words numbers, the length of the texts and different pre-trained word embeddings models, like CBOW and Skip-gram has shown that there is always room for improvement. The experiments have shown that adding the depth to the model is an effective tactic to model’s performance improvement. The other important conclusion is that parameters are playing a crucial role. I suppose that if the parameter tuning would be finished the resulting model could perform significantly better than any of my models.
Because all models are based on the input data the importance of explanatory data analysis and then the data preparations remain a high priority.
The next steps could be the further improvement of the model’s architecture, parameter tuning and using more computational power for example by using an online service (AWS) or a powerful setup consisting of Nvidia graphic cards to use the CUDA environment.


# Literature

Casey Botticello (2019): “How Do Claps Work on Medium?”, https://medium.com/blogging-guide/how-do-claps-work-on-medium-b2897784ce6b (last visited 31.08.2020)

Dipanjan Sarkar (2019): Text Analytics with Python, Apress Media LLC

Gabriel L. Schlomer, Sheri Bauman, and Noel A. Card (2010): Best Practices for Missing Data Management in Counseling Psychology, Journal of Counseling Psychology © 2010 American Psychological Association, Vol. 57, No. 1, pp. 1–10 

Guillaume Lample, Miguel Ballesteros, Sandeep Sub-ramanian, Kazuya Kawakami, Chris Dyer (2016): Neural  architectures  for  named  entity  recognition. InHLT-NAACL

Ivan Vasilev, Daniel Slater, Gianmario Spacagna, Peter Roelants, Valentino Zocca (2019): Python Deep Learning, Second Edition, Exploring deep learning techniques and neural network architectures with PyTorch, Keras, and TensorFlow, Published by Packt Publishing Ltd., Birmingham UK

Jason Brownlee (2019): How to Choose Loss Functions When Training Deep Learning Neural Networks, https://machinelearningmastery.com/how-to-choose-loss-functions-when-training-deep-learning-neural-networks/ (last visited 28.08.2020)

Jeff Schneider (1997): Cross Validation, https://www.cs.cmu.edu/~schneide/tut5/node42.html (last visited 31.08.2020)

Kavita Ganesan (2020): Word2Vec: A Comparison Between CBOW, SkipGram & SkipGramSI, https://kavita-ganesan.com/comparison-between-cbow-skipgram-subword/ (last visited 24.08.2019)

Lalit Vyas (2020): Word2Vec — CBOW & Skip-gram : Algorithmic Optimizations, https://medium.com/analytics-vidhya/word2vec-cbow-skip-gram-algorithmic-optimizations-921d6f62d739 (last visited 24.08.2019)

Lianne & Justin (2020): Hyperparameter Tuning with Python: Keras Step-by-Step Guide, https://www.justintodata.com/hyperparameter-tuning-with-python-keras-guide/ (last visited 31.08.2020)

Medium https://medium.com/ (last visited 31.08.2020)

Pierre Megret (2018): Gensim Word2Vec Tutorial, https://www.kaggle.com/pierremegret/gensim-word2vec-tutorial (last visited 24.08.2019) 

Rohan Hirekerur (2020): A Comprehensive Guide To Loss Functions — Part 1 : Regression, https://medium.com/analytics-vidhya/a-comprehensive-guide-to-loss-functions-part-1-regression-ff8b847675d6 (last visited 31.08.2020) 

Tobias Sterbak (2018): Guide to word vectors with gensim and keras, https://www.depends-on-the-definition.com/guide-to-word-vectors-with-gensim-and-keras/ (last visited 31.08.2020)

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean (2013): Efficient Estimation of Word Representations in Vector Space, https://arxiv.org/pdf/1301.3781.pdf (last visited 24.08.2019)

Usman Malik (2020): Python for NLP: Creating Multi-Data-Type Classification Models with Keras, https://stackabuse.com/python-for-nlp-creating-multi-data-type-classification-models-with-keras/ (last visited 31.08.2020)

Vitaly Bushaev (2018): Understanding RMSprop — faster neural network learning, https://towardsdatascience.com/understanding-rmsprop-faster-neural-network-learning-62e116fcf29a (last visited 20.08.2020) 

Xuezhe Ma and Eduard Hovy (2016): End-to-end se-quence labeling via bi-directional lstm-cnns-crf.  InProceedings of the 54th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long  Papers), Association  for  Computational  Linguistics, Berlin, Germany, pp. 1064–1074,  http://www.aclweb.org/anthology/P16-1101 (last visited 24.08.2019)
