<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center>Assignment #6
### <center> Beating baselines in "How good is your Medium article?"
    
<img src='../../img/medium_claps.jpg' width=40% />


[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared ["primitive" baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) - you'll find something valuable there.

**Your task:**
 1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
 2. You need to name your [team](https://www.kaggle.com/c/how-good-is-your-medium-article/team) (out of 1 person) in full accordance with the [course rating](https://drive.google.com/open?id=19AGEhUQUol6_kNLKSzBsjcGUU3qWy3BNUg8x8IFkO3Q). You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.

In [70]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge

The following code will help to throw away all HTML tags from an article content.

In [3]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ' '.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [4]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [5]:
def extract_features_and_write(path_to_data, inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = {}
    for feat in features:
        feature_files[feat] = open(os.path.join(path_to_data, '{}_{}.txt'.format(prefix, feat)), 'w', encoding='utf-8')
    
    with open(os.path.join(path_to_data, inp_filename), encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            content = strip_tags(json_data['content'].replace('\n', ' ').replace('\r', ' '))
            feature_files['content'].write(content + "\n")
            published = json_data['published']['$date']
            feature_files['published'].write(published + "\n")
            title = json_data['title'].replace('\n', ' ').replace('\r', ' ')
            feature_files['title'].write(title + "\n")
            author = json_data['author']['url']
            feature_files['author'].write(author + "\n")

    for feat in features:
        feature_files[feat].close()
        
    for feat in features:
        with open(os.path.join(path_to_data, '{}_{}.txt'.format(prefix, feat)), 'r', encoding='utf-8') as f:
            print(f"{feat}: {sum(1 for line in f)} lines")

In [6]:
PATH_TO_DATA = 'data' # modify this if you need to

In [7]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


content: 62313 lines
published: 62313 lines
title: 62313 lines
author: 62313 lines


In [8]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

HBox(children=(IntProgress(value=1, bar_style='info', max=1), HTML(value='')))


content: 34645 lines
published: 34645 lines
title: 34645 lines
author: 34645 lines


**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [112]:
def tfidf_ft(file):
    with open(file, 'r', encoding='utf-8') as f:
        tfv = TfidfVectorizer(ngram_range=(1, 2), max_features=500000)
        return (tfv, tfv.fit_transform(tqdm_notebook(f.readlines())))
    
def tfidf_t(file, tfv):
    with open(file, 'r', encoding='utf-8') as f:
        return tfv.transform(tqdm_notebook(f.readlines()))

In [94]:
%%time

(title_tfv, X_train_title_sparse) = tfidf_ft('data/train_title.txt')
X_test_title_sparse = tfidf_t('data/test_title.txt', title_tfv)

HBox(children=(IntProgress(value=0, max=62313), HTML(value='')))




HBox(children=(IntProgress(value=0, max=34645), HTML(value='')))




In [95]:
%%time

(content_tfv, X_train_content_sparse) = tfidf_ft('data/train_content.txt')
X_test_content_sparse = tfidf_t('data/test_content.txt', content_tfv)

HBox(children=(IntProgress(value=0, max=62313), HTML(value='')))




HBox(children=(IntProgress(value=0, max=34645), HTML(value='')))




In [71]:
def content_to_features(line):
    return [len(line)]

def content_features():
    with open('data/train_content.txt', 'r', encoding="utf-8") as f_train:
        with open('data/test_content.txt', 'r', encoding="utf-8") as f_test:
            test_lines = [content_to_features(line) for line in f_test.readlines()]
            train_lines = [content_to_features(line) for line in f_train.readlines()]
            fit_lines = train_lines + test_lines
            scaler = MinMaxScaler()
            scaler.fit(fit_lines)
            return (scaler.transform(train_lines), scaler.transform(test_lines))

(X_train_content_more_sparse, X_test_content_more_sparse) = content_features()
print(X_test_content_more_sparse)

[[0.02044119]
 [0.0108384 ]
 [0.01179192]
 ...
 [0.03839136]
 [0.0152414 ]
 [0.01655784]]


In [100]:
%%time
import time

def timestamp_to_features(s):
    ts = time.strptime(s[:13],"%Y-%m-%dT%H")
    hour = ts.tm_hour
    morning = 1 if ts.tm_hour < 12 else 0
    day = 1 if 6 <= ts.tm_hour < 18 else 0
    weekend = 1 if ts.tm_wday > 4 else 0 
    return [morning, day, weekend]

def time_onehot(file):
    with open(file, 'r', encoding="utf-8") as f:
        features = [timestamp_to_features(line) for line in f.readlines()]
        ohe = OneHotEncoder()
        return ohe.fit_transform(features)
            
X_train_time_features_sparse = time_onehot('data/train_published.txt')
X_test_time_features_sparse = time_onehot('data/test_published.txt')

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


Wall time: 3.41 s


In [107]:
%%time

def onehot():
    with open('data/train_author.txt', 'r', encoding="utf-8") as f_train:
        with open('data/test_author.txt', 'r', encoding="utf-8") as f_test:
            test_lines = f_test.readlines()
            train_lines = f_train.readlines()
            #fit_lines = train_lines + test_lines
            fit_lines = list(set(train_lines) & set(test_lines))
            ohe = OneHotEncoder(handle_unknown='ignore')
            ohe.fit(np.reshape(fit_lines, (len(fit_lines), 1)))
            X_train = ohe.transform(np.reshape(train_lines, (len(train_lines), 1)))
            X_test = ohe.transform(np.reshape(test_lines, (len(test_lines), 1)))
            return (X_train, X_test)

(X_train_author_sparse, X_test_author_sparse) = onehot()


Wall time: 443 ms


**Join all sparse matrices.**

In [108]:
#X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
#                         X_train_author_sparse, X_train_time_features_sparse]).tocsr()

X_train_sparse = hstack([X_train_content_sparse, X_train_title_sparse,
                         X_train_author_sparse, X_train_time_features_sparse,
                         X_train_content_more_sparse]).tocsr()

In [109]:
#X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
#                        X_test_author_sparse, X_test_time_features_sparse]).tocsr()

X_test_sparse = hstack([X_test_content_sparse, X_test_title_sparse,
                        X_test_author_sparse, X_test_time_features_sparse,
                        X_test_content_more_sparse]).tocsr()

**Read train target and split data for validation.**

In [14]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), 
                           index_col='id')
y_train = train_target['log_recommends'].values

In [110]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [111]:
%%time
ridge = Ridge(random_state=17)
ridge.fit(X_train_part_sparse, y_train_part);
ridge_valid_pred = ridge.predict(X_valid_sparse)
valid_mae = mean_absolute_error(y_valid, ridge_valid_pred)
print(valid_mae)

1.0715888734810064
Wall time: 1min 14s


In [113]:
valid_mae_zero = mean_absolute_error(y_valid, np.zeros_like(ridge_valid_pred))
print(valid_mae_zero)

2.9556227666631


In [134]:
# X - y = 1.071 + y
# 0 - y = 2.955 + y

print(ridge_valid_pred + 2.9556227666631 - 1.0715888734810064)
valid_mae_full = mean_absolute_error(y_valid, ridge_valid_pred / 2.9556227666631 * 1.0715888734810064)
print(valid_mae_full)

[5.20857116 5.55782191 4.00702619 ... 4.43721403 4.37806634 5.25880164]
1.9035330035794324


**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [80]:
%%time
ridge = Ridge(random_state=17)
ridge.fit(X_train_sparse, y_train);
ridge_test_pred = ridge.predict(X_test_sparse)

Wall time: 1min 57s


In [49]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [81]:
write_submission_file(ridge_test_pred, 'submission/a6_submission_1.05783.csv')

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeroes. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [97]:
write_submission_file(np.zeros_like(ridge_test_pred), 'submission/medium_all_zeros_submission.csv')

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [None]:
ridge_test_pred_modif = np.full_like(ridge_test_pred) # You code here

In [None]:
write_submission_file(ridge_test_pred_modif, 'submission/assignment6_medium_submission_with_hack.csv')

That's it for the assignment. Much more credits will be given to the winners in this competition, check [course roadmap](https://mlcourse.ai/roadmap). Do not spoil the assignment and the competition - don't share high-performing kernels (with MAE < 1.5).

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>