<img src="https://habrastorage.org/webt/ia/m9/zk/iam9zkyzqebnf_okxipihkgjwnw.jpeg" />
    
**<center>[mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course** </center><br>
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). [mlcourse.ai](https://mlcourse.ai) is powered by [OpenDataScience (ods.ai)](https://ods.ai/) © 2017—2022

# <center>Assignment #6. Task</center><a class="tocSkip">
## <center> Beating benchmarks in "How good is your Medium article?"</center><a class="tocSkip">
    
[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "Assignment 6 baseline" (~1.45 Public LB score). You can refer to [this simple Ridge baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline?rvi=1).

-----

In [51]:
import json
import os
from pathlib import Path
from bs4 import BeautifulSoup

import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error
from tqdm.notebook import tqdm

The following code will help to throw away all HTML tags from an article content.

In [2]:
from html.parser import HTMLParser


class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs = True
        self.fed = []

    def handle_data(self, d):
        self.fed.append(d)

    def get_data(self):
        return "".join(self.fed)


def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [5]:
def read_json_line(line=None):
    result = None
    try:
        result = json.loads(line)
    except Exception as e:
        # Find the offending character index:
        idx_to_replace = int(str(e).split(" ")[-1].replace(")", ""))
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = " "
        new_line = "".join(new_line)
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [7]:
def extract_features_and_write(path_to_data, inp_filename, is_train=True):

    features = ["content", "published", "title", "author"]
    prefix = "train" if is_train else "test"
    feature_files = [
        open(
            os.path.join(path_to_data, "{}_{}.txt".format(prefix, feat)),
            "w",
            encoding="utf-8",
        )
        for feat in features
    ]

    with open(
        os.path.join(path_to_data, inp_filename), encoding="utf-8"
    ) as inp_json_file:

        for line in tqdm(inp_json_file):
            json_data = read_json_line(line)


# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)

Download the [competition data](https://www.kaggle.com/c/how-good-is-your-medium-article/data) and place it where it's convenient for you. You can modify the path to data below.

In [19]:
PATH_TO_DATA = Path("../../_static/data/assignment6/")  # modify this if you need to

In [71]:
X_train = pd.read_json(PATH_TO_DATA / 'train.json', lines=True)
X_train = X_train[X_train.columns[3:]]
X_train['published'] = pd.to_datetime(X_train['published'].apply(lambda x: x['$date']), format='%Y-%m-%dT%H:%M:%S.%fZ')

In [133]:
tmp = pd.DataFrame()
tmp['published'] = X_train['published']
tmp['title'] = X_train['title']
tmp['author'] = X_train['meta_tags'].apply(lambda x: x['author'])
tmp['domain'] = X_train['domain']
tmp['url'] = X_train['url']
tmp['tags'] = X_train['content'].apply(extract_tags)
tmp.sort_values(by='published')

Unnamed: 0,published,title,author,domain,url,tags
24068,1970-01-01 00:00:00.001,Saving Your Marriage By Watching Steamy Sex Ed...,Susan Bratton,medium.com,http://personallifemedia.com/2017/01/saving-ma...,Lovemaking Sex SexPositions EarlyBird SexEdVideos
59081,1970-01-01 00:00:00.001,やってよかった中学受験 – Ryo Ooishi – Medium,Ryo Ooishi,medium.com,https://medium.com/@ooishi/%E3%82%84%E3%81%A3%...,
54757,1970-01-18 03:21:32.400,はてなブログに書いた今年の手帳のお話 – なぞちゅうの部屋 – Medium,なぞちゅう,medium.com,http://nazoblackrx.hatenablog.com/entry/2016/1...,徒然日記 手帳 ブログ
32182,1987-12-08 21:45:00.000,Internet Corporation LLC to Acquire Early Clue...,Internet Corporation LLC,medium.com,https://medium.com/the-internet-corporation/de...,SocialMedia EarlyClues InternetCorporationLlc
33152,2003-12-29 17:00:00.000,g sowtwaretrading botMoneyMoneyMakeGetting To ...,Mackenzie Oldridge,medium.com,http://www.investopedia.com/articles/optioninv...,Finance Trading
...,...,...,...,...,...,...
16281,2017-06-30 22:52:19.817,Melhores práticas de SEO para Redatores — Webw...,Bruna Ciafrei,medium.com,https://medium.com/galata/melhores-pr%C3%A1tic...,Webwriting Marketing SEO Google TutoriaisGalata
4864,2017-06-30 22:57:00.044,All men (what about women?) are created equal:...,kerry stranman,medium.com,https://medium.com/enso/all-men-what-about-wom...,GenderEquality EqualRights SharedMission Const...
22525,2017-06-30 23:05:40.000,‘Artiz’lerden sanatçı duyarlılığı beklemek! – ...,Akif Umut Avaz,medium.com,http://www.tr724.com/artizlerden-sanatci-duyar...,Tr724Yorum Tr724
58493,2017-06-30 23:31:50.736,Princípios gerais do processo civil – Anotaçõe...,niva,medium.com,https://medium.com/anota%C3%A7%C3%B5es-de-dire...,Dpc0213 TeoriaGeralDoProcesso Tgp ProcessoCivil


In [153]:
tmp.title.iloc[20]

'Phenom, a 500 Startups Company – Hacker Noon'

In [131]:
def extract_tags(content):
    tags_str = []
    soup = BeautifulSoup(content, 'lxml')
    try:
        tag_block = soup.find('ul', class_='tags')
        tags = tag_block.find_all('a')
        for tag in tags:
            tags_str.append(tag.text.translate({ord(' '):None, ord('-'):None}))
        tags = ' '.join(tags_str)
    except Exception:
        tags = 'None'
    return tags

In [115]:
X_train.sort_values(by='published')['meta_tags'].iloc[0]['author']

'Susan Bratton'

In [105]:
X_train

Unnamed: 0,url,domain,published,title,content,author,image_url,tags,link_tags,meta_tags
0,https://medium.com/policy/medium-terms-of-serv...,medium.com,2012-08-13 22:54:53.510,Medium Terms of Service – Medium Policy – Medium,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@Med...",,[],{'canonical': 'https://medium.com/policy/mediu...,"{'viewport': 'width=device-width, initial-scal..."
1,https://medium.com/policy/amendment-to-medium-...,medium.com,2015-08-03 07:44:50.331,Amendment to Medium Terms of Service Applicabl...,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@Med...",,[],{'canonical': 'https://medium.com/policy/amend...,"{'viewport': 'width=device-width, initial-scal..."
2,https://medium.com/@aelcenganda/%E9%96%A9%E6%9...,medium.com,2017-02-05 13:08:17.410,走入山與海之間：閩東大刀會和兩岸走私 – Yun-Chen Chien（簡韻真） – Medium,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@ael...",https://cdn-images-1.medium.com/max/1200/1*k9f...,[],{'canonical': 'https://medium.com/@aelcenganda...,"{'viewport': 'width=device-width, initial-scal..."
3,https://medium.com/what-comes-to-mind/how-fast...,medium.com,2017-05-06 08:16:30.776,How fast can a camera get? – What comes to min...,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@vai...",https://cdn-images-1.medium.com/max/1200/1*6UZ...,[],{'canonical': 'https://medium.com/what-comes-t...,"{'viewport': 'width=device-width, initial-scal..."
4,https://medium.com/what-comes-to-mind/a-game-f...,medium.com,2017-06-04 14:46:25.772,A game for the lonely fox – What comes to mind...,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@vai...",https://cdn-images-1.medium.com/max/1200/1*liU...,[],{'canonical': 'https://medium.com/what-comes-t...,"{'viewport': 'width=device-width, initial-scal..."
...,...,...,...,...,...,...,...,...,...,...
62308,https://medium.com/@Politix_news/whenialmostdi...,medium.com,2016-01-28 01:03:08.798,WhenIAlmostDied – Heather Nann – Medium,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@Pol...",https://cdn-images-1.medium.com/max/1200/1*bGg...,[],{'canonical': 'https://medium.com/@Politix_new...,"{'viewport': 'width=device-width, initial-scal..."
62309,https://medium.com/@Politix_news/twitter-troll...,medium.com,2016-01-14 13:28:30.277,Twitter Troll Trigger Alert – Heather Nann – M...,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@Pol...",https://cdn-images-1.medium.com/max/1200/1*90B...,[],{'canonical': 'https://medium.com/@Politix_new...,"{'viewport': 'width=device-width, initial-scal..."
62310,https://medium.com/@Politix_news/space-and-can...,medium.com,2016-03-06 06:51:45.307,Space and candles – Heather Nann – Medium,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@Pol...",https://cdn-images-1.medium.com/focal/1200/632...,[],{'canonical': 'https://medium.com/@Politix_new...,"{'viewport': 'width=device-width, initial-scal..."
62311,https://medium.com/swip/laser-focus-your-team-...,medium.com,2017-01-15 17:45:22.836,Laser focus your team using Work-In-Progress l...,"<div><header class=""container u-maxWidth740""><...","{'name': None, 'url': 'https://medium.com/@nic...",https://cdn-images-1.medium.com/max/1200/1*hko...,[],{'canonical': 'https://medium.com/swip/laser-f...,"{'viewport': 'width=device-width, initial-scal..."


**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [19]:
X_train_content_sparse = csr_matrix(np.empty([10, 100000]))  # change this
X_train_title_sparse = csr_matrix(np.empty([10, 100000]))  # change this
X_train_author_sparse = csr_matrix(np.empty([10, 100000]))  # change this
X_train_time_features_sparse = csr_matrix(np.empty([10, 5]))  # change this

X_test_content_sparse = csr_matrix(np.empty([5, 100000]))  # change this
X_test_title_sparse = csr_matrix(np.empty([5, 100000]))  # change this
X_test_author_sparse = csr_matrix(np.empty([5, 100000]))  # change this
X_test_time_features_sparse = csr_matrix(np.empty([5, 5]))  # change this

In [8]:
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)

**Join all sparse matrices.**

In [9]:
X_train_sparse = hstack(
    [
        X_train_content_sparse,
        X_train_title_sparse,
        X_train_author_sparse,
        X_train_time_features_sparse,
    ]
).tocsr()

In [10]:
X_test_sparse = hstack(
    [
        X_test_content_sparse,
        X_test_title_sparse,
        X_test_author_sparse,
        X_test_time_features_sparse,
    ]
).tocsr()

**Read train target and split data for validation.**

In [11]:
train_target = pd.read_csv(
    os.path.join(PATH_TO_DATA, "train_log1p_recommends.csv"), index_col="id"
)
y_train = train_target["log_recommends"].values

In [12]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse = X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [13]:
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [14]:
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)

In [15]:
ridge_test_pred = np.empty([34645, 1])  # change this

In [16]:
def write_submission_file(
    prediction,
    filename,
    path_to_sample=os.path.join(PATH_TO_DATA, "sample_submission.csv"),
):
    submission = pd.read_csv(path_to_sample, index_col="id")

    submission["log_recommends"] = prediction
    submission.to_csv(filename)

In [17]:
write_submission_file(
    ridge_test_pred, os.path.join(PATH_TO_DATA, "assignment6_medium_submission.csv")
)

**Now's the time for dirty Kaggle hacks. Form a submission file with all zeros. Make a submission. What do you get if you think about it? How is it going to help you with modifying your predictions?**

In [18]:
write_submission_file(
    np.zeros_like(ridge_test_pred),
    os.path.join(PATH_TO_DATA, "medium_all_zeros_submission.csv"),
)

**Modify predictions in an appropriate way (based on your all-zero submission) and make a new submission.**

In [19]:
ridge_test_pred_modif = ridge_test_pred
# You code here (read-only in a JupyterBook, pls run jupyter-notebook to edit)

In [20]:
write_submission_file(
    ridge_test_pred_modif,
    os.path.join(PATH_TO_DATA, "assignment6_medium_submission_with_hack.csv"),
)

That's it for the assignment. In case you'd like to try some more ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will train much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- And neural nets of course. We don't cover them in this course byt still transformer-based architectures will likely perform well in such types of tasks