<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center>Assignment #6
### <center> Beating baselines in "How good is your Medium article?"
    
<img src='../../img/medium_claps.jpg' width=40% />


[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared ["primitive" baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) - you'll find something valuable there.

**Your task:**
 1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
 2. You need to name your [team](https://www.kaggle.com/c/how-good-is-your-medium-article/team) (out of 1 person) in full accordance with the [course rating](https://drive.google.com/open?id=19AGEhUQUol6_kNLKSzBsjcGUU3qWy3BNUg8x8IFkO3Q). You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.
 
*For discussions, please stick to [ODS Slack](https://opendatascience.slack.com/), channel #mlcourse_ai, pinned thread __#a6__*

In [0]:
import lightgbm as lgb


In [0]:
import os
from datetime import datetime
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge
from sklearn.feature_extraction.text import CountVectorizer

The following code will help to throw away all HTML tags from an article content.

In [0]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

Supplementary function to read a JSON line without crashing on escape characters.

In [0]:
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result

Extract features `content`, `published`, `title` and `author`, write them to separate files for train and test sets.

In [0]:
def extract_description(path_to_data,
                               inp_filename, is_train=True):
    
    prefix = 'train' if is_train else 'test'
    
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:
        
        f = open(os.path.join(path_to_data, '{}_description.txt'.format(prefix)), 'w', encoding='utf-8')
        print(os.path.join(path_to_data, '{}_description.txt'.format(prefix)))
        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            f.write(strip_tags(json_data['meta_tags']['twitter:description'].replace('\n', ' ').replace('\r', ' ')) + '\n')
            
        f.close()

#extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

In [0]:
extract_description(r'C:\Users\Андрей\AnacondaProjects\ODS\mlcourse.ai\data\kaggle_medium', 'train.json', is_train=True)

In [0]:
def extract_features_and_write(path_to_data,
                               inp_filename, is_train=True):
    
    features = ['content', 'published', 'title', 'author']
    prefix = 'train' if is_train else 'test'
    feature_files = [open('{}_{}.txt'.format(prefix, feat),
                          'w', encoding='utf-8')
                     for feat in features]
    
    with open(os.path.join(path_to_data, inp_filename), 
              encoding='utf-8') as inp_json_file:

        for line in tqdm_notebook(inp_json_file):
            json_data = read_json_line(line)
            
            feature_files[0].write(strip_tags(json_data['content'].replace('\n', ' ').replace('\r', ' ')) + '\n')
            feature_files[1].write(json_data['published']['$date'].split('.')[0] + '\n')
            feature_files[2].write(strip_tags(json_data['title'].replace('\n', ' ').replace('\r', ' ')) + '\n')
            feature_files[3].write(strip_tags(json_data['author']['url'].split('@')[-1]) + '\n')
            
    feature_files[0].close()
    feature_files[1].close()
    feature_files[2].close()
    feature_files[3].close()

In [0]:
PATH_TO_DATA = r'../input' # modify this if you need to

In [0]:
extract_features_and_write(PATH_TO_DATA, 'train.json', is_train=True)

In [0]:
extract_features_and_write(PATH_TO_DATA, 'test.json', is_train=False)

**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [0]:
vect = TfidfVectorizer(ngram_range=(1,2), max_features=100000) 
with open('train_title.txt', encoding='utf-8') as input_train_file:
    X_train_title_sparse = vect.fit_transform(input_train_file)

In [0]:
#vect = CountVectorizer(ngram_range=(1,2), max_features=100000) 
with open('test_title.txt', encoding='utf-8') as input_train_file:
    X_test_title_sparse = vect.transform(input_train_file)

In [0]:
vect_c = TfidfVectorizer(ngram_range=(1,2), max_features=150000) 
with open('train_content.txt', encoding='utf-8') as input_train_file:
    X_train_content_sparse = vect_c.fit_transform(input_train_file)

In [0]:
with open('test_content.txt', encoding='utf-8') as input_train_file:
    X_test_content_sparse = vect_c.transform(input_train_file)

In [0]:
def extract_date_f(PATH_TO_DATA, pub_filename, tit_filename):
    date_list = []
    with open(pub_filename) as f:
        for line in f:
            date_list.append(pd.to_datetime(line.split('.')[0], format='%Y-%m-%dT%H:%M:%S'))
    date_df = pd.DataFrame(date_list, columns=['date'])
    
    date_df['start_hour'] = date_df['date'].apply(lambda x: x.hour)
    hour = date_df['start_hour'].values
    date_df['morning'] = ((hour >= 7) & (hour <= 11)).astype('int')
    date_df['day'] = ((hour >= 12) & (hour <= 18)).astype('int')
    date_df['evening'] = ((hour >= 19) & (hour <= 23)).astype('int')
    date_df['night'] = ((hour >= 0) & (hour <= 6)).astype('int')
    date_df['weekday'] = date_df['date'].apply(lambda ts: ts.dayofweek).astype('int64') 
    date_df['is_weekend'] = date_df['date'].apply(lambda x: 1 if x.date().weekday() in (5, 6) else 0)
    
    date_df = pd.concat((date_df, pd.get_dummies(date_df[['weekday', 'start_hour']], 
                                             columns=['start_hour', 'weekday'])), axis=1)
    date_df.head()
    
    with open(tit_filename, encoding='utf-8') as input_train_file:
        len_tit = []
        for i in input_train_file:
            len_tit.append(len(i))
    #date_df['len_title'] = np.array(len_tit)
    
    return date_df.drop(['date', 'weekday', 'start_hour'], axis=1).values

In [0]:
X_train_time_features_sparse = extract_date_f(PATH_TO_DATA, 'train_published.txt', 'train_title.txt')

In [0]:
X_test_time_features_sparse = extract_date_f(PATH_TO_DATA, 'test_published.txt', 'test_title.txt')

In [0]:
with open('train_author.txt', encoding='utf-8') as input_train_file:
    author1 = []
    for i in input_train_file:
        author1.append(i)
print(len(set(author1)))

In [0]:
with open('test_author.txt', encoding='utf-8') as input_train_file:
    author2 = []
    for i in input_train_file:
        author2.append(i)
print(len(set(author2)))

In [0]:
a = pd.get_dummies(author1 + author2)

In [0]:
a.head()

**Join all sparse matrices.**

In [0]:
from sklearn.preprocessing import StandardScaler

train_tmp_scaled = StandardScaler().fit_transform(X_train_time_features_sparse)
test_tmp_scaled = StandardScaler().fit_transform(X_test_time_features_sparse)

In [0]:
X_train_sparse = hstack([
                        X_train_content_sparse,
                         X_train_title_sparse, 
                         train_tmp_scaled, 
                         a.values[:len(author1), :]
]).tocsr()

In [0]:
X_test_sparse = hstack([
                        X_test_content_sparse,
                         X_test_title_sparse, 
                         test_tmp_scaled, 
                         a.values[len(author1):, :],
]).tocsr()

**Read train target and split data for validation.**

In [0]:
d_ = {name: [] for name in ['id','log_recommends']}
with open(os.path.join('../input', 'train_log1p_recommends.csv'))as file:
    for i, line in tqdm_notebook(enumerate(file)):
        if not i:
            continue
        r = line.replace('\n', '').split(',')
        d_['id'].append(r[0])
        d_['log_recommends'].append(r[1])

In [0]:
train_target = pd.DataFrame(d_['log_recommends'], d_['id'], ['log_recommends'])
train_target.head()

In [0]:
y_train = train_target['log_recommends'].values.astype(float)

In [0]:
train_part_size = int(0.7 * train_target.shape[0])
X_train_part_sparse = X_train_sparse[:train_part_size, :]
y_train_part = y_train[:train_part_size]
X_valid_sparse =  X_train_sparse[train_part_size:, :]
y_valid = y_train[train_part_size:]

**Train a simple Ridge model and check MAE on the validation set.**

In [0]:
ridge = Ridge(alpha=1, random_state=17)

In [0]:
%%time
ridge.fit(X_train_part_sparse, y_train_part);

In [0]:
ridge_pred = ridge.predict(X_valid_sparse)

In [0]:
ridge_valid_mae = mean_absolute_error(y_valid, ridge_pred)
ridge_valid_mae

**Train the same Ridge with all available data, make predictions for the test set and form a submission file.**

In [0]:
%%time
ridge.fit(X_train_sparse, y_train);

In [0]:
ridge_test_pred = ridge.predict(X_test_sparse)

In [0]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    #submission = pd.read_csv(path_to_sample, index_col='id')
    d_ = {name: [] for name in ['id','log_recommends']}
    with open(path_to_sample)as file:
        for i, line in tqdm_notebook(enumerate(file)):
            if not i:
                continue
            r = line.replace('\n', '').split(',')
            d_['id'].append(r[0])
            d_['log_recommends'].append(r[1])
    df_ = pd.DataFrame(d_['log_recommends'], d_['id'], ['log_recommends'])
    df_['log_recommends'] = prediction
    df_.index.name = 'id'
    df_.to_csv(filename)

In [0]:
write_submission_file(ridge_test_pred + (4.33328 - mean_absolute_error(ridge_test_pred, np.zeros_like(ridge_test_pred))), 
                      'assignment6_medium_submission.csv')