<center>
<img src="../../img/ods_stickers.jpg" />
    
## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course 
Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). Edited by Sergey Kolchenko (@KolchenkoSergey). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose.

## <center>Assignment #6
### <center> Beating baselines in "How good is your Medium article?"
    
<img src='../../img/medium_claps.jpg' width=40% />


[Competition](https://www.kaggle.com/c/how-good-is-your-medium-article). The task is to beat "A6 baseline" (~1.45 Public LB score). Do not forget about our shared ["primitive" baseline](https://www.kaggle.com/kashnitsky/ridge-countvectorizer-baseline) - you'll find something valuable there.

**Your task:**
 1. "Freeride". Come up with good features to beat the baseline "A6 baseline" (for now, public LB is only considered)
 2. You need to name your [team](https://www.kaggle.com/c/how-good-is-your-medium-article/team) (out of 1 person) in full accordance with the [course rating](https://drive.google.com/open?id=19AGEhUQUol6_kNLKSzBsjcGUU3qWy3BNUg8x8IFkO3Q). You can think of it as a part of the assignment. 16 credits for beating the mentioned baseline and correct team naming.

In [1]:
#!conda install -c conda-forge cld2-cffi --yes

In [2]:
import os
import json
from tqdm import tqdm_notebook
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import mean_absolute_error
from scipy.sparse import csr_matrix, hstack
from sklearn.linear_model import Ridge
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
import cld2

In [3]:
PATH_TO_DATA = '../../../Kaggle/Medium'

In [4]:
train_target = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_log1p_recommends.csv'), index_col='id')
y_train = train_target['log_recommends'].values

In [5]:
y_train.shape

(62313,)

In [6]:
train = pd.read_csv('train_data.csv')
test = pd.read_csv('test_data.csv')
del train['Unnamed: 0']
del test['Unnamed: 0']

In [7]:
train.shape, test.shape

((62313, 8), (34645, 8))

In [8]:
train = pd.concat((train, pd.DataFrame(y_train)), axis=1)
train['published'] = train['published'].apply(pd.to_datetime)
train['year'] = train['published'].apply(lambda x: x.year)
train = train[train['year']>=2016]
train.sort_values(by='published', ascending=True, inplace=True)

In [9]:
y_train = train[0]
train.drop(columns={0}, inplace=True)

In [10]:
y_train.shape, train.shape

((45938,), (45938, 9))

In [11]:
#!conda install -c conda-forge langdetect --yes    

In [12]:
#train['published'] = train['published'].apply(pd.to_datetime)  used above
train['tags'] = train['tags'].astype('str')
train['dow'] = train['published'].apply(lambda x: x.dayofweek)
train['hour'] = train['published'].apply(lambda x: x.hour)
train['month'] = train['published'].apply(lambda x: x.month)
#train['year'] = train['published'].apply(lambda x: x.year)    #used above
train['number_of_tags'] = train['tags'].apply(lambda x: len(x.split()))

test['tags'] = test['tags'].astype('str')
test['published'] = test['published'].apply(pd.to_datetime)
test['dow'] = test['published'].apply(lambda x: x.dayofweek)
test['hour'] = test['published'].apply(lambda x: x.hour)
test['month'] = test['published'].apply(lambda x: x.month)
test['year'] = test['published'].apply(lambda x: x.year)
test['number_of_tags'] = test['tags'].apply(lambda x: len(x.split()))

**Add the following groups of features:**
    - Tf-Idf with article content (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Tf-Idf with article titles (ngram_range=(1, 2), max_features=100000 but you can try adding more)
    - Time features: publication hour, whether it's morning, day, night, whether it's a weekend
    - Bag of authors (i.e. One-Hot-Encoded author names)

In [13]:
train['title_length'] = train['title'].apply(lambda x: len(x.split(' ')))
train['title_length_sq'] = train['title_length'].apply(lambda x: x**2)

test['title_length'] = test['title'].apply(lambda x: len(x.split(' ')))
test['title_length_sq'] = test['title_length'].apply(lambda x: x**2)

In [14]:
train['topics'] = train['url'].apply(lambda x: x.split('/')[3] if len(x.split('/')) > 3 else 'None')
test['topics'] = test['url'].apply(lambda x: x.split('/')[3] if len(x.split('/')) > 3 else 'None')

In [15]:
train['length_log'] = train['length'].apply(lambda x: np.log(x))
test['length_log'] = test['length'].apply(lambda x: np.log(x))

In [16]:
#train = train.sort_values(by='published')
#test = test.sort_values(by='published')

In [17]:
#full_df_feats = full_df.copy() 
#full_df_feats.drop(columns=['content','published','title','domain','tags','length','url','hour','year'], inplace=True) 
#list_to_dums = ['author','dow', 'month', 'number_of_tags'] 
#full_df_feats = pd.get_dummies(full_df_feats, columns = list_to_dums, drop_first=True, prefix=list_to_dums, sparse=False); 
#cv_content = TfidfVectorizer(ngram_range=(1, 2), max_features=43908) 
#cv_title = TfidfVectorizer(ngram_range=(1, 2), max_features=43908) ; 
#'alpha': 1.3

In [70]:
def kmeansshow(k,X):

    from sklearn import cluster
    from matplotlib import pyplot

    kmeans = cluster.KMeans(n_clusters=k)
    kmeans.fit(X)
    labels = kmeans.labels_
    centroids = kmeans.cluster_centers_
    #print centroids

    for i in range(k):
        # select only data observations with cluster label == i
        ds = X[np.where(labels==i)]
        # plot the data observations
        pyplot.plot(ds[:,0],ds[:,1],'o')
        # plot the centroids
        lines = pyplot.plot(centroids[i,0],centroids[i,1],'kx')
        # make the centroid x's bigger
        pyplot.setp(lines,ms=15.0)
        pyplot.setp(lines,mew=2.0)
    pyplot.legend()
    pyplot.show()
    return centroids

In [71]:
pi = np.pi
train['hour_sin_x'] = train['hour'].apply(lambda ts: np.sin(2*pi*ts/24.))
train['hour_cos_x'] = train['hour'].apply(lambda ts: np.cos(2*pi*ts/24.))

test['hour_sin_x'] = test['hour'].apply(lambda ts: np.sin(2*pi*ts/24.))
test['hour_cos_x'] = test['hour'].apply(lambda ts: np.cos(2*pi*ts/24.))

train['month_sin_x'] = train['month'].apply(lambda ts: np.sin(2*pi*ts/24.))
train['month_cos_x'] = train['month'].apply(lambda ts: np.cos(2*pi*ts/24.))

test['month_sin_x'] = test['month'].apply(lambda ts: np.sin(2*pi*ts/24.))
test['month_cos_x'] = test['month'].apply(lambda ts: np.cos(2*pi*ts/24.))

In [72]:
idx_split = len(train)
full_df = pd.concat((train, test), sort=True)

In [19]:
#full_df.columns

In [20]:
#full_df_feats = full_df.copy()
#full_df_feats.drop(columns=['content', 'published', 'title', 'length',
#       'url', 'dow', 'hour', 'month', 'year', 'number_of_tags', 'title_length'],  inplace=True)

In [21]:
#list_to_dums = ['author','tags','domain','topics','length_log','title_length_sq']
#full_df_feats = pd.get_dummies(full_df_feats, columns = list_to_dums, drop_first=True, prefix=list_to_dums, sparse=False)

In [73]:
full_df.columns

Index(['author', 'content', 'domain', 'dow', 'hour', 'hour_cos_x',
       'hour_sin_x', 'length', 'length_log', 'month', 'month_cos_x',
       'month_sin_x', 'number_of_tags', 'published', 'tags', 'title',
       'title_length', 'title_length_sq', 'topics', 'url', 'year'],
      dtype='object')

In [68]:
full_df_feats = full_df.copy() 
full_df_feats.drop(columns=['content', 'published','length', 
                            'url','hour','year','title_length', 'title_length_sq', 'dow', 'tags', 
                            'month', 'number_of_tags', 'length_log','title'], inplace=True) 
list_to_dums = ['author', 'topics', 'domain'] 
full_df_feats = pd.get_dummies(full_df_feats, columns = list_to_dums, drop_first=True, prefix=list_to_dums, sparse=False); 

In [69]:
full_df_feats.shape

(80583, 75328)

In [63]:
#full_df_feats.shape

(80583, 75324)

In [25]:
cv_content = TfidfVectorizer(ngram_range=(1, 2), max_features=300000, sublinear_tf=True) 
cv_title = TfidfVectorizer(ngram_range=(1, 2), max_features=300000, sublinear_tf=True) 

In [26]:
X_train_content = cv_content.fit_transform(full_df.iloc[:idx_split,:]['content'].values.tolist())

In [27]:
X_test_content = cv_content.transform(full_df.iloc[idx_split:,:]['content'].values.tolist())

In [28]:
X_train_title = cv_title.fit_transform(full_df.iloc[:idx_split,:]['title'].values.tolist())

In [29]:
X_test_title = cv_title.transform(full_df.iloc[idx_split:,:]['title'].values.tolist())

In [74]:
X_train_sparse = hstack([X_train_content, 
                         X_train_title,
                         full_df_feats.iloc[:idx_split,:]]).tocsr()


In [75]:
X_test_sparse = hstack([X_test_content, 
                        X_test_title,
                        full_df_feats.iloc[idx_split:,:]]).tocsr()


In [76]:
time_split = TimeSeriesSplit(n_splits=5)

In [77]:
[(el[0].shape, el[1].shape) for el in time_split.split(X_train_sparse)]

[((7658,), (7656,)),
 ((15314,), (7656,)),
 ((22970,), (7656,)),
 ((30626,), (7656,)),
 ((38282,), (7656,))]

In [78]:
ridge_grid = GridSearchCV(estimator=Ridge(random_state=17), 
                          param_grid = {'alpha': [0.06, 0.08, 0.1, 0.12, 0.14]}, 
                          scoring='neg_mean_absolute_error',
                          n_jobs=1, 
                          cv=time_split, 
                          verbose=1)

In [79]:
#ridge = Ridge(random_state=17, alpha=2)                          
ridge_pred_test = ridge_grid.fit(X_train_sparse, y_train).predict(X_test_sparse) 

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=1)]: Done  25 out of  25 | elapsed: 37.5min finished


In [80]:
ridge_grid.best_estimator_

Ridge(alpha=0.12, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=17, solver='auto', tol=0.001)

In [81]:
ridge_grid.best_score_

-1.0589594630908086

In [84]:
#y_predictions = ridge_pred_test

In [85]:
y_predictions = ridge_pred_test + (4.33328 - np.mean(ridge_pred_test))

In [86]:
np.mean(y_predictions)

4.333279999999999

In [87]:
def write_submission_file(prediction, filename,
                          path_to_sample=os.path.join(PATH_TO_DATA, 
                                                      'sample_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='id')
    
    submission['log_recommends'] = prediction
    submission.to_csv(filename)

In [88]:
#write_submission_file(y_predictions, os.path.join(PATH_TO_DATA,
#                                                    'to_test26nov-1.csv'))

In [89]:
write_submission_file(y_predictions, os.path.join(PATH_TO_DATA,
                                                    'to_a_test_26-final.csv'))


That's it for the assignment. Much more credits will be given to the winners in this competition, check [course roadmap](https://mlcourse.ai/roadmap). Do not spoil the assignment and the competition - don't share high-performing kernels (with MAE < 1.5).

Some ideas for improvement:

- Engineer good features, this is the key to success. Some simple features will be based on publication time, authors, content length and so on
- You may not ignore HTML and extract some features from there
- You'd better experiment with your validation scheme. You should see a correlation between your local improvements and LB score
- Try TF-IDF, ngrams, Word2Vec and GloVe embeddings
- Try various NLP techniques like stemming and lemmatization
- Tune hyperparameters. In our example, we've left only 50k features and used C=1 as a regularization parameter, this can be changed
- SGD and Vowpal Wabbit will learn much faster
- Play around with blending and/or stacking. An intro is given in [this Kernel](https://www.kaggle.com/kashnitsky/ridge-and-lightgbm-simple-blending) by @yorko 
- In our course, we don't cover neural nets. But it's not obliged to use GRUs/LSTMs/whatever in this competition.

Good luck!

<img src='../../img/kaggle_shakeup.png' width=50%>