# Objectives
1. Develop a strong baseline using nothing more than **Scikit-learn**
2. Inspect into the top features of the regression model
3. Search for the best pipelines and the corresponding hyper-parameters using GridSearch
4. Discuss the interesting observations

# Pipelines to be evaluated
1. TFIDF vectorizer -> Linear Regression
2. TFIDF vectorizer -> Ridge Regression
3. TFIDF vectorizer -> TruncatedSVD -> Linear Regression
4. TFIDF vectorizer -> TruncatedSVD -> Ridge Regression

# Spolier alert of interesting observations
1. One peculiar observation is that the **top negative features of the regression model contain a lot of stop words** 
2. Removing the stop words results in an inferior performance which I guess is because, well structured sentences consists predominantly of conjunctions, articles etc. which are in fact in the list of stop words
3. The performance of the model with dimensionality reduction is superior to the model without dimensionality reduction
4. Normalizing the data before passing on to the regressors achieves superior performance
5. Best model **Without** SVD -> **Ridge regression**
6. Best model **With** SVD -> **Linear Regression**. So this implies that SVD does reduce noise and substitutes regularization

In [1]:
import string
import pandas as pd

from pathlib import Path
from sklearn.pipeline import Pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
COMPETITION_DATA_PATH = Path('../input/commonlitreadabilityprize')
TRAIN_DATA_PATH = COMPETITION_DATA_PATH / 'train.csv'
TEST_DATA_PATH = COMPETITION_DATA_PATH / 'test.csv'

In [3]:
train_data = pd.read_csv(TRAIN_DATA_PATH)
train_data = train_data[['excerpt', 'target']]

In [4]:
def preprocess_text(text):
    # Strip punctuations
    text = text.translate(str.maketrans('', '', string.punctuation))
    return text

# Pipeline without Dimensionality reduction

To inspect word features

In [5]:
pipeline_without_svd = Pipeline([('vectorizer', 'passthrough'),
                                 ('regressor', 'passthrough')])

param_grid = {'vectorizer': [TfidfVectorizer(preprocessor=preprocess_text)],
              'regressor':  [LinearRegression(), Ridge()],
              
              'vectorizer__stop_words': [None, 'english'],
              'vectorizer__min_df': [5, 7, 9],
              'vectorizer__ngram_range': [(1, 2), (1, 3)],
              
              'regressor__fit_intercept': [True, False],
              'regressor__normalize': [True, False]}

search_without_svd = GridSearchCV(pipeline_without_svd, param_grid, cv=3,
                                  n_jobs=-1, scoring='neg_root_mean_squared_error')
search_without_svd.fit(X=train_data['excerpt'], y=train_data['target'])

print("Best parameter (CV score=%0.3f):" % search_without_svd.best_score_)
print(search_without_svd.best_params_)

Best parameter (CV score=-0.775):
{'regressor': Ridge(fit_intercept=False, normalize=True), 'regressor__fit_intercept': False, 'regressor__normalize': True, 'vectorizer': TfidfVectorizer(min_df=7, ngram_range=(1, 3),
                preprocessor=<function preprocess_text at 0x7f8cd8bdfcb0>), 'vectorizer__min_df': 7, 'vectorizer__ngram_range': (1, 3), 'vectorizer__stop_words': None}


## Top features

In [6]:
best_pipeline_without_svd = search_without_svd.best_estimator_

vectorizer_features = best_pipeline_without_svd['vectorizer'].get_feature_names()
model_weights = best_pipeline_without_svd['regressor'].coef_

print('Top positive features')
for weight, feature in sorted(zip(model_weights, vectorizer_features))[::-1][:10]:
    print(f'Feature: {feature}: {weight:.2f}')
    
print(5*'--------------------')

print('Top negative features')
for weight, feature in sorted(zip(model_weights, vectorizer_features))[:10]:
    print(f'Feature: {feature}: {weight:.2f}')

Top positive features
Feature: just: 0.98
Feature: One: 0.97
Feature: boys: 0.91
Feature: lived: 0.89
Feature: snow: 0.83
Feature: trees: 0.80
Feature: out: 0.77
Feature: think: 0.76
Feature: away: 0.76
Feature: rights: 0.74
----------------------------------------------------------------------------------------------------
Top negative features
Feature: of: -3.23
Feature: with: -1.96
Feature: the: -1.73
Feature: in: -1.73
Feature: as: -1.47
Feature: which: -1.44
Feature: by: -1.25
Feature: to: -1.24
Feature: and: -1.15
Feature: at: -1.12


# Pipeline with Dimensionality reduction

In [7]:
pipeline_with_svd = Pipeline([('vectorizer', 'passthrough'),
                              ('svd', 'passthrough'),
                              ('regressor', 'passthrough')])

param_grid = {'vectorizer': [TfidfVectorizer(preprocessor=preprocess_text)],
              'svd': [TruncatedSVD()],
              'regressor':  [LinearRegression(), Ridge()],
              
              'vectorizer__stop_words': [None, 'english'],
              'vectorizer__min_df': [3, 5, 7],
              'vectorizer__ngram_range': [(1, 2), (1, 3)],
              
              'svd__n_components': [700, 800],
              
              'regressor__fit_intercept': [True, False],
              'regressor__normalize': [True, False]}

search_with_svd = GridSearchCV(pipeline_with_svd, param_grid, cv=3,
                                  n_jobs=-1, scoring='neg_root_mean_squared_error')
search_with_svd.fit(X=train_data['excerpt'], y=train_data['target'])

print("Best parameter (CV score=%0.3f):" % search_with_svd.best_score_)
print(search_with_svd.best_params_)

Best parameter (CV score=-0.761):
{'regressor': LinearRegression(fit_intercept=False), 'regressor__fit_intercept': False, 'regressor__normalize': False, 'svd': TruncatedSVD(n_components=800), 'svd__n_components': 800, 'vectorizer': TfidfVectorizer(min_df=3, ngram_range=(1, 3),
                preprocessor=<function preprocess_text at 0x7f8cd8bdfcb0>), 'vectorizer__min_df': 3, 'vectorizer__ngram_range': (1, 3), 'vectorizer__stop_words': None}


# Repeating the Interesting observations
1. One peculiar observation is that the top features of the regression model contain a lot of stop words 
2. Removing the stop words results in an inferior performance which I guess is because well structured sentences consists predominantly of conjunctions, articles etc. which are in fact in the list of stop words
3. The performance of the model with dimensionality reduction is superior to the model without dimensionality reduction
4. Normalizing the data before passing on to the regressors achieves superior performance
5. Best model **Without** SVD -> Ridge regression where Best model **with** SVD -> Linear Regression. So this implies that SVD does reduce noise and substitutes regularization

# Predicting with the best pipeline

In [8]:
test_data = pd.read_csv(TEST_DATA_PATH)
test_data['target'] = search_with_svd.predict(test_data['excerpt'])
test_data[['id','target']].to_csv('submission.csv', index=False)