# News Popularity

### Team ML makes me cry

Team member and contribution :
* 陳博安 103062321 : 資料處理 40%
* 谷佳駿 103062232 : trainning 60%

Before preprocessing, what we had was just a lot of text data, which was hard for us to use to train our model, so we needed to use HashingVectorizer to convert our data into vectors.(We've tried the tfidfVectorizer, but it requires more momory than we have, so we didn't use it.)

We found news titles might be a factor that affects people's decision as to whether to read the content of news, which may be related to its popularity, so we decided to single out titles and vectorize them as features to train our model.
We also found which day of the week the news came out may help us with our predictions, so we added it as features after one-hot encoding it.

Before vectorizeing text data, we first tokenized it into lists of strings, and then stemmed them and filtered out the stop words in the data.


In [4]:
from sklearn.feature_extraction.text import HashingVectorizer
from bs4 import BeautifulSoup
from dateutil import parser
from sklearn.preprocessing import OneHotEncoder
import re
import numpy as np
from nltk.stem.porter import PorterStemmer
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop = stopwords.words('english')

def tokenizer_stem_nostop(text):
    porter = PorterStemmer()
    return [porter.stem(w) for w in re.split('\s+', text.strip()) \
            if w not in stop and re.match('[a-zA-Z]+', w)]
def get_processed_data(X):
    content = []
    titles = []
    weekdays = []
    for i in range(X.shape[0]):
        text = X[i]
        soup = BeautifulSoup(text, 'html.parser')
        text = soup.get_text()
        date = soup.find('time')
        title = soup.find('h1')
        if date.string != None:
            date = parser.parse(date.string).weekday()
        else:
            date = 0
        title = re.sub('[\W]+', ' ', title.string.lower())
        
        content.append(text)
        titles.append(title)
        weekdays.append(date)
    
    ohe = OneHotEncoder(sparse=False, n_values=7)
    weekdays = np.array(weekdays)
    one_hot_weekdays = ohe.fit_transform(weekdays[:, np.newaxis])
   
    #tfidf = TfidfVectorizer(ngram_range=(1,1), tokenizer=tokenizer_stem_nostop)
    hashVec = HashingVectorizer(n_features=2**14, tokenizer=tokenizer_stem_nostop)
    content = np.array(content).reshape(-1, 1)
    titles = np.array(titles).reshape(-1, 1)
    content_tfidf = hashVec.transform(content.flatten()).toarray()
    titles_tfidf = hashVec.transform(titles.flatten()).toarray()
    
    
    return np.concatenate((titles_tfidf, content_tfidf, one_hot_weekdays), axis = 1)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/billywithbelly/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


After preprocessing,  we were finally able to train our model.
We used the SGDClassifier model since our computers were not powerful enough to allow us to trian all the data at once.

In [10]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
import _pickle as pkl

def get_stream(path, size):
    for chunk in pd.read_csv(path, chunksize=size):
        yield chunk


# loss='log' gives logistic regression
clf = SGDClassifier(loss='log', n_iter=100)
batch_size =200 
stream = get_stream(path='datasets/train.csv', size=batch_size)
classes = np.array([-1, 1])
train_auc, val_auc = [], []
# we use one batch for training and another for validation in each iteration
iters = int((40000+batch_size-1)/(batch_size))

for i in range(iters):
    batch = next(stream)
    X_train, y_train = batch['Page content'], batch['Popularity']
    if X_train is None:
        break
    processed_X_train = get_processed_data(X_train)
    clf.partial_fit(processed_X_train, y_train, classes=classes)
    train_auc.append(roc_auc_score(y_train, clf.predict_proba(processed_X_train)[:,1]))
    
pkl.dump(clf, open('output/clf-sgd-final.pkl', 'wb'))

StopIteration: 

Having the model trained, we used the model to predict the testing dataset.

In [11]:
def get_stream(path, size):
    for chunk in pd.read_csv(path, chunksize=size):
        yield chunk


# loss='log' gives logistic regression
#modify here
clf = pkl.load(open('output/clf-sgd-final.pkl', 'rb'))
batch_size =100 
stream = get_stream(path='datasets/train.csv', size=batch_size)
#modify here
y_train_B_predict = np.array([])

# we use one batch for training and another for validation in each iteration
iters = int((30000+batch_size)/(batch_size))
for i in range(iters):
    batch = next(stream)
    X_test = batch['Page content']
    if X_test is None:
        break
    processed_X_test = get_processed_data(X_test)
    #modify here
    y_train_B_predict = np.concatenate( (y_train_B_predict, clf.predict(processed_X_test)), axis = 0 )
    print('[{}/{}]'.format((i+1)*(batch_size), 25000))

[100/25000]
[200/25000]
[300/25000]
[400/25000]
[500/25000]
[600/25000]
[700/25000]
[800/25000]
[900/25000]
[1000/25000]
[1100/25000]
[1200/25000]
[1300/25000]
[1400/25000]
[1500/25000]
[1600/25000]
[1700/25000]
[1800/25000]
[1900/25000]
[2000/25000]
[2100/25000]
[2200/25000]
[2300/25000]
[2400/25000]
[2500/25000]
[2600/25000]
[2700/25000]
[2800/25000]
[2900/25000]
[3000/25000]
[3100/25000]
[3200/25000]
[3300/25000]
[3400/25000]
[3500/25000]
[3600/25000]
[3700/25000]
[3800/25000]
[3900/25000]
[4000/25000]
[4100/25000]
[4200/25000]
[4300/25000]
[4400/25000]
[4500/25000]
[4600/25000]
[4700/25000]
[4800/25000]
[4900/25000]
[5000/25000]
[5100/25000]
[5200/25000]
[5300/25000]
[5400/25000]
[5500/25000]
[5600/25000]
[5700/25000]
[5800/25000]
[5900/25000]
[6000/25000]
[6100/25000]
[6200/25000]
[6300/25000]
[6400/25000]
[6500/25000]
[6600/25000]
[6700/25000]
[6800/25000]
[6900/25000]
[7000/25000]
[7100/25000]
[7200/25000]
[7300/25000]
[7400/25000]
[7500/25000]
[7600/25000]
[7700/25000]
[7800/25

StopIteration: 

# Conclusions

Out-of-core learning:
Since we didn't have enough time, we've only tried out SGDClassifier and LogisticRegression. The latter didn't work ㄋsince we have not enough memory, which tells us the improtance of out-of-core leaning and that The models we can use are limited by the computers' performance.

Consistency of features:
When using out-of-core learning models, we encountered a pitfall where when we tried to one-hot-encode the weekday feature of a small part of the training dataset, since the part of the dataset didn't contain all seven days, the one-hot encoder only produced 6 columns, which led to inconsistency of number of columns. However it wasn't hard so solve the problem. All we did to solve the problem was set the n_values argument of OneHotEncoder to 7. From this expirience, we learned that we need to make sure the data features of each set of data after preprocessing are consistent when using Out-of-core algorithms.

Preprocessing:
In real life, a lot of data is not numerical and needs to be preprocessed before being applied to models, just as we did in the compitition. Before we preprocessed the data, it couldn't be applied to a model. Moreover, How well a model predicts could mostly depend on how well the data is preprocessed.