<a href="https://colab.research.google.com/github/acharyariku/Hello-world/blob/master/EluvioDSChallenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Define a problem:**
Through pandas to read the csv files, I just found the file contains 8 columns(509236*8): ["time_created"、"date_created"、"up_votes"、"down_votes"、"title"、"over_18"、"author" and "category"].

datasets

After simple checking, the column called "down_votes" are found all "0" and the column named "category" are all shown as "worldnews", so those two features are useless, and just drop them. So there are 6 columns left.

Since there is no clear label, I want to set my own label based on the features value. In my opinion, whether the news is attracting or not can be a label, and by the virtue of "up_votes" values, the news can be easily split to be two parts(Ones are attracting news and the others are not attracting news). And this is very intuitive and reasonable since there should be more "up_votes" if the news is attracting.

So which columns can be features affecting the label(attracing news or not)? Obviously, the "title" should be one feature, since if the title is attracting, there should be more people to read and more probably to like it. And the "over_18" should also be one factor since age will sometimes or to some extent determine their interest. And "author" have to be one factor since some famous actor will be more likely to get likeness. And the time columns [time_created, date_created] should have some influences on the label but I do not think the effect is huge, so here I just drop them.

And there are 85838 authors, which is too much if we do one-hot encoding, so here let us just dropped it first.

After clarification, here is a classification problem, the label is whether a news is attracting or not(based on "up_votes") and the features are ["title", "over_18"]. In this problem, I just use the feature ['title'] to simplify the problem.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
import pickle
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [3]:
path = "drive/My Drive/Eluvio_DS_Challenge.csv"
df = pd.read_csv(path)

In [4]:
df.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


In [5]:
len(df)

509236

In [6]:
print(sum(df['category'] == "worldnews"))
print(sum(df["down_votes"] == 0))

509236
509236


In [7]:
df = df.drop("category", axis = 1)
df = df.drop("down_votes", axis = 1)
df = df.drop("time_created", axis = 1)
df = df.drop("date_created", axis = 1)

In [8]:
df.head()


Unnamed: 0,up_votes,title,over_18,author
0,3,Scores killed in Pakistan clashes,False,polar
1,2,Japan resumes refuelling mission,False,polar
2,3,US presses Egypt on Gaza border,False,polar
3,1,Jump-start economy: Give health care to all,False,fadi420
4,4,Council of Europe bashes EU&UN terror blacklist,False,mhermans


In [9]:

len(df['author'])

509236

In [10]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

In [12]:
# To get the stems of words in a sentence.
def tokenize_and_stem(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

# To get the words themself in a sentence.
def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

In [13]:
#lowercase
title = df.title.str.lower()

In [14]:
# Get full stems and tokens to build vocabulary
def tokenized_stemmed(title):
    totalvocab_stemmed = []
    totalvocab_tokenized = []
    for i in title:
        allwords_stemmed = tokenize_and_stem(i) 
        totalvocab_stemmed.extend(allwords_stemmed) 

        allwords_tokenized = tokenize_only(i)
        totalvocab_tokenized.extend(allwords_tokenized)
    return totalvocab_stemmed, totalvocab_tokenized

In [15]:
totalvocab_stemmed, totalvocab_tokenized = tokenized_stemmed(title)

In [16]:
print(len(totalvocab_stemmed))

7194561


In [17]:
pickle.dump((totalvocab_stemmed, totalvocab_tokenized), open("drive/My Drive/stem_token.pkl", "wb" ))
totalvocab_stemmed, totalvocab_tokenized = pickle.load(open("drive/My Drive/stem_token.pkl", "rb" ))

In [19]:
# Rule out repetitions of stem-token pairs
totalvocab = zip(totalvocab_stemmed, totalvocab_tokenized)
totalvocab = list(set(totalvocab))
totalvocab_stemmed, totalvocab_tokenized = zip(*totalvocab)

pickle.dump((totalvocab_stemmed, totalvocab_tokenized), open("drive/My Drive/stem_token.pkl", "wb" ))

totalvocab_stemmed, totalvocab_tokenized = pickle.load(open("drive/My Drive/stem_token.pkl", "rb" ))

In [20]:
print(len(totalvocab_stemmed))

115041


In [22]:
#stem-token vocabulary
vocab_frame = pd.DataFrame({'words': totalvocab_tokenized}, index = totalvocab_stemmed)

pickle.dump(vocab_frame, open('drive/My Drive/vocab_frame.pkl','wb'))

vocab_frame = pickle.load(open('drive/My Drive/vocab_frame.pkl','rb'))

In [23]:
# Build stopwords set. Combine two common set.
import sklearn.feature_extraction.text as text
stopwords = nltk.corpus.stopwords.words('english')
my_stop_words = text.ENGLISH_STOP_WORDS.union(stopwords)

In [24]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer(min_df =10**-3 ,analyzer = 'word', max_features=len(set(totalvocab_stemmed)), stop_words=my_stop_words, tokenizer=tokenize_and_stem, ngram_range=(1,3))

tfidf_matrix = tfidf_vectorizer.fit_transform(title)

print(tfidf_matrix.shape)

  'stop_words.' % sorted(inconsistent))


(509236, 1814)


In [25]:

pickle.dump(tfidf_matrix, open("drive/My Drive/tfidf_matrix.pkl", "wb" ))
pickle.dump(tfidf_vectorizer, open( "drive/My Drive/tfidf_vectorizer.pkl", "wb" ))

tfidf_matrix = pickle.load(open("drive/My Drive/tfidf_matrix.pkl", "rb" ))
tfidf_vectorizer = pickle.load(open("drive/My Drive/tfidf_vectorizer.pkl", "rb" ))

In [26]:
tfidf_matrix

<509236x1814 sparse matrix of type '<class 'numpy.float64'>'
	with 3565328 stored elements in Compressed Sparse Row format>

In [27]:
thre = np.quantile(df['up_votes'], 0.8)
y = [1 if i > thre else 0 for i in df['up_votes']]
y = np.array(y)
X_train, X_test, y_train, y_test = train_test_split(tfidf_matrix, y, test_size = 0.2, shuffle = True, random_state = 42)

In [28]:
clf = MultinomialNB()
clf.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [29]:
y_predict = clf.predict(X_test)
clf.score(X_test, y_test)

0.8050624459979577

In [30]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.81      1.00      0.89     81988
           1       0.56      0.00      0.00     19860

    accuracy                           0.81    101848
   macro avg       0.68      0.50      0.45    101848
weighted avg       0.76      0.81      0.72    101848



In [33]:
LR = LogisticRegression(C=1.0, penalty='l2', tol=0.01)

In [34]:
LR.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.01, verbose=0,
                   warm_start=False)

In [35]:
y_predict = LR.predict(X_test)
LR.score(X_test, y_test)

0.8061817610556908

In [36]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.81      0.99      0.89     81988
           1       0.54      0.04      0.07     19860

    accuracy                           0.81    101848
   macro avg       0.68      0.52      0.48    101848
weighted avg       0.76      0.81      0.73    101848



In [37]:
gbdt = GradientBoostingClassifier()
gbdt.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [38]:
y_predict = gbdt.predict(X_test)
gbdt.score(X_test, y_test)

0.8053962768046501

In [39]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.81      1.00      0.89     81988
           1       0.73      0.00      0.01     19860

    accuracy                           0.81    101848
   macro avg       0.77      0.50      0.45    101848
weighted avg       0.79      0.81      0.72    101848



In [40]:
rfc = RandomForestClassifier(n_jobs = -1, max_features = 'sqrt', n_estimators = 10, oob_score = True)
rfc.fit(X_train, y_train)

  warn("Some inputs do not have OOB scores. "
  predictions[k].sum(axis=1)[:, np.newaxis])


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='sqrt',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=True, random_state=None, verbose=0,
                       warm_start=False)

In [41]:
y_predict = rfc.predict(X_test)
rfc.score(X_test, y_test)

0.7933685492105883

In [42]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.81      0.97      0.88     81988
           1       0.32      0.05      0.09     19860

    accuracy                           0.79    101848
   macro avg       0.56      0.51      0.49    101848
weighted avg       0.71      0.79      0.73    101848



In [43]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier

In [44]:
xgb = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

In [45]:
xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=None, n_estimators=1000, n_jobs=1,
              nthread=4, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27,
              silent=None, subsample=0.8, verbosity=1)

In [46]:
y_predict = xgb.predict(X_test)

In [47]:
xgb.score(X_test, y_test)

0.8062406723745189

In [48]:
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.81      0.99      0.89     81988
           1       0.54      0.04      0.08     19860

    accuracy                           0.81    101848
   macro avg       0.68      0.52      0.48    101848
weighted avg       0.76      0.81      0.73    101848

