In [1]:
import numpy as np
import pandas as pd

In [2]:
data = pd.read_csv('data/Eluvio_DS_Challenge.csv')

In [3]:
data.head()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
0,1201232046,2008-01-25,3,0,Scores killed in Pakistan clashes,False,polar,worldnews
1,1201232075,2008-01-25,2,0,Japan resumes refuelling mission,False,polar,worldnews
2,1201232523,2008-01-25,3,0,US presses Egypt on Gaza border,False,polar,worldnews
3,1201233290,2008-01-25,1,0,Jump-start economy: Give health care to all,False,fadi420,worldnews
4,1201274720,2008-01-25,4,0,Council of Europe bashes EU&UN terror blacklist,False,mhermans,worldnews


In [4]:
data.tail()

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
509231,1479816764,2016-11-22,5,0,Heil Trump : Donald Trump s alt-right white...,False,nonamenoglory,worldnews
509232,1479816772,2016-11-22,1,0,There are people speculating that this could b...,False,SummerRay,worldnews
509233,1479817056,2016-11-22,1,0,Professor receives Arab Researchers Award,False,AUSharjah,worldnews
509234,1479817157,2016-11-22,1,0,Nigel Farage attacks response to Trump ambassa...,False,smilyflower,worldnews
509235,1479817346,2016-11-22,1,0,Palestinian wielding knife shot dead in West B...,False,superislam,worldnews


- Looks like only two of the features really have any meaning in the dataset: **up_votes** and **title**

**Problem Statements**
1. Taking **title** as an input feature to our machine learning and predicting the amount of **up_votes** that it gets can be a good way to start.

In [5]:
authors = data.author.unique()
print(f'Unique authors: {authors}')
print(f'Num of unique authors: {len(authors)}')

categories = data.category.unique()
print(f'Unique categories: {categories}')
print(f'Num of unique categories: {len(categories)}')

over_18 = data.over_18.unique()
print(f'Over_18: {over_18}')

Unique authors: ['polar' 'fadi420' 'mhermans' ... 'calfellow' 'Randiathrowaway1'
 'SummerRay']
Num of unique authors: 85838
Unique categories: ['worldnews']
Num of unique categories: 1
Over_18: [False  True]


- This shows that there is no relation between the **author** of the news and the number of **up_votes** that it gets. Same with the **category** of the news. The **up_votes** seem to be only dependent on the **title** of the news.

In [6]:
data.author.value_counts()

davidreiss666    8897
anutensil        5730
DoremusJessup    5037
maxwellhill      4023
igeldard         4013
                 ... 
sevans59            1
rwinston            1
Cartoon_4u          1
mona_mh69           1
piroko05            1
Name: author, Length: 85838, dtype: int64

- This shows that some **authors** have a lot of news articles in the dataset compared to other authors who have only 1 or more.
- Can this information be useful in understanding/predicting if a news will get higher number of **up_votes**?

In [7]:
news_by_david = data.loc[data['author'] == 'davidreiss666']
news_by_david.sort_values(by=['up_votes'], ascending=False)

Unnamed: 0,time_created,date_created,up_votes,down_votes,title,over_18,author,category
299385,1408645314,2014-08-21,4833,0,"The president of Indonesia, the world’s most p...",False,davidreiss666,worldnews
321121,1415762158,2014-11-12,4407,0,"Study: Brazilian cops killed more than 11,000 ...",False,davidreiss666,worldnews
281476,1402490941,2014-06-11,4141,0,Chile rejects Patagonia wilderness dam project...,False,davidreiss666,worldnews
142719,1350997812,2012-10-23,3299,0,A 28-year-old Tunisian who was caught on secur...,False,davidreiss666,worldnews
268114,1397293641,2014-04-12,3062,0,Armed men dressed in camouflage clothing have ...,False,davidreiss666,worldnews
...,...,...,...,...,...,...,...,...
172105,1365423820,2013-04-08,0,0,Egypt president condemns sectarian violence: P...,False,davidreiss666,worldnews
249200,1391255460,2014-02-01,0,0,South Sudan looting of aid reflects the new na...,False,davidreiss666,worldnews
248687,1391086017,2014-01-30,0,0,Thailand PM Yingluck Shinawatra raises the sta...,False,davidreiss666,worldnews
172134,1365427586,2013-04-08,0,0,Chilean officials have begun exhuming the rema...,False,davidreiss666,worldnews


- It is clear that even though the news is from the same author, we cannot guarantee that it will get similar number of **up_votes**

- Drop all features except "title" and "up_votes"
- Convert "up_votes" into a binomial class: 1: Popular and 0: Not Popular
    - To do so, I'll need to select a threshold (which will be a hyperparameter for the model)

In [8]:
data = pd.read_csv('data/Eluvio_DS_Challenge.csv')
data = data[['title', 'up_votes']]
data.head()

Unnamed: 0,title,up_votes
0,Scores killed in Pakistan clashes,3
1,Japan resumes refuelling mission,2
2,US presses Egypt on Gaza border,3
3,Jump-start economy: Give health care to all,1
4,Council of Europe bashes EU&UN terror blacklist,4


### Let's study more about how up_votes is distributed in terms of scores
- Will help to find a threshold that divides the dataset into two equal class partitions

In [9]:
data['up_votes'].describe()

count    509236.000000
mean        112.236283
std         541.694675
min           0.000000
25%           1.000000
50%           5.000000
75%          16.000000
max       21253.000000
Name: up_votes, dtype: float64

In [10]:
# Convert up_votes into a binary class:
votes_threshold = 10
data.loc[data['up_votes'] < votes_threshold, 'up_votes'] = 0
data.loc[data['up_votes'] >= votes_threshold, 'up_votes'] = 1
data.up_votes.value_counts()

0    339079
1    170157
Name: up_votes, dtype: int64

- When selecting **"10"** as a threshold for assigning an article as popular, we can see that the dataset can be divided into two groups where almost 2/3rd of the samples fall in the **Not Popular** category
- One way to deal with imbalanced data is to use **class_weights** for the loss function

# Converting text into feature vector

- To work with text data, we will first need to convert it into feature vectors so that they can be provided as an input to the ML model

### Text preprocessing, tokenizing and filtering of stopwords are all included in CountVectorizer, which builds a dictionary of features and transforms documents to feature vectors:

- The data that we are working on is in the dataframe as follows:
- So, the column title (containing the text) will need to be converted into numerical features

In [11]:
data.head()

Unnamed: 0,title,up_votes
0,Scores killed in Pakistan clashes,0
1,Japan resumes refuelling mission,0
2,US presses Egypt on Gaza border,0
3,Jump-start economy: Give health care to all,0
4,Council of Europe bashes EU&UN terror blacklist,0


In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [12]:
from sklearn.model_selection import train_test_split

In [18]:
X_train, X_test, y_train, y_test = train_test_split(data['title'].values, data['up_votes'].values, test_size=0.2, random_state=42)

In [21]:
print(len(X_train), len(X_test), len(y_train), len(y_test))

407388 101848 407388 101848


In [23]:
print(X_train.shape)

(407388,)


### Tokenizing text for feature vectors

In [24]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(X_train)

In [26]:
X_train_counts.shape

(407388, 80174)

- This means there are 407,388 training samples with each sample having feature vector of size **80,174** stored in a sparse form
    - By sparse, each title contains a lot of zeros in the feature vector. So, the counts are stored in sparsely to save memory

In [28]:
X_train_counts[1].data

array([1, 1, 1, 1, 1, 1, 1, 1, 1])

### From occurrences to frequencies

In [34]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(407388, 80174)

In [35]:
X_train_tf[1].data

array([0.33333333, 0.33333333, 0.33333333, 0.33333333, 0.33333333,
       0.33333333, 0.33333333, 0.33333333, 0.33333333])

### Training a Classifier

In [36]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(X_train_tfidf, y_train)

In [38]:
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)

In [43]:
predicted = clf.predict(X_test_tfidf)
np.unique(predicted)

array([0, 1])

In [44]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [45]:
print(accuracy_score(y_test, predicted))

0.6737000235645275


In [46]:
print(confusion_matrix(y_test, predicted))

[[66249  1622]
 [31611  2366]]
