# News articles data analysis.

### Notes:
- Before we make our wordclouds and apply classification and clustering methods to our data,
    we make sure to run *generate_train_test_sets.ipynb*, in order to create the train and test sets.
  

## Preparations

### Importing Data

In [1]:
import pandas as pd

# df = pd.read_csv("data.tsv", sep='\t', encoding = 'ANSI')

# 1. Word Clouds per Category

For the wordclouds we need all the data we've got.

In [6]:
from wordcloud import WordCloud

This function takes as parameter a string representing one of the dataframe's categories,
and returns all the articles' content in that category as a string.

In [7]:
def choose_category_content(category: str) -> str:
    articles_series = df[df['category'] == category]['content']
    words = ' '.join(articles_series)
    return words

def wordcloud_gen(category):
    wordcloud = WordCloud(
        width = 1600,
        height = 1000,
        background_color = "white",
        min_font_size = 10).generate(choose_category_content(category))

    plt.figure(figsize = (16, 10), facecolor = None) 
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()

Build wordclouds

### Business Word Cloud

In [8]:

# entertainment_wc = WordCloud(width=1920, background_color = 'white', height=1080).generate(choose_category_content('entertainment'))

# politics_wc = WordCloud(width=1920, background_color = 'white', height=1080).generate(choose_category_content('politics'))

# sport_wc = WordCloud(width=1920, background_color = 'white', height=1080).generate(choose_category_content('sport'))

# tech_wc = WordCloud(width=1920, background_color = 'white', height=1080).generate(choose_category_content('tech'))

In [9]:
# wordcloud_gen("business")

### Entertainment Word Cloud

In [10]:
# wordcloud_gen("entertainment")

### Politics Word Cloud

In [11]:
# wordcloud_gen("politics")

### Sport Word Cloud

In [12]:
# wordcloud_gen("sport")

### Tech Word Cloud

In [13]:
# wordcloud_gen("tech")

## Just a few worth-reading observations regarding the wordclouds
First of all, most of the words in each word clouds are pretty relevant to the respective categories.
Another interesting thing is the word **said**. One quick logical thought is that it would alter the classification results, either little or more, it doesn't matter.

We can prove that it won't, by doing a chi-squared test on our data.

Chi-squared test can measure 


In [14]:
# df['category_id'] = df.category.factorize()[0]

In [15]:
# tf_idf = TfidfVectorizer(max_features = 100, ngram_range = (1, 2))

# features = tf_idf.fit_transform(df.content).toarray()
# features.shape

In [16]:
# from sklearn.feature_selection import chi2
# N = 2
# labels = df.category_id
# for category, category_id in sorted(category_to_id.items()):
#     features_chi2 = chi2(features, labels == category_id)
#     indices = np.argsort(features_chi2[0])
#     feature_names = np.array(tf_idf.get_feature_names())[indices]
#     unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
#     bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
#     print("# '{}':".format(category))
#     print("  . Most correlated unigrams:\n. {}".format('\n. '.join(unigrams[-N:])))
#     print("  . Most correlated bigrams:\n. {}".format('\n. '.join(bigrams[-N:])))

# 2 Classification



### Load our test and train datasets.

In [17]:
import pandas as pd

from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV
from sklearn.metrics import classification_report

from sklearn.metrics import make_scorer,    \
                            accuracy_score, \
                            precision_score,\
                            recall_score,   \
                            f1_score,       \
                            classification_report

train_set = pd.read_csv("train_set.tsv", sep='\t', encoding = 'ANSI')
test_set = pd.read_csv("test_set.tsv", sep='\t', encoding = 'ANSI')
test_labels = pd.read_csv("test_labels.tsv", sep='\t', encoding = 'ANSI')

Keep only the content and category columns.

In [18]:
train_content = train_set['content']
train_labels = train_set['category']
test_content = test_set['content']
test_categories = test_labels['category']

### Encode the train and test labels using sklearn.preprocessing.LabelEncoder

In [19]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(train_labels)
train_labels = le.transform(train_labels)
test_labels = le.transform(test_categories)

## 2.1 Transform our text data using CountVectorizer

In [20]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

count_vec = CountVectorizer(max_features=500,ngram_range=(1,2))

cv_train_content = count_vec.fit_transform(train_content)
cv_test_content = count_vec.transform(test_content)

### Classifiers cross validation and Evaluating predicted results.

In [21]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

scoring = ['precision_macro', 'recall_macro', 'f1_macro', 'accuracy']

### 2.1.1 Support Vector Machine (SVM)

First we perform a grid search to tune the model and see which parameters work best.

In [22]:
svm_clf = SVC()


param_grid = {
            'kernel': ['rbf', 'linear'],
            'C': [1e3, 5e3, 1e4, 5e4, 1e5],
            'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}
clf = GridSearchCV(svm_clf, param_grid)

Fit the train data into the grid search and show the best parameters for the SVM classifier.

In [23]:
clf.fit(cv_train_content, train_labels)

GridSearchCV(cv=None, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1000.0, 5000.0, 10000.0, 50000.0, 100000.0],
                         'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
                         'kernel': ['rbf', 'linear']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [24]:
clf.best_params_

{'C': 1000.0, 'gamma': 0.0001, 'kernel': 'rbf'}

Once we've found the best parameters, use them to perform the cross validation and test evaluation.

In [25]:
svm_clf = SVC(kernel='rbf',C=1000, gamma=0.0001)
svm_clf.fit(cv_train_content, train_labels)

SVC(C=1000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Now perform a 10-Fold Cross Validation on the trained model.

In [43]:
svm_scores = cross_validate(svm_clf, cv_train_content, train_labels, scoring=scoring, cv=10)

In [44]:
for i in svm_scores:
    print(i,svm_scores[i].mean())

fit_time 1.1875282049179077
score_time 0.13688256740570068
test_precision_macro 0.9469065702600578
test_recall_macro 0.9441582262903015
test_f1_macro 0.9445829206432208
test_accuracy 0.9460674157303371


We will proceed to predict the labels.

In [28]:
predicted_labels = svm_clf.predict(cv_test_content)

In [29]:
print(precision_score(test_labels, predicted_labels, average='macro'))
print(recall_score(test_labels, predicted_labels, average='macro'))
print(f1_score(test_labels, predicted_labels, average='macro'))
print(accuracy_score(test_labels, predicted_labels))

0.9658951358880079
0.9600700280112046
0.9622641221420061
0.9617977528089887


### 2.1.2 Random Forest Classifier

In [30]:
rf_clf = RandomForestClassifier()

In [31]:
rf_clf.fit(cv_train_content, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [42]:
rf_scores = cross_validate(rf_clf, cv_train_content, train_labels, scoring=scoring, cv=10)
for i in rf_scores:
    print(i,rf_scores[i].mean())

fit_time 3.070152688026428
score_time 0.09425168037414551
test_precision_macro 0.9564136933030689
test_recall_macro 0.9531539564899703
test_f1_macro 0.9540699869475532
test_accuracy 0.955056179775281


In [34]:
rd_predicted_labels = rf_clf.predict(cv_test_content)

In [35]:
print(precision_score(test_labels, rd_predicted_labels, average='macro'))
print(recall_score(test_labels, rd_predicted_labels, average='macro'))
print(f1_score(test_labels, rd_predicted_labels, average='macro'))
print(accuracy_score(test_labels, rd_predicted_labels))

0.958214709445628
0.9535383244206773
0.9552606029641464
0.9550561797752809


### 2.1.3 Multinomial Naive Bayes

In [36]:
mnb_clf = MultinomialNB()

In [37]:
mnb_clf.fit(cv_train_content, train_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [41]:
rf_scores = cross_validate(mnb_clf, cv_train_content, train_labels, scoring=scoring, cv=10)
for i in rf_scores:
    print(i,rf_scores[i].mean())

fit_time 0.005485796928405761
score_time 0.006685829162597657
test_precision_macro 0.9480210261769069
test_recall_macro 0.9477503527195873
test_f1_macro 0.9472689646923363
test_accuracy 0.948876404494382


In [45]:
mnb_predicted_labels = mnb_clf.predict(cv_test_content)

In [46]:
print(precision_score(test_labels, mnb_predicted_labels, average='macro'))
print(recall_score(test_labels, mnb_predicted_labels, average='macro'))
print(f1_score(test_labels, mnb_predicted_labels, average='macro'))
print(accuracy_score(test_labels, mnb_predicted_labels))

0.9622892355322696
0.9616749427043546
0.961929996436564
0.9617977528089887


## 2.2 Transform our text data using TF-IDF Vectorizer

In [101]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer(max_features=300, ngram_range=(1,2))

tfidf_train_content = tfidf_vec.fit_transform(train_content)
tfidf_test_content = tfidf_vec.transform(test_content)

### 2.2.1 Support Vector Machine (SVM)

In [87]:
param_grid = {
            'kernel': ['rbf', 'linear'],
            'C': [1e3, 5e3, 1e4, 5e4, 1e5],
            'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}
clf = GridSearchCV(SVC(), param_grid)

Fit the train data into the grid search and show the best parameters for the SVM classifier.

In [88]:
clf.fit(tfidf_train_content, train_labels)

In [89]:
clf.best_params_

Once we've found the best parameters, use them to perform the cross validation and test evaluation.

In [108]:
svm_clf = SVC(kernel='rbf',C=1000, gamma=0.0005)

svm_clf.fit(tfidf_train_content, train_labels)

SVC(C=1000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0005, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

Now perform a 10-Fold Cross Validation on the trained model.

In [109]:
svm_scores = cross_validate(svm_clf, tfidf_train_content, train_labels, scoring=scoring)

In [110]:
for i in svm_scores:
    print(i,svm_scores[i].mean())

fit_time 1.0540414333343506
score_time 0.27834105491638184
test_precision_macro 0.9489600565531576
test_recall_macro 0.9481579969533429
test_f1_macro 0.9483229705505785
test_accuracy 0.949438202247191


We will proceed to predict the labels.

In [111]:
predicted_labels = svm_clf.predict(tfidf_test_content)

In [113]:
print(precision_score(test_labels, predicted_labels, average='macro'))
print(recall_score(test_labels, predicted_labels, average='macro'))
print(f1_score(test_labels, predicted_labels, average='macro'))
print(accuracy_score(test_labels, predicted_labels))

0.9658356311643737
0.9634536541889483
0.964450851635213
0.9640449438202248


### 2.2.2 Random Forest Classifier

In [114]:
rf_clf = RandomForestClassifier()

In [115]:
rf_clf.fit(tfidf_train_content, train_labels)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [116]:
rf_scores = cross_validate(rf_clf, tfidf_train_content, train_labels, scoring=scoring)
for i in rf_scores:
    print(i,rf_scores[i].mean())

fit_time 1.7234101295471191
score_time 0.06059713363647461
test_precision_macro 0.9449434268335754
test_recall_macro 0.9417875686123782
test_f1_macro 0.9427747764962021
test_accuracy 0.9438202247191011


In [118]:
rd_predicted_labels = rf_clf.predict(tfidf_test_content)

In [119]:
print(precision_score(test_labels, rd_predicted_labels, average='macro'))
print(recall_score(test_labels, rd_predicted_labels, average='macro'))
print(f1_score(test_labels, rd_predicted_labels, average='macro'))
print(accuracy_score(test_labels, rd_predicted_labels))

0.9510416698884191
0.9442182327476445
0.9468950044273738
0.946067415730337


### 2.2.3 Multinomial Naive Bayes

In [120]:
mnb_clf = MultinomialNB()

In [121]:
mnb_clf.fit(tfidf_train_content, train_labels)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [122]:
rf_scores = cross_validate(mnb_clf, tfidf_train_content, train_labels, scoring=scoring)
for i in rf_scores:
    print(i,rf_scores[i].mean())

fit_time 0.008976411819458009
score_time 0.008979415893554688
test_precision_macro 0.9398194444961249
test_recall_macro 0.9356126834486863
test_f1_macro 0.9371527975029984
test_accuracy 0.9387640449438202


In [123]:
mnb_predicted_labels = mnb_clf.predict(tfidf_test_content)

In [124]:
print(precision_score(test_labels, mnb_predicted_labels, average='macro'))
print(recall_score(test_labels, mnb_predicted_labels, average='macro'))
print(f1_score(test_labels, mnb_predicted_labels, average='macro'))
print(accuracy_score(test_labels, mnb_predicted_labels))

0.9439110539110539
0.9405347593582889
0.9420072594647889
0.9415730337078652


## 3 Beat the Benchmark

We chose to improve the model based on Multinomial Naive Bayes.

### Function prototype for text preprocessing.

<span style="color:DeepPink">**preprocess_article**</span>**(text)**  
&nbsp;&nbsp;Removes special characters from a given string object, removes stop words and lematizes words using WordNetLematizer().  
&nbsp;&nbsp;&nbsp;**Parameters: &nbsp;&nbsp;&nbsp;text : str**  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
String object to process. 

&nbsp;&nbsp;&nbsp;**Returns: &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;text : str**  
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
Lowercase lematized string object without stopwords and several special characters.

In [126]:
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as stop_words
from nltk.stem import WordNetLemmatizer
import re
stop_words = list(stop_words)

""" In previous version of our project, the wordclouds below showed that 'said' and 'say' words
    appear the most in the data, so we decided to remove them as they has no valuable meaning. """
# stop_words.extend(['said','say'])
wordnet_lemmatizer = WordNetLemmatizer()

""" Make sure that the text parameter and return variable are of string type. """
def preprocess_article(text: str) -> str:
    # Remove newlines and \r characters.
    text = text.replace('\n', ' ')
    text = text.replace('\r', ' ')
    
    # Remove quotes
    text = text.replace('"', ' ')
   
    # Convert text to lowercase.
    text = text.lower()
    
    # Remove punctuation and many special characters.
    text = text.translate(str.maketrans('', '', '!?:\';.,[]()@#$%^&*£'))
   
    # Remove terminating 's characters.
    text = text.replace("'s", "")

    # Remove stop words. Note: do this first and then lemmatize because lemmatizing
    # can change words like 'has' to 'ha'.
    text = ' '.join([word for word in text.split() if word not in stop_words])
    
    # Lematize text with WordNetLemmatizer().
    text = ' '.join([wordnet_lemmatizer.lemmatize(word) for word in text.split(' ')])
    
    # Remove all words with numbers in them (ie. 400bn, 512kbps etc.) .
    text = re.sub(r'\w*\d\w*', '', text).strip()
    
    return text

In [128]:
train_content = train_content.apply(preprocess_article)
test_content = test_content.apply(preprocess_article)

In [129]:
test_content

0      playing politics security nation michael howar...
1      film-maker spike lee say black representation ...
2      johnny vaughan denise van outens saturday nigh...
3      annual academy award taking place february sta...
4      arsenal vice-chairman david dein said club con...
                             ...                        
440    luxury cruise liner crystal harmony currently ...
441    share elan biogen idec plunged monday firm sus...
442    manchester united winger cristiano ronaldo sai...
443    australian wing david campese told england sto...
444    leonardo dicaprio jamie foxx hilary swank atte...
Name: content, Length: 445, dtype: object

In [155]:
# from sklearn.feature_extraction.text import TfidfVectorizer
# tfidf_vec = TfidfVectorizer(max_features=300, ngram_range=(1,2))

# tfidf_train_content = tfidf_vec.fit_transform(train_content)
# tfidf_test_content = tfidf_vec.transform(test_content)

count_vec = CountVectorizer(max_features=500,ngram_range=(1,2))

cv_train_content = count_vec.fit_transform(train_content)
cv_test_content = count_vec.transform(test_content)

In [147]:
param_grid = {
            'kernel': ['rbf', 'linear'],
            'C': [1e3, 5e3, 1e4, 5e4, 1e5],
            'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
}
clf = GridSearchCV(SVC(), param_grid)

Fit the train data into the grid search and show the best parameters for the SVM classifier.

In [148]:
clf.fit(tfidf_train_content, train_labels)

GridSearchCV(cv=None, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1000.0, 5000.0, 10000.0, 50000.0, 100000.0],
                         'gamma': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1],
                         'kernel': ['rbf', 'linear']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [149]:
clf.best_params_

{'C': 1000.0, 'gamma': 0.0005, 'kernel': 'rbf'}

In [156]:
mnb_clf = SVC(kernel='rbf',C=1000, gamma=0.0005)

In [157]:
mnb_clf.fit(tfidf_train_content, train_labels)

SVC(C=1000, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0005, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [158]:
rf_scores = cross_validate(mnb_clf, tfidf_train_content, train_labels, scoring=scoring)
for i in rf_scores:
    print(i,rf_scores[i].mean())

fit_time 0.6890371799468994
score_time 0.1812415599822998
test_precision_macro 0.9494755974974487
test_recall_macro 0.9487769746708979
test_f1_macro 0.9488877162256211
test_accuracy 0.95


In [159]:
mnb_predicted_labels = mnb_clf.predict(tfidf_test_content)

In [160]:
print(precision_score(test_labels, mnb_predicted_labels, average='macro'))
print(recall_score(test_labels, mnb_predicted_labels, average='macro'))
print(f1_score(test_labels, mnb_predicted_labels, average='macro'))
print(accuracy_score(test_labels, mnb_predicted_labels))

0.9664536115630558
0.9630334861217215
0.9643538056226294
0.9640449438202248
