<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Guardian or Telegraph?

Sentiment analysis of article titles

---

**Objectives:**

1. Complete sentiment analysis manually using the sentiment dictionary from the lesson
2. Build a classification model to predict if the article was in The Guardian or the Telegraph
3. Evaluate your model with a classification report and confusion matrix
4. Do steps 1 to 3 using the VADER Sentiment Analyzer

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix

import textacy
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from tqdm import tqdm_notebook

## Load data

The data is in the file `brexit_articles.csv`.

The word sentiments are in the file `word_sentiments.csv`.

In [2]:
brexit = pd.read_csv('../../../../resource-datasets/brexit_articles/brexit_articles.csv')
sents = pd.read_csv('../../../../resource-datasets/sentiment_words/sentiment_words.csv')

## Create the `sen_dict` from the word_sentiments data frame

In [3]:
list(sents.head(3).itertuples())

[Pandas(Index=0, pos='adj', word='.22-caliber', pos_score=0.0, neg_score=0.0),
 Pandas(Index=1, pos='adj', word='.22-calibre', pos_score=0.0, neg_score=0.0),
 Pandas(Index=2, pos='adj', word='.22_caliber', pos_score=0.0, neg_score=0.0)]

In [5]:
from collections import defaultdict
sen_dict = defaultdict(dict) # set up a default dictionary with an empty dictionary as default value

for row in tqdm_notebook(sents.itertuples()):
    sen_dict[row.pos] [row.word] = {'objectivity': row.pos_score, 'pos_vs_neg': row.neg_score}

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  after removing the cwd from sys.path.


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




## Engineer an initial feature of title length

In [9]:
brexit.head(1)

Unnamed: 0,source,title,title_length
0,guardian,Sam Gyimah resigns over Theresa May's Brexit deal,49


In [8]:
brexit['title_length'] = [len(row) for row in brexit['title']]

## Complete sentiment analysis manually using the sentiment dictionary

In [35]:
en_nlp = textacy.load_spacy_lang('en_core_web_sm')

In [36]:
parsed_quotes = []
for parsed in tqdm_notebook(en_nlp.pipe(brexit['title'], batch_size=50)):
    assert parsed.is_parsed
    parsed_quotes.append(parsed)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




In [38]:
def process_text(documents, pos=False):
    '''
    cleans sentences from stop words and punctuation and filters by pos tags if given
    returns cleaned sentence and tokenized sentence
    '''
    nlp = textacy.load_spacy_lang('en_core_web_sm')
    
    texts = []
    tokenised_texts = []

    if pos: # pos can either be False or a list of parts of speech
        for document in tqdm_notebook(nlp.pipe(documents, batch_size=200)):
            assert document.is_parsed
            tokens = [token
                      for token in document 
                      if token.is_stop == False
                      and token.pos_ in pos
                      and token.pos_ != 'PUNCT']
            doc_ = ''
            for token in tokens:
                doc_ += str(token) + ' '
            
            doc_ = doc_.strip()
            texts.append(doc_)
            tokenised_texts.append(tokens)
    
    
    else:    
        for document in tqdm_notebook(nlp.pipe(documents, batch_size=200)):
            assert document.is_parsed
            tokens = [token
                      for token in document 
                      if token.is_stop == False
                      and token.pos_ != 'PUNCT']
            doc_ = ''
            for token in tokens:
                doc_ += str(token) + ' '
            
            doc_ = doc_.strip()
            texts.append(doc_)
            tokenised_texts.append(tokens)
            
    return texts, tokenised_texts

In [40]:
pos = ['NOUN', 'ADJ', 'VERB', 'ADV']
processed_titles, tokenised_titles = process_text(brexit['title'], pos=pos)
brexit['processed_titles'] = processed_titles
brexit['tokenised_titles'] = tokenised_titles
brexit.head()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  if sys.path[0] == '':


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




Unnamed: 0,source,title,title_length,vader_compound,vader_neg,vader_neu,vader_pos,processed_titles,tokenised_titles
0,guardian,Sam Gyimah resigns over Theresa May's Brexit deal,49,-0.3182,0.247,0.753,0.0,resigns deal,"[resigns, deal]"
1,guardian,SNP and Lib Dems back Benn amendment to preven...,62,0.0258,0.0,0.901,0.099,amendment prevent deal,"[amendment, prevent, deal]"
2,guardian,The Guardian view on Donald Trump’s credibilit...,89,0.0,0.0,1.0,0.0,Guardian view credibility compromised leader,"[Guardian, view, credibility, compromised, lea..."
3,guardian,"In this high-stakes game of Brexit, how much o...",87,0.0,0.0,1.0,0.0,high stakes game gambler,"[high, stakes, game, gambler]"
4,guardian,Brexit: McDonnell says remain would be on ball...,87,0.0,0.0,1.0,0.0,says remain ballot second referendum Politics ...,"[says, remain, ballot, second, referendum, Pol..."


In [41]:
def scorer(parsed):
    """
    Determines the average objectivity and positive-versus-negative scores 
    for a given sentence
    """
   
    obj_scores, pvn_scores = [], []
    for token in parsed:
        try:
            obj_scores.append(sen_dict[token.pos_][token.lemma_]['objectivity'])
            pvn_scores.append(sen_dict[token.pos_][token.lemma_]['pos_vs_neg'])
        except:
            pass
        
    # set default values if no token found
    if not obj_scores:
        obj_scores = [1.]
    if not pvn_scores:
        pvn_scores = [0.]
        
    return [np.mean(obj_scores), np.mean(pvn_scores)]

In [42]:
scores = brexit['tokenised_titles'].map(scorer)
brexit['objectivity_avg'] = scores.map(lambda x: x[0])
brexit['polarity_avg'] = scores.map(lambda x: x[1])

In [43]:
brexit.head()

Unnamed: 0,source,title,title_length,vader_compound,vader_neg,vader_neu,vader_pos,processed_titles,tokenised_titles,objectivity_avg,polarity_avg
0,guardian,Sam Gyimah resigns over Theresa May's Brexit deal,49,-0.3182,0.247,0.753,0.0,resigns deal,"[resigns, deal]",1.0,0.0
1,guardian,SNP and Lib Dems back Benn amendment to preven...,62,0.0258,0.0,0.901,0.099,amendment prevent deal,"[amendment, prevent, deal]",1.0,0.0
2,guardian,The Guardian view on Donald Trump’s credibilit...,89,0.0,0.0,1.0,0.0,Guardian view credibility compromised leader,"[Guardian, view, credibility, compromised, lea...",1.0,0.0
3,guardian,"In this high-stakes game of Brexit, how much o...",87,0.0,0.0,1.0,0.0,high stakes game gambler,"[high, stakes, game, gambler]",1.0,0.0
4,guardian,Brexit: McDonnell says remain would be on ball...,87,0.0,0.0,1.0,0.0,says remain ballot second referendum Politics ...,"[says, remain, ballot, second, referendum, Pol...",1.0,0.0


## Build a classification model to predict if the article was in The Guardian or the Telegraph

I am using a Random Forest Model, but if you have time do try others!!

In [45]:
X, y = brexit[['title_length','objectivity_avg','polarity_avg']], brexit['source']

scaler = StandardScaler()
Xs = scaler.fit_transform(X)

lr = LogisticRegression()
lr.fit(Xs, y)
cross_val_score(lr, X, y, cv=5).mean()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


0.6300136798905609

## Evaluate your model with a classification report and confusion matrix

Describe your results!

In [46]:
from sklearn.metrics import classification_report, confusion_matrix

print('Classification Report:')
print(classification_report(y, lr.predict(Xs), labels=['guardian', 'telegraph']))
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y, lr.predict(Xs)), columns=['guardian', 'telegraph'],index=['guardian', 'telegraph']))

Classification Report:
              precision    recall  f1-score   support

    guardian       0.00      0.00      0.00       150
   telegraph       0.65      0.99      0.78       277

   micro avg       0.64      0.64      0.64       427
   macro avg       0.32      0.49      0.39       427
weighted avg       0.42      0.64      0.51       427

Confusion Matrix:
           guardian  telegraph
guardian          0        150
telegraph         4        273


## Do steps 1 to 3 using the VADER Sentiment Analyzer

### Complete sentiment analysis using VADER

In [10]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [11]:
analyzer = SentimentIntensityAnalyzer()
for sentence in brexit.title:
    vs = analyzer.polarity_scores(sentence)
    print(sentence)
    print(vs)

Sam Gyimah resigns over Theresa May's Brexit deal
{'neg': 0.247, 'neu': 0.753, 'pos': 0.0, 'compound': -0.3182}
SNP and Lib Dems back Benn amendment to prevent no-deal Brexit
{'neg': 0.0, 'neu': 0.901, 'pos': 0.099, 'compound': 0.0258}
The Guardian view on Donald Trump’s credibility: America’s compromised leader | Editorial
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
In this high-stakes game of Brexit, how much of a gambler are you? | Jonathan Freedland
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Brexit: McDonnell says remain would be on ballot in a second referendum - Politics live
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Labour should back May's Brexit deal, say MP Ian Austin
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
No-deal Brexit would 'devastate' UK gaming industry, says report
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
To the Labour MPs who want to reject May’s Brexit deal – are you sure? | Ian Austin
{'neg': 0.14, 'neu': 0.

Is MP Dakin ignoring his electorate over Brexit?
{'neg': 0.278, 'neu': 0.722, 'pos': 0.0, 'compound': -0.4019}
CBI thinks it knows better about Brexit
{'neg': 0.0, 'neu': 0.674, 'pos': 0.326, 'compound': 0.4404}
I tried the exact facial loved by Julia Roberts and Julianne Moore. Here's my verdict
{'neg': 0.0, 'neu': 0.686, 'pos': 0.314, 'compound': 0.6705}
John Simpson: Company will still hope to deliver acceptable results
{'neg': 0.0, 'neu': 0.606, 'pos': 0.394, 'compound': 0.6369}
Northern Ireland church leaders urge politicians to show respect in Brexit talks
{'neg': 0.0, 'neu': 0.78, 'pos': 0.22, 'compound': 0.4767}
'Brexit Customs system will take two years to set up if no deal'
{'neg': 0.155, 'neu': 0.845, 'pos': 0.0, 'compound': -0.296}
Martin O'Neill waves goodbye to Republic of Ireland - but he can bounce back
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
Bombardier job cuts a £35m hammer blow to Northern Ireland economy
{'neg': 0.196, 'neu': 0.804, 'pos': 0.0, 'compou

In [17]:
vader_scores = brexit['title'].map(analyzer.polarity_scores)
vader_scores.head()

0    {'neg': 0.247, 'neu': 0.753, 'pos': 0.0, 'comp...
1    {'neg': 0.0, 'neu': 0.901, 'pos': 0.099, 'comp...
2    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
3    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
4    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
Name: title, dtype: object

In [21]:
from sklearn.feature_extraction import DictVectorizer

dvec = DictVectorizer()
vader_scores = dvec.fit_transform(vader_scores)

In [19]:
dvec.feature_names_

['compound', 'neg', 'neu', 'pos']

In [22]:
for i, col in enumerate(dvec.feature_names_):
    brexit['vader_{}'.format(col)] = vader_scores[:, i].toarray().ravel()
brexit.head()

Unnamed: 0,source,title,title_length,vader_compound,vader_neg,vader_neu,vader_pos
0,guardian,Sam Gyimah resigns over Theresa May's Brexit deal,49,-0.3182,0.247,0.753,0.0
1,guardian,SNP and Lib Dems back Benn amendment to preven...,62,0.0258,0.0,0.901,0.099
2,guardian,The Guardian view on Donald Trump’s credibilit...,89,0.0,0.0,1.0,0.0
3,guardian,"In this high-stakes game of Brexit, how much o...",87,0.0,0.0,1.0,0.0
4,guardian,Brexit: McDonnell says remain would be on ball...,87,0.0,0.0,1.0,0.0


### Build a classification model to predict if the article was in The Guardian or the Telegraph

In [23]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [27]:
X, y = brexit.loc[:,'title_length':'vader_pos'], brexit['source']

scaler = StandardScaler()
Xs = scaler.fit_transform(X)

lr = LogisticRegression()
lr.fit(Xs, y)
cross_val_score(lr, X, y, cv=5).mean()

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


0.6602735978112175

### Evaluate your model with a classification report and confusion matrix

Describe your results!

In [34]:
from sklearn.metrics import classification_report, confusion_matrix

print('Classification Report:')
print(classification_report(y, lr.predict(Xs), labels=['guardian', 'telegraph']))
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y, lr.predict(Xs)), columns=['guardian', 'telegraph'],index=['guardian', 'telegraph']))

Classification Report:
              precision    recall  f1-score   support

    guardian       0.59      0.26      0.36       150
   telegraph       0.69      0.90      0.78       277

   micro avg       0.68      0.68      0.68       427
   macro avg       0.64      0.58      0.57       427
weighted avg       0.66      0.68      0.64       427

Confusion Matrix:
           guardian  telegraph
guardian         39        111
telegraph        27        250
