# The Guardian Opinions: Text Classification and Bag-of-Words

Outline:
- Import and explore data
- Apply text preprocessing techniques
- -Implement the bag-of-words model

## Import and explore data

The following data was collected using a Scrapy spider, found in this repository under /guardianscraper/guardianscraper/spiders/guardianspider.py

In [5]:
import pandas as pd

In [6]:
df = pd.read_csv('./guardian_data.csv')
df.head(10)

Unnamed: 0,date,title,author,article
0,2023/feb/07,Liz Truss set out her grand plan to me – and i...,Katy Balls,"“You’ve set the cat among the pigeons,” messag..."
1,2023/feb/07,The Gareth Thomas case proves it: no one wins ...,James Greig,Few public figures alive today have done more ...
2,2023/feb/07,There’s no cycle of violence in Jerusalem – on...,Jalal Abukhater,"Almost every day, the bulldozers are on the mo..."
3,2023/feb/06,The ‘leftwing economic establishment’ did not ...,Polly Toynbee,"“This soul-searching has not been easy,” she w..."
4,2023/feb/07,I have seen race hate in the US and UK and the...,Al Sharpton,I came to London more than 30 years ago to pro...
5,2023/feb/07,As the detective who inspired TV’s Prime Suspe...,Jackie Malton,We have now heard for the first time from Davi...
6,2023/feb/07,"Britain, we had a thing with Truss and Johnson...",Marina Hyde,Liz Truss is now eluded by two major types of ...
7,2023/feb/07,Would I ditch a date after 51 minutes? I’ve wa...,Elle Hunt,Fifty-one minutes. It’s too long for a meeting...
8,2023/feb/07,"As an ex police officer, this much is clear: a...",Steve White,The sentencing of David Carrick and the dreadf...
9,2023/feb/07,Jeremy Hunt says focus on the ‘economically in...,Frances Ryan,"As the government lurches between screw-ups, s..."


In [7]:
sample_article = df.article.values[0]
sample_article

'“You’ve set the cat among the pigeons,” messaged a Tory MP after my interview with Liz Truss dropped on Monday night. The former prime minister’s first spoken intervention since leaving office saw Truss offer little in the way of a mea culpa, and instead set out her plans to carve out a place for herself on the backbenches as a committed tax-cutter. “Obviously I’ve got more time available now to think about these things and make the argument and that’s what I want to do,” she said of her post-prime ministerial plans.Given Rishi Sunak and Jeremy Hunt are trying to lower expectations ahead of the spring budget (and the budget after that), it’s exactly the type of intervention the government would rather avoid. The chancellor has repeatedly suggested now is not the time for tax cuts – instead they will only come “when the time is right”. Sunak has frequently said bringing down inflation must come first. He said the public were “not idiots” and understood this. The implication? Some in hi

Some text clean up:

In [8]:
df['article'] = df['article'].replace('\n', '', regex=True)

## Text preprocessing

Outline:
- Understand the bag of words model
- Tokenization
- Stop-word removal
- Stemming

### The Bag of words intuition

1. Create a list of all the words in the articls
2. Convert each article into vector counts

#### some limitations:

   1. Possibly too many words
   2. some words may occur too frequently
   3. some words may occur very rarely or only once
   4. A single word may have many forms (ex: go, gone, going; or bird vs. bird)

### Tokenization

In [9]:
import nltk

In [10]:
from nltk.tokenize import word_tokenize


In [11]:
nltk.download('punkt')


[nltk_data] Downloading package punkt to
[nltk_data]     /Users/juancarlos/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
sample_tokenize = word_tokenize(df.article.values[0])


In [13]:
all_text = []

for text in df.article.values:
    all_text.append(text)

In [14]:
# join all text
all_joined = ' '.join(all_text)
full_tokens = word_tokenize(all_joined)
print(len(full_tokens))

50074


### Stop-word removal

removing commonly-occurring words that offer little or no meaning

In [15]:
from nltk.corpus import stopwords


In [16]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juancarlos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
english_stops = stopwords.words('english')

In [18]:
", ".join(english_stops)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [19]:
def remove_stops(tokens):
  return[ word for word in tokens if word.lower() not in english_stops]


In [20]:
stopless_tokens = remove_stops(full_tokens)

print(len(stopless_tokens))

31076


I need to remove all the punctuation ...

In [21]:
import string

punctuation_list = list(string.punctuation)
apostrophes = ['“', '’','”','‘','–']

no_punctuation = [word for word in stopless_tokens if word not in punctuation_list and word not in apostrophes]

print(len(no_punctuation))


23559


### Stemming vs. Lemmatization

While stemming truncates variations of a word to a common 'stem' (sometimes this tem is not grammatical, i.e. 'decidedly' => 'decid'), Lemmatization finds the grammatical root word (i.e. "love" => "love", "loving" => "love", "lovable" => "love" ). Lemmatization finds the root word by using a dictionary, thus making it slow and heavy.

I will opt for stemming here.

In [22]:
from nltk.stem.snowball import SnowballStemmer


In [23]:
stemmer = SnowballStemmer(language = 'english')

In [24]:
stemmed_text = [stemmer.stem(word) for word in no_punctuation]

In [25]:
print(len(stemmed_text))

23559


## Bag-of-words

### Transform Text to Vectors using CountVectorizer

[CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) is a module of scikit learn which converts a collection of text documents to a matrix of token counts.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer


In [27]:
vectorize = CountVectorizer()

In [28]:
vectorize.fit(stemmed_text)

CountVectorizer()

In [None]:
vectorize.vocabulary_

In [30]:
sorted_vocab = sorted([ (v,k) for k,v in vectorize.vocabulary_.items()], reverse=True)


In [34]:
sorted_vocab[:10]

[(5359, 'zoë'),
 (5358, 'zone'),
 (5357, 'zombie'),
 (5356, 'zombi'),
 (5355, 'zoe'),
 (5354, 'zero'),
 (5353, 'zeitung'),
 (5352, 'zdf'),
 (5351, 'zahawi'),
 (5350, 'youtub')]

This approach doesn't suit my purpose because it returns personal names. Also, I am not sure the count is correct. I will attempt a new approach in a different notebook.