## Gathered Notebook

This notebook was generated by the Gather Extension. The intent is that it contains only the code and cells required to produce the same results as the cell originally selected for gathering. Please note that the Python analysis is quite conservative, so if it is unsure whether a line of code is necessary for execution, it will err on the side of including it.

**Please let us know if you are satisfied with what was gathered [here](https://aka.ms/gatherfeedback).**

Thanks

In [3]:
import pandas as pd

In [240]:
import re

In [242]:
df = pd.read_csv('./guardian_data.csv')

In [243]:
df['article'] = df['article'].replace('\n', '', regex=True)

### Join all the articles

In [246]:
text_list = df["article"].to_list()

all_text = " ".join(text_list)
add_space = re.sub(r'\b\.(?=\w)', ". ", all_text)
print(len(add_space))


259276


# NLTK

In [214]:
import nltk
from nltk.tokenize import word_tokenize

In [247]:
tokenized_text = word_tokenize(add_space)
pos_text = nltk.pos_tag(tokenized_text)
# nltk.help.upenn_tagset('NN.*')
print(len(pos_text))
pos_text[:20]


50956


[('“', 'NN'),
 ('You', 'PRP'),
 ('’', 'VBP'),
 ('ve', 'JJ'),
 ('set', 'VBN'),
 ('the', 'DT'),
 ('cat', 'NN'),
 ('among', 'IN'),
 ('the', 'DT'),
 ('pigeons', 'NNS'),
 (',', ','),
 ('”', 'NNP'),
 ('messaged', 'VBD'),
 ('a', 'DT'),
 ('Tory', 'NNP'),
 ('MP', 'NNP'),
 ('after', 'IN'),
 ('my', 'PRP$'),
 ('interview', 'NN'),
 ('with', 'IN')]

### Delete stop-words

In [8]:
from nltk.corpus import stopwords

In [9]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juancarlos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [10]:
english_stops = stopwords.words('english')
", ".join(english_stops)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [257]:
stopless_tuples =[]

for tuple in pos_text:
    if tuple[0].lower() not in english_stops:
        stopless_tuples.append(tuple)
print(len(stopless_tuples))
stopless_tuples[:10]

31679


[('“', 'NN'),
 ('’', 'VBP'),
 ('set', 'VBN'),
 ('cat', 'NN'),
 ('among', 'IN'),
 ('pigeons', 'NNS'),
 (',', ','),
 ('”', 'NNP'),
 ('messaged', 'VBD'),
 ('Tory', 'NNP')]

In [11]:
# counts = {}

# for word in all_text:
#     if word.lower() not in english_stops:
#         counts[word] = counts.get(word, 0) + 1


View some of the "counts"

In [None]:
# peak_df = pd.DataFrame.from_dict(counts, orient='index', columns=['count'])
# peak_df.head(20).sort_values("count")


### cleaning

In [55]:
import string

#### exclude punctuation

In [249]:
clean_vocab = []
exclude = [",", ".","“", "”", "‘", "’" ]
for tuple in stopless_tuples:
     if tuple[0] not in exclude:
        clean_vocab.append(tuple)
print(len(clean_vocab))
clean_vocab[:20]

26331


[('set', 'VBN'),
 ('cat', 'NN'),
 ('among', 'IN'),
 ('pigeons', 'NNS'),
 ('messaged', 'VBD'),
 ('Tory', 'NNP'),
 ('MP', 'NNP'),
 ('interview', 'NN'),
 ('Liz', 'NNP'),
 ('Truss', 'NNP'),
 ('dropped', 'VBD'),
 ('Monday', 'NNP'),
 ('night', 'NN'),
 ('former', 'JJ'),
 ('prime', 'JJ'),
 ('minister', 'NN'),
 ('first', 'RB'),
 ('spoken', 'VBN'),
 ('intervention', 'NN'),
 ('since', 'IN')]

In [250]:
counts = {}

for tuple in clean_vocab:
    if tuple in counts:
        counts[tuple] += 1
    else:
        counts[tuple] = 1

clean_vocab = [(tuple[0], tuple[1], count) for tuple, count in counts.items()]
print(len(clean_vocab))

8770


### Exclude digits

In [263]:
vocab = []
for tuple in clean_vocab:
    if not tuple[0].isdigit():
        vocab.append(tuple)
print(len(vocab))

8658


In [266]:
peek = pd.DataFrame(vocab, columns=['Word', 'POS', 'Count'])
peek.tail(10)

Unnamed: 0,Word,POS,Count
8648,realised,VBD,1
8649,civic,NN,1
8650,maintained,VBN,1
8651,whatever,IN,1
8652,Starmer,NN,1
8653,bent,VBN,1
8654,Neal,NNP,1
8655,Lawson,NNP,1
8656,cross-party,JJ,1
8657,Compass,NNP,1


# TO DO:
1. find if there is a difference between proper nouns and other capitalised words and lower case words that don't need to be capitalised
2. export to .csv without index

In [267]:
# def replace_chars(s):
#     return s.translate(str.maketrans(" ", " ", string.punctuation.replace("_", "")))

# clean_count = {replace_chars(k): v for k, v in counts.items()}
# print(len(clean_count))



In [268]:
# clean_count = {k.replace(',', ' ').replace('.', ' '): v for k, v in counts.items()}

# print(len(clean_count))


#### delete quotations

In [269]:
# clean_count = {k.replace('“', '').replace('”', '').replace('‘', '').replace('’', ''): v for k, v in clean_count.items()}
# print(len(clean_count))


#### Remove entries with capitalized words (potentially proper nouns)

In [270]:
# clean_count = {k: v for k, v in clean_count.items() if len(k)>0 and not k[0].isupper()}
# print(len(clean_count))


#### Remove entries with sterling

In [271]:
# clean_count = {k: v for k, v in clean_count.items() if len(k)>0 and not k[0]== "£"}
# # peek_currency = {k: v for k, v in clean_count.items() if k[0]== "£"}
# print(len(clean_count))

#### Remove entries with "–" as key

In [272]:
# clean_count = {k: v for k, v in clean_count.items() if len(k)>0 and not k[0]== "–"}
# # peek_currency = {k: v for k, v in clean_count.items() if k[0]== "£"}
# print(len(clean_count))

#### Remove entries with digits

In [273]:
# clean_count = {k: v for k, v in clean_count.items() if len(k)>0 and not k[0].isdigit()}
# print(len(clean_count))


#### peek

In [274]:
# peek = pd.DataFrame.from_dict(clean_count, orient='index', columns=['count'])
# peek.sort_values("count").tail(50)

Sort the words

In [275]:
# sorted_vocabulary = sorted([ (v,k) for k,v in counts.items()], reverse=True)


The sorted list reveals that further refinement is going to be needed: The most common are probably only marginally useful.

In [276]:
# sorted_vocabulary[:10]

many at the bottom of the list (those less used) are not grammatical words and they need to be cleaned up. For example removing words in parenthesis, with hashtags, numbers, etc.

In [None]:
# sorted_vocabulary[-1000:]

back to the dictionary:

In [None]:
# clean_counts = {k: v for k, v in counts.items() if not (k.startswith('#') 
# or k.startswith('(')
# or k.startswith('$')
# or k.startswith('£')
# or k.startswith('\''))}


In [None]:
# print(len(sorted_vocabulary))

10181


In [None]:
# print(len(clean_counts))

10067


In [280]:
# for x in list(clean_counts)[900:1000]:
#     print ("key {}, value {} ".format(x,  clean_counts[x]))

In [None]:
# data = [(k, v) for k, v in clean_counts.items()]

In [None]:
# df2 = pd.DataFrame(data, columns=['key', 'value'])

In [None]:
# df2

In [None]:
# df2.sort_values(by='value', inplace=True)

In [None]:
# df2.head(20)