## Gathered Notebook

This notebook was generated by the Gather Extension. The intent is that it contains only the code and cells required to produce the same results as the cell originally selected for gathering. Please note that the Python analysis is quite conservative, so if it is unsure whether a line of code is necessary for execution, it will err on the side of including it.

**Please let us know if you are satisfied with what was gathered [here](https://aka.ms/gatherfeedback).**

Thanks

In [1]:
import pandas as pd

In [2]:
import re

In [6]:
df = pd.read_json('./guardian_data.json')

In [7]:
df['text'] = df['text'].replace('\n', '', regex=True)

### Join all the articles

In [8]:
text_list = df["text"].to_list()

all_text = " ".join(text_list)
add_space = re.sub(r'\b\.(?=\w)', ". ", all_text)
print(len(add_space))


80540


In [9]:
with open('alltext.txt', 'w') as f:
    f.write(add_space)


# NLTK

In [10]:
import nltk
from nltk.tokenize import word_tokenize

In [11]:
tokenized_text = word_tokenize(add_space)
pos_text = nltk.pos_tag(tokenized_text)
# nltk.help.upenn_tagset('NN.*')
print(len(pos_text))
pos_text[:20]


15589


[('Friends', 'NNS'),
 ('of', 'IN'),
 ('Firsat', 'NNP'),
 ('Dag', 'NNP'),
 (',', ','),
 ('a', 'DT'),
 ('25-year-old', 'JJ'),
 ('Kurdish', 'NNP'),
 ('asylum', 'NN'),
 ('seeker', 'NN'),
 (',', ','),
 ('said', 'VBD'),
 ('he', 'PRP'),
 ('came', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('UK', 'NNP'),
 ('to', 'TO'),
 ('escape', 'VB'),
 ('violence', 'NN')]

### Delete stop-words

In [12]:
from nltk.corpus import stopwords

In [13]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/juancarlos/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
english_stops = stopwords.words('english')
", ".join(english_stops)

"i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mus

In [15]:
stopless_tuples =[]

for tuple in pos_text:
    if tuple[0].lower() not in english_stops:
        stopless_tuples.append(tuple)
print(len(stopless_tuples))
stopless_tuples[:10]

9108


[('Friends', 'NNS'),
 ('Firsat', 'NNP'),
 ('Dag', 'NNP'),
 (',', ','),
 ('25-year-old', 'JJ'),
 ('Kurdish', 'NNP'),
 ('asylum', 'NN'),
 ('seeker', 'NN'),
 (',', ','),
 ('said', 'VBD')]

In [16]:
# counts = {}

# for word in all_text:
#     if word.lower() not in english_stops:
#         counts[word] = counts.get(word, 0) + 1


View some of the "counts"

In [17]:
# peak_df = pd.DataFrame.from_dict(counts, orient='index', columns=['count'])
# peak_df.head(20).sort_values("count")


### cleaning

In [18]:
import string

#### exclude punctuation

In [19]:
clean_vocab = []
exclude = [",", ".","“", "”", "‘", "’" ,"-webkit", "-ms", "flex", "email-", "nowrap", "width:", "height:"]
for tuple in stopless_tuples:
     if tuple[0] not in exclude:
        clean_vocab.append(tuple)
print(len(clean_vocab))
clean_vocab[:20]

7454


[('Friends', 'NNS'),
 ('Firsat', 'NNP'),
 ('Dag', 'NNP'),
 ('25-year-old', 'JJ'),
 ('Kurdish', 'NNP'),
 ('asylum', 'NN'),
 ('seeker', 'NN'),
 ('said', 'VBD'),
 ('came', 'VBD'),
 ('UK', 'NNP'),
 ('escape', 'VB'),
 ('violence', 'NN'),
 ('Instead', 'RB'),
 ('stabbed', 'VBN'),
 ('death', 'NN'),
 ('park', 'NN'),
 ('way', 'NN'),
 ('home', 'NN'),
 ('night', 'NN'),
 ('isolated', 'JJ')]

In [20]:
counts = {}

for tuple in clean_vocab:
    if tuple in counts:
        counts[tuple] += 1
    else:
        counts[tuple] = 1

clean_vocab = [(tuple[0], tuple[1], count) for tuple, count in counts.items()]
print(len(clean_vocab))

3781


### Exclude digits

In [21]:
vocab = []
for tuple in clean_vocab:
    if not tuple[0].isdigit():
        vocab.append(tuple)
print(len(vocab))

3743


In [22]:
vocab[:20]

[('Friends', 'NNS', 1),
 ('Firsat', 'NNP', 2),
 ('Dag', 'NNP', 3),
 ('25-year-old', 'JJ', 2),
 ('Kurdish', 'NNP', 2),
 ('asylum', 'NN', 16),
 ('seeker', 'NN', 2),
 ('said', 'VBD', 15),
 ('came', 'VBD', 8),
 ('UK', 'NNP', 17),
 ('escape', 'VB', 3),
 ('violence', 'NN', 2),
 ('Instead', 'RB', 3),
 ('stabbed', 'VBN', 2),
 ('death', 'NN', 4),
 ('park', 'NN', 2),
 ('way', 'NN', 11),
 ('home', 'NN', 7),
 ('night', 'NN', 3),
 ('isolated', 'JJ', 2)]

In [23]:
peek = pd.DataFrame(vocab, columns=['Word', 'POS', 'Count'])
peek.tail(10)

Unnamed: 0,Word,POS,Count
3733,Said,NNP,1
3734,sexy,JJ,1
3735,Maybe,RB,1
3736,try,VB,1
3737,cycling,VBG,1
3738,Oliver,NNP,1
3739,Wainwright,NNP,1
3740,architecture,NN,1
3741,design,NN,1
3742,critic,NN,1


In [24]:
peek

Unnamed: 0,Word,POS,Count
0,Friends,NNS,1
1,Firsat,NNP,2
2,Dag,NNP,3
3,25-year-old,JJ,2
4,Kurdish,NNP,2
...,...,...,...
3738,Oliver,NNP,1
3739,Wainwright,NNP,1
3740,architecture,NN,1
3741,design,NN,1


# TO DO:
1. find if there is a difference between proper nouns and other capitalised words and lower case words that don't need to be capitalised
2. export to .csv without index
3. add extended name for POS

In [25]:
nltk.help.upenn_tagset('NN.*')

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
NNP: noun, proper, singular
    Motown Venneboerger Czestochwa Ranzer Conchita Trumplane Christos
    Oceanside Escobar Kreisler Sawyer Cougar Yvette Ervin ODI Darryl CTCA
    Shannon A.K.C. Meltex Liverpool ...
NNPS: noun, proper, plural
    Americans Americas Amharas Amityvilles Amusements Anarcho-Syndicalists
    Andalusians Andes Andruses Angels Animals Anthony Antilles Antiques
    Apache Apaches Apocrypha ...
NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


In [26]:

# with open('tuples.txt', 'w') as f:
#     for t in vocab:
#         f.write(str(t) + '\n')

In [27]:
exceptions = [t for t in vocab if t[0].startswith("-webkit-")]
print(len(exceptions))

0


In [28]:
# def replace_chars(s):
#     return s.translate(str.maketrans(" ", " ", string.punctuation.replace("_", "")))

# clean_count = {replace_chars(k): v for k, v in counts.items()}
# print(len(clean_count))



In [29]:
# clean_count = {k.replace(',', ' ').replace('.', ' '): v for k, v in counts.items()}

# print(len(clean_count))


#### delete quotations

In [30]:
# clean_count = {k.replace('“', '').replace('”', '').replace('‘', '').replace('’', ''): v for k, v in clean_count.items()}
# print(len(clean_count))


#### Remove entries with capitalized words (potentially proper nouns)

In [31]:
# clean_count = {k: v for k, v in clean_count.items() if len(k)>0 and not k[0].isupper()}
# print(len(clean_count))


#### Remove entries with sterling

In [32]:
# clean_count = {k: v for k, v in clean_count.items() if len(k)>0 and not k[0]== "£"}
# # peek_currency = {k: v for k, v in clean_count.items() if k[0]== "£"}
# print(len(clean_count))

#### Remove entries with "–" as key

In [33]:
# clean_count = {k: v for k, v in clean_count.items() if len(k)>0 and not k[0]== "–"}
# # peek_currency = {k: v for k, v in clean_count.items() if k[0]== "£"}
# print(len(clean_count))

#### Remove entries with digits

In [34]:
# clean_count = {k: v for k, v in clean_count.items() if len(k)>0 and not k[0].isdigit()}
# print(len(clean_count))


#### peek

In [35]:
# peek = pd.DataFrame.from_dict(clean_count, orient='index', columns=['count'])
# peek.sort_values("count").tail(50)

Sort the words

In [36]:
# sorted_vocabulary = sorted([ (v,k) for k,v in counts.items()], reverse=True)


The sorted list reveals that further refinement is going to be needed: The most common are probably only marginally useful.

In [37]:
# sorted_vocabulary[:10]

many at the bottom of the list (those less used) are not grammatical words and they need to be cleaned up. For example removing words in parenthesis, with hashtags, numbers, etc.

In [38]:
# sorted_vocabulary[-1000:]

back to the dictionary:

In [39]:
# clean_counts = {k: v for k, v in counts.items() if not (k.startswith('#') 
# or k.startswith('(')
# or k.startswith('$')
# or k.startswith('£')
# or k.startswith('\''))}


In [40]:
# print(len(sorted_vocabulary))

In [41]:
# print(len(clean_counts))

In [42]:
# for x in list(clean_counts)[900:1000]:
#     print ("key {}, value {} ".format(x,  clean_counts[x]))

In [43]:
# data = [(k, v) for k, v in clean_counts.items()]

In [44]:
# df2 = pd.DataFrame(data, columns=['key', 'value'])

In [45]:
# df2

In [46]:
# df2.sort_values(by='value', inplace=True)

In [47]:
# df2.head(20)