# Tentative workflow

I tried separating the workflow in semi-independant steps. It would still be good if we could try to do the whole thing with a tiny subset of the actual dataset.
Step 6 is totally independant of the previous steps

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
import bz2
import json
import pandas as pd

# Load & format quotes from 2020
#change on everyone's computer according to personal path
path_to_file = '/content/drive/MyDrive/ADA/Quotebank/quotes-2020.json.bz2' 

list_of_quotes_dict = []
count = 0
sample_size = 1000  # Sample chosen for current experiments

# Open the 2020 quotebank
with bz2.open(path_to_file, 'rb') as s_file:
    for instance in s_file:
        if count == sample_size:
            break
        #print(instance)
        decoded = json.loads(instance.decode('utf-8'))  # Decode each instance into a dictionary
        #print(decoded["quoteID"])
        list_of_quotes_dict.append(decoded)
        count += 1

df_quotes = pd.DataFrame(list_of_quotes_dict)  # Turn list of entries into dataframe
df_quotes.head()

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
0,2020-01-28-000082,[ D ] espite the efforts of the partners to cr...,,[],2020-01-28 08:04:05,1,"[[None, 0.7272], [Prime Minister Netanyahu, 0....",[http://israelnationalnews.com/News/News.aspx/...,E
1,2020-01-16-000088,[ Department of Homeland Security ] was livid ...,Sue Myrick,[Q367796],2020-01-16 12:00:13,1,"[[Sue Myrick, 0.8867], [None, 0.0992], [Ron Wy...",[http://thehill.com/opinion/international/4782...,E
2,2020-02-10-000142,... He (Madhav) also disclosed that the illega...,,[],2020-02-10 23:45:54,1,"[[None, 0.8926], [Prakash Rai, 0.1074]]",[https://indianexpress.com/article/business/ec...,E
3,2020-02-15-000053,"... [ I ] f it gets to the floor,",,[],2020-02-15 14:12:51,2,"[[None, 0.581], [Andy Harris, 0.4191]]",[https://patriotpost.us/opinion/68622-trump-bu...,E
4,2020-01-24-000168,[ I met them ] when they just turned 4 and 7. ...,Meghan King Edmonds,[Q20684375],2020-01-24 20:37:09,4,"[[Meghan King Edmonds, 0.5446], [None, 0.2705]...",[https://people.com/parents/meghan-king-edmond...,E


##### 1. Load the sentences for each speakers with >X (tbd) quotes
- Not for milestone 2, but afterwards will need info such as year, website, country and have to be able to keep them throughout the process

In [None]:
#code here

min_quotes = 10  # Value of X

counts = df_quotes.groupby(by=["speaker"]).sum().reset_index()  # Count number of quotes per spreaker
#print(counts.head(50))
speakers_with_many_quotes = counts[counts['numOccurrences'] >= min_quotes]  # Select speakers with at least min_quotes
speakers_with_many_quotes = speakers_with_many_quotes[speakers_with_many_quotes["speaker"] != "None"]  # Remove "None" from speakers with many quotes
#print(speakers_with_many_quotes.head(50))
quotes_selected_speakers = df_quotes[df_quotes["speaker"].isin(speakers_with_many_quotes["speaker"])]  # Select quotes from speakers with many quotes
quotes_selected_speakers.head(10)

Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase
28,2020-03-18-000741,A face-to-face duty lawyer service provided by...,Mike Dwyer,[Q6379626],2020-03-18 07:47:15,12,"[[Mike Dwyer, 0.6042], [None, 0.3958]]",[http://www.balonnebeacon.com.au/news/tourism-...,E
29,2020-01-26-000499,a few of the candidates who will do better in ...,Dave Loebsack,[Q771586],2020-01-26 13:21:36,11,"[[Dave Loebsack, 0.9011], [None, 0.0949], [Joe...",[http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/...,E
129,2020-01-10-005809,"androids, surrogates or copies of real humans,",Pranav Mistry,[Q2722796],2020-01-10 04:00:06,10,"[[Pranav Mistry, 0.6069], [None, 0.3931]]",[http://news.cnet.com/how-to/samsungs-neon-exp...,E
139,2020-01-28-006720,Apply for the privilege of having this entrepr...,Cindy Gallop,[Q5120529],2020-01-28 13:00:00,22,"[[Cindy Gallop, 0.921], [None, 0.079]]",[https://www.byronnews.com.au/news/womans-geni...,E
173,2020-02-04-008518,At least 7 million lives could be saved over t...,Tedros Adhanom Ghebreyesus,[Q16196017],2020-02-04 02:25:29,14,"[[Tedros Adhanom Ghebreyesus, 0.9326], [None, ...",[https://www.miragenews.com/who-outlines-steps...,E
254,2020-02-27-009528,But it really is that. The universe is a weird...,Melanie Johnston-Hollitt,"[Q50505758, Q53953454]",2020-02-27 16:11:43,65,"[[Melanie Johnston-Hollitt, 0.8937], [None, 0....",[http://andoveradvertiser.co.uk/news/national/...,E
263,2020-02-07-012379,but [ President ] Trump (was) eager to make a ...,President Donald Trump,[Q22686],2020-02-07 23:05:05,1,"[[President Donald Trump, 0.5698], [None, 0.43...",[http://uspolitics.einnews.com/article/5092030...,E
273,2020-02-22-004519,But think of it: A man leaks classified inform...,President Donald Trump,[Q22686],2020-02-22 16:58:48,4,"[[President Donald Trump, 0.6539], [None, 0.21...",[http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/...,E
294,2020-04-14-008791,Can't wait for you to see what we've put toget...,Lady Gaga,[Q19848],2020-04-14 21:58:48,2,"[[Lady Gaga, 0.6251], [None, 0.3109], [Taylor ...",[https://www.nme.com/news/music/taylor-swift-c...,E
297,2020-02-17-009693,Caroline and me were together from the very st...,Iain Stirling,[Q5980627],2020-02-17 20:48:51,207,"[[Iain Stirling, 0.7729], [None, 0.1476], [Car...",[https://www.eonline.com/news/1123804/love-isl...,E


##### 2. A) Perform named entity recognition on the data and make sure they will each be tokenized as single token

In [None]:
# TODO Iris

##### 2. B) Tokenise the sentences (turn them into single words)

In [None]:
#your best code here

import nltk
nltk.download("punkt")
from nltk.tokenize import word_tokenize

quotes_selected_speakers["tokenized_quote"] = quotes_selected_speakers.quotation.apply(word_tokenize)  # Tokenize quotes and add them in the dataframe as new column
quotes_selected_speakers.head()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


Unnamed: 0,quoteID,quotation,speaker,qids,date,numOccurrences,probas,urls,phase,tokenized_quote
28,2020-03-18-000741,A face-to-face duty lawyer service provided by...,Mike Dwyer,[Q6379626],2020-03-18 07:47:15,12,"[[Mike Dwyer, 0.6042], [None, 0.3958]]",[http://www.balonnebeacon.com.au/news/tourism-...,E,"[A, face-to-face, duty, lawyer, service, provi..."
29,2020-01-26-000499,a few of the candidates who will do better in ...,Dave Loebsack,[Q771586],2020-01-26 13:21:36,11,"[[Dave Loebsack, 0.9011], [None, 0.0949], [Joe...",[http://rss.cnn.com/~r/rss/cnn_allpolitics/~3/...,E,"[a, few, of, the, candidates, who, will, do, b..."
129,2020-01-10-005809,"androids, surrogates or copies of real humans,",Pranav Mistry,[Q2722796],2020-01-10 04:00:06,10,"[[Pranav Mistry, 0.6069], [None, 0.3931]]",[http://news.cnet.com/how-to/samsungs-neon-exp...,E,"[androids, ,, surrogates, or, copies, of, real..."
139,2020-01-28-006720,Apply for the privilege of having this entrepr...,Cindy Gallop,[Q5120529],2020-01-28 13:00:00,22,"[[Cindy Gallop, 0.921], [None, 0.079]]",[https://www.byronnews.com.au/news/womans-geni...,E,"[Apply, for, the, privilege, of, having, this,..."
173,2020-02-04-008518,At least 7 million lives could be saved over t...,Tedros Adhanom Ghebreyesus,[Q16196017],2020-02-04 02:25:29,14,"[[Tedros Adhanom Ghebreyesus, 0.9326], [None, ...",[https://www.miragenews.com/who-outlines-steps...,E,"[At, least, 7, million, lives, could, be, save..."


##### 3. Remove punctuation and stopwords

In [None]:
#should be easy enough, don't mess it up

nltk.download('stopwords')
from nltk.corpus import stopwords
import string

EN_STOPWORDS = set(stopwords.words("english"))  # English stopwords
PUNCTUATION = set(string.punctuation)  # Punctuation

def remove_stopwords(l):
    return [w for w in l if w.lower() not in EN_STOPWORDS]

def remove_punctuation(l):
    return [w for w in l if w not in PUNCTUATION]

quotes_selected_speakers["tokenized_quote"] = quotes_selected_speakers.tokenized_quote.apply(remove_stopwords)  # Remove stopwords from tokens
quotes_selected_speakers["tokenized_quote"] = quotes_selected_speakers.tokenized_quote.apply(remove_punctuation)  # Remove punctuation from tokens
quotes_selected_speakers[["quotation", "tokenized_quote"]].head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,quotation,tokenized_quote
28,A face-to-face duty lawyer service provided by...,"[face-to-face, duty, lawyer, service, provided..."
29,a few of the candidates who will do better in ...,"[candidates, better, part, world]"
129,"androids, surrogates or copies of real humans,","[androids, surrogates, copies, real, humans]"
139,Apply for the privilege of having this entrepr...,"[Apply, privilege, entrepreneurial, enterprisi..."
173,At least 7 million lives could be saved over t...,"[least, 7, million, lives, could, saved, next,..."


##### 4. Stem the words (ie 'eating', 'eats', 'ate' -> "eat")
* [Source](https://towardsdatascience.com/stemming-corpus-with-nltk-7a6a6d02d3e5)

In [None]:
#go champion, you can do it!

from nltk.stem import SnowballStemmer

# You may look into other stemming algorithms than this one, but note that no stemmer will give perfect results.
SBS = SnowballStemmer(language='english')

def stem_words(l):
    return [SBS.stem(w) for w in l]

quotes_selected_speakers["tokenized_quote"] = quotes_selected_speakers.tokenized_quote.apply(stem_words)  # Stem tokens
quotes_selected_speakers[["quotation", "tokenized_quote"]].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,quotation,tokenized_quote
28,A face-to-face duty lawyer service provided by...,"[face-to-fac, duti, lawyer, servic, provid, le..."
29,a few of the candidates who will do better in ...,"[candid, better, part, world]"
129,"androids, surrogates or copies of real humans,","[android, surrog, copi, real, human]"
139,Apply for the privilege of having this entrepr...,"[appli, privileg, entrepreneuri, enterpris, cr..."
173,At least 7 million lives could be saved over t...,"[least, 7, million, live, could, save, next, d..."


##### 5. A) Pool the quotes from the same speaker

In [None]:
# TODO

##### 5. B) Assign an 'importance' score to each word
- using TF-IDF score
- using complexity score from a dictionary (ie cambridge)
- determine rarity of word using its ferquency in the corpus (= tf-idf?)

In [None]:
#tough one, hope you have the whole day free 

##### 6. Get data on the speakers from wikidata  _   **INDEPENDANT**
- job title (only this for milestone 2)
- education level
- area of interest
- gender
- etc


In [None]:
#enjoying the loner's task, are you?

##### 7. Create corpus for 1 job title (look at correlation across speakers)
- Later on will have to look according to different job titles, education level, as well as the additional info mentioned in point 1

In [None]:
#hopefully before friday 10pm 