# Assignment A3: Embeddings and Parsing

Covering material from notebooks 7 and 8 

# Word Embeddings

**Training word2vec**

In this section, we train a word2vec model using gensim. We train the model on text8 (which consists of the first 90M characters of a Wikipedia dump from 2006 and is considered one of the benchmarks for evaluating language models).

In [1]:
import gensim.downloader as api

api.info("text8")

{'num_records': 1701,
 'record_format': 'list of str (tokens)',
 'file_size': 33182058,
 'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/text8/__init__.py',
 'license': 'not found',
 'description': 'First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets.',
 'checksum': '68799af40b6bda07dfa47a32612e5364',
 'file_name': 'text8.gz',
 'read_more': ['http://mattmahoney.net/dc/textdata.html'],
 'parts': 1}

In [2]:
dataset = api.load("text8")



In [6]:
from gensim.models import Word2Vec

##TODO train a word2vec model on this dataset, only consider words which appear at least 10 times in the corpus
model = Word2Vec(sentences=dataset, min_count=10)

**Word Similarities**

gensim models provide almost all the utility you might want to wish for to perform standard word similarity tasks. They are available in the .wv (wordvectors) attribute of the model, more details could be found [here](https://radimrehurek.com/gensim/models/keyedvectors.html).

In [8]:
##TODO find the closest words to king
sims = model.wv.most_similar('king', topn=10)  # get other similar words
print(sims)

[('prince', 0.7549830079078674), ('queen', 0.713581919670105), ('emperor', 0.7002630829811096), ('throne', 0.6998481154441833), ('vii', 0.6928168535232544), ('kings', 0.6912181973457336), ('regent', 0.6725978255271912), ('sigismund', 0.6714800596237183), ('aragon', 0.6675142645835876), ('elector', 0.6599542498588562)]


King is to man as woman is to X

In [14]:
##TODO find the closest word for the vector "woman" + "king" - "man"
print(model.wv.most_similar(positive=["king", "woman"], negative=["man"])[0])

# what's going on under the hood?
vec_king, vec_man, vec_woman = model.wv.get_vector("king", norm=True), model.wv.get_vector("man", norm=True), model.wv.get_vector("woman", norm=True)
result = model.wv.similar_by_vector(vec_king - vec_man + vec_woman)[1]
print(result)

('queen', 0.6591334342956543)
('queen', 0.6591334342956543)


**Evaluate Word Similarities** 

One common way to evaluate word2vec models are word analogy tasks. Let's check how good our model is on one of those. We consider the [WordSim353](http://alfonseca.org/eng/research/wordsim353.html) benchmark, the task is to determine how similar two words are.

In [15]:
!wget http://alfonseca.org/pubs/ws353simrel.tar.gz
!tar xf ws353simrel.tar.gz

path = "wordsim353_sim_rel/wordsim_similarity_goldstandard.txt"

def load_data(path):
    X, y = [], []
    with open(path) as f:
        for line in f:
            line = line.strip().split("\t")
            X.append((line[0], line[1])) # each entry in x contains two words, e.g. X[0] = (tiger, cat)
            y.append(float(line[-1])) # each entry in y is the annotation how similar two words are, e.g. Y[0] = 7.35
    return X, y

X, y = load_data(path)
print (X[:3], y[:3])

--2022-11-21 11:11:55--  http://alfonseca.org/pubs/ws353simrel.tar.gz
Resolving alfonseca.org (alfonseca.org)... 162.215.249.67
Connecting to alfonseca.org (alfonseca.org)|162.215.249.67|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5460 (5.3K) [application/x-gzip]
Saving to: 'ws353simrel.tar.gz'


2022-11-21 11:11:56 (82.7 MB/s) - 'ws353simrel.tar.gz' saved [5460/5460]

[('tiger', 'cat'), ('tiger', 'tiger'), ('plane', 'car')] [7.35, 10.0, 5.77]


In [24]:
##TODO compute how similar the pairs in the WordSim353 are according to our model
##TODO if  aword is not present in our model, we assign similarity 0 for the respective text pair
results = []
for i in range(len(X)):
    try:
        vec1, vec2 = model.wv.get_vector(X[i][0], norm=True), model.wv.get_vector(X[i][1], norm=True)
        results.append(model.wv.similarity(X[i][0], X[i][1]))
    except:
        results.append(0)

0.59547836
1.0
0.43516302
[0.59547836, 1.0, 0.43516302, 0.5323238, 0.74368805, 0.5204391, 0.68291867, 0.71805686, 0.51681644, 0.43424064, 0.6265508, 0.5278405, 0.36214787, 0.6459682, 0.713582, 0.2549085, 0.5573806, 0.1103385, 0.81987226, 0.7906139, 0.71744037, 0, 0.7581198, 0.6396023, 0.7743103, 0.7803402, 0.60831726, 0.39107767, 0.75948596, 0.36391062, 0.75969815, 0, 0.71910495, 0.7711973, 0.5829115, 0.69855523, 0.3797861, 0.5059833, 0.046145216, 0.38520706, 0.40653822, 0.6798787, 0.37402877, 0.1627388, 0.3341751, 0.22194439, -0.009029146, 0.26948124, 0.7713224, 0.66677904, 0.57125586, 0.577476, 0.7543255, 0.62265646, 0.36359268, 0.41836524, 0, 0.11971237, -0.11856868, -0.04326883, 0.4479499, 0.57966703, 0.811289, 0.7102834, 0, 0.3007991, 0.42208344, -0.029927105, 0, 0.15269986, 0.3453254, 0.15259264, 0.58035094, 0.19801103, 0.7063954, 0.23811662, 0.34194934, 0.5081974, 0.75541615, 0.35168144, 0.5062486, 0.15133363, 0.37584507, 0.6490791, 0.4026931, 0.69814104, 0.69137526, 0.30182022,

In [26]:
from scipy.stats import spearmanr

##TODO compute spearman's rank correlation between our prediction and the human annotations
spearmanr(y, results)

SpearmanrResult(correlation=0.6514524845694514, pvalue=6.697848116658943e-26)

In [31]:
import spacy
en = spacy.load('en_core_web_md')

##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings
results_spacy = []
for i in range(len(X)):
    try:
        results_spacy.append(en(X[i][0]).similarity(en(X[i][1])))
    except:
        results_spacy.append(0)
##TODO compute spearman's rank correlation between these similarities and the human annotations
spearmanr(y, results_spacy)
# Don't worry if results are not too convincing for this experiment

SpearmanrResult(correlation=0.6445293311100014, pvalue=3.203028082291722e-25)

# Document Embeddings

**Task 1**
In this task, we evaluate different document embeddings on the English version of the [STS Benchmark](https://arxiv.org/pdf/1708.00055.pdf). The task is to determine how semantically similar two texts are and is a popular dataset to evaluate document embeddings, i.e. we want embeddings of two semantically similar documents to be similar as well. We provide a wordcounts baseline for this task and ask you to compute and evaluate embeddings for a selected sample of document embedding techniques.

To evaluate, we follow [(Reimers and Gurevych, 2019)](https://arxiv.org/pdf/1908.10084.pdf) and compute the Spearmanâ€™s rank correlation between the cosine-similarity of thesentence embeddings and the gold labels. 

In [32]:
# obtain the data
!wget http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval.v1.1.zip
!wget http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.gs.zip

!unzip sts2017.eval.v1.1.zip 
!unzip sts2017.gs.zip 

--2022-11-21 11:44:12--  http://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval.v1.1.zip
Resolving alt.qcri.org (alt.qcri.org)... 80.76.166.234
Connecting to alt.qcri.org (alt.qcri.org)|80.76.166.234|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval.v1.1.zip [following]
--2022-11-21 11:44:12--  https://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.eval.v1.1.zip
Connecting to alt.qcri.org (alt.qcri.org)|80.76.166.234|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 87902 (86K) [application/zip]
Saving to: 'sts2017.eval.v1.1.zip'


2022-11-21 11:44:13 (647 KB/s) - 'sts2017.eval.v1.1.zip' saved [87902/87902]

URL transformed to HTTPS due to an HSTS policy
--2022-11-21 11:44:13--  https://alt.qcri.org/semeval2017/task1/data/uploads/sts2017.gs.zip
Resolving alt.qcri.org (alt.qcri.org)... 80.76.166.234
Connecting to alt.qcri.org (alt.qcri.org)|80.76

In [33]:
# load the data

def load_STS_data():
    with open("STS2017.gs/STS.gs.track5.en-en.txt") as f:
        labels = [float(line.strip()) for line in f]
    
    text_a, text_b = [], []
    with open("STS2017.eval.v1.1/STS.input.track5.en-en.txt") as f:
        for line in f:
            line = line.strip().split("\t")
            text_a.append(line[0])
            text_b.append(line[1])
    return text_a, text_b, labels

text_a, text_b, labels = load_STS_data()
text_a[0], text_b[0], labels[0]

('A person is on a baseball team.',
 'A person is playing basketball on a team.',
 2.4)

In [34]:
# some utils
from scipy.stats import spearmanr
def evaluate(predictions, labels):
    print ("spearman's rank correlation", spearmanr(predictions, labels)[0])

import numpy as np
from numpy import dot
from numpy.linalg import norm

def cosine_similarity(a,b):
    return dot(a, b)/(norm(a)*norm(b))


In [35]:
# Wordcounts baseline
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer()
vec.fit(text_a + text_b)

# encode documents
text_a_encoded = np.array(vec.transform(text_a).todense())
text_b_encoded = np.array(vec.transform(text_b).todense())

# predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

spearman's rank correlation 0.6998056665685976


In [50]:
##TODO train Doc2Vec on the texts in the dataset
##TODO derive the word vectors for each text in the dataset
##TODO compute cosine similarity between the text pairs and evaluate spearman's rank correlation
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
text_a_raw = [TaggedDocument(doc, [i]) for i, doc in enumerate(text_a)]
text_b_raw = [TaggedDocument(doc, [i]) for i, doc in enumerate(text_b)]
text = text_a_raw + text_b_raw
model = Doc2Vec(text)
text_a_encoded = np.array([model.infer_vector(doc.words.split()) for doc in text_a_raw])
text_b_encoded = np.array([model.infer_vector(doc.words.split()) for doc in text_b_raw])

# # predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

## Don't worry if results are not satisfactory using Doc2Vec (the dataset is too small to train good embeddings)

spearman's rank correlation 0.0752375525451465


In [51]:
##TODO do the same with embeddings provided by spaCy
##TODO compute word similarities in the WordSim353 dataset using spaCy word embeddings
text_a_encoded = np.array([en(text).vector for text in text_a])
text_b_encoded = np.array([en(text).vector for text in text_b])

# # predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

spearman's rank correlation 0.4434417833016249


In [55]:
##TODO do the same with universal sentence embeddings
# import necessary libraries
import tensorflow_hub as hub
  
# Load pre-trained universal sentence encoder model
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
  
# encode documents
text_a_encoded = np.array(embed(text_a))
text_b_encoded = np.array(embed(text_a))

# # predict cosine similarities
predictions = [cosine_similarity(a,b) for a,b in zip(text_a_encoded, text_b_encoded)]

# evaluate
evaluate(predictions, labels)

2022-11-21 14:04:15.796839: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


spearman's rank correlation 0.0038053103255100895


**Task 2**
Use your favorite document embeddings method to compute embeddings for a dataset you are interested in. Think of a method and provide some data visualization statistics (one method would be the path we have chosen in the notebook, i.e. cluster the embeddings with k-means and visualize low-dimensional representations of the document embeddings obtained by PCA). 

This task is very open and there is no right or wrong; If you want to use document embeddings in your course project, this is a chance to play around with them.




In [14]:
# import libraries
import os
import tweepy
import spacy
from dotenv import load_dotenv
import pandas as pd
from bertopic import BERTopic

#Â load environment variables
load_dotenv()
consumer_key = os.environ["API_KEY"]
consumer_secret = os.environ["API_KEY_SECRET"]
access_token = os.environ["ACCESS_TOKEN"]
access_token_secret = os.environ["ACCESS_TOKEN_SECRET"]

# authenticate
auth = tweepy.OAuth1UserHandler(
  consumer_key, 
  consumer_secret, 
  access_token, 
  access_token_secret
)

# connect to twitter
api = tweepy.API(auth)
api.verify_credentials()

# set query topic
queryTopic = 'inflation -filter:retweets'

# get the pages
extracted_pages = []
for page in tweepy.Cursor(api.search_tweets, 
                            queryTopic, 
                            lang="de",
                            count=100).pages(10):
    extracted_pages.append(page)
    
# get the tweets
tweets = []
for page in extracted_pages:
    tweets += page

# convert data to pandas df
json_data = [r._json for r in tweets]
df = pd.json_normalize(json_data)

# select subset of columns
df = df[["created_at", "text"]]


In [15]:

# initialize spacy
nlp = spacy.load("de_core_news_sm")

# get stopwords
def get_stopwords():
    "Return a set of stopwords read in from a file."
    with open("stop_words_german.txt") as f:
        stopwords = []
        for line in f:
            stopwords.append(line.strip("\n"))
    # Convert to set for performance
    stopwords_set = set(stopwords)
    return stopwords_set
stopwords = get_stopwords()

# define leammatization function
def lemmatize_pipe(doc):
    lemma_list = [str(tok.lemma_).lower() for tok in doc
                  if tok.is_alpha and tok.text.lower() not in stopwords] 
    return lemma_list

# define lemmatization pipe
def preprocess_pipe(texts):
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=20):
        preproc_pipe.append(lemmatize_pipe(doc))
    return preproc_pipe

# cleaned text
df['preproc_pipe'] = preprocess_pipe(df['text'])
df['created_at'] = pd.to_datetime(df['created_at'])
df.shape

In [3]:
#Â get embeddings
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')

#Sentences are encoded by calling model.encode()
embeddings = model.encode(df['text'])


In [7]:
from sklearn.cluster import KMeans

# Perform kmean clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(embeddings)
cluster_assignment = clustering_model.labels_
clustered_sentences = [[] for i in range(num_clusters)]
for sentence_id, cluster_id in enumerate(cluster_assignment):
    clustered_sentences[cluster_id].append(df.loc[sentence_id, 'text'])

for i, cluster in enumerate(clustered_sentences):
    print("Cluster ", i+1)
    print(cluster)
    print("")
    

[3 3 3 1 3 0 3 2 4 1]
Cluster  1
['@ZStadtfux Auch wenn man Unsinn wiederholt wird es nicht richtig. DieðŸ‡¨ðŸ‡­ist heute je nach Studie das beste, reichsteâ€¦ https://t.co/8oYfuexgGS']

Cluster  2
['@DrieElmann Ich dachte, wir haetten Inflation?\nansparen von mehr als11kðŸ˜‡ðŸ˜œ', '@DannyLevievFam Nur Brot wegen inflation']

Cluster  3
['@eulebln @rbb24 @sophiamersmann @Hadmut @_donalphonso @RolandTichy @reitschuster @_richtig_falsch @destatis Wir habâ€¦ https://t.co/Vu9Aqa5GlU']

Cluster  4
['@pschnek @Clemanns1984 @volkspartei Durch die Merit Order haben wir zur Zeit eine kÃ¼nstlich aufgeblasene Inflation,â€¦ https://t.co/0Z5bTjzmdH', 'â€žDie Angst vor Ãœberfremdung, Kulturverlust oder Inflation ist verÃ¤chtlich, die Angst vor dem Weltuntergang ist gut.â€¦ https://t.co/LHyuLpLMpq', '@HerrNMaus Sagen wir es Mal so in 10 Jahren Hoffen wir dass die Inflation sich bisschen gelegt hat denn dann brauchâ€¦ https://t.co/lpP2S6JwL5', '@ChrSchumi Das Problem es ist Krieg, Inflation  Pandemie,

In [13]:
from sklearn.decomposition import PCA
import plotly.express as px

pca = PCA(n_components=2)
components = pca.fit_transform(embeddings)


fig = px.scatter(components, 
                 color=cluster_assignment.astype(str), 
                 labels={
                     "0": "PCA component 1",
                     "1": "PCA component 2",
                     "color": "Cluster"
                 },
                 x=0, 
                 y=1)
fig.show()


# Parsing

In [2]:
import pandas as pd
import nltk
df = pd.read_csv('train.csv')

df.columns = ["label", "title", "lead"]
label_map = {1:"world", 2:"sport", 3:"business", 4:"sci/tech"}
def replace_label(x):
	return label_map[x]
df["label"] = df["label"].apply(replace_label) 
df["text"] = df["title"] + " " + df["lead"]
df = df.sample(n=10000) # # only use 10K datapoints
df.head()

Unnamed: 0,label,title,lead,text
88262,world,"On litter-strewn street, Palestinians mourn",After 36 years in which he influenced and domi...,"On litter-strewn street, Palestinians mourn Af..."
74285,sci/tech,Wireless network links researchers in panda pr...,"Giant pandas might prefer bamboo to laptops, b...",Wireless network links researchers in panda pr...
26563,sport,NFL Game Summary - Green Bay at Carolina,"Charlotte, NC (Sports Network) - Ahman Green s...",NFL Game Summary - Green Bay at Carolina Charl...
27726,business,Citigroup apology on bond deals,LONDON Citigroup told employees on Tuesday tha...,Citigroup apology on bond deals LONDON Citigro...
12072,sport,Hamm #39;s legacy should be for Olympic ideal,Even though he has fled the perceived hostile ...,Hamm #39;s legacy should be for Olympic ideal ...


In [64]:
import spacy
nlp = spacy.load('en_core_web_md')
#TODO preprocess the corpus using spacy or load the pre-processed corpus
docs = nlp.pipe(df["lead"]) 
df["text_clean"] = [[chunk.text.lower() for chunk in doc if not 
                      (chunk.is_punct or chunk.is_stop)] for doc in docs]
df["text_clean"] = df["text_clean"].str.join(" ")

# use the pre-processed corpus for the rest of the exercise
docs = list(nlp.pipe(df["text_clean"]))

### Information Extraction

In [59]:
def extract_subject_verb_pairs(sent):
    subjs = [w for w in sent if w.dep_ == "nsubj"]
    pairs = [(w.lemma_.lower(), w.head.lemma_.lower()) for w in subjs]
    return pairs
##TODO extract the subject-verbs pairs and print the result for the first document
pairs = []
for doc in docs:
    for sent in doc.sents:
        pairs.append(extract_subject_verb_pairs(sent))
        
from collections import Counter
counter = Counter()

##TODO create a list ranking the most common pairs and print the first 10 items
# unlist the list of lists
flat_pairs = [item for sublist in pairs for item in sublist]
counted_pairs = Counter(elem for elem in flat_pairs)
counted_pairs.most_common(10)

[(('official', 'say'), 131),
 (('.', 'say'), 84),
 (('company', 'say'), 72),
 (('inc', 'say'), 70),
 (('target=/stock', 'quickinfo'), 57),
 (('corp', 'say'), 48),
 (('group', 'say'), 32),
 (('research', 'say'), 29),
 (('people', 'kill'), 29),
 (('police', 'say'), 26)]

In [60]:
##TODO do the same for verbs-object pairs ('dobj')
def extract_object_verb_pairs(sent):
    subjs = [w for w in sent if w.dep_ == "dobj"]
    pairs = [(w.lemma_.lower(), w.head.lemma_.lower()) for w in subjs]
    return pairs
##TODO extract the subject-verbs pairs and print the result for the first document
pairs = []
for doc in docs:
    for sent in doc.sents:
        pairs.append(extract_object_verb_pairs(sent))
        
from collections import Counter
counter = Counter()

##TODO create a list ranking the most common pairs and print the first 10 items
# unlist the list of lists
flat_pairs = [item for sublist in pairs for item in sublist]
counted_pairs = Counter(elem for elem in flat_pairs)
counted_pairs.most_common(10)


[(('people', 'kill'), 53),
 (('point', 'score'), 26),
 (('lawsuit', 'file'), 16),
 (('rate', 'raise'), 14),
 (('plan', 'announce'), 14),
 (('soldier', 'kill'), 11),
 (('39;t', 'win'), 10),
 (('game', 'win'), 10),
 (('job', 'cut'), 10),
 (('attack', 'kill'), 10)]

In [61]:
##TODO do the same for adjectives-nouns pairs ('amod')
def extract_adjectives_nouns_pairs(sent):
    subjs = [w for w in sent if w.dep_ == "amod"]
    pairs = [(w.lemma_.lower(), w.head.lemma_.lower()) for w in subjs]
    return pairs
##TODO extract the subject-verbs pairs and print the result for the first document
pairs = []
for doc in docs:
    for sent in doc.sents:
        pairs.append(extract_adjectives_nouns_pairs(sent))
        
from collections import Counter
counter = Counter()

##TODO create a list ranking the most common pairs and print the first 10 items
# unlist the list of lists
flat_pairs = [item for sublist in pairs for item in sublist]
counted_pairs = Counter(elem for elem in flat_pairs)
counted_pairs.most_common(10)

[(('mobile', 'phone'), 56),
 (('presidential', 'election'), 51),
 (('second', 'quarter'), 41),
 (('open', 'source'), 39),
 (('high', 'price'), 38),
 (('fourth', 'quarter'), 33),
 (('senior', 'official'), 33),
 (('quarterly', 'profit'), 32),
 (('crude', 'oil'), 30),
 (('economic', 'growth'), 29)]

### Exploring cross label dependencies

In [62]:
##TODO extract all the subject-verbs and verbs-object pairs for the verb "win"
pairs = []
for doc in docs:
    for sent in doc.sents:
        pairs.append(extract_subject_verb_pairs(sent))
        
# unlist the list of lists
flat_pairs = [item for sublist in pairs for item in sublist]
subject_flat_pairs_filtered = [pair for pair in flat_pairs if pair[1] == "win"]
print(subject_flat_pairs_filtered[0:10])

pairs = []
for doc in docs:
    for sent in doc.sents:
        pairs.append(extract_object_verb_pairs(sent))

# unlist the list of lists
flat_pairs = [item for sublist in pairs for item in sublist]
object_flat_pairs_filtered = [pair for pair in flat_pairs if pair[1] == "win"]
print(object_flat_pairs_filtered[0:10])



[('turnover', 'win'), ('server', 'win'), ('norway', 'win'), ('singh', 'win'), ('park', 'win'), ('russia', 'win'), ('hofstra', 'win'), ('bush', 'win'), ('vote', 'win'), ('spain', 'win')]
[('opener', 'win'), ('controversy', 'win'), ('year', 'win'), ('39;t', 'win'), ('rally', 'win'), ('game', 'win'), ('course', 'win'), ('bridge', 'win'), ('korea', 'win'), ('win', 'win')]


In [63]:
##TODO for each label create a list ranking the most common subject-verbs pairs and one for the most common verbs-object pairs
object_pairs_by_label = []
for label in label_map:
    df_label = df.loc[df['label'] == label_map[label]]
    docs = list(nlp.pipe(df_label["text_clean"]))
    ##TODO extract the subject-verbs pairs and print the result for the first document
    pairs = []
    for doc in docs:
        for sent in doc.sents:
            pairs.append(extract_object_verb_pairs(sent))
            
    from collections import Counter
    counter = Counter()

    # unlist the list of lists
    flat_pairs = [item for sublist in pairs for item in sublist]
    counted_pairs = Counter(elem for elem in flat_pairs)
    object_pairs_by_label.append(counted_pairs)
    
subject_pairs_by_label = []
for label in label_map:
    df_label = df.loc[df['label'] == label_map[label]]
    docs = list(nlp.pipe(df_label["text_clean"]))
    ##TODO extract the subject-verbs pairs and print the result for the first document
    pairs = []
    for doc in docs:
        for sent in doc.sents:
            pairs.append(extract_subject_verb_pairs(sent))
            
    from collections import Counter
    counter = Counter()

    # unlist the list of lists
    flat_pairs = [item for sublist in pairs for item in sublist]
    counted_pairs = Counter(elem for elem in flat_pairs)
    subject_pairs_by_label.append(counted_pairs)
    
##TODO print the 10 most common pairs for each of the two lists for the labels "sport" and "business"
print("#### Object pairs ####")
for i, sublist in enumerate(object_pairs_by_label):
    if (i == 1) or (i == 2):
        print(sublist.most_common(10))
        print("")
print("#### Subject pairs ####")
for i, sublist in enumerate(subject_pairs_by_label):
    if (i == 1) or (i == 2):
        print(sublist.most_common(10))
        print("")


#### Object pairs ####
[(('point', 'score'), 26), (('game', 'win'), 9), (('pass', 'throw'), 9), (('series', 'win'), 8), (('surgery', 'undergo'), 6), (('season', 'miss'), 6), (('term', 'agree'), 6), (('touchdown', 'throw'), 6), (('par', 'shoot'), 5), (('title', 'win'), 5)]

[(('rate', 'raise'), 14), (('job', 'cut'), 9), (('charge', 'settle'), 8), (('earning', 'report'), 8), (('inc', 'buy'), 8), (('corp', 'buy'), 8), (('research', 'quote'), 7), (('job', 'add'), 6), (('world', 'quickinfo'), 6), (('loss', 'report'), 6)]

#### Subject pairs ####
[(('quot', 'say'), 13), (('game', 'play'), 8), (('bond', 'hit'), 7), (('match', 'play'), 6), (('second', 'leave'), 6), (('antonio', 'spur'), 5), (('manning', 'throw'), 5), (('assist', 'lead'), 5), (('team', 'win'), 5), (('michael', 'phelp'), 5)]

[(('.', 'say'), 57), (('target=/stock', 'quickinfo'), 57), (('inc', 'say'), 55), (('company', 'say'), 45), (('corp', 'say'), 31), (('research', 'say'), 25), (('official', 'say'), 22), (('profit', 'rise'), 2