# Week 6 Tasks: NYT API Data and Cosine Similarity for Text 

These tasks were discussed during week 6 and you had to work on them with your group.  
Below are the solutions and additional tasks that will be helpful for your project.

**Table of Content**
* [Part 1: Working with the NYT API](#sec1)
* [Part 2: Cosine Similarity for Text](#sec2)
* [Part 3: Similarity of Spring and Summer sentences](#sec3)

<a id="sec1"></a>
## Part 1: Working with the NYT API 

We have the following goals:

1. Use the API to get all articles from a month
2. Verify the number of articles
3. Find the distribution of articles by section

### Important: replace the string below with your API key

In [None]:
myAPIkey = "ADD YOUR OWN API KEY HERE!"

We will write a function that given a date (month and year) will talk to the NYT API and get the articles for that time period. We will store the results in a JSON file to process when needed. 

In [None]:
import requests, json

def getNYTArticles(year, month, apiKey):
    """Function that sends a request to the NYT API for all articles in a month
    and then stores the results in a JSON file.
    """
    # create URL
    URL = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={apiKey}"

    # send the request to get the data
    data = requests.get(URL)
    if data.status_code == 200:
        print("Successfully got the data.")

    dataJson = data.json() # get response as JSON

    with open(f"NYT_{year}-{month}.json", 'w') as fout:
        json.dump(dataJson, fout)

Let's test the function for the months of February 2024:

In [None]:
getNYTArticles(2024, 2, myAPIkey)

## Explore the NYT Data

Now that we have the data, we will look into how to retrieve things like article title, section, etc.

In [None]:
with open("NYT_2024-2.json") as fin:
    articles = json.load(fin)

print(type(articles))

We can check the keys of this dictionary:

In [None]:
articles.keys()

Then we check what values are stored under each key, without printing the values, but checking for their type:

In [None]:
for key in articles:
    print(key, type(articles[key]))

In [None]:
articles['copyright']

Let's look at the keys for 'response':

In [None]:
articles['response'].keys()

One more time, we look what kind of information is stored under each of these keys:

In [None]:
for key in articles['response']:
    print(key, type(articles['response'][key]))

In [None]:
# what is under the "meta" key?

articles['response']['meta']

So, this shows how many articles are in the data. We can verify this:

In [None]:
len(articles['response']['docs'])

It's the same number, which is a good thing. Now let's look at what one of the articles (or docs) looks like:

In [None]:
articles['response']['docs'][0] # using indexing, because we know that the data is stored in a list

We can see tha an article is a somewhat nested data structure, it's a dictionary, but many of the keys point to list of other dictionaries. Let's look at the top fields: 

In [None]:
oneArticle = articles['response']['docs'][0]
for key in oneArticle:
    print(key, type(oneArticle[key]))

### Find the distribution of articles by section

As we saw above, every article has a section name, so we can easily collect all those names:

In [None]:
sections = [article['section_name'] for article in articles['response']['docs']]

# Let's look up a few of them
sections[:5]

In [None]:
from collections import Counter

distDct = Counter(sections) # count the occurrences of each section name

distDct.most_common(10)

## Tasks for you:

1. Write a Python function that takes a date, for example, "2024-02-12", and returns the list of articles for that day.
2. Write some code that explores whether the fields "abstract" and "snippet" are always the same or they differ. Which one has more information?
3. Write a function that given one article (in its nested structure), creates a flat dictionary with keys that are relevant for analysis: either the abstract or snippet (see point 2); lead paragraph; headline; keywords concatenated via semicolon; pub_date; document_type; section_name; and type_of_material
4. Write another function that calls the function from point 3 on every article, to create a list of article dictionaries, and convert this list into a dataframe and then store it as a CSV file with the date-month in the title (this is important for point 5 below).
5. Once you have done all of these in the notebook, create a Python script that can be called with a date (from a TikTok video). First, the script looks whether a CSV with cleaned articles is in our folder. If not, calls first the API function to get the articles and then the function that converts them into a CSV. Then, it loads the CSV into a datafram and it uses filtering to get the articles for the desired date. These articles will be used for the Semantic Similarity portion of the TikTok Project.

<a id="sec2"></a>
## Part 2: Cosine Similarity for Text

We will start with the example that was in the slides. There, we initially used the Jaccard similarity to rank sentences most similar to a query, and when that didn't work as expected, we looked at the cosine similarity.

### Use Jaccard similarity for a query phrase and a list of sentences

In [None]:
q = "red dress"

sentences = [
"she wore a dress and red earrings",
"the dress has a red wine stain",
"tomorrow I will wear my new red dress",
"the red dress in the photo resembles the red dress she is wearing",
"short dress",
"red lipstick"
]

def jaccard(text1, text2):
    """Implement Jaccard similarity. Assumes there is no punctuation in text."""
    sw1 = set(text1.lower().split()) # turn into a set of words
    sw2 = set(text2.lower().split())
    sim = len(sw1.intersection(sw2)) / len(sw1.union(sw2))
    return round(sim, 4) # round to 4 digits after the comma

def applyJaccard(query, sentences):
    """Appl the Jaccard similarity between query and each sentence"""
    results = []
    for sent in sentences:
        jac = jaccard(query, sent)
        results.append((jac, sent))
    
        # Sort in descending order
        results.sort(reverse=True)

    return results

# call the function

applyJaccard(q, sentences)

As we discussed in class, the Jaccard similarity is not doing well with our data (showing as similar text that, thus, we will try the cosine similarity. However, in order to apply the cosine similarity, we need some other steps:

1. Create the vocabulary of words that will serve as the dimensions of our vector space
2. Represent each document as a vector in the vector space

### Create Vocabulary

While our sentences in the example don't have punctuation, most of the time text will have it, thus, we need to be prepared to remove it. This will be necesary in order to avoid a word show multiple times, with and without punctuation.

In [None]:
phrase = "that, that is the thing I want: dancing by the river! ah, the river, I have missed it so much!"
phrase.lower().split()

Notice how we have both "that!" and "that", and also "river," and "river!". This is why we will remove punctuation. Luckily, Python has a library that lists all punctuation:

In [None]:
import string
string.punctuation

One way to go about it is the following:

In [None]:
"".join(char for char in phrase if char not in string.punctuation)

Notice how all the punctuation is gone. Now that we know how to do this, we can write our function.

In [None]:
def getVocabulary(textchunk):
    """Given some text, create the vocabulary of unique words."""
    textchunk = textchunk.lower()
    cleantext = "".join(char for char in textchunk if char not in string.punctuation)
    words = set(cleantext.split())
    voc = sorted(words)

    return voc

Let's test it with our sentences. Since they are a list, we turn them into a string first:

In [None]:
getVocabulary(" ".join(sentences))

It looks good, no word is repeated. 

### Vector representation

Now that we have a vocabulary, we can easily convert every sentence into a vector of numbers. Remember, all the vectors will have the same length. They will have 0 for a dimension (word) that they don't have, and the count of word for a dimension they have.

In [None]:
def text2vector(sentence, voc):
    """Given a sentence and the vocabulary for the problem,
    turn every sentence into a vector.
    """
    cleantext = "".join(char for char in sentence if char not in string.punctuation)
    words = cleantext.lower().split()
    vector = [words.count(w) for w in voc]
    return vector

Let's try it with one sentence:

In [None]:
voc = getVocabulary(" ".join(sentences))
text2vector(sentences[0], voc)

Let's verify that this is done right by checking what sentence was turned into a vector:

In [None]:
sentences[0]

Let's combine the vocabulary and the vector to see the pairs:

In [None]:
list(zip(voc, text2vector(sentences[0], voc)))

Notice how each word in our sentence has a 1 next to it and all the other words have a 0.

We will now convert all the sentences to vectors:

In [None]:
sent2vec = [text2vector(sent, voc) for sent in sentences]
sent2vec

We represent this in pandas:

In [None]:
import pandas as pd
df = pd.DataFrame(sent2vec, 
                  columns=voc,
                  index=[f"doc_{i+1}" for i in range(len(sentences))])
df

### Cosine Similarity

We discussed the implementation of cosine similarity in class. Below is the function that implements it.

In [None]:
import numpy as np
from numpy.linalg import norm
 
def cosineSimilarity(vec1, vec2):
    """Calculate the cosine similarity between two vectors."""
    V1 = np.array(vec1)
    V2 = np.array(vec2)
    cosine = np.dot(V1, V2)/(norm(V1)*norm(V2))
    return cosine

Now that we have the cosine similarity function, we will write a function that given a query and a list of sentences, calculates the similarity score for each pair (query, sentence).

In [None]:
def rankDocuments(query, sentences):
    """Given a query and some sentences, rank the sentences for 
    which are the most similar to the query.
    """
    # Step 1: create vocabulary
    voc = getVocabulary(" ".join(sentences))

    # Step 2: generate vector for query
    queryVec = text2vector(query, voc)

    # Step 3: generate vector for sentences and calculate cosine similarity at once
    similarities = []
    for sent in sentences:
        sentVec = text2vector(sent, voc)
        sim = cosineSimilarity(queryVec, sentVec)
        similarities.append((round(sim, 4), sent)) # keep track of sentences

    similarities.sort(reverse=True) # most similar sentence at the top
    return similarities

Now we can call the function for our query "red dress" and the list of sentences:

In [None]:
rankDocuments("red dress", sentences)

**Note:** These values are slightly different from the ones in the slides. There was a bug with the word "I", which was not lowercased in the sentences, so it didn't count in the vector. The bug has been fixed in this version.

<a id="sec3"></a>
## Part 3: Similarity of Spring and Summer sentences

You were given the following sentences in the slides of Day 10. These were created by GenAI to capture the spirit of "spring" and "summer".

In [None]:
springSentences = [
"As spring unfolds, the warmth of the season encourages the first blossoms to open, signaling longer days ahead.",
"Spring brings not only blooming flowers but also the anticipation of sunny days and outdoor activities.",
"With the arrival of spring, people begin planning their summer vacations, eager to enjoy the seasonal warmth.",
"The mild spring weather marks the transition from the cold winter to the inviting warmth of summer.",
"During spring, families often start spending more time outdoors, enjoying the season's pleasant temperatures and the promise of summer fun."
]

summerSentences = [
"Summer continues the season's trend of growth and warmth, with gardens full of life and days filled with sunlight.",
"The summer season is synonymous with outdoor adventures and enjoying the extended daylight hours that began in spring.",
"As summer arrives, the warm weather invites a continuation of the outdoor activities that people began enjoying in spring.",
"The transition into summer brings even warmer temperatures, allowing for beach visits and swimming, much awaited since the spring.",
"Summer vacations are often planned as the days grow longer, a pattern that starts in the spring, culminating in peak summer leisure."
]

**Our Goal:**

We want to generate a heatmap of the similarity scores between all sentences to one another to find out how similar they are. To achieve this goal, we need to break down the task:

1. We need to create first the vocabulary of all terms (or the dimensions of our vector space).
2. We will turn every sentence into a vector.
3. We will compare every sentence to every other sentence through the cosine similartiy to create the similarity matrix.
4. We will draw the heatmap with seaborn.

### Create Vocabulary

We will call the function `getVocabulary` that we created before.

In [None]:
allSentences = " ".join(springSentences) + " " + " ".join(summerSentences)
voc = getVocabulary(allSentences)
print(f"Vocabulary has {len(voc)} words.")

### Convert sentences to vectors

We will call the function `text2vector` on every sentence:

In [None]:
sentVectors = [text2vector(sent, voc) for sent in springSentences+summerSentences]
print(len(sentVectors), len(sentVectors[0]))

This means that we created 10 vectors, each with a length of 102 dimensions.  
Let's check our work:

In [None]:
oneSent = springSentences[0]
oneSent

In [None]:
pairs = list(zip(text2vector(oneSent, voc), voc))
nonZero = [pair for pair in pairs if pair[0] != 0]
nonZero

In [None]:
print(f"Words in sentence: {len(oneSent.split())}; nonzero terms in vector: {len(nonZero)}")

This looks good. There are 16 unique words, and the word "the" is repeated two more times, that explains the numbers 16 and 18. 

### Calculating the similarity matrix

We will calculate the cosine similarity for every pair of sentences. This makes sense because we only have 10 sentences, if we had way more, we will try to be more efficient and not repeat the calculations (since we know that the matrix is symmetrical). 

In [None]:
simMatrix = []
for vec1 in sentVectors:
    simRow = []
    for vec2 in sentVectors:
        simRow.append(cosineSimilarity(vec1, vec2))
    simMatrix.append(simRow)

print(simMatrix)

### Generate the heatmap

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

def drawHeatmap(sentLabels, simMtrx, plotTitle):
    """Draws a heatmap for the similarity matrix.
    """
    sns.set(font_scale=0.9)
    g = sns.heatmap(
          simMtrx, # similarity matrix with the cosine sim values
          xticklabels=sentLabels,
          yticklabels=sentLabels,
          vmin=0,
          vmax=1,
          cmap="YlOrRd")
    g.set_xticklabels(sentLabels, rotation=90)
    g.set_title(plotTitle, fontsize=14)
    plt.show()

In [None]:
shortSent = [sent[:25] for sent in springSentences+summerSentences]
drawHeatmap(shortSent, simMatrix, "Cosine similarity matrix")

### Short Exploration

Let's look at the similarity matrix in a pandas dataframe:

In [None]:
labels = [f"s{i+1}" for i in range(10)]
df = pd.DataFrame(simMatrix, columns=labels, index=labels)
df

I will write some code to compare sentences that have a high similarity score:

In [None]:
def getWords(sent):
    """Get the words of a sentence after lowercasing and removing punctuation."""
    cleantext = "".join(char for char in sent.lower() if char not in string.punctuation)
    cleanWords = cleantext.split()
    return cleanWords

In [None]:
def compareSentences(sent1, sent2):
    """Compare the content of two sentences."""
    words1 = getWords(sent1)
    words2 = getWords(sent2)
    commonWords = sorted([w for w in words1 if w in words2])
    print("COMPARISON RESULTS")
    print("Sent1: ", sent1)
    print("Sent2: ", sent2)
    print(f"Lengths of sentences: {len(words1)} and {len(words2)}. Words in common: {len(commonWords)}")
    print("Common words:", commonWords)

Let's check s1 and s4, in the group os Spring sentences:

In [None]:
compareSentences(springSentences[0], springSentences[3])

What about the sentences s7 and s8, in the group of Summer sentences?

In [None]:
compareSentences(summerSentences[1], summerSentences[2])