# Practice with NYTimes API
CS315: Data Science for the Web  
Professor Eni Mustafaraj  
[Day 10 Slides 22-28](https://docs.google.com/presentation/d/15fuBhqPNv8GgeqNlNydKAlJW5ecAx92h7lQ_KyTxbKI/edit#slide=id.g2bd1c28673c_0_127)  
Edith Po  

In [54]:
import requests

In [55]:
key = '1DFmIMxxqdYl8wJBPqAFxtHkimk86Qtn'

In [91]:
year = 2024
month = 2

url = f'https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={key}'
print(url)

https://api.nytimes.com/svc/archive/v1/2024/2.json?api-key=1DFmIMxxqdYl8wJBPqAFxtHkimk86Qtn


In [92]:
data = requests.get(url)
data.status_code

200

In [93]:
articles = data.json()
len(articles)

2

In [94]:
articles.keys()

dict_keys(['copyright', 'response'])

In [95]:
len(articles['response']['docs']) # list

3791

In [61]:
articles['response']['docs'][0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'print_section', 'print_page', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

In [19]:
article0 = articles['response']['docs'][0]
print(article0['abstract'])
print(article0['web_url'])
print(article0['snippet'])
print(article0['section_name'])

Periods of backlash take shape after surges of Black progress. We have entered another such period.
https://www.nytimes.com/2024/01/31/opinion/racist-backlash-history.html
Periods of backlash take shape after surges of Black progress. We have entered another such period.
Opinion


## Find the Top 5 Section Names for February 2024 Articles

In [31]:
section_names = {}

for doc in articles['response']['docs']:
    section = doc['section_name']
    if section in section_names:
        section_names[section] += 1
    else:
        section_names[section] = 1

len(section_names)

36

In [32]:
# sort the section names by count
sorted_sections = sorted(section_names.items(),key=lambda x:x[1],reverse=True)
sorted_sections[:5]

[('U.S.', 734),
 ('World', 513),
 ('Arts', 326),
 ('Opinion', 272),
 ('Business Day', 244)]

In [34]:
# ALTERNATE METHOD FROM "Week 6 Task Solutions.ipynb"
sections = [article['section_name'] for article in articles['response']['docs']]

from collections import Counter

distDct = Counter(sections) # count the occurrences of each section name

distDct.most_common(10)

[('U.S.', 734),
 ('World', 513),
 ('Arts', 326),
 ('Opinion', 272),
 ('Business Day', 244),
 ('New York', 200),
 ('Style', 174),
 ('Books', 139),
 ('Crosswords & Games', 125),
 ('Movies', 123)]

In [98]:
def get_articles_by_year_month(year, month, key):
    # create URL
    url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={key}"

    # send the request to get the data
    data = requests.get(url)
    if data.status_code == 200:
        print("Successfully got the data.")

    dataJson = data.json() # get response as JSON
    documents = dataJson['response']['docs']
    return documents

## Various Tasks

1. Write a Python function that takes a date, for example, "2024-02-12", and returns the list of articles for that day.

In [96]:
import datetime
def get_articles_by_date(date):
    dt = datetime.datetime.strptime(date,'%Y-%m-%d')

    # Get articles for given month and year
    url = f'https://api.nytimes.com/svc/archive/v1/{dt.year}/{dt.month}.json?api-key={key}'
    data = requests.get(url)
    documents = data.json()['response']['docs']

    articles = [doc for doc in documents if doc['pub_date'][:10] == date]

    return articles

In [97]:
feb12_articles = get_articles_by_date('2024-02-12')
print(len(feb12_articles))

116


2. Write some code that explores whether the fields "abstract" and "snippet" are always the same or they differ. Which one has more information? 

In [99]:
articles = get_articles_by_year_month(2024, 2, key)
len(articles)

Successfully got the data.


3791

In [137]:
dif_abstract_snippet = []

# find articles whose abstracts and snippets are not the same
for article in articles:
    abstract = article['abstract']
    snippet = article['snippet']
    if (abstract!= snippet) & (len(snippet) != 0):
        dif_abstract_snippet.append(article)

print(f"Number of articles with different abstracts and snippets (where snippet field was not empty): {len(dif_abstract_snippet)}")

fraction = (len(dif_abstract_snippet)/len(articles))*100
print(f"Only {str(fraction)[:4]}% of the abstracts in Feb 2024 were different from the snippets.")

Number of articles with different abstracts and snippets (where snippet field was not empty): 5
Only 0.13% of the abstracts in Feb 2024 were different from the snippets.


In [115]:
abstract_lengths = [len(article['abstract']) for article in dif_abstract_snippet]
snippet_lengths = [len(article['snippet']) for article in dif_abstract_snippet]

print(abstract_lengths)
print(snippet_lengths)

avg_abstract = sum(abstract_lengths)/len(abstract_lengths)
avg_snippet = sum(snippet_lengths)/len(snippet_lengths)

print(f'Average Abstract Length: {avg_abstract}')
print(f'Average Snippet Length: {avg_snippet}')

[288, 253, 342, 253, 249]
[250, 250, 250, 250, 250]
Average Abstract Length: 277.0
Average Snippet Length: 250.0


3. Write a function that given one article (in its nested structure), creates a flat dictionary with keys that are relevant for analysis: either the abstract or snippet (see point 2); lead paragraph; headline; keywords concatenated via semicolon; pub_date; document_type; section_name; and type_of_material

In [118]:
articles[0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'print_section', 'print_page', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

In [140]:
def get_relevant_keys(article):
    '''Given a article (nested dictionary), returns a flat dictionary of keys
    that are relevant for analysis.'''
    keys = ['abstract','lead_paragraph','headline','keywords','pub_date','document_type','section_name','type_of_material']
    
    dct = {}
    for key in keys:
        dct[key] = article[key]
        print(f'{key} --> {type(article[key])}')

    # dct['headline_main'] = article['headline']['main']
    # 

get_relevant_keys(articles[0])
    

abstract --> <class 'str'>
lead_paragraph --> <class 'str'>
headline --> <class 'dict'>
keywords --> <class 'list'>
pub_date --> <class 'str'>
document_type --> <class 'str'>
section_name --> <class 'str'>
type_of_material --> <class 'str'>


In [142]:
def concat_keywords(keywords_dct):
    '''concatenates keywords from keywords_dct with semicolons'''
    all_keywords = ""
    for elt in keywords_dct:
        

[{'name': 'subject', 'value': 'Hate Crimes', 'rank': 1, 'major': 'N'},
 {'name': 'subject', 'value': 'Black People', 'rank': 2, 'major': 'N'},
 {'name': 'subject', 'value': 'Blacks', 'rank': 3, 'major': 'N'},
 {'name': 'subject', 'value': 'Discrimination', 'rank': 4, 'major': 'N'},
 {'name': 'subject',
  'value': 'Civil Rights Movement (1954-68)',
  'rank': 5,
  'major': 'N'},
 {'name': 'subject', 'value': 'Reconstruction Era', 'rank': 6, 'major': 'N'},
 {'name': 'subject',
  'value': 'Segregation and Desegregation',
  'rank': 7,
  'major': 'N'},
 {'name': 'persons',
  'value': 'Nixon, Richard Milhous',
  'rank': 8,
  'major': 'N'}]

4. Write another function that calls the function from point 3 on every article, to create a list of article dictionaries, and convert this list into a dataframe and then store it as a CSV file with the date-month in the title (this is important for point 5 below).

5. Once you have done all of these in the notebook, create a Python script that can be called with a date (from a TikTok video). First, the script looks whether a CSV with cleaned articles is in our folder. If not, calls first the API function to get the articles and then the function that converts them into a CSV. Then, it loads the CSV into a datafram and it uses filtering to get the articles for the desired date. These articles will be used for the Semantic Similarity portion of the TikTok Project.