# Practice with NYTimes API
CS315: Data Science for the Web  
Professor Eni Mustafaraj  
[Day 10 Slides 22-28](https://docs.google.com/presentation/d/15fuBhqPNv8GgeqNlNydKAlJW5ecAx92h7lQ_KyTxbKI/edit#slide=id.g2bd1c28673c_0_127)  
Edith Po  

In [189]:
import requests

In [230]:
key = '1DFmIMxxqdYl8wJBPqAFxtHkimk86Qtn'

In [231]:
year = 2024
month = 2

url = f'https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={key}'
print(url)

https://api.nytimes.com/svc/archive/v1/2024/2.json?api-key=1DFmIMxxqdYl8wJBPqAFxtHkimk86Qtn


In [232]:
data = requests.get(url)
data.status_code

200

In [193]:
articles = data.json()
len(articles)

2

In [194]:
articles.keys()

dict_keys(['copyright', 'response'])

In [195]:
len(articles['response']['docs']) # list

3791

In [196]:
articles['response']['docs'][0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'print_section', 'print_page', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

In [197]:
article0 = articles['response']['docs'][0]
print(article0['abstract'])
print(article0['web_url'])
print(article0['snippet'])
print(article0['section_name'])

Periods of backlash take shape after surges of Black progress. We have entered another such period.
https://www.nytimes.com/2024/01/31/opinion/racist-backlash-history.html
Periods of backlash take shape after surges of Black progress. We have entered another such period.
Opinion


## Find the Top 5 Section Names for February 2024 Articles

In [198]:
section_names = {}

for doc in articles['response']['docs']:
    section = doc['section_name']
    if section in section_names:
        section_names[section] += 1
    else:
        section_names[section] = 1

len(section_names)

36

In [199]:
# sort the section names by count
sorted_sections = sorted(section_names.items(),key=lambda x:x[1],reverse=True)
sorted_sections[:5]

[('U.S.', 734),
 ('World', 513),
 ('Arts', 326),
 ('Opinion', 272),
 ('Business Day', 244)]

In [200]:
# ALTERNATE METHOD FROM "Week 6 Task Solutions.ipynb"
sections = [article['section_name'] for article in articles['response']['docs']]

from collections import Counter

distDct = Counter(sections) # count the occurrences of each section name

distDct.most_common(10)

[('U.S.', 734),
 ('World', 513),
 ('Arts', 326),
 ('Opinion', 272),
 ('Business Day', 244),
 ('New York', 200),
 ('Style', 174),
 ('Books', 139),
 ('Crosswords & Games', 125),
 ('Movies', 123)]

In [201]:
def get_articles_by_year_month(year, month, key):
    # create URL
    url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={key}"

    # send the request to get the data
    data = requests.get(url)
    if data.status_code == 200:
        print("Successfully got the data.")

    dataJson = data.json() # get response as JSON
    documents = dataJson['response']['docs']
    return documents

## Various Tasks

1. Write a Python function that takes a date, for example, "2024-02-12", and returns the list of articles for that day.

In [233]:
import datetime
def get_articles_by_date(date,key):
    dt = datetime.datetime.strptime(date,'%Y-%m-%d')

    # Get articles for given month and year
    url = f'https://api.nytimes.com/svc/archive/v1/{dt.year}/{dt.month}.json?api-key={key}'
    
    # send the request to get the data
    data = requests.get(url)
    if data.status_code == 200:
        print("Successfully got the data.")
    else:
        print("Did not get the data successfully.")

    try:
        documents = data.json()['response']['docs']
        print("Documents found!")
    except:
        print("Documents not found.")
        documents = {}

    articles = [doc for doc in documents if doc['pub_date'][:10] == date]

    return articles

In [234]:
feb12_articles = get_articles_by_date('2024-02-12',key)
print(len(feb12_articles))

Successfully got the data.
Documents found
116


2. Write some code that explores whether the fields "abstract" and "snippet" are always the same or they differ. Which one has more information? 

In [204]:
articles = get_articles_by_year_month(2024, 2, key)
len(articles)

Successfully got the data.


3791

In [205]:
dif_abstract_snippet = []

# find articles whose abstracts and snippets are not the same
for article in articles:
    abstract = article['abstract']
    snippet = article['snippet']
    if (abstract!= snippet) & (len(snippet) != 0):
        dif_abstract_snippet.append(article)

print(f"Number of articles with different abstracts and snippets (where snippet field was not empty): {len(dif_abstract_snippet)}")

fraction = (len(dif_abstract_snippet)/len(articles))*100
print(f"Only {str(fraction)[:4]}% of the abstracts in Feb 2024 were different from the snippets.")

Number of articles with different abstracts and snippets (where snippet field was not empty): 5
Only 0.13% of the abstracts in Feb 2024 were different from the snippets.


In [206]:
abstract_lengths = [len(article['abstract']) for article in dif_abstract_snippet]
snippet_lengths = [len(article['snippet']) for article in dif_abstract_snippet]

print(abstract_lengths)
print(snippet_lengths)

avg_abstract = sum(abstract_lengths)/len(abstract_lengths)
avg_snippet = sum(snippet_lengths)/len(snippet_lengths)

print(f'Average Abstract Length: {avg_abstract}')
print(f'Average Snippet Length: {avg_snippet}')

[288, 253, 342, 253, 249]
[250, 250, 250, 250, 250]
Average Abstract Length: 277.0
Average Snippet Length: 250.0


3. Write a function that given one article (in its nested structure), creates a flat dictionary with keys that are relevant for analysis: either the abstract or snippet (see point 2); lead paragraph; headline; keywords concatenated via semicolon; pub_date; document_type; section_name; and type_of_material

In [207]:
articles[0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'print_section', 'print_page', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

In [208]:
# Look at the types of all the relevant keys for analysis
keys = ['abstract','lead_paragraph','headline','keywords','pub_date','document_type','section_name','type_of_material']
for key in keys: print(f'{key}: {type(article[key])}')

abstract: <class 'str'>
lead_paragraph: <class 'str'>
headline: <class 'dict'>
keywords: <class 'list'>
pub_date: <class 'str'>
document_type: <class 'str'>
section_name: <class 'str'>
type_of_material: <class 'str'>


In [209]:
# Look at the stucture of the keywords list
articles[0]['keywords']

[{'name': 'subject', 'value': 'Hate Crimes', 'rank': 1, 'major': 'N'},
 {'name': 'subject', 'value': 'Black People', 'rank': 2, 'major': 'N'},
 {'name': 'subject', 'value': 'Blacks', 'rank': 3, 'major': 'N'},
 {'name': 'subject', 'value': 'Discrimination', 'rank': 4, 'major': 'N'},
 {'name': 'subject',
  'value': 'Civil Rights Movement (1954-68)',
  'rank': 5,
  'major': 'N'},
 {'name': 'subject', 'value': 'Reconstruction Era', 'rank': 6, 'major': 'N'},
 {'name': 'subject',
  'value': 'Segregation and Desegregation',
  'rank': 7,
  'major': 'N'},
 {'name': 'persons',
  'value': 'Nixon, Richard Milhous',
  'rank': 8,
  'major': 'N'}]

In [210]:
def concat_keywords(keywords_list):
    '''concatenates keywords from keywords_list with semicolons'''
    all_keywords = ""
    for dct in keywords_list:
        keyword = dct['value']
        all_keywords += keyword + ';'
    return all_keywords[:-1]

concat_keywords(articles[0]['keywords'])

'Hate Crimes;Black People;Blacks;Discrimination;Civil Rights Movement (1954-68);Reconstruction Era;Segregation and Desegregation;Nixon, Richard Milhous'

In [211]:
# Examine headline dictionary
articles[0]['headline']

{'main': 'The Dawn of a New Era of Oppression',
 'kicker': 'Charles M. Blow',
 'content_kicker': None,
 'print_headline': 'The Dawn of a New Era of Oppression',
 'name': None,
 'seo': None,
 'sub': None}

In [218]:
def create_flat_dct(article):
    '''Given a article (nested dictionary), returns a flat dictionary of keys
    that are relevant for analysis.'''
    str_val_keys = ['abstract','lead_paragraph','pub_date','document_type',
                    'section_name','type_of_material']
    dct = {}
    for key in str_val_keys:
        dct[key] = article[key]

    dct['headline'] = article['headline']['main']
    dct['keywords'] = concat_keywords(article['keywords'])

    return dct


create_flat_dct(articles[0])

{'abstract': 'Periods of backlash take shape after surges of Black progress. We have entered another such period.',
 'lead_paragraph': 'I am fascinated, and alarmed, by the swiftness with which periods of backlash take shape after surges of Black progress, and I believe that we have entered another such period.',
 'pub_date': '2024-02-01T00:00:08+0000',
 'document_type': 'article',
 'section_name': 'Opinion',
 'type_of_material': 'Op-Ed',
 'headline': 'The Dawn of a New Era of Oppression',
 'keywords': 'Hate Crimes;Black People;Blacks;Discrimination;Civil Rights Movement (1954-68);Reconstruction Era;Segregation and Desegregation;Nixon, Richard Milhous'}

4. Write another function that calls the function from point 3 on every article, to create a list of article dictionaries, and convert this list into a dataframe and then store it as a CSV file with the date-month in the title (this is important for point 5 below).

In [238]:
import pandas as pd
import datetime as datetime

def create_df(date,key):
    '''
    Takes the date in MMMM-YY-DD format, gets all the articles for that 
    particular date, gets the relevant data for each article, saves the data 
    as a df, writes it to a csv file, and returns the df.
    '''
    # get all the articles from the given the date
    articles = get_articles_by_date(date,key)
    
    # Create list of flattened dictionaries
    flat_articles = []
    for article in articles:
        flat_articles.append(create_flat_dct(article))

    # Convert list to df
    df = pd.DataFrame(flat_articles)

    # Save as csv
    filename = f'{date}-articles.csv'
    print(filename)
    df.to_csv(filename)

    return df

In [239]:
df = create_df('2024-02-12',key)
df

Successfully got the data.
Documents found
2024-02-12-articles.csv


Unnamed: 0,abstract,lead_paragraph,pub_date,document_type,section_name,type_of_material,headline,keywords
0,A Cetaphil commercial showed a father and daug...,When an advertisement for Cetaphil lotion was ...,2024-02-12T00:30:32+0000,article,Business Day,News,"Ad Nods to Taylor Swift and Football, Drawing ...",Advertising and Marketing;Super Bowl;Cosmetics...
1,Taylor Swift and Travis Kelce have been the su...,Extending a weekslong right-wing meltdown over...,2024-02-12T00:32:24+0000,article,U.S.,News,Trump Says It Would Be ‘Disloyal’ for Taylor S...,"Swift, Taylor;Kelce, Travis;Trump, Donald J;Bi..."
2,In a halftime set that touched on more than a ...,A few minutes into Usher’s dynamic and sly Sup...,2024-02-12T02:14:00+0000,article,Arts,Review,Usher Brings Precise Details to Pop’s Biggest ...,"Rap and Hip-Hop;Super Bowl;Usher;Keys, Alicia;..."
3,The pop superstar used a Verizon ad to tell fa...,After days of speculation and online sleuthing...,2024-02-12T02:41:56+0000,article,Arts,News,Beyoncé Announces New Album in Super Bowl Comm...,"Pop and Rock Music;Super Bowl;Knowles, Beyonce..."
4,"William Albert Haynes Jr., 70, went by “Billy ...","William Albert “Billy Jack” Haynes Jr., who in...",2024-02-12T03:10:32+0000,article,Arts,News,Former W.W.F. Wrestler Arrested in Wife’s Murder,"Murders, Attempted Murders and Homicides;Wrest..."
...,...,...,...,...,...,...,...,...
111,Pregnant women with diabetes or high blood pre...,Women who develop high blood pressure or diabe...,2024-02-12T23:13:06+0000,article,Health,News,Children Born to Mothers With Pregnancy Compli...,your-feed-science;Pregnancy and Childbirth;Dia...
112,The shooting took place at the Mount Eden Aven...,A 35-year-old man was killed and five other pe...,2024-02-12T23:13:38+0000,article,New York,News,One Killed and 5 Wounded in Shooting at Bronx ...,"Subways;Murders, Attempted Murders and Homicid..."
113,Our guide to the themes dominating the race.,The special election in New York’s Third Congr...,2024-02-12T23:20:47+0000,article,U.S.,News,How Special Is New York’s Special Election?,"Politics and Government;Elections;Suozzi, Thom..."
114,The former president showed up at the federal ...,Former President Donald J. Trump and his lawye...,2024-02-12T23:31:02+0000,article,U.S.,News,Trump Attends Court Hearing on Access to Class...,Federal Criminal Case Against Trump (Documents...


5. Once you have done all of these in the notebook, create a Python script that can be called with a date (from a TikTok video). First, the script looks whether a CSV with cleaned articles is in our folder. If not, calls first the API function to get the articles and then the function that converts them into a CSV. Then, it loads the CSV into a datafram and it uses filtering to get the articles for the desired date. These articles will be used for the Semantic Similarity portion of the TikTok Project.

Completed this in the python file **articles_to_csv.py**.