# Get data from The New York Times

In this notebook, we set up the functions to send and process multiple queries to the Article Search API provided by The New York Times. Each query asks for articles that contain a deaf-related phrase, such as "deaf and dumb" or "hearing-impaired". Full list of phrases below.

The Article Search API returns a max of 10 results at a time. If we got 143 hits. The current offset is 0, so this means we are on the first 10 results, or the first page. If we append `&page=2` to the query, and process that response, now the offset is 10. This confirms we are on the second page. Since each page has 10 results and we have 143 hits, we will need to go through ceil(143/10) = 15 pages to get all of them. 

Contents:
- Define functions
- Set up queries
- Get data
- Postprocessing

Import dependencies.

In [6]:
import os
import dateutil
import pandas as pd
#pd.options.display.max_colwidth = 100
import numpy as np
import requests
import time
import json
from ast import literal_eval
import datetime
from dateutil.relativedelta import relativedelta

# Usernames and passwords
import configparser
configs = configparser.ConfigParser()
configs.read('../../config.ini')

['../../config.ini']

## Define functions

In [7]:
def send_request(query, page):
    '''Sends a request to the NYT Archive API for given date.'''
    base_url = 'http://api.nytimes.com/svc/search/v2/articlesearch.json'
    url = base_url + '?fq=' + query + '&api-key=' + configs['NYT']['ACCESS_KEY'] + '&page=' + str(page)
    response = requests.get(url).json()
    time.sleep(6)
    return response


def parse_response(response, data):
    '''Parses and returns response as pandas data frame.'''
    
    articles = response['response']['docs'] 
    for article in articles: 
        
        # id
        data['id'].append(article['_id'])
        
        # Date
        date = dateutil.parser.parse(article['pub_date']).date()
        data['date'].append(date)
        
        # Headline
        data['headline'].append(article['headline']['main']) 
        
        # Section
        if 'section_name' in article:
            data['section'].append(article['section_name'])
        else:
            data['section'].append(None)
        
        # News desk
        if 'news_desk' in article:
            data['news_desk'].append(article['news_desk'])
        else:
            data['news_desk'].append(None)
        
        # Document type
        data['doc_type'].append(article['document_type'])
        
        # Type of material
        if 'type_of_material' in article: 
            data['material_type'].append(article['type_of_material'])
        else:
            data['material_type'].append(None)
            
        # Keywords
        keywords = [keyword['value'] for keyword in article['keywords'] if keyword['name'] == 'subject']
        data['keywords'].append(keywords)
        
        # Web URL
        if 'web_url' in article:
            data['url'].append(article['web_url'])
            
        # Author
        if 'byline' in article:
            data['byline'].append(article['byline']['original'])
        else:
            data['byline'].append(None)
            
            
def send_query(query, data, date=None):
    # If the query has already been sent, don't send again
    if query + '.csv' in os.listdir('data'): 
        print('Already have data for the term "' + query + '".\n')
        return False
    
    # If date is provided, append to query string
    query_str = queries[query]
    if date:
        query_str = query_str + date
        
    print('Querying string: ' + query_str + '\n')
    
    page_num = 0
    while True:
        response = send_request(query_str, page_num)
        offset = response['response']['meta']['offset']
        hits = response['response']['meta']['hits']
        
        if offset > hits: 
            print('Done processing results.\n')
            return True
        # If we have 2,000 hits or more, we will need to break down our query into date intervals
        elif hits >= 2000: 
            print('We have over 2,000 hits.\n')
            # Send the same query again, once for each date interval
            for date in q_dates:
                send_query(query, data, date) 
            return True
            
        print('Processing results ' + str(offset) + '—' + str(min((offset + 10), hits)) + '/' + str(hits) + '...')
        parse_response(response, data)
        page_num += 1

## Set up queries

Our goal is to analyze the usage of deaf-related terms in *The New York Times*. 

The deaf-related terms: 

- "deaf and dumb"
- "deaf-mute"
- "hearing-impaired"
- "tone deaf"
- "deaf as a post"
- "stone deaf" 
- "fell on deaf ears"
- "deaf" (excluding its presence in all of the above terms) 

Some terms have varieties. For example, take "fell on deaf ears." Taking into consideration all of its varieties (that yielded hits when testing manually), the resulting query would be 

>`body:"fell on deaf ears" OR headline:"fell on deaf ears" 
OR body:"fall on deaf ears" OR headline:"fall on deaf ears" 
OR body:"falls on deaf ears" 
OR body:"fall on a deaf ear" 
OR body:"fell on a deaf ear" 
OR body:"turn a deaf ear" OR headline:"turn a deaf ear" 
OR body:"turned deaf ear" OR headline:"turned deaf ear"
OR body:"turned a deaf ear" OR headline:"turned a deaf ear"`

Construct remaining queries. I did this by checking the API manually in the browser, adding conditions one by one. 

http://api.nytimes.com/svc/search/v2/articlesearch.json?fq= + `query` + &api-key= + `api_key`

In [18]:
deaf_and_dumb = ['deaf and dumb',
                 'deaf dumb']

deaf_mute = ['deaf mute',
             'deaf and mute',
             'mute deaf',
             'mute and deaf']

fell_on_deaf_ears = ['fell on deaf ears', 
                     'fall on deaf ears', 
                     'falls on deaf ears', 
                     'fall on a deaf ear',
                     'falling on deaf ears',
                     'falling on a deaf ear',
                     'turn a deaf ear', 
                     'turned deaf ears', 
                     'turned a deaf ear',
                     'turning deaf ears',
                     'turning a deaf ear']

hearing_impaired = ['hearing impaired', 
                    'hearing impairment',
                    'impaired hearing']

tone_deaf = ['tone deaf']

deaf_as_a_post = ['deaf as a post']

stone_deaf = ['stone deaf']

deaf = ['deaf']

deaf_dumb_and_blind = ['deaf dumb and blind',
                       'deaf dumb blind', 
                       'blind deaf dumb',
                       'blind deaf and dumb',
                       'blind and deaf and dumb']

deafblind = ['deaf blind',
              'deafblind',
              'deaf and blind']

hard_of_hearing = ['hard of hearing']

In [58]:
phrases = {'deaf_and_dumb': deaf_and_dumb,
           'deaf_dumb_and_blind': deaf_dumb_and_blind,
           'deafblind': deafblind,
           'deaf_mute': deaf_mute,
           'hard_of_hearing': hoh,
           'hearing_impaired': hearing_impaired,
           'fell_on_deaf_ears': fell_on_deaf_ears,
           'tone_deaf': tone_deaf,
           'deaf_as_a_post': deaf_as_a_post,
           'stone_deaf': stone_deaf,
           'deaf': deaf}

# Save phrases to file
with open('data/phrases.txt', 'w') as outfile:
    json.dump(phrases, outfile)

Construct queries based on phrases.

In [59]:
queries = {phrase: (' OR ').join(['body:"' + x + '" OR headline:"' + x + '"' for x in phrases[phrase]]) for phrase in phrases.keys()}

# Edit 'deaf' query to exclude all other phrases that contain 'deaf'
queries['deaf'] = '(' + queries['deaf'] \
+ ') AND NOT (' + queries['deaf_and_dumb'] \
+ ') AND NOT (' + queries['deaf_mute'] \
+ ') AND NOT (' + queries['fell_on_deaf_ears'] \
+ ') AND NOT (' + queries['tone_deaf'] \
+ ') AND NOT (' + queries['deaf_as_a_post'] \
+ ') AND NOT (' + queries['stone_deaf'] \
+ ') AND NOT (' + queries['deaf_dumb_and_blind'] \
+ ') AND NOT (' + queries['deafblind'] + ')'

# did not get any results
# Edit 'deaf_dumb_and_blind' query to exclude 'deaf_and_dumb' and 'deafblind'
#queries['deaf_dumb_and_blind'] = '(' + queries['deaf_dumb_and_blind'] \
#+ ') AND NOT (' + queries['deaf_and_dumb'] + ')'

# Edit deafblind query to exclude deaf_dumb_and_blind
queries['deafblind'] = '(' + queries['deafblind'] \
+ ') AND NOT (' + queries['deaf_dumb_and_blind'] + ')'

queries 

{'deaf_and_dumb': 'body:"deaf and dumb" OR headline:"deaf and dumb" OR body:"deaf dumb" OR headline:"deaf dumb"',
 'deaf_dumb_and_blind': 'body:"deaf dumb and blind" OR headline:"deaf dumb and blind" OR body:"deaf dumb blind" OR headline:"deaf dumb blind" OR body:"blind deaf dumb" OR headline:"blind deaf dumb" OR body:"blind deaf and dumb" OR headline:"blind deaf and dumb" OR body:"blind and deaf and dumb" OR headline:"blind and deaf and dumb"',
 'deafblind': '(body:"deaf blind" OR headline:"deaf blind" OR body:"deafblind" OR headline:"deafblind" OR body:"deaf and blind" OR headline:"deaf and blind") AND NOT (body:"deaf dumb and blind" OR headline:"deaf dumb and blind" OR body:"deaf dumb blind" OR headline:"deaf dumb blind" OR body:"blind deaf dumb" OR headline:"blind deaf dumb" OR body:"blind deaf and dumb" OR headline:"blind deaf and dumb" OR body:"blind and deaf and dumb" OR headline:"blind and deaf and dumb")',
 'deaf_mute': 'body:"deaf mute" OR headline:"deaf mute" OR body:"deaf a

## Get data

Date intervals for NYT search API are formatted as `&begin_date=20120101&end_date=20121231`. We need to format date intervals because if our query returns over 2,000 results, we can't get them all. We need to make our query smaller, and we can do that by narrowing the date interval. We will do 5-year intervals.

In [60]:
start, end = [datetime.datetime.strptime("1850-01-01", "%Y-%m-%d"), datetime.datetime.today()]
interval = 5 # years
dates = [(start + relativedelta(years=x)).strftime('%Y%m%d') for x in range(0, relativedelta(end, start).years + 10, interval)]
q_dates = ['&begin_date=' + start + '&end_date=' + end for start, end in zip(dates, dates[1:])]
q_dates

['&begin_date=18500101&end_date=18550101',
 '&begin_date=18550101&end_date=18600101',
 '&begin_date=18600101&end_date=18650101',
 '&begin_date=18650101&end_date=18700101',
 '&begin_date=18700101&end_date=18750101',
 '&begin_date=18750101&end_date=18800101',
 '&begin_date=18800101&end_date=18850101',
 '&begin_date=18850101&end_date=18900101',
 '&begin_date=18900101&end_date=18950101',
 '&begin_date=18950101&end_date=19000101',
 '&begin_date=19000101&end_date=19050101',
 '&begin_date=19050101&end_date=19100101',
 '&begin_date=19100101&end_date=19150101',
 '&begin_date=19150101&end_date=19200101',
 '&begin_date=19200101&end_date=19250101',
 '&begin_date=19250101&end_date=19300101',
 '&begin_date=19300101&end_date=19350101',
 '&begin_date=19350101&end_date=19400101',
 '&begin_date=19400101&end_date=19450101',
 '&begin_date=19450101&end_date=19500101',
 '&begin_date=19500101&end_date=19550101',
 '&begin_date=19550101&end_date=19600101',
 '&begin_date=19600101&end_date=19650101',
 '&begin_da

Send these queries to the Archive API to create CSV tables that we save to `./data/`.

In [61]:
for query in list(queries.keys()):
    # Reset the global data object
    data = {'headline': [],  
            'date': [], 
            'doc_type': [],
            'material_type': [],
            'news_desk': [],
            'section': [],
            'keywords': [],
            'url': [],
            'id': [],
            'byline': []}
    
    # Send query
    got_result = send_query(query, data)
    
    # Build and save frame from data object 
    if got_result:
        data_df = pd.DataFrame(data)
        data_df['date'] = pd.to_datetime(data_df['date'])
        data_df.to_csv('data/' + query + '.csv', index=False)
        print('Saved as ' + query + '.csv.\n')

Already have data for the term "deaf_and_dumb".

Already have data for the term "deaf_dumb_and_blind".

Already have data for the term "deafblind".

Already have data for the term "deaf_mute".

Querying string: body:"hard of hearing" OR headline:"hard of hearing"

Processing results 0—10/1245...
Processing results 10—20/1245...
Processing results 20—30/1245...
Processing results 30—40/1245...
Processing results 40—50/1245...
Processing results 50—60/1245...
Processing results 60—70/1245...
Processing results 70—80/1245...
Processing results 80—90/1245...
Processing results 90—100/1245...
Processing results 100—110/1245...
Processing results 110—120/1245...
Processing results 120—130/1245...
Processing results 130—140/1245...
Processing results 140—150/1245...
Processing results 150—160/1245...
Processing results 160—170/1245...
Processing results 170—180/1245...
Processing results 180—190/1245...
Processing results 190—200/1245...
Processing results 200—210/1245...
Processing results 2

Merge all dataframes into one, with True or False for whether they contain a particular phrase. This means we're setting these additional `bool` columns:

- deaf_and_dumb
- deaf_mute
- fall_on_deaf_ears
- hearing_impaired
- tone_deaf
- deaf_as_a_post
- stone_deaf
- deaf

In [62]:
final_df = pd.DataFrame(columns=data.keys())

# For each key
for key in queries.keys():
    # Read in its CSV
    df = pd.read_csv('data/' + key + '.csv')
    
    # Drop any duplicates for this term alone
    num_dupes = len(df[df['id'].duplicated()])
    if num_dupes > 0:
        print('Dropping duplicates: ' + str(num_dupes))
        df = df.drop_duplicates(subset='id')
    print('added to final_df the df for key ' + key + ' with ' + str(len(df)) + ' values')
    
    df[key] = True # Label each row of this df as belonging to that df
    final_df = pd.concat([final_df, df], axis=0) # Add the df to the final df
    final_df.reset_index(drop=True, inplace=True)

print('\nTotal values: ' + str(len(final_df)))

added to final_df the df for key deaf_and_dumb with 727 values
added to final_df the df for key deaf_dumb_and_blind with 182 values
added to final_df the df for key deafblind with 536 values
added to final_df the df for key deaf_mute with 1120 values
added to final_df the df for key hard_of_hearing with 1245 values
added to final_df the df for key hearing_impaired with 1423 values
added to final_df the df for key fell_on_deaf_ears with 1757 values
added to final_df the df for key tone_deaf with 1739 values
added to final_df the df for key deaf_as_a_post with 22 values
added to final_df the df for key stone_deaf with 65 values
Dropping duplicates: 8
added to final_df the df for key deaf with 10824 values

Total values: 19640


In [63]:
final_df

Unnamed: 0,headline,date,doc_type,material_type,news_desk,section,keywords,url,id,byline,...,deaf_dumb_and_blind,deafblind,deaf_mute,hard_of_hearing,hearing_impaired,fell_on_deaf_ears,tone_deaf,deaf_as_a_post,stone_deaf,deaf
0,THE DEAF AND DUMB WAITER.,1885-12-03,article,Archives,,Archives,[],https://www.nytimes.com/1885/12/03/archives/th...,nyt://article/0074c23c-1ff6-5bc7-85d9-e56a5af3...,,...,,,,,,,,,,
1,Chad Threatens to Expel Sudanese Refugees,2006-04-14,article,News,International,World,[],https://www.nytimes.com/2006/04/14/world/chad-...,nyt://article/00bb19d7-2ba6-5072-8e6b-3159730d...,By Marc Lacey,...,,,,,,,,,,
2,WELFARE HOTEL CHILDREN: TOMORROW'S POOR,1987-07-16,article,News,Metropolitan Desk,New York,"['Homeless Persons', 'HOTELS AND MOTELS', 'Chi...",https://www.nytimes.com/1987/07/16/nyregion/we...,nyt://article/01670df3-ae07-5eb6-8862-7bd834bf...,By Lydia Chavez,...,,,,,,,,,,
3,Wal-Mart Says Oil Prices Held Down Profits for...,2005-08-16,article,News,Business,Business Day,['Company Reports'],https://www.nytimes.com/2005/08/16/business/wa...,nyt://article/0175ac61-cc62-5cdc-923c-f5efb8ec...,By Roben Farzad,...,,,,,,,,,,
4,"A Space Force? The Idea May Have Merit, Some Say",2018-06-23,article,News,Washington,U.S.,"['Space and Astronomy', 'United States Defense...",https://www.nytimes.com/2018/06/23/us/politics...,nyt://article/01b8b8a5-7d0c-592a-a283-a9ccd3d8...,By Helene Cooper,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19635,"Alexa, Awake",2020-01-03,multimedia,Interactive Feature,Opinion,Opinion,"['Privacy', 'Science and Technology']",https://www.nytimes.com/interactive/2020/01/03...,nyt://interactive/c104e18e-f797-5490-b03c-259b...,By Brian Turner,...,,,,,,,,,,True
19636,Confessions of a Dating Profile,2020-01-03,multimedia,Interactive Feature,Opinion,Opinion,"['Artificial Intelligence', 'Dating and Relati...",https://www.nytimes.com/interactive/2020/01/03...,nyt://interactive/ccaba5b4-672d-568b-9daa-b6bb...,By Eric Kaplan,...,,,,,,,,,,True
19637,Parent-Teacher Association,2020-01-03,multimedia,Interactive Feature,Opinion,Opinion,"['Privacy', 'Education (K-12)', 'Children and ...",https://www.nytimes.com/interactive/2020/01/03...,nyt://interactive/df6cd908-52b0-5851-ac81-e7b8...,By Jessica Powell,...,,,,,,,,,,True
19638,Protecting the Rights of Those With Disabilities,2020-08-01,article,Letter,Letters,Opinion,"['Disabilities', 'AMERICANS WITH DISABILITIES ...",https://www.nytimes.com/2020/08/01/opinion/let...,nyt://article/a23c70a4-3518-5560-a80b-38a8292d...,,...,,,,,,,,,,True


## Postprocessing

Go back through each key's column and set `NaN` to `False`. 

In [64]:
for key in queries.keys():
    final_df[key].fillna(False, inplace=True)
final_df

Unnamed: 0,headline,date,doc_type,material_type,news_desk,section,keywords,url,id,byline,...,deaf_dumb_and_blind,deafblind,deaf_mute,hard_of_hearing,hearing_impaired,fell_on_deaf_ears,tone_deaf,deaf_as_a_post,stone_deaf,deaf
0,THE DEAF AND DUMB WAITER.,1885-12-03,article,Archives,,Archives,[],https://www.nytimes.com/1885/12/03/archives/th...,nyt://article/0074c23c-1ff6-5bc7-85d9-e56a5af3...,,...,False,False,False,False,False,False,False,False,False,False
1,Chad Threatens to Expel Sudanese Refugees,2006-04-14,article,News,International,World,[],https://www.nytimes.com/2006/04/14/world/chad-...,nyt://article/00bb19d7-2ba6-5072-8e6b-3159730d...,By Marc Lacey,...,False,False,False,False,False,False,False,False,False,False
2,WELFARE HOTEL CHILDREN: TOMORROW'S POOR,1987-07-16,article,News,Metropolitan Desk,New York,"['Homeless Persons', 'HOTELS AND MOTELS', 'Chi...",https://www.nytimes.com/1987/07/16/nyregion/we...,nyt://article/01670df3-ae07-5eb6-8862-7bd834bf...,By Lydia Chavez,...,False,False,False,False,False,False,False,False,False,False
3,Wal-Mart Says Oil Prices Held Down Profits for...,2005-08-16,article,News,Business,Business Day,['Company Reports'],https://www.nytimes.com/2005/08/16/business/wa...,nyt://article/0175ac61-cc62-5cdc-923c-f5efb8ec...,By Roben Farzad,...,False,False,False,False,False,False,False,False,False,False
4,"A Space Force? The Idea May Have Merit, Some Say",2018-06-23,article,News,Washington,U.S.,"['Space and Astronomy', 'United States Defense...",https://www.nytimes.com/2018/06/23/us/politics...,nyt://article/01b8b8a5-7d0c-592a-a283-a9ccd3d8...,By Helene Cooper,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19635,"Alexa, Awake",2020-01-03,multimedia,Interactive Feature,Opinion,Opinion,"['Privacy', 'Science and Technology']",https://www.nytimes.com/interactive/2020/01/03...,nyt://interactive/c104e18e-f797-5490-b03c-259b...,By Brian Turner,...,False,False,False,False,False,False,False,False,False,True
19636,Confessions of a Dating Profile,2020-01-03,multimedia,Interactive Feature,Opinion,Opinion,"['Artificial Intelligence', 'Dating and Relati...",https://www.nytimes.com/interactive/2020/01/03...,nyt://interactive/ccaba5b4-672d-568b-9daa-b6bb...,By Eric Kaplan,...,False,False,False,False,False,False,False,False,False,True
19637,Parent-Teacher Association,2020-01-03,multimedia,Interactive Feature,Opinion,Opinion,"['Privacy', 'Education (K-12)', 'Children and ...",https://www.nytimes.com/interactive/2020/01/03...,nyt://interactive/df6cd908-52b0-5851-ac81-e7b8...,By Jessica Powell,...,False,False,False,False,False,False,False,False,False,True
19638,Protecting the Rights of Those With Disabilities,2020-08-01,article,Letter,Letters,Opinion,"['Disabilities', 'AMERICANS WITH DISABILITIES ...",https://www.nytimes.com/2020/08/01/opinion/let...,nyt://article/a23c70a4-3518-5560-a80b-38a8292d...,,...,False,False,False,False,False,False,False,False,False,True


Some articles contain multiple phrases. For example an article may contain "deaf-mute" as well as "hearing-impaired." They currently show up as two separate rows, with duplicate values for everything except the deaf-mute and hearing-impaired columns. One row shows True in the deaf-mute column, and the other row shows True in the hearing-impaired column. We want to merge these two rows into one row where both columns show True.

In [65]:
if final_df['id'].duplicated().any():
    print('Duplicates: ' + str(final_df['id'].duplicated().sum()))

Duplicates: 1235


I tried to process the duplicates in the data frame like this:

In [66]:
final_df.groupby(df.columns[0:10].tolist(), as_index=False).max() # DOESNT WORK, DROPS TOO MANY ROWS

Unnamed: 0,headline,date,doc_type,material_type,news_desk,section,keywords,url,id,byline,...,deaf_dumb_and_blind,deafblind,deaf_mute,hard_of_hearing,hearing_impaired,fell_on_deaf_ears,tone_deaf,deaf_as_a_post,stone_deaf,deaf
0,'A Little Bit of Help',1985-09-07,article,News,Metropolitan Desk,New York,"['Culture', 'Awards, Decorations and Honors', ...",https://www.nytimes.com/1985/09/07/nyregion/ne...,nyt://article/d9a9b191-341c-5ca2-975e-7c26d0c3...,By Susan Heller Anderson and David W. Dunlap,...,False,False,False,False,False,False,False,False,False,True
1,"'Beach House,' romantic comedy at Circle Rep.",1985-03-29,article,News,Weekend Desk,Theater,['Theater'],https://www.nytimes.com/1985/03/29/theater/bro...,nyt://article/1d7824c8-8ec3-5af2-809b-f11b56b3...,By Enid Nemy,...,False,False,False,False,True,False,False,False,False,True
2,"'Noises Off,' With Flawless Timing",1990-07-08,article,News,Westchester Weekly Desk,New York,"['Reviews', 'Theater']",https://www.nytimes.com/1990/07/08/nyregion/th...,nyt://article/6f7f506a-308a-5679-bc58-79cfa26f...,By Alvin Klein,...,False,False,False,False,False,False,False,False,False,True
3,'These People' Are Ruining Our Lives?,1990-07-01,article,News,Long Island Weekly Desk,New York,"['Social Conditions and Trends', 'Homeless Per...",https://www.nytimes.com/1990/07/01/nyregion/lo...,nyt://article/4c2972ff-341d-5981-9ee0-e95fa45f...,By Warren Goldstein,...,False,False,False,True,False,False,False,False,False,False
4,10TH ANNIVERSARY FOR 'SUMMER EVENINGS',1985-06-09,article,Review,Westchester Weekly Desk,New York,['Music'],https://www.nytimes.com/1985/06/09/nyregion/mu...,nyt://article/801429bd-4a71-5285-8a61-bf2ba672...,By Robert Sherman,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12625,‘Wonder Woman’ Could Be the Superhero Women in...,2017-06-04,article,News,Business,Business Day,"['Movies', 'Women and Girls', 'Sexual Harassme...",https://www.nytimes.com/2017/06/04/business/me...,nyt://article/234b9c6e-59ed-5dfb-a561-2939c35c...,By Jim Rutenberg,...,False,False,False,False,True,False,False,False,False,False
12626,‘World Wide Mind’,2011-02-15,article,Text,Science,Science,"['Books and Literature', 'Science and Technolo...",https://www.nytimes.com/2011/02/15/science/15s...,nyt://article/9c18470b-2b77-55bf-863a-fb55bef5...,By Michael Chorost,...,False,False,False,False,False,False,False,False,False,True
12627,‘Write When You Get Work’ Review: Backstreet B...,2018-11-22,article,Review,Weekend,Movies,['Movies'],https://www.nytimes.com/2018/11/22/movies/writ...,nyt://article/db879a75-4d65-56ff-b5ad-93ecf008...,By Jeannette Catsoulis,...,False,False,False,False,False,False,True,False,False,False
12628,"‘Yellow Vests’ Riot in Paris, but Their Anger ...",2018-12-02,article,News,Foreign,World,"['Yellow Vests Movement', 'Demonstrations, Pro...",https://www.nytimes.com/2018/12/02/world/europ...,nyt://article/4f02b33e-94f3-5d46-9fa9-71e3c50a...,By Adam Nossiter,...,False,False,False,False,False,False,False,False,False,True


But it didn't work. So I copied duplicates into a separate dataframe, processed them there, removed all copies of them from final_df, then appended the processed dupes to final_df and resorted. *shrug*

In [68]:
# Process dupes separately
dupes = pd.concat(g for _, g in final_df.groupby('id') if len(g) > 1)
dupes

Unnamed: 0,headline,date,doc_type,material_type,news_desk,section,keywords,url,id,byline,...,deaf_dumb_and_blind,deafblind,deaf_mute,hard_of_hearing,hearing_impaired,fell_on_deaf_ears,tone_deaf,deaf_as_a_post,stone_deaf,deaf
4454,"M. McClelland, 53; Helped Experiments In Theat...",1993-09-22,article,Obituary; Biography,Obituary,Obituaries,"['Biographical Information', 'DEATHS']",https://www.nytimes.com/1993/09/22/obituaries/...,nyt://article/00742127-b642-510b-a474-a05476c5...,By Kathleen Teltsch,...,False,False,False,False,True,False,False,False,False,False
13930,"M. McClelland, 53; Helped Experiments In Theat...",1993-09-22,article,Obituary; Biography,Obituary,Obituaries,"['Biographical Information', 'DEATHS']",https://www.nytimes.com/1993/09/22/obituaries/...,nyt://article/00742127-b642-510b-a474-a05476c5...,By Kathleen Teltsch,...,False,False,False,False,False,False,False,False,False,True
3809,"THE HARD-OF-HEARING LIKE THE THEATER, TOO",1983-08-13,article,Letter,Editorial Desk,Opinion,['TERMS NOT AVAILABLE'],https://www.nytimes.com/1983/08/13/opinion/l-t...,nyt://article/007e25b1-b68b-5bc6-ada2-350415d9...,,...,False,False,False,True,False,False,False,False,False,False
5231,"THE HARD-OF-HEARING LIKE THE THEATER, TOO",1983-08-13,article,Letter,Editorial Desk,Opinion,['TERMS NOT AVAILABLE'],https://www.nytimes.com/1983/08/13/opinion/l-t...,nyt://article/007e25b1-b68b-5bc6-ada2-350415d9...,,...,False,False,False,False,True,False,False,False,False,False
12794,"THE HARD-OF-HEARING LIKE THE THEATER, TOO",1983-08-13,article,Letter,Editorial Desk,Opinion,['TERMS NOT AVAILABLE'],https://www.nytimes.com/1983/08/13/opinion/l-t...,nyt://article/007e25b1-b68b-5bc6-ada2-350415d9...,,...,False,False,False,False,False,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19572,How Architecture Could Help Us Adapt to the Pa...,2020-06-10,multimedia,Interactive Feature,Magazine,Magazine,"['Coronavirus (2019-nCoV)', 'Workplace Environ...",https://www.nytimes.com/interactive/2020/06/09...,nyt://interactive/e6694730-c72a-5173-ab0c-42d4...,By Kim Tingley,...,False,False,False,False,False,False,False,False,False,True
3869,Deaf Club,2016-04-29,multimedia,Slideshow,T Magazine,T Magazine,"['Deafness', 'Music']",https://www.nytimes.com/slideshow/2016/04/29/t...,nyt://slideshow/3412e8a3-2e79-5269-a752-7573ec...,,...,False,False,False,False,True,False,False,False,False,False
19267,Deaf Club,2016-04-29,multimedia,Slideshow,T Magazine,T Magazine,"['Deafness', 'Music']",https://www.nytimes.com/slideshow/2016/04/29/t...,nyt://slideshow/3412e8a3-2e79-5269-a752-7573ec...,,...,False,False,False,False,False,False,False,False,False,True
3975,‘Magic to Do’,2011-07-25,multimedia,Slideshow,Theater,Theater,"['Theater', 'Deafness']",https://www.nytimes.com/slideshow/2009/02/12/t...,nyt://slideshow/bcb62c85-d683-5d1b-b548-0a7b9b...,,...,False,False,False,False,True,False,False,False,False,False


Merge rows so that there is only one row for each unique id, retaining all True values. Note again that for each unique id, the only thing that differs are the columns containing the True/False values, hence us doing groupby on all the other columns so that we can apply .max() to those True/False columns.

.max() works cuz True > False

In [69]:
dupes.reset_index(drop=True, inplace=True)
dupes_fixed = dupes.groupby(dupes.columns[0:10].tolist(), as_index=False).max()
dupes_fixed

Unnamed: 0,headline,date,doc_type,material_type,news_desk,section,keywords,url,id,byline,...,deaf_dumb_and_blind,deafblind,deaf_mute,hard_of_hearing,hearing_impaired,fell_on_deaf_ears,tone_deaf,deaf_as_a_post,stone_deaf,deaf
0,"'Beach House,' romantic comedy at Circle Rep.",1985-03-29,article,News,Weekend Desk,Theater,['Theater'],https://www.nytimes.com/1985/03/29/theater/bro...,nyt://article/1d7824c8-8ec3-5af2-809b-f11b56b3...,By Enid Nemy,...,False,False,False,False,True,False,False,False,False,True
1,DRUNK AGAIN,1985-03-31,article,News,Magazine Desk,Magazine,['English Language'],https://www.nytimes.com/1985/03/31/magazine/on...,nyt://article/89c35a34-5440-5541-afbf-3f8f652d...,By William Safire,...,False,False,False,False,True,False,False,False,False,True
2,HELPING THOSE HOW HELP,1985-11-24,article,News,Society Desk,Style,['TERMS NOT AVAILABLE'],https://www.nytimes.com/1985/11/24/style/socia...,nyt://article/855d9f97-49c4-58d6-9640-c7bd9d2b...,By Robert E. Tomasson,...,False,False,False,True,True,False,False,False,False,False
3,HOPE IS OFFERED TO THE HARD-OF-HEARING,1985-08-18,article,News,Westchester Weekly Desk,New York,['TERMS NOT AVAILABLE'],https://www.nytimes.com/1985/08/18/nyregion/we...,nyt://article/82d141c9-c2b9-5324-bc57-1ad2021f...,By Rosalyn Fein,...,False,False,False,True,True,False,False,False,False,True
4,LISTENERS PAY A HIGH PRICE FOR LOUD MUSIC,1990-03-18,article,News,Arts and Leisure Desk,Arts,"['Music', 'Deafness']",https://www.nytimes.com/1990/03/18/arts/sound-...,nyt://article/a4d01869-7cfa-5f26-9a45-019ca8aa...,By Hans Fantel,...,False,False,False,False,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
802,‘Nothing About Us Without Us’: 16 Moments in t...,2020-07-22,article,News,SpecialSections,U.S.,"['Disabilities', 'Civil Rights and Liberties',...",https://www.nytimes.com/2020/07/22/us/ada-disa...,nyt://article/b87eb307-fda1-5fda-83c7-701dc80f...,By Julia Carmel,...,False,False,False,True,False,False,False,False,False,True
803,‘Singing’ With Their Hands,2012-02-11,article,News,Styles,Fashion & Style,"['Video Recordings and Downloads', 'Music', 'S...",https://www.nytimes.com/2012/02/12/fashion/sin...,nyt://article/0918d106-bd33-59fe-a100-cd3f9a23...,By Austin Considine,...,False,False,False,False,True,False,False,False,False,True
804,"‘Switched at Birth,’ a Series Illuminating a W...",2017-01-30,article,News,Culture,Arts,['Television'],https://www.nytimes.com/2017/01/30/arts/televi...,nyt://article/70949850-b6e8-50c9-9982-1061b7f8...,By Neil Genzlinger,...,False,False,False,False,True,False,False,False,False,True
805,‘The Worthy’\n,2006-07-23,article,News,BookReview,Books,['Books and Literature'],https://www.nytimes.com/2006/07/23/books/chapt...,nyt://article/51443601-2828-54cd-bdc4-c330b652...,By Will Clarke,...,True,False,False,False,False,False,False,False,False,False


Remove all instances of duplicates from final_df.

In [70]:
final_df = final_df[~final_df['id'].isin(dupes['id'])]
final_final_lol_mybad_df = pd.concat((final_df, dupes_fixed))
final_final_lol_mybad_df.reset_index(inplace=True, drop=True)
final_final_lol_mybad_df

Unnamed: 0,headline,date,doc_type,material_type,news_desk,section,keywords,url,id,byline,...,deaf_dumb_and_blind,deafblind,deaf_mute,hard_of_hearing,hearing_impaired,fell_on_deaf_ears,tone_deaf,deaf_as_a_post,stone_deaf,deaf
0,THE DEAF AND DUMB WAITER.,1885-12-03,article,Archives,,Archives,[],https://www.nytimes.com/1885/12/03/archives/th...,nyt://article/0074c23c-1ff6-5bc7-85d9-e56a5af3...,,...,False,False,False,False,False,False,False,False,False,False
1,Chad Threatens to Expel Sudanese Refugees,2006-04-14,article,News,International,World,[],https://www.nytimes.com/2006/04/14/world/chad-...,nyt://article/00bb19d7-2ba6-5072-8e6b-3159730d...,By Marc Lacey,...,False,False,False,False,False,False,False,False,False,False
2,WELFARE HOTEL CHILDREN: TOMORROW'S POOR,1987-07-16,article,News,Metropolitan Desk,New York,"['Homeless Persons', 'HOTELS AND MOTELS', 'Chi...",https://www.nytimes.com/1987/07/16/nyregion/we...,nyt://article/01670df3-ae07-5eb6-8862-7bd834bf...,By Lydia Chavez,...,False,False,False,False,False,False,False,False,False,False
3,NEW-YORK CITY.; Union of the Liberal Societies...,1854-07-06,article,Archives,,Archives,[],https://www.nytimes.com/1854/07/06/archives/ne...,nyt://article/027c2e43-cb25-52ab-9eee-03873325...,,...,False,False,False,False,False,False,False,False,False,False
4,Deaf and Dumb at Intervals.,1896-09-06,article,Archives,,Archives,[],https://www.nytimes.com/1896/09/06/archives/de...,nyt://article/02812482-a829-5351-b560-31a87b64...,,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18063,‘Nothing About Us Without Us’: 16 Moments in t...,2020-07-22,article,News,SpecialSections,U.S.,"['Disabilities', 'Civil Rights and Liberties',...",https://www.nytimes.com/2020/07/22/us/ada-disa...,nyt://article/b87eb307-fda1-5fda-83c7-701dc80f...,By Julia Carmel,...,False,False,False,True,False,False,False,False,False,True
18064,‘Singing’ With Their Hands,2012-02-11,article,News,Styles,Fashion & Style,"['Video Recordings and Downloads', 'Music', 'S...",https://www.nytimes.com/2012/02/12/fashion/sin...,nyt://article/0918d106-bd33-59fe-a100-cd3f9a23...,By Austin Considine,...,False,False,False,False,True,False,False,False,False,True
18065,"‘Switched at Birth,’ a Series Illuminating a W...",2017-01-30,article,News,Culture,Arts,['Television'],https://www.nytimes.com/2017/01/30/arts/televi...,nyt://article/70949850-b6e8-50c9-9982-1061b7f8...,By Neil Genzlinger,...,False,False,False,False,True,False,False,False,False,True
18066,‘The Worthy’\n,2006-07-23,article,News,BookReview,Books,['Books and Literature'],https://www.nytimes.com/2006/07/23/books/chapt...,nyt://article/51443601-2828-54cd-bdc4-c330b652...,By Will Clarke,...,True,False,False,False,False,False,False,False,False,False


In [71]:
print('Duplicates: ' + str(final_final_lol_mybad_df['id'].duplicated().sum()))

Duplicates: 0


In [72]:
final_final_lol_mybad_df.to_csv('data/all.csv', index=False)