# Analyzing/Filtering NYT Data


**Table of Contents**  

1. [Clean dataframe](#sec1)
2.  [Filter for all Unique Words](#sec2)
3.  [Filter for all Unique Political Words](#sec3)

<a id="sec1"></a>

## 1. Clean Dataframe:

In [1]:
import pandas as pd
csv_file_path = 'NYT_Data.csv'

df = pd.read_csv(csv_file_path)

In [2]:
df.head()

Unnamed: 0,Title,Abstract,Date,DocType,NewsDesk,SectionName,Keywords
0,"Winning South Carolina, Biden Makes Case Again...",Joseph R. Biden Jr. drew on his decades-long r...,2020-03-01T00:06:07+0000,article,Politics,U.S.,"['Biden, Joseph R Jr', 'Sanders, Bernard', 'De..."
1,"At CPAC, Trump Takes Aim at Rivals",President Trump appeared eager to bask in the ...,2020-03-01T00:09:35+0000,article,Washington,U.S.,"['Presidential Election of 2020', 'Tax Cuts an..."
2,The Islanders Are Saying Goodbye to Brooklyn,Gov. Andrew M. Cuomo announced on Saturday tha...,2020-03-01T00:35:23+0000,article,Sports,Sports,"['Hockey, Ice', 'Stadiums and Arenas', 'Cuomo,..."
3,Trump Moves to Calm Fears as First U.S. Death ...,"A person in the Seattle area has died, officia...",2020-03-01T01:13:23+0000,article,Washington,U.S.,"['Coronavirus (2019-nCoV)', 'Presidential Elec..."
4,Mother and Daughter Attacked for Speaking Span...,Officials filed felony hate-crime charges agai...,2020-03-01T01:38:43+0000,article,Express,U.S.,"['Assaults', 'Hate Crimes', 'Hispanic-American..."


In [3]:
df['Date'] = pd.to_datetime(df['Date'])

df['Date'] = df['Date'].dt.strftime('%Y-%m-%d')

print(df)

                                                   Title  \
0      Winning South Carolina, Biden Makes Case Again...   
1                     At CPAC, Trump Takes Aim at Rivals   
2           The Islanders Are Saying Goodbye to Brooklyn   
3      Trump Moves to Calm Fears as First U.S. Death ...   
4      Mother and Daughter Attacked for Speaking Span...   
...                                                  ...   
65577      Beyoncé’s Country Is America: Every Bit of It   
65578                    How to Use Up Those Easter Eggs   
65579        Deep-Sixing Pornographic Deepfakes for Good   
65580  Esther Coopersmith, Washington Hostess and Dip...   
65581  Lorraine Graves, Pioneering Harlem Ballerina, ...   

                                                Abstract        Date  DocType  \
0      Joseph R. Biden Jr. drew on his decades-long r...  2020-03-01  article   
1      President Trump appeared eager to bask in the ...  2020-03-01  article   
2      Gov. Andrew M. Cuomo announce

In [4]:
import ast
df.head()
df['Keywords'] = df['Keywords'].apply(lambda x: ast.literal_eval(x))


<a id="sec2"></a>

## 2. Filter for all Unique Words:

In [5]:
df['Year'] = pd.to_datetime(df['Date']).dt.year

# Function to aggregate all keywords by year and find unique ones
def unique_keywords(group):
    all_keywords = set()
    for keywords_list in group:
        all_keywords.update(keywords_list)
    return list(all_keywords)

# Group by 'Year' and aggregate keywords
unique_df = df.groupby('Year')['Keywords'].agg(unique_keywords)

print(unique_df)

Year
2020    [McIlroy, Rory, Bridgeway Community Church (Co...
2021    [McIlroy, Rory, Botstein, Leon, Nelson, Gene, ...
2022    [McIlroy, Rory, Botstein, Leon, Gorillas Techn...
2023    [McIlroy, Rory, Night Side of the River: Ghost...
2024    [Gotland Island (Sweden), LucasArts Entertainm...
Name: Keywords, dtype: object


In [6]:
def year_words(seriesData, years):
    for year in years:
        keywords = seriesData.loc[year]
        print(keywords)
        print(len(keywords))
year_words(unique_df, [2020,2021,2022,2023,2024])

16310
15263
14961
13690
13074


<a id="sec3"></a>

## 3. Filter for all Unique Political Words:

In [7]:
df['Year'] = pd.to_datetime(df['Date']).dt.year

# Filter rows where 'NewsDesk' is one of the specified categories
filtered_df = df[df['NewsDesk'].isin(['Politics', 'Washington', 'National'])]

# Function to aggregate all keywords by year and find unique ones
def unique_keywords(group):
    all_keywords = set()
    for keywords_list in group:
        all_keywords.update(keywords_list)
    return list(all_keywords)

# Group by 'Year' and aggregate keywords
unique_df = filtered_df.groupby('Year')['Keywords'].agg(unique_keywords)

print(unique_df)
year_words(unique_df, [2020,2021,2022,2023,2024])

Year
2020    [Bridgeway Community Church (Columbia, Md), Ar...
2021    [Arbery, Ahmaud (1994-2020), DeJoy, Louis, Bro...
2022    [No-Fly Zones, Arbery, Ahmaud (1994-2020), Ban...
2023    [Crimo, Robert E III, American Assn of Univers...
2024    [American Assn of University Professors, Hemph...
Name: Keywords, dtype: object
2238
2010
1955
2006
['American Assn of University Professors', 'Hemphill, Preston (Memphis, Tenn, Police Officer)', 'Graves, Garret', 'Darien Gap', 'Samsel, Ryan', 'Sununu, Christopher T (1974- )', 'Car Services and Livery Cabs', 'Reading and Writing Skills (Education)', 'Minnesota', 'Attorneys General', 'Liebman, Wilma B', 'Eritrea', 'Federal Criminal Case Against Trump (2020 Election Case)', 'Childs, J Michelle', 'McMaster, Henry', 'Isaacman, Jared (1983- )', 'Reproductive System (Human)', 'Conservative Partnership Institute', 'Northern Virginia Community College', 'Treasury Department', 'Amnesties, Commutations and Pardons', 'National Oceanic and Atmospheric Admin

In [24]:
import json
json_result = unique_df.to_json()
#print(json_result)
json_dict = json.loads(json_result)
print((json_dict['2020']))

with open("all_NYT.json", 'w') as file:
    file.write(json_result)

