# Analyzing/Filtering NYT Data


**Table of Contents**  

1. [Clean dataframe](#sec1)
2.  [Filter for all Unique Words](#sec2)
3.  [Filter for all Unique Political Words](#sec3)

<a id="sec1"></a>

## 1. Clean Dataframe:

In [2]:
import pandas as pd
csv_file_path = 'NYT_Data.csv'

df = pd.read_csv(csv_file_path)

In [35]:
df.head()

Unnamed: 0,Title,Abstract,Date,DocType,NewsDesk,SectionName,Keywords
0,"Winning South Carolina, Biden Makes Case Again...",Joseph R. Biden Jr. drew on his decades-long r...,2020-03-01T00:06:07+0000,article,Politics,U.S.,"['Biden, Joseph R Jr', 'Sanders, Bernard', 'De..."
1,"At CPAC, Trump Takes Aim at Rivals",President Trump appeared eager to bask in the ...,2020-03-01T00:09:35+0000,article,Washington,U.S.,"['Presidential Election of 2020', 'Tax Cuts an..."
2,The Islanders Are Saying Goodbye to Brooklyn,Gov. Andrew M. Cuomo announced on Saturday tha...,2020-03-01T00:35:23+0000,article,Sports,Sports,"['Hockey, Ice', 'Stadiums and Arenas', 'Cuomo,..."
3,Trump Moves to Calm Fears as First U.S. Death ...,"A person in the Seattle area has died, officia...",2020-03-01T01:13:23+0000,article,Washington,U.S.,"['Coronavirus (2019-nCoV)', 'Presidential Elec..."
4,Mother and Daughter Attacked for Speaking Span...,Officials filed felony hate-crime charges agai...,2020-03-01T01:38:43+0000,article,Express,U.S.,"['Assaults', 'Hate Crimes', 'Hispanic-American..."


In [3]:
df['Date'] = pd.to_datetime(df['Date'])

df['Date'] = df['Date'].dt.strftime('%Y-%m-%d')

print(df)

                                                   Title  \
0      Winning South Carolina, Biden Makes Case Again...   
1                     At CPAC, Trump Takes Aim at Rivals   
2           The Islanders Are Saying Goodbye to Brooklyn   
3      Trump Moves to Calm Fears as First U.S. Death ...   
4      Mother and Daughter Attacked for Speaking Span...   
...                                                  ...   
65577      Beyoncé’s Country Is America: Every Bit of It   
65578                    How to Use Up Those Easter Eggs   
65579        Deep-Sixing Pornographic Deepfakes for Good   
65580  Esther Coopersmith, Washington Hostess and Dip...   
65581  Lorraine Graves, Pioneering Harlem Ballerina, ...   

                                                Abstract        Date  DocType  \
0      Joseph R. Biden Jr. drew on his decades-long r...  2020-03-01  article   
1      President Trump appeared eager to bask in the ...  2020-03-01  article   
2      Gov. Andrew M. Cuomo announce

In [4]:
import ast
df.head()
df['Keywords'] = df['Keywords'].apply(lambda x: ast.literal_eval(x))


<a id="sec2"></a>

## 2. Filter for all Unique Words:

In [50]:
df['Year'] = pd.to_datetime(df['Date']).dt.year

# Function to aggregate all keywords by year and find unique ones
def unique_keywords(group):
    all_keywords = set()
    for keywords_list in group:
        all_keywords.update(keywords_list)
    return list(all_keywords)

# Group by 'Year' and aggregate keywords
unique_df = df.groupby('Year')['Keywords'].agg(unique_keywords)

print(unique_df)

Year
2020    [Wooden, Nashom (Mona Foot), La Trobe Universi...
2021    [You I Like (Play), Jelle's Marble Runs, Marti...
2022    [Pershing Square Capital Management, Martial L...
2023    [Pershing Square Capital Management, Watching ...
2024    [Pershing Square Capital Management, Snowmass ...
Name: Keywords, dtype: object


In [7]:
def year_words(seriesData, years):
    for year in years:
        keywords = seriesData.loc[year]
        print(keywords)
        print(len(keywords))
year_words(unique_df, [2020,2021,2022,2023,2024])

3390
3409
3233
3250
3546


<a id="sec3"></a>

## 3. Filter for all Unique Political Words:

In [9]:
df['Year'] = pd.to_datetime(df['Date']).dt.year

# Filter rows where 'NewsDesk' is one of the specified categories
filtered_df = df[df['NewsDesk'].isin(['Politics', 'Washington', 'National'])]

# Function to aggregate all keywords by year and find unique ones
def unique_keywords(group):
    all_keywords = set()
    for keywords_list in group:
        all_keywords.update(keywords_list)
    return list(all_keywords)

# Group by 'Year' and aggregate keywords
unique_df = filtered_df.groupby('Year')['Keywords'].agg(unique_keywords)

print(unique_df)
year_words(unique_df, [2020,2021,2022,2023,2024])

Year
2020    [Floods, Monmouth University, Craig, Gregory B...
2021    [Floods, Monmouth University, JPMorgan Chase &...
2022    [Murdoch, Kathryn, Bacteria, Floods, School Di...
2023    [Floods, Assassinations and Attempted Assassin...
2024    [Floods, McCall, Matthew N, School Discipline ...
Name: Keywords, dtype: object
2238
2010
1955
2006
['Floods', 'McCall, Matthew N', 'School Discipline (Students)', 'Sociology', 'Kozol, Jonathan', 'Strauss, David A', 'Assassinations and Attempted Assassinations', 'Sacramento Municipal Utility District', 'Civil Rights Movement (1954-68)', 'Alcoholic Beverages', 'The End of Race Politics: Arguments for a Colorblind America (Book)', 'Kidnapping and Hostages', 'Project Veritas', 'Consumer Financial Protection Bureau', 'Brown, Charles Q Jr', 'Baldwin, Tammy Suzanne Green', 'Police Department (Aurora, Colo)', 'Elrod, Jennifer Walker', 'Drones (Pilotless Planes)', 'United States International Relations', 'Evictions', 'Crumbley, Ethan', 'Blacks', 'Depres