# Process Raw News Data About Nvidia

This notebook processes raw JSON news data about Apple and related tech topics from the Webarticles. It:
1. Loads the raw JSON data
2. Normalizes and cleans the data into a pandas DataFrame
3. Exports the processed data to CSV and Excel formats

## Import Required Libraries

In [47]:
import pandas as pd
import os
from pathlib import Path

## Configuration

Define paths and settings to make the notebook more maintainable and configurable.

In [48]:
INPUT_PATH = Path('../data/raw/news/news_response_2025-03-30_2025-04-29.json')
OUTPUT_DIR = Path('../data/processed/news')
CSV_OUTPUT = OUTPUT_DIR / 'news_articles.csv'
EXCEL_OUTPUT = OUTPUT_DIR / 'news_articles.xlsx'

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

SELECTED_COLUMNS = ['source.name', 'author', 'title', 'description', 'content', 'url']

## Data Loading and Processing

Load the raw JSON news data and convert it to a structured DataFrame.

In [49]:
def load_all_articles(input_dir):
    """Load all news articles from JSON files in the specified directory."""
    all_articles = []
    
    # Check if directory exists
    if not input_dir.exists() or not input_dir.is_dir():
        print(f"Warning: Input directory {input_dir} does not exist or is not a directory")
        return all_articles
    
    # Process each JSON file
    for file_path in input_dir.glob('*.json'):
        try:
            raw_data = pd.read_json(file_path)
            articles = raw_data.get('articles', [])
            all_articles.extend(articles)
        except Exception as e:
            print(f"Error loading {file_path.name}: {e}")
    
    return all_articles

# Load and process
articles = load_all_articles(INPUT_PATH)
if articles:
    # Convert to DataFrame and normalize nested JSON
    df_news = pd.json_normalize(articles)
    
    # Check if all requested columns exist
    missing_columns = [col for col in SELECTED_COLUMNS if col not in df_news.columns]
    if missing_columns:
        print(f"Warning: Missing columns in data: {missing_columns}")
        available_columns = [col for col in SELECTED_COLUMNS if col in df_news.columns]
        df_selected = df_news[available_columns] if available_columns else df_news
    else:
        df_selected = df_news[SELECTED_COLUMNS]
    
    # Report source distribution if that column exists
    if 'source.name' in df_selected.columns:
        source_counts = df_selected['source.name'].value_counts()
        sources_str = ', '.join(source_counts.index)
        print(f"Loaded {len(df_selected)} articles from: {sources_str}.")
    else:
        print(f"Loaded {len(df_selected)} articles.")
else:
    print("No articles found.")

Loaded 25 articles from: TechCrunch, The Next Web.


## Preview the Processed Data

In [50]:
df_selected

Unnamed: 0,source.name,author,title,description,content,url
0,The Next Web,Thomas Macaulay,Meet the Dutch tech stars speaking at TNW Conf...,"As our favourite Dutch holiday approaches, TNW...","As our favourite Dutch holiday approaches, TNW...",https://thenextweb.com/news/meet-dutch-tech-te...
1,TechCrunch,Rebecca Bellan,Apple loses $250B market value as tariffs tank...,Apple lost more than $250 billion in market va...,Apple lost more than $250 billion in market va...,https://techcrunch.com/2025/04/03/apple-loses-...
2,TechCrunch,Anthony Ha,"Trump exempts smartphones, laptops, and semico...",The Trump administration is carving out big ta...,The Trump administration is carving out big ta...,https://techcrunch.com/2025/04/12/trump-exempt...
3,TechCrunch,Rebecca Szkutak,US government imposes license requirement on N...,Nvidia's H20 was the most advanced AI chip the...,Semiconductor giant Nvidia is facing unexpecte...,https://techcrunch.com/2025/04/15/nvidia-h20-c...
4,TechCrunch,Rebecca Szkutak,Nvidia's H20 AI chips may be spared from expor...,Nvidia CEO Jensen Huang appears to have struck...,Nvidia CEO Jensen Huang appears to have struck...,https://techcrunch.com/2025/04/09/nvidias-h20-...
5,TechCrunch,Kyle Wiggers,Nvidia says it plans to manufacture some AI ch...,Nvidia says that it has commissioned more than...,Nvidia says it has commissioned more than a mi...,https://techcrunch.com/2025/04/14/nvidia-says-...
6,TechCrunch,Ingrid Lunden,"Hammerspace, an unstructured data wrangler use...",Artificial intelligence services at their hear...,Artificial intelligence services at their hear...,https://techcrunch.com/2025/04/16/hammerspace-...
7,TechCrunch,Kyle Wiggers,AMD takes $800M charge on US license requireme...,AMD says that the U.S. government's license co...,AMD says that the U.S. government’s license co...,https://techcrunch.com/2025/04/16/amd-takes-80...
8,TechCrunch,Rebecca Szkutak,Here are the 19 US AI startups that have raise...,U.S.-based AI startups continue to rake in ven...,Last year was monumental for the AI industry i...,https://techcrunch.com/2025/04/23/here-are-the...
9,TechCrunch,Kyle Wiggers,Google's newest Gemini AI model focuses on eff...,"Google is releasing a new AI model, Gemini 2.5...",Google is releasing a new AI model designed to...,https://techcrunch.com/2025/04/09/googles-newe...


## Export Processed Data

Save the processed data to CSV and Excel formats for further analysis.

In [42]:
def export_data(df, csv_path, excel_path):
    """Export dataframe to CSV and Excel formats."""
    try:
        # Export to CSV
        df.to_csv(csv_path, index=False, encoding='utf-8')
        
        # Export to Excel
        df.to_excel(excel_path, index=False)
        
        return True
    except Exception as e:
        print(f"Error exporting data: {e}")
        return False

# Export the data
if export_data(df_selected, CSV_OUTPUT, EXCEL_OUTPUT):
    print(f"Data successfully exported to:")
    print(f"- CSV: {CSV_OUTPUT}")
    print(f"- Excel: {EXCEL_OUTPUT}")

Data successfully exported to:
- CSV: ../data/processed/news/news_articles.csv
- Excel: ../data/processed/news/news_articles.xlsx


## Conclusion

This notebook has successfully:
1. Loaded raw news data from a JSON file
2. Processed and normalized the data into a structured format
3. Performed basic analysis on the dataset
4. Exported the processed data to CSV and Excel formats for further use

The processed data includes 10 TechCrunch articles from April 2025, with 2 directly related to Nvidia. The data can now be used for further analysis, sentiment analysis, or as part of a larger dataset for tech news tracking.