# Process Raw News Data About Nvidia

This notebook processes raw JSON news data about Apple and related tech topics from the Webarticles. It:
1. Loads the raw JSON data
2. Normalizes and cleans the data into a pandas DataFrame
3. Exports the processed data to CSV and Excel formats

## Import Required Libraries

In [65]:
import pandas as pd
import os
from pathlib import Path

## Configuration

Define paths and settings to make the notebook more maintainable and configurable.

In [66]:
INPUT_PATH = Path('../data/raw/news')
OUTPUT_DIR = Path('../data/processed/news')
CSV_OUTPUT = OUTPUT_DIR / 'news_articles.csv'
EXCEL_OUTPUT = OUTPUT_DIR / 'news_articles.xlsx'

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

SELECTED_COLUMNS = ['source.name', 'author', 'title', 'description', 'content', 'url']

## Data Loading and Processing

Load the raw JSON news data and convert it to a structured DataFrame.

In [67]:
def load_all_articles(input_dir):
    """Load all news articles from JSON files in the specified directory."""
    all_articles = []
    
    # Check if directory exists
    if not input_dir.exists() or not input_dir.is_dir():
        print(f"Warning: Input directory {input_dir} does not exist or is not a directory")
        return all_articles
    
    # Process each JSON file
    for file_path in input_dir.glob('*.json'):
        try:
            raw_data = pd.read_json(file_path)
            articles = raw_data.get('articles', [])
            all_articles.extend(articles)
        except Exception as e:
            print(f"Error loading {file_path.name}: {e}")
    
    return all_articles

# Load and process
articles = load_all_articles(INPUT_PATH)
if articles:
    # Convert to DataFrame and normalize nested JSON
    df_news = pd.json_normalize(articles)
    
    # Check if all requested columns exist
    missing_columns = [col for col in SELECTED_COLUMNS if col not in df_news.columns]
    if missing_columns:
        print(f"Warning: Missing columns in data: {missing_columns}")
        available_columns = [col for col in SELECTED_COLUMNS if col in df_news.columns]
        df_selected = df_news[available_columns] if available_columns else df_news
    else:
        df_selected = df_news[SELECTED_COLUMNS]
    
    # Report source distribution if that column exists
    if 'source.name' in df_selected.columns:
        source_counts = df_selected['source.name'].value_counts()
        sources_str = ', '.join(source_counts.index)
        print(f"Loaded {len(df_selected)} articles from: {sources_str}.")
    else:
        print(f"Loaded {len(df_selected)} articles.")
else:
    print("No articles found.")

Loaded 100 articles from: ETF Daily News, Yahoo Entertainment, Biztoc.com, GlobeNewswire, Plos.org, Thefly.com, Business Insider, Elifesciences.org, Pharmaceutical Technology, Vedomosti.ru, Al Jazeera English, Sputnikglobe.com, Terra.com.br, International Business Times, Www.e15.cz.


## Preview the Processed Data

In [68]:
df_selected

Unnamed: 0,source.name,author,title,description,content,url
0,Business Insider,Sarah Jackson,These companies are lowering or outright ditch...,Major companies are lowering or scrapping thei...,Companies are finding it increasingly challeng...,https://www.businessinsider.com/list-companies...
1,Biztoc.com,benzinga.com,$1000 Invested In Thermo Fisher Scientific 15 ...,Thermo Fisher Scientific (NYSE:TMO) has outper...,Thermo Fisher Scientific (NYSE:TMO) has outper...,https://biztoc.com/x/940b8493ed86e608
2,Business Insider,Hallam Bullock,4 Big Law firms decided to fight the Trump adm...,"In each of the four lawsuits, federal judges q...",Alex Wong/Getty Images\r\n<ul><li>This post or...,https://www.businessinsider.com/big-law-fight-...
3,Biztoc.com,benzinga.com,Thermo Fisher Scientific Q1 Earnings Surpass E...,Thermo Fisher Scientific Inc. (NYSE:TMO) repor...,Thermo Fisher Scientific Inc. (NYSE:TMO) repor...,https://biztoc.com/x/7563e9828d653546
4,Yahoo Entertainment,Vardah Gill,Thermo Fisher Scientific Inc. (TMO): Among the...,We recently published a list of the 10 High Gr...,We recently published a list of the 10 High Gr...,https://finance.yahoo.com/news/thermo-fisher-s...
...,...,...,...,...,...,...
95,Elifesciences.org,"memerman@fredhutch.org (Brigitte Allard), meme...",Integrator complex subunit 12 knockout overcom...,The latent HIV reservoir is a major barrier to...,Cell culture and maintenance\r\nRequest a deta...,https://elifesciences.org/articles/103064
96,Www.e15.cz,Jan Sedlák,"Jižní Morava chystá fond, který bude investova...",V Česku by měl letos začít fungovat další fond...,V esku by ml letos zaít fungovat dalí fond riz...,https://www.e15.cz/byznys/jizni-morava-chysta-...
97,Plos.org,"Iratxe Zuazo-Gaztelu, David Lawrence, Ioanna O...",A nonenzymatic dependency on inositol-requirin...,ER-resident IRE1 is known to play a role in ca...,"Citation: Zuazo-Gaztelu I, Lawrence D, Oikonom...",https://journals.plos.org/plosbiology/article?...
98,GlobeNewswire,Research and Markets,Genomic Cancer Panel and Profiling Market Anal...,The global genomic cancer panel and profiling ...,"Dublin, April 24, 2025 (GLOBE NEWSWIRE) -- The...",https://www.globenewswire.com/news-release/202...


## Export Processed Data

Save the processed data to CSV and Excel formats for further analysis.

In [69]:
def export_data(df, csv_path, excel_path):
    """Export dataframe to CSV and Excel formats."""
    try:
        # Export to CSV
        df.to_csv(csv_path, index=False, encoding='utf-8')
        
        # Export to Excel
        df.to_excel(excel_path, index=False)
        
        return True
    except Exception as e:
        print(f"Error exporting data: {e}")
        return False

# Export the data
if export_data(df_selected, CSV_OUTPUT, EXCEL_OUTPUT):
    print(f"Data successfully exported to:")
    print(f"- CSV: {CSV_OUTPUT}")
    print(f"- Excel: {EXCEL_OUTPUT}")

Data successfully exported to:
- CSV: ../data/processed/news/news_articles.csv
- Excel: ../data/processed/news/news_articles.xlsx


## Conclusion

This notebook has successfully:
1. Loaded raw news data from a JSON file
2. Processed and normalized the data into a structured format
3. Performed basic analysis on the dataset
4. Exported the processed data to CSV and Excel formats for further use

The processed data includes 10 TechCrunch articles from April 2025, with 2 directly related to Nvidia. The data can now be used for further analysis, sentiment analysis, or as part of a larger dataset for tech news tracking.