# Public Data Analysis for Journalism
This notebook demonstrates the workflow for collecting public data, analyzing it with AI tools (like Hugging Face's transformers), processing it with pandas, and visualizing insights. 

### Goals:
- Learn how to scrape data
- Analyze sentiment and detect key entities
- Visualize insights for journalistic storytelling

**Estimated time:** ~2.5 hours, with group tasks for deeper exploration.

## 1. Setup and Installation

In [None]:
!pip install requests beautifulsoup4 pandas transformers matplotlib openpyxl

## 2. Data Collection with Web Scraping
Let's start by scraping some sample headlines from a website. This will help us collect public data.

In [None]:
import requests
from bs4 import BeautifulSoup

# URL of the news website (replace with a real URL if possible)
url = "https://example.com/news"

# Request and parse the webpage
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Extract and store headlines in a list
headlines = [h2.text for h2 in soup.find_all('h2')]
print(headlines)

## 3. Data Cleaning and Preprocessing with Pandas
Now that we have our headlines, let's clean the data to prepare it for analysis.

In [None]:
import pandas as pd

# Load headlines into a DataFrame
df = pd.DataFrame(headlines, columns=["Headline"])

# Sample cleaning - remove punctuation, lowercase
df["Cleaned_Headline"] = df["Headline"].str.replace(r'[^\w\s]', '', regex=True).str.lower()
df.head()

## 4. Text Analysis with Transformers
### 4.1 Sentiment Analysis
We'll use a pre-trained model to analyze the sentiment of each headline.

In [None]:
from transformers import pipeline

# Load the sentiment analysis pipeline
sentiment_analyzer = pipeline("sentiment-analysis")

# Apply sentiment analysis to each headline
df["Sentiment"] = df["Headline"].apply(lambda x: sentiment_analyzer(x)[0]['label'])
df["Score"] = df["Headline"].apply(lambda x: sentiment_analyzer(x)[0]['score'])
df.head()

### 4.2 Named Entity Recognition
We'll identify key entities (people, places, organizations) in each headline.

In [None]:
ner_analyzer = pipeline("ner", grouped_entities=True)

# Extract entities for each headline
df["Entities"] = df["Headline"].apply(lambda x: ner_analyzer(x))
df.head()

## 5. Data Visualization
### 5.1 Sentiment Distribution
Let's visualize the distribution of sentiment across our collected headlines.

In [None]:
import matplotlib.pyplot as plt

sentiment_counts = df["Sentiment"].value_counts()

plt.figure(figsize=(8, 6))
plt.bar(sentiment_counts.index, sentiment_counts.values, color=['skyblue', 'salmon'])
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.title("Sentiment Distribution of Headlines")
plt.show()

### 5.2 Entity Frequency Visualization
Now, let's visualize the most common entities found in our headlines.

In [None]:
from collections import Counter

# Flatten entity lists and count occurrences
entities = [entity['word'] for sublist in df["Entities"] for entity in sublist]
entity_counts = Counter(entities).most_common(10)

# Plot the most common entities
entity_names, entity_values = zip(*entity_counts)
plt.figure(figsize=(10, 6))
plt.bar(entity_names, entity_values, color='lightgreen')
plt.xlabel("Entity")
plt.ylabel("Frequency")
plt.title("Top 10 Most Common Entities")
plt.show()

## 6. Exporting Data to Excel
Finally, let's export our data to an Excel file for further analysis.

In [None]:
df.to_excel("news_headlines_analysis.xlsx", index=False)

## 7. Wrap-Up and Discussion
In this notebook, we covered:
- Collecting public data with web scraping
- Cleaning and analyzing the data using pandas and transformers
- Visualizing insights to tell stories with data

**Take-Home Challenge**: Apply these techniques to another dataset, such as social media data, to analyze sentiment on a specific topic.

### Thank you for participating! Feel free to ask questions and explore more on your own.