# Data Exploration and Analysis

This notebook provides an initial exploration of the fake news dataset.

## Objectives:
- Load and examine the dataset
- Understand data distribution
- Identify missing values and data quality issues
- Visualize key patterns
- Generate insights for preprocessing

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import nltk
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn')
sns.set_palette("husl")

## 1. Load Dataset

In [None]:
# TODO: Replace with your actual dataset path
# Example datasets you can use:
# 1. LIAR dataset: https://www.cs.ucsb.edu/~william/data/liar_dataset.zip
# 2. Fake News Detection: https://www.kaggle.com/c/fake-news/data
# 3. ISOT dataset: https://www.uvic.ca/engineering/ece/isot/datasets/

# df = pd.read_csv('../data/raw/fake_news_dataset.csv')
# print(f"Dataset shape: {df.shape}")
# df.head()

## 2. Basic Data Information

In [None]:
# Display basic information about the dataset
# df.info()
# print("\nColumn names:")
# print(df.columns.tolist())
# print("\nMissing values:")
# print(df.isnull().sum())

## 3. Label Distribution

In [None]:
# Analyze the distribution of fake vs real news
# label_counts = df['label'].value_counts()
# print(label_counts)

# plt.figure(figsize=(8, 6))
# label_counts.plot(kind='bar')
# plt.title('Distribution of Fake vs Real News')
# plt.xlabel('Label')
# plt.ylabel('Count')
# plt.xticks(rotation=0)
# plt.show()

## 4. Text Length Analysis

In [None]:
# Analyze text length patterns
# df['text_length'] = df['text'].str.len()
# df['word_count'] = df['text'].str.split().str.len()

# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# # Character length distribution
# df.boxplot(column='text_length', by='label', ax=ax1)
# ax1.set_title('Text Length Distribution by Label')

# # Word count distribution
# df.boxplot(column='word_count', by='label', ax=ax2)
# ax2.set_title('Word Count Distribution by Label')

# plt.tight_layout()
# plt.show()

## 5. Word Clouds

In [None]:
# Create word clouds for fake and real news
# fake_text = ' '.join(df[df['label'] == 'FAKE']['text'])
# real_text = ' '.join(df[df['label'] == 'REAL']['text'])

# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))

# # Fake news word cloud
# fake_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(fake_text)
# ax1.imshow(fake_wordcloud, interpolation='bilinear')
# ax1.set_title('Most Frequent Words in Fake News', fontsize=16)
# ax1.axis('off')

# # Real news word cloud
# real_wordcloud = WordCloud(width=800, height=400, background_color='white').generate(real_text)
# ax2.imshow(real_wordcloud, interpolation='bilinear')
# ax2.set_title('Most Frequent Words in Real News', fontsize=16)
# ax2.axis('off')

# plt.tight_layout()
# plt.show()

## 6. Most Common Words Analysis

In [None]:
# Download NLTK data if needed
# nltk.download('stopwords')
# from nltk.corpus import stopwords
# stop_words = set(stopwords.words('english'))

# def get_common_words(text_series, n=20):
#     """Get most common words from text series"""
#     all_words = []
#     for text in text_series:
#         words = text.lower().split()
#         words = [word for word in words if word not in stop_words and word.isalpha()]
#         all_words.extend(words)
#     return Counter(all_words).most_common(n)

# fake_common = get_common_words(df[df['label'] == 'FAKE']['text'])
# real_common = get_common_words(df[df['label'] == 'REAL']['text'])

# print("Most common words in FAKE news:")
# for word, count in fake_common[:10]:
#     print(f"{word}: {count}")

# print("\nMost common words in REAL news:")
# for word, count in real_common[:10]:
#     print(f"{word}: {count}")

## 7. Key Insights and Next Steps

Based on the exploration above, document your key findings:

1. **Dataset Balance**: 
2. **Text Characteristics**: 
3. **Vocabulary Differences**: 
4. **Data Quality Issues**: 
5. **Preprocessing Needs**: 

### Recommended Next Steps:
1. Clean and preprocess the text data
2. Handle missing values appropriately
3. Balance the dataset if needed
4. Prepare features for modeling
5. Split data into train/validation/test sets