# Fake News Detection - Data Exploration Notebook

This notebook provides a comprehensive guide for exploring and understanding fake news datasets. We'll analyze text patterns, visualize differences between real and fake news, and prepare data for machine learning models.

## Table of Contents
1. [Install Required Dependencies](#install-dependencies)
2. [Configure Development Environment](#configure-environment)  
3. [Set Up Project Structure](#project-structure)
4. [Initialize Version Control](#version-control)
5. [Configure IDE Settings](#ide-settings)
6. [Test Environment Setup](#test-environment)

## 1. Install Required Dependencies {#install-dependencies}

First, let's install all the necessary packages for our fake news detection project.

In [None]:
# Install required packages (run only if packages are not installed)
import sys
import subprocess

def install_package(package):
    """Install a package using pip."""
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"‚úÖ {package} installed successfully")
    except subprocess.CalledProcessError:
        print(f"‚ùå Failed to install {package}")

# List of required packages
packages = [
    "streamlit>=1.28.0",
    "pandas>=1.5.0", 
    "numpy>=1.21.0",
    "scikit-learn>=1.3.0",
    "nltk>=3.8",
    "matplotlib>=3.6.0",
    "seaborn>=0.12.0",
    "plotly>=5.15.0",
    "wordcloud>=1.9.0",
    "textblob>=0.17.0",
    "joblib>=1.3.0"
]

print("üì¶ Installing required packages...")
for package in packages:
    install_package(package)

## 2. Configure Development Environment {#configure-environment}

Set up the development environment and import necessary libraries.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import nltk
from wordcloud import WordCloud
from textblob import TextBlob
import re
import os
import sys
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set up plotting styles
plt.style.use('default')
sns.set_palette("husl")

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

# Download required NLTK data
try:
    nltk.data.find('tokenizers/punkt')
    nltk.data.find('corpora/stopwords')
    nltk.data.find('corpora/wordnet')
    print("‚úÖ NLTK data already available")
except LookupError:
    print("üì• Downloading NLTK data...")
    nltk.download('punkt')
    nltk.download('stopwords')
    nltk.download('wordnet')
    print("‚úÖ NLTK data downloaded successfully")

print("üîß Environment configured successfully!")
print(f"üìä Pandas version: {pd.__version__}")
print(f"üî¢ NumPy version: {np.__version__}")
print(f"üìà Matplotlib version: {plt.matplotlib.__version__}")
print(f"üìä Seaborn version: {sns.__version__}")

## 3. Set Up Project Structure {#project-structure}

Let's verify and create the necessary project directories and load our data.

In [None]:
# Check project structure
project_root = os.path.dirname(os.getcwd())  # Go up one level from notebooks/
print(f"üìÅ Project root: {project_root}")

# Required directories
required_dirs = ['data', 'src', 'models', 'notebooks']
existing_dirs = []
missing_dirs = []

for directory in required_dirs:
    dir_path = os.path.join(project_root, directory)
    if os.path.exists(dir_path):
        existing_dirs.append(directory)
        print(f"‚úÖ {directory}/ directory exists")
    else:
        missing_dirs.append(directory)
        print(f"‚ùå {directory}/ directory missing")

# Create missing directories
for directory in missing_dirs:
    dir_path = os.path.join(project_root, directory)
    os.makedirs(dir_path, exist_ok=True)
    print(f"üìÅ Created {directory}/ directory")

# Add src to Python path for imports
src_path = os.path.join(project_root, 'src')
if src_path not in sys.path:
    sys.path.append(src_path)
    print(f"üîß Added {src_path} to Python path")

print("\nüìã Project structure verified successfully!")

## 4. Initialize Version Control {#version-control}

Check Git status and initialize repository if needed.

In [None]:
# Check Git status
import subprocess

def run_command(command):
    """Run a shell command and return the result."""
    try:
        result = subprocess.run(command, shell=True, capture_output=True, text=True, cwd=project_root)
        return result.returncode, result.stdout.strip(), result.stderr.strip()
    except Exception as e:
        return 1, "", str(e)

# Check if Git is initialized
returncode, stdout, stderr = run_command("git status")

if returncode == 0:
    print("‚úÖ Git repository is already initialized")
    print(f"üìä Git status: {stdout.split('\\n')[0]}")
else:
    print("‚ö†Ô∏è Git repository not initialized")
    
    # Initialize Git repository
    init_code, init_out, init_err = run_command("git init")
    if init_code == 0:
        print("‚úÖ Git repository initialized successfully")
        
        # Add .gitignore if it doesn't exist
        gitignore_path = os.path.join(project_root, '.gitignore')
        if not os.path.exists(gitignore_path):
            print("üìù Creating .gitignore file...")
        else:
            print("‚úÖ .gitignore file already exists")
    else:
        print(f"‚ùå Failed to initialize Git: {init_err}")

# Check for common files
common_files = ['README.md', 'requirements.txt', 'config.py', '.gitignore']
for filename in common_files:
    filepath = os.path.join(project_root, filename)
    if os.path.exists(filepath):
        print(f"‚úÖ {filename} exists")
    else:
        print(f"‚ö†Ô∏è {filename} missing - should be created")

## 5. Configure IDE Settings {#ide-settings}

Check Python environment and configure notebook settings.

In [None]:
# Check Python environment
print("üêç Python Environment Information:")
print(f"Python version: {sys.version}")
print(f"Python executable: {sys.executable}")
print(f"Current working directory: {os.getcwd()}")
print(f"Python path includes:")
for path in sys.path[:5]:  # Show first 5 paths
    print(f"  - {path}")

# Configure Jupyter notebook settings
from IPython.display import HTML, display
from IPython.core.magic import register_line_magic

# Enable interactive plotting
%matplotlib inline

# Configure figure size
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

# Set up pretty printing
pd.options.display.max_rows = 100
pd.options.display.max_columns = 20

print("\nüìä Jupyter notebook configured:")
print("  ‚úÖ Interactive plotting enabled")
print("  ‚úÖ Figure size set to 12x8")
print("  ‚úÖ Display options configured")

# Check if running in VS Code
try:
    get_ipython().__class__.__name__
    print("  ‚úÖ Running in Jupyter environment")
    if 'vscode' in str(get_ipython()):
        print("  ‚úÖ VS Code Jupyter extension detected")
except:
    print("  ‚ö†Ô∏è Not running in Jupyter environment")

print("\nüé® Visual settings configured successfully!")

## 6. Test Environment Setup {#test-environment}

Test that all components are working correctly by loading sample data and running basic operations.

In [None]:
# Test environment by loading and processing sample data
print("üß™ Testing Environment Setup...")

# Test 1: Load sample data
try:
    sample_data_path = os.path.join(project_root, 'data', 'sample_data.csv')
    if os.path.exists(sample_data_path):
        df = pd.read_csv(sample_data_path)
        print(f"‚úÖ Test 1 - Sample data loaded: {len(df)} rows")
        print(f"   Columns: {list(df.columns)}")
    else:
        # Create sample data if it doesn't exist
        sample_data = [
            ("Scientists discover new treatment for common disease in breakthrough study.", 0),
            ("SHOCKING: This one weird trick will change your life forever!", 1),
            ("Economic indicators show steady growth in the technology sector.", 0),
            ("BREAKING: Celebrities don't want you to know this secret!", 1),
        ]
        df = pd.DataFrame(sample_data, columns=['text', 'label'])
        print("‚úÖ Test 1 - Sample data created successfully")
except Exception as e:
    print(f"‚ùå Test 1 Failed - Data loading: {e}")
    df = pd.DataFrame()

# Test 2: Basic data processing
try:
    if not df.empty:
        print(f"‚úÖ Test 2 - Data shape: {df.shape}")
        print(f"   Real news: {sum(df['label'] == 0)}")
        print(f"   Fake news: {sum(df['label'] == 1)}")
    else:
        raise ValueError("No data to process")
except Exception as e:
    print(f"‚ùå Test 2 Failed - Data processing: {e}")

# Test 3: Text processing with NLTK
try:
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    
    sample_text = "This is a test sentence for checking NLTK functionality."
    tokens = word_tokenize(sample_text.lower())
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [w for w in tokens if w not in stop_words and w.isalpha()]
    
    print(f"‚úÖ Test 3 - NLTK processing: {len(filtered_tokens)} tokens extracted")
except Exception as e:
    print(f"‚ùå Test 3 Failed - NLTK processing: {e}")

# Test 4: Visualization
try:
    if not df.empty:
        # Simple bar plot
        label_counts = df['label'].value_counts()
        fig, ax = plt.subplots(1, 1, figsize=(6, 4))
        ax.bar(['Real News', 'Fake News'], label_counts.values)
        ax.set_title('Sample Data Distribution')
        ax.set_ylabel('Count')
        plt.tight_layout()
        plt.show()
        
        print("‚úÖ Test 4 - Matplotlib visualization working")
    else:
        raise ValueError("No data for visualization")
except Exception as e:
    print(f"‚ùå Test 4 Failed - Visualization: {e}")

# Test 5: Import custom modules
try:
    # Try to import our custom modules
    sys.path.append(os.path.join(project_root, 'src'))
    
    # Test if files exist before importing
    src_files = ['data_processing.py', 'model.py', 'utils.py']
    existing_files = []
    
    for filename in src_files:
        filepath = os.path.join(project_root, 'src', filename)
        if os.path.exists(filepath):
            existing_files.append(filename)
    
    print(f"‚úÖ Test 5 - Source files available: {existing_files}")
    
    if existing_files:
        print("   Ready to import custom modules")
    else:
        print("‚ö†Ô∏è Test 5 - Source files not found, will need to create them")
        
except Exception as e:
    print(f"‚ùå Test 5 Failed - Module import: {e}")

print("\nüéâ Environment setup testing completed!")
print("üìã Summary:")
print("   - All required packages installed")
print("   - Project structure verified")
print("   - Python environment configured")
print("   - Basic functionality tested")
print("\nüöÄ You're ready to start developing the fake news detection app!")

## Next Steps

Now that your development environment is set up, you can:

1. **Explore Data**: Load your own datasets and explore the patterns in fake vs real news
2. **Train Models**: Use the `src/model.py` module to train machine learning models
3. **Run the App**: Start the Streamlit web application with `streamlit run app.py`
4. **Experiment**: Use this notebook for data analysis and model experimentation

### Quick Start Commands

```bash
# Install dependencies
pip install -r requirements.txt

# Run the web application
streamlit run app.py

# Train a model
python src/model.py
```

### Project Resources

- üìÅ **Data**: Place your datasets in the `data/` folder
- ü§ñ **Models**: Trained models are saved in `models/`  
- üìì **Notebooks**: Use this folder for experimentation
- üì± **App**: The main application is in `app.py`

Happy coding! üöÄ