In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 🚨 Fake News Detection using NLP and Machine Learning\n",
    "\n",
    "**Project Overview:** This notebook demonstrates a complete pipeline for detecting fake news articles using Natural Language Processing (NLP) and Machine Learning techniques. The model uses TF-IDF vectorization combined with advanced features including sentiment analysis, readability scores, and linguistic features to classify news articles as either real or fake.\n",
    "\n",
    "**Phase 1 Improvements:**\n",
    "- ✅ Replaced stemming with lemmatization using spaCy\n",
    "- ✅ Preserved named entities and used POS tagging to filter tokens\n",
    "- ✅ Added sentiment polarity scores using TextBlob and VADER\n",
    "- ✅ Added readability scores (Flesch Reading Ease and others)\n",
    "- ✅ Enabled toggle in preprocessing to turn these additional features on or off\n",
    "- ✅ Updated model training to include these new features alongside TF-IDF vectors\n",
    "\n",
    "---\n",
    "\n",
    "## 📋 Table of Contents\n",
    "1. [Setup and Imports](#setup)\n",
    "2. [Data Loading and Exploration](#data-loading)\n",
    "3. [Advanced Data Preprocessing](#preprocessing)\n",
    "4. [Exploratory Data Analysis](#eda)\n",
    "5. [Feature Engineering](#feature-engineering)\n",
    "6. [Model Training](#model-training)\n",
    "7. [Model Evaluation](#evaluation)\n",
    "8. [Sample Predictions](#predictions)\n",
    "9. [Conclusions](#conclusions)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Setup and Imports <a name=\"setup\"></a>\n",
    "\n",
    "First, let's import all the necessary libraries and our custom modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import standard libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Import scikit-learn components\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n",
    "\n",
    "# Import our custom modules\n",
    "import sys\n",
    "sys.path.append('../src')\n",
    "\n",
    "from data_loader import load_data\n",
    "from preprocessing import prepare_data, clean_text, get_text_statistics, extract_text_features\n",
    "from visualization import generate_wordcloud, plot_label_distribution, plot_article_length_distribution\n",
    "from model import train_model, load_model, predict_text, get_feature_importance, evaluate_model_performance\n",
    "from evaluation import evaluate_model, generate_evaluation_report\n",
    "\n",
    "# Set plotting style\n",
    "plt.style.use('seaborn-v0_8')\n",
    "sns.set_palette(\"husl\")\n",
    "plt.rcParams['figure.figsize'] = (12, 8)\n",
    "\n",
    "print(\"✅ All libraries imported successfully!\")\n",
    "print(\"🔧 Phase 1 improvements enabled:\")\n",
    "print(\"   - spaCy lemmatization\")\n",
    "print(\"   - Named entity preservation\")\n",
    "print(\"   - POS tagging and filtering\")\n",
    "print(\"   - Sentiment analysis (TextBlob + VADER)\")\n",
    "print(\"   - Readability scores\")\n",
    "print(\"   - Advanced feature extraction\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data Loading and Exploration <a name=\"data-loading\"></a>\n",
    "\n",
    "Let's load the dataset and explore its structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load the dataset\n",
    "print(\"📊 Loading dataset...\")\n",
    "try:\n",
    "    df = load_data('../data/train.csv')\n",
    "    print(f\"✅ Dataset loaded successfully! Shape: {df.shape}\")\n",
    "except FileNotFoundError:\n",
    "    print(\"❌ Dataset not found! Please ensure 'train.csv' is in the data/ directory.\")\n",
    "    print(\"\\n📝 Note: You can download the dataset from Kaggle:\")\n",
    "    print(\"   https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset\")\n",
    "    \n",
    "    # Create sample data for demonstration\n",
    "    print(\"\\n🔄 Creating sample data for demonstration...\")\n",
    "    sample_data = {\n",
    "        'text': [\n",
    "            \"Scientists discover new species of deep-sea creatures in the Pacific Ocean. The research team used advanced underwater drones to explore depths previously inaccessible to humans.\",\n",
    "            \"BREAKING: Aliens contact Earth government! Secret meeting held at Area 51. Sources say they want to share advanced technology in exchange for our natural resources.\",\n",
    "            \"New study shows that regular exercise can reduce the risk of heart disease by up to 30%. The research involved over 10,000 participants across multiple countries.\",\n",
    "            \"SHOCKING: Celebrities are actually robots controlled by the government! Insider reveals all the secrets they don't want you to know.\",\n",
    "            \"Climate change report indicates global temperatures have risen by 1.1°C since pre-industrial levels. Scientists warn of severe consequences if action is not taken.\",\n",
    "            \"CONSPIRACY: The moon landing was filmed in Hollywood! NASA admits to staging the entire event to win the space race.\"\n",
    "        ],\n",
    "        'label': [0, 1, 0, 1, 0, 1]  # 0=Real, 1=Fake\n",
    "    }\n",
    "    df = pd.DataFrame(sample_data)\n",
    "    print(f\"✅ Sample dataset created! Shape: {df.shape}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display basic information about the dataset\n",
    "print(\"📋 DATASET INFORMATION:\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Shape: {df.shape}\")\n",
    "print(f\"Columns: {list(df.columns)}\")\n",
    "print(f\"\\nData types:\")\n",
    "print(df.dtypes)\n",
    "print(f\"\\nMissing values:\")\n",
    "print(df.isnull().sum())\n",
    "\n",
    "# Display first few rows\n",
    "print(f\"\\n📄 First 3 rows:\")\n",
    "print(df.head(3))\n",
    "\n",
    "# Display label distribution\n",
    "print(f\"\\n📊 Label distribution:\")\n",
    "label_counts = df['label'].value_counts()\n",
    "print(label_counts)\n",
    "print(f\"\\nPercentage distribution:\")\n",
    "print(label_counts / len(df) * 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Advanced Data Preprocessing <a name=\"preprocessing\"></a>\n",
    "\n",
    "Now let's clean and preprocess the text data with our enhanced Phase 1 features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Configuration for preprocessing\n",
    "USE_ADVANCED_FEATURES = True  # Toggle for advanced features\n",
    "PRESERVE_ENTITIES = True       # Toggle for named entity preservation\n",
    "\n",
    "print(\"🧹 ADVANCED PREPROCESSING:\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Using advanced features: {USE_ADVANCED_FEATURES}\")\n",
    "print(f\"Preserving entities: {PRESERVE_ENTITIES}\")\n",
    "\n",
    "# Prepare the data with advanced features\n",
    "X, y, feature_df = prepare_data(df, use_advanced_features=USE_ADVANCED_FEATURES, preserve_entities=PRESERVE_ENTITIES)\n",
    "\n",
    "print(f\"\\n✅ Preprocessing completed!\")\n",
    "print(f\"Text features shape: {X.shape}\")\n",
    "print(f\"Labels shape: {y.shape}\")\n",
    "if feature_df is not None:\n",
    "    print(f\"Additional features shape: {feature_df.shape}\")\n",
    "    print(f\"Additional features: {list(feature_df.columns)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get text statistics\n",
    "print(\"📊 TEXT STATISTICS:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "stats = get_text_statistics(df)\n",
    "for key, value in stats.items():\n",
    "    print(f\"{key}: {value:.0f}\")\n",
    "\n",
    "# Show some examples of cleaned text\n",
    "print(f\"\\n📝 SAMPLE CLEANED TEXTS:\")\n",
    "print(\"=\" * 50)\n",
    "for i, (original, cleaned) in enumerate(zip(df['text'].head(3), X.head(3))):\n",
    "    print(f\"\\nExample {i+1}:\")\n",
    "    print(f\"Original: {original[:100]}...\")\n",
    "    print(f\"Cleaned:  {cleaned[:100]}...\")\n",
    "\n",
    "# Display advanced features if available\n",
    "if feature_df is not None:\n",
    "    print(f\"\\n🔍 ADVANCED FEATURES SAMPLE:\")\n",
    "    print(\"=\" * 50)\n",
    "    print(feature_df.head(3))\n",
    "    \n",
    "    # Show feature statistics\n",
    "    print(f\"\\n📈 FEATURE STATISTICS:\")\n",
    "    print(\"=\" * 50)\n",
    "    print(feature_df.describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Exploratory Data Analysis <a name=\"eda\"></a>\n",
    "\n",
    "Let's explore the data through various visualizations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate visualizations\n",
    "print(\"📈 EXPLORATORY DATA ANALYSIS:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "# 1. Label distribution\n",
    "plot_label_distribution(df)\n",
    "\n",
    "# 2. Article length distribution\n",
    "plot_article_length_distribution(df)\n",
    "\n",
    "# 3. Word clouds\n",
    "print(\"\\n☁️ Generating word clouds...\")\n",
    "generate_wordcloud(df)\n",
    "\n",
    "print(\"\\n✅ All visualizations completed!\")\n",
    "print(\"📁 Check the 'outputs/' directory for saved plots.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Feature Engineering <a name=\"feature-engineering\"></a>\n",
    "\n",
    "The TF-IDF vectorization and advanced features will be handled by our enhanced model training function. Let's prepare the train-test split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Split the data into training and testing sets\n",
    "print(\"🔀 SPLITTING DATA:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    X, y, test_size=0.2, random_state=42, stratify=y\n",
    ")\n",
    "\n",
    "# Split additional features if available\n",
    "feature_df_train = None\n",
    "feature_df_test = None\n",
    "if feature_df is not None:\n",
    "    feature_df_train = feature_df.iloc[X_train.index]\n",
    "    feature_df_test = feature_df.iloc[X_test.index]\n",
    "\n",
    "print(f\"Training set size: {len(X_train)}\")\n",
    "print(f\"Test set size: {len(X_test)}\")\n",
    "print(f\"Training set label distribution:\")\n",
    "print(y_train.value_counts())\n",
    "print(f\"\\nTest set label distribution:\")\n",
    "print(y_test.value_counts())\n",
    "\n",
    "if feature_df_train is not None:\n",
    "    print(f\"\\nTraining additional features shape: {feature_df_train.shape}\")\n",
    "    print(f\"Test additional features shape: {feature_df_test.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Model Training <a name=\"model-training\"></a>\n",
    "\n",
    "Now let's train our enhanced model with TF-IDF vectorization and advanced features."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train the model with advanced features\n",
    "print(\"🤖 TRAINING ENHANCED MODEL:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "# Choose model type\n",
    "MODEL_TYPE = 'logistic'  # Options: 'logistic', 'random_forest'\n",
    "\n",
    "model, vectorizer, scaler = train_model(\n",
    "    X_train, y_train, \n",
    "    feature_df=feature_df_train,\n",
    "    model_type=MODEL_TYPE\n",
    ")\n",
    "\n",
    "print(f\"\\n✅ Model training completed!\")\n",
    "print(f\"Model type: {MODEL_TYPE}\")\n",
    "print(f\"Using advanced features: {feature_df_train is not None}\")\n",
    "print(f\"Model saved to: models/fake_news_model_{MODEL_TYPE}.pkl\")\n",
    "print(f\"Vectorizer saved to: models/tfidf_vectorizer.pkl\")\n",
    "if scaler:\n",
    "    print(f\"Feature scaler saved to: models/feature_scaler.pkl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get feature importance\n",
    "print(\"🔍 FEATURE IMPORTANCE:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "top_features = get_feature_importance(\n",
    "    model, vectorizer, scaler, feature_df_train, top_n=20\n",
    ")\n",
    "\n",
    "print(\"\\nTop 20 most important features:\")\n",
    "for i, (feature, importance) in enumerate(top_features, 1):\n",
    "    sentiment = \"🟢 (Real)\" if importance < 0 else \"🔴 (Fake)\"\n",
    "    print(f\"{i:2d}. {feature:25s}: {importance:8.4f} {sentiment}\")\n",
    "\n",
    "# Analyze feature types\n",
    "tfidf_features = [f for f, _ in top_features if not f.startswith(('textblob_', 'vader_', 'flesch_', 'pos_', 'entity_', 'avg_'))]\n",
    "sentiment_features = [f for f, _ in top_features if f.startswith(('textblob_', 'vader_'))]\n",
    "readability_features = [f for f, _ in top_features if f.startswith(('flesch_', 'gunning_', 'smog_', 'automated_', 'coleman_', 'linsear_', 'dale_'))]\n",
    "linguistic_features = [f for f, _ in top_features if f.startswith(('pos_', 'entity_', 'avg_'))]\n",
    "\n",
    "print(f\"\\n📊 Feature Type Analysis:\")\n",
    "print(f\"TF-IDF features: {len(tfidf_features)}\")\n",
    "print(f\"Sentiment features: {len(sentiment_features)}\")\n",
    "print(f\"Readability features: {len(readability_features)}\")\n",
    "print(f\"Linguistic features: {len(linguistic_features)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Model Evaluation <a name=\"evaluation\"></a>\n",
    "\n",
    "Let's evaluate our enhanced model's performance comprehensively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluate the model\n",
    "print(\"📊 MODEL EVALUATION:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "results = evaluate_model_performance(\n",
    "    model, vectorizer, X_test, y_test, \n",
    "    scaler=scaler, feature_df_test=feature_df_test\n",
    ")\n",
    "\n",
    "# Generate evaluation report\n",
    "generate_evaluation_report(results)\n",
    "\n",
    "print(f\"\\n✅ Evaluation completed!\")\n",
    "print(f\"📁 Check the 'outputs/' directory for evaluation plots and report.\")\n",
    "\n",
    "# Compare with baseline (TF-IDF only)\n",
    "print(f\"\\n🔍 COMPARISON WITH BASELINE:\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Enhanced model (with advanced features): {results['accuracy']:.4f}\")\n",
    "print(f\"Features used: TF-IDF + {len(feature_df.columns) if feature_df is not None else 0} additional features\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Sample Predictions <a name=\"predictions\"></a>\n",
    "\n",
    "Let's test our enhanced model with some sample news articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test predictions on sample texts\n",
    "print(\"🧪 SAMPLE PREDICTIONS:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "sample_texts = [\n",
    "    \"Scientists discover new species of deep-sea creatures in the Pacific Ocean. The research team used advanced underwater drones to explore depths previously inaccessible to humans.\",\n",
    "    \"BREAKING: Aliens contact Earth government! Secret meeting held at Area 51. Sources say they want to share advanced technology in exchange for our natural resources.\",\n",
    "    \"New study shows that regular exercise can reduce the risk of heart disease by up to 30%. The research involved over 10,000 participants across multiple countries.\",\n",
    "    \"SHOCKING: Celebrities are actually robots controlled by the government! Insider reveals all the secrets they don't want you to know.\",\n",
    "    \"Climate change report indicates global temperatures have risen by 1.1°C since pre-industrial levels. Scientists warn of severe consequences if action is not taken.\"\n",
    "]\n",
    "\n",
    "print(\"\\nPredictions for sample texts:\")\n",
    "for i, text in enumerate(sample_texts, 1):\n",
    "    # Extract features for the text if using advanced features\n",
    "    text_features = None\n",
    "    if feature_df is not None:\n",
    "        text_features = extract_text_features(text, use_advanced_features=True)\n",
    "        text_features_df = pd.DataFrame([text_features])\n",
    "    else:\n",
    "        text_features_df = None\n",
    "    \n",
    "    prediction, probability = predict_text(\n",
    "        text, model, vectorizer, scaler, text_features_df\n",
    "    )\n",
    "    \n",
    "    result = \"🔴 FAKE\" if prediction == 1 else \"🟢 REAL\"\n",
    "    confidence = max(probability) * 100\n",
    "    \n",
    "    print(f\"\\n{i}. {result} (Confidence: {confidence:.1f}%)\")\n",
    "    print(f\"   Text: {text[:80]}...\")\n",
    "    print(f\"   Probabilities: Real={probability[0]:.3f}, Fake={probability[1]:.3f}\")\n",
    "    \n",
    "    # Show extracted features if available\n",
    "    if text_features:\n",
    "        print(f\"   Sentiment (TextBlob): {text_features.get('textblob_polarity', 0):.3f}\")\n",
    "        print(f\"   Readability (Flesch): {text_features.get('flesch_reading_ease', 0):.1f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Interactive Prediction <a name=\"interactive\"></a>\n",
    "\n",
    "Let's create an interactive function to test your own news articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Interactive prediction function\n",
    "def predict_news_article(text):\n",
    "    \"\"\"\n",
    "    Predict whether a given news article is fake or real news.\n",
    "    \n",
    "    Args:\n",
    "        text (str): The news article text\n",
    "    \n",
    "    Returns:\n",
    "        tuple: (prediction, confidence, probabilities, features)\n",
    "    \"\"\"\n",
    "    # Extract features for the text if using advanced features\n",
    "    text_features = None\n",
    "    if feature_df is not None:\n",
    "        text_features = extract_text_features(text, use_advanced_features=True)\n",
    "        text_features_df = pd.DataFrame([text_features])\n",
    "    else:\n",
    "        text_features_df = None\n",
    "    \n",
    "    prediction, probability = predict_text(\n",
    "        text, model, vectorizer, scaler, text_features_df\n",
    "    )\n",
    "    confidence = max(probability) * 100\n",
    "    \n",
    "    return prediction, confidence, probability, text_features\n",
    "\n",
    "# Test with user input\n",
    "print(\"🎯 INTERACTIVE PREDICTION:\")\n",
    "print(\"=\" * 50)\n",
    "print(\"\\nEnter a news article text to classify (or press Enter to skip):\")\n",
    "\n",
    "# You can uncomment the following lines to enable interactive input\n",
    "# user_text = input(\"\\nNews article: \")\n",
    "# if user_text.strip():\n",
    "#     pred, conf, probs, features = predict_news_article(user_text)\n",
    "#     result = \"🔴 FAKE\" if pred == 1 else \"🟢 REAL\"\n",
    "#     print(f\"\\nPrediction: {result}\")\n",
    "#     print(f\"Confidence: {conf:.1f}%\")\n",
    "#     print(f\"Probabilities: Real={probs[0]:.3f}, Fake={probs[1]:.3f}\")\n",
    "#     if features:\n",
    "#         print(f\"\\nExtracted Features:\")\n",
    "#         print(f\"  Sentiment: {features.get('textblob_polarity', 0):.3f}\")\n",
    "#         print(f\"  Readability: {features.get('flesch_reading_ease', 0):.1f}\")\n",
    "#         print(f\"  Word count: {features.get('word_count', 0)}\")\n",
    "# else:\n",
    "#     print(\"\\nSkipping interactive prediction.\")\n",
    "\n",
    "print(\"\\n💡 To test your own articles, uncomment the input section above!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Conclusions <a name=\"conclusions\"></a>\n",
    "\n",
    "### 📊 Model Performance Summary\n",
    "\n",
    "Our enhanced fake news detection model achieved the following performance:\n",
    "\n",
    "- **Accuracy**: [Model accuracy from results]\n",
    "- **Precision**: [Model precision from results]\n",
    "- **Recall**: [Model recall from results]\n",
    "- **F1-Score**: [Model F1-score from results]\n",
    "\n",
    "### 🔍 Phase 1 Improvements Summary\n",
    "\n",
    "✅ **Successfully Implemented:**\n",
    "\n",
    "1. **Advanced Text Preprocessing**:\n",
    "   - Replaced stemming with spaCy lemmatization\n",
    "   - Preserved named entities (PERSON, ORG, GPE, DATE, MONEY)\n",
    "   - Used POS tagging to filter and analyze tokens\n",
    "   - Added toggle for advanced features\n",
    "\n",
    "2. **Sentiment Analysis**:\n",
    "   - TextBlob polarity and subjectivity scores\n",
    "   - VADER sentiment analysis (compound, positive, negative, neutral)\n",
    "\n",
    "3. **Readability Scores**:\n",
    "   - Flesch Reading Ease\n",
    "   - Flesch-Kincaid Grade Level\n",
    "   - Gunning Fog Index\n",
    "   - SMOG Index\n",
    "   - Automated Readability Index\n",
    "   - Coleman-Liau Index\n",
    "   - Linsear Write Formula\n",
    "   - Dale-Chall Readability Score\n",
    "\n",
    "4. **Linguistic Features**:\n",
    "   - POS tag counts (NOUN, VERB, ADJ, ADV, PROPN)\n",
    "   - Named entity counts\n",
    "   - Dependency analysis\n",
    "\n",
    "5. **Enhanced Model Training**:\n",
    "   - Combined TF-IDF with additional features\n",
    "   - Feature scaling for numerical features\n",
    "   - Support for multiple model types (Logistic Regression, Random Forest)\n",
    "   - Comprehensive feature importance analysis\n",
    "\n",
    "### 🚀 Next Steps for Phase 2\n",
    "\n",
    "To improve the model further, consider:\n",
    "\n",
    "1. **Advanced NLP Models**:\n",
    "   - Implement BERT or other transformer models\n",
    "   - Use pre-trained language models for better text understanding\n",
    "   - Experiment with sentence transformers\n",
    "\n",
    "2. **Feature Engineering**:\n",
    "   - Add source credibility features\n",
    "   - Include temporal features (publication date analysis)\n",
    "   - Extract URL and domain features\n",
    "   - Add image analysis features\n",
    "\n",
    "3. **Ensemble Methods**:\n",
    "   - Combine multiple models (voting, stacking)\n",
    "   - Use different feature subsets for different models\n",
    "   - Implement model selection strategies\n",
    "\n",
    "4. **Advanced Evaluation**:\n",
    "   - Cross-validation with different metrics\n",
    "   - Bias analysis and fairness evaluation\n",
    "   - Interpretability analysis (SHAP, LIME)\n",
    "\n",
    "5. **Production Deployment**:\n",
    "   - API development with FastAPI\n",
    "   - Model versioning and monitoring\n",
    "   - Real-time prediction pipeline\n",
    "\n",
    "### 📁 Project Structure\n",
    "\n",
    "```\n",
    "fake-news-detection-nlp-ml/\n",
    "├── data/                 # Dataset files\n",
    "├── notebooks/           # Jupyter notebooks\n",
    "├── src/                # Source code modules\n",
    "│   ├── preprocessing.py # Enhanced with Phase 1 features\n",
    "│   ├── model.py        # Enhanced model training\n",
    "│   └── ...\n",
    "├── models/             # Trained models\n",
    "├── outputs/            # Generated plots and reports\n",
    "├── README.md           # Project documentation\n",
    "└── requirements.txt    # Updated dependencies\n",
    "```\n",
    "\n",
    "---\n",
    "\n",
    "**🎉 Congratulations!** You've successfully implemented Phase 1 improvements for the fake news detection system!\n",
    "\n",
    "This enhanced project demonstrates:\n",
    "- ✅ Advanced NLP preprocessing with spaCy\n",
    "- ✅ Comprehensive feature engineering\n",
    "- ✅ Sentiment and readability analysis\n",
    "- ✅ Enhanced model training pipeline\n",
    "- ✅ Professional code structure\n",
    "- ✅ Production-ready implementation\n",
    "\n",
    "**📧 Contact**: [Your Name] - [Your Email]\n",
    "**🔗 GitHub**: [Your GitHub Profile]\n",
    "**📚 Portfolio**: [Your Portfolio Website]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}