<a href="https://colab.research.google.com/github/fjadidi2001/fake_news_detection/blob/main/DANSE_Mar26.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Step-by-Step Workflow to Apply DANES Methodology on `facebook-fact-check.csv` Dataset**

---

## **📌 Step 1: Install Required Libraries**

---

---
## **📌 Step 2: Load and Explore the Dataset**
Start by loading and understanding your dataset.

In [3]:
import pandas as pd

# Load dataset
df = pd.read_csv("/content/facebook-fact-check.csv", encoding='latin1')

# Check initial shape and missing values
print("Initial shape:", df.shape)  # Should be (2282, 14)
print("\nMissing values:\n", df.isnull().sum())

Initial shape: (2282, 13)

Missing values:
 account_id           0
post_id              0
Category             0
Page                 0
Post URL             0
Date Published       0
Post Type            0
Rating               0
Debate            1984
share_count         70
reaction_count       2
comment_count        2
Context Post         0
dtype: int64


### 🔹 **Check for Missing Values**
---

---
## **📌 Step 3: Define Target Variable**
- If `Rating` column contains fact-checking labels, convert it into a binary/multi-class target variable.
---

In [5]:
# Check unique ratings
print("Unique Ratings:", df['Rating'].unique())

# Example mapping (adjust based on your data)


# For multi-class, use LabelEncoder instead:
from sklearn.preprocessing import LabelEncoder
df['label'] = LabelEncoder().fit_transform(df['Rating'])

Unique Ratings: ['no factual content' 'mostly true' 'mixture of true and false'
 'mostly false']


---

## **📌 Step 4: Preprocess Text Data (Text Branch)**
Since DANES uses deep learning models, we need to **clean, tokenize, and embed** the text.

In [7]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m74.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━

In [8]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return tokens

df['processed_text'] = df['Context Post'].apply(preprocess_text)

# Train Word2Vec model
w2v_model = Word2Vec(df['processed_text'], vector_size=100, window=5, min_count=1, workers=4)

# Tokenize and pad sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['processed_text'].apply(lambda x: ' '.join(x)))
sequences = tokenizer.texts_to_sequences(df['processed_text'].apply(lambda x: ' '.join(x)))
max_len = 100  # Adjust based on data analysis
X_text = pad_sequences(sequences, maxlen=max_len, padding='post')

# Create embedding matrix
vocab_size = len(tokenizer.word_index) + 1
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix[i] = w2v_model.wv[word]

ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

### 🔹 **Convert Text to Word Embeddings**
Use **Word2Vec, FastText, or GloVe** to obtain word embeddings.

---


---

## **📌 Step 5: Prepare Social Context Features (Social Branch)**
We normalize numerical social features (`share_count`, `reaction_count`, `comment_count`).

---

---
## **📌 Step 6: Train-Test Split**

---

---
## **📌 Step 7: Build the DANES Model (Deep Learning)**
 Create the **Text Branch (LSTM)** and **Social Branch (MLP/CNN)** and combine them.

---

---
## **📌 Step 8: Train the Model**

---

## **📌 Step 9: Evaluate the Model**

---

## **📌 Step 10: Save the Model for Future Use**

---

Let's proceed with implementing Steps 3 to 10:

---

## **📌 Step 3: Define Target Variable**
Convert the `Rating` column into a binary target variable.

```python
# Check unique ratings
print("Unique Ratings:", df['Rating'].unique())

# Example mapping (adjust based on your data)
# Binary classification: 'True' vs others
df['label'] = df['Rating'].apply(lambda x: 1 if x == 'True' else 0)

# For multi-class, use LabelEncoder instead:
# from sklearn.preprocessing import LabelEncoder
# df['label'] = LabelEncoder().fit_transform(df['Rating'])
```

---

## **📌 Step 4: Preprocess Text Data**
Clean the text and convert it into embeddings.

```python
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

nltk.download('stopwords')
nltk.download('punkt')

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    tokens = word_tokenize(text)
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return tokens

df['processed_text'] = df['Context Post'].apply(preprocess_text)

# Train Word2Vec model
w2v_model = Word2Vec(df['processed_text'], vector_size=100, window=5, min_count=1, workers=4)

# Tokenize and pad sequences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['processed_text'].apply(lambda x: ' '.join(x)))
sequences = tokenizer.texts_to_sequences(df['processed_text'].apply(lambda x: ' '.join(x)))
max_len = 100  # Adjust based on data analysis
X_text = pad_sequences(sequences, maxlen=max_len, padding='post')

# Create embedding matrix
vocab_size = len(tokenizer.word_index) + 1
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix[i] = w2v_model.wv[word]
```

---

## **📌 Step 5: Prepare Social Context Features**
Handle missing values and scale numerical features.

```python
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Impute missing values
imputer = SimpleImputer(strategy='median')
social_features = imputer.fit_transform(df[['share_count', 'reaction_count', 'comment_count']])

# Scale features
scaler = StandardScaler()
X_social = scaler.fit_transform(social_features)
```

---

## **📌 Step 6: Train-Test Split**
Split the dataset into training and testing sets.

```python
from sklearn.model_selection import train_test_split

X_text_train, X_text_test, X_social_train, X_social_test, y_train, y_test = train_test_split(
    X_text, X_social, df['label'], test_size=0.2, stratify=df['label'], random_state=42)
```

---

## **📌 Step 7: Build the DANES Model**
Create a dual-input neural network.

```python
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Concatenate

# Text branch
text_input = Input(shape=(max_len,))
x = Embedding(vocab_size, 100, weights=[embedding_matrix], trainable=False)(text_input)
x = LSTM(64)(x)

# Social branch
social_input = Input(shape=(3,))
y = Dense(32, activation='relu')(social_input)

# Combine branches
combined = Concatenate()([x, y])
z = Dense(64, activation='relu')(combined)
output = Dense(1, activation='sigmoid')(z)

model = Model(inputs=[text_input, social_input], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
```

---

## **📌 Step 8: Train the Model**
Train the model on the training data.

```python
history = model.fit(
    [X_text_train, X_social_train], y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)
```

---

## **📌 Step 9: Evaluate the Model**
Assess performance on the test set.

```python
# Evaluate accuracy
loss, accuracy = model.evaluate([X_text_test, X_social_test], y_test, verbose=0)
print(f'Test Accuracy: {accuracy * 100:.2f}%')

# Generate classification report
from sklearn.metrics import classification_report
y_pred = (model.predict([X_text_test, X_social_test]) > 0.5).astype(int)
print(classification_report(y_test, y_pred))
```

---

## **📌 Step 10: Save the Model**
Save the trained model for future use.

```python
model.save('danes_model.h5')
```

---

This completes the implementation of the DANES framework on your dataset. Adjust hyperparameters (epochs, embedding dimensions, etc.) based on your specific requirements.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from gensim.models import Word2Vec

# Step 1: Install Required Libraries
# Note: Ensure these are installed via pip or conda
# pip install pandas numpy matplotlib seaborn scikit-learn tensorflow gensim

# Step 2: Load and Explore the Dataset
def load_and_explore_dataset(filepath):
    # Load dataset with latin1 encoding
    df = pd.read_csv(filepath, encoding='latin1')

    # Initial dataset information
    print("Initial Dataset Shape:", df.shape)
    print("\nColumn Names:", list(df.columns))

    # Missing values analysis
    print("\nMissing Values:\n", df.isnull().sum())

    # Visualize missing values
    plt.figure(figsize=(10, 6))
    sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')
    plt.title('Missing Values Heatmap')
    plt.tight_layout()
    plt.show()

    # Basic dataset statistics
    print("\nDataset Statistics:")
    print(df.describe())

    return df

# Example usage
filepath = "/content/facebook-fact-check.csv"
df = load_and_explore_dataset(filepath)

RuntimeError: empty_like method already has a different docstring

In [None]:
# Step 8: Train the Model
def train_danes_model(model, X_text_train, X_social_train, y_train, validation_split=0.2, epochs=50, batch_size=32):
    early_stopping = tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=10,
        restore_best_weights=True
    )

    history = model.fit(
        [X_text_train, X_social_train],
        y_train,
        validation_split=validation_split,
        epochs=epochs,
        batch_size=batch_size,
        callbacks=[early_stopping]
    )

    return history

# Train the model
history = train_danes_model(danes_model, X_text_train, X_social_train, y_train)

# Step 9: Evaluate the Model
def evaluate_model(model, X_text_test, X_social_test, y_test):
    # Model evaluation
    test_loss, test_accuracy = model.evaluate(
        [X_text_test, X_social_test],
        y_test
    )

    print(f"Test Loss: {test_loss}")
    print(f"Test Accuracy: {test_accuracy}")

    # Detailed Classification Report
    y_pred = model.predict([X_text_test, X_social_test])
    y_pred_classes = np.argmax(y_pred, axis=1)

    from sklearn.metrics import classification_report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred_classes))

    # Visualization of training history
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'], label='Training Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.title('Model Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'], label='Training Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title('Model Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.legend()
    plt.tight_layout()
    plt.show()

# Evaluate the model
evaluate_model(danes_model, X_text_test, X_social_test, y_test)

# Step 10: Save the Model for Future Use
def save_danes_model(model, tokenizer, label_encoder, scaler, base_path='danes_model/'):
    import os

    # Create directory if it doesn't exist
    os.makedirs(base_path, exist_ok=True)

    # Save model
    model.save(os.path.join(base_path, 'model.h5'))

    # Save additional components
    import joblib
    joblib.dump(tokenizer, os.path.join(base_path, 'tokenizer.pkl'))
    joblib.dump(label_encoder, os.path.join(base_path, 'label_encoder.pkl'))
    joblib.dump(scaler, os.path.join(base_path, 'social_scaler.pkl'))

    print("Model and supporting files saved successfully!")

# Save the model
save_danes_model(danes_model, tokenizer, label_encoder, scaler)