#### Class Distribution Analysis

Before training the model, it is important to examine the distribution of the target classes. An imbalanced dataset e.g., 90% FAKE and 10% TRUE could bias the model toward the majority class, resulting in misleading accuracy and poor generalization. Ensuring balance is therefore critical for fair and reliable predictions.

1. Raw Class Counts


In [1]:
data['label'].value_counts()


NameError: name 'data' is not defined

The dataset contains 23,481 fake news articles and 21,417 true news articles.This results in a total of 44,898 articles.

FAKE : 23,481  (~52.3%)
TRUE : 21,417  (~47.7%)

2. Percentage Distribution

In [2]:
data['label'].value_counts(normalize=True) * 100


NameError: name 'data' is not defined



When converted into percentages, the class proportions are:

FAKE → 52.3%

TRUE → 47.7%

This shows that the dataset is almost evenly balanced, with only a slight skew toward fake article


3. Visualization


In [3]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.countplot(x='label', data=data, palette='Set2')
plt.title("Class Distribution of Fake vs True News")
plt.show()


NameError: name 'data' is not defined


A bar plot of the distribution highlights the balance:

Both bars (FAKE vs TRUE) are nearly equal in height.

There is no severe dominance of one class over the other.


4. Why Class Balance Matters

The class distribution is critical for machine learning classification tasks:

If the dataset was heavily imbalanced e.g. 90% TRUE vs 10% FAKE, a model could achieve 90% accuracy by always predicting TRUE, but such a model would be useless in practice.

Here, with a ~52/48 split, the baseline accuracy of random guessing is close to 50%, meaning our models must perform significantly better than chance to be useful.

5. Implications for Modeling

Since the dataset is balanced, we can proceed with standard train/test splitting without needing resampling techniques like SMOTE Synthetic Minority Oversampling Technique or undersampling.

Evaluation metrics such as Accuracy, Precision, Recall, and F1-score will be meaningful, because neither class is underrepresented.

Models won’t need special handling like class_weight='balanced' at this stage, though testing it may still be worthwhile later.



✅ Conclusion

The Fake News dataset provides a balanced distribution of FAKE and TRUE articles. This balance ensures that classification models trained on this data will not inherently favor one label over the other, allowing for a fair and unbiased evaluation of algorithm performance.

The slight overrepresentation of fake articles (~52%) may even be beneficial, as catching fake news (recall for FAKE) is often more critical in real-world applications than misclassifying a small portion of true news.