# N-gram Comparison: Fake vs Real News

This short notebook performs a simple, reproducible analysis comparing the most frequent terms
(unigrams) in the `Fake.csv` and `True.csv` datasets included in `data/`. The visualization
below helps highlight differences in vocabulary between the two datasets — useful as a
quick EDA item to show original analysis in this repository.

In [None]:
# Standard imports
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")

In [None]:
# Load the datasets (these CSVs are included in `data/`)
fake = pd.read_csv('data/Fake.csv')
real = pd.read_csv('data/True.csv')
# Ensure we have text columns as strings
fake_texts = fake['text'].astype(str).tolist()
real_texts = real['text'].astype(str).tolist()

In [None]:
def top_ngrams(corpus, n=15, ngram_range=(1,1)):
    vec = CountVectorizer(ngram_range=ngram_range, stop_words='english')
    X = vec.fit_transform(corpus)
    sums = X.sum(axis=0)
    terms = vec.get_feature_names_out()
    freqs = [(terms[i], int(sums[0, i])) for i in range(len(terms))]
    return sorted(freqs, key=lambda x: x[1], reverse=True)[:n]

top_fake = top_ngrams(fake_texts, n=15, ngram_range=(1,1))
top_real = top_ngrams(real_texts, n=15, ngram_range=(1,1))

In [None]:
# Convert to DataFrame for plotting
df_fake = pd.DataFrame(top_fake, columns=['term','count'])
df_real = pd.DataFrame(top_real, columns=['term','count'])

fig, axes = plt.subplots(1,2, figsize=(14,6))
sns.barplot(x='count', y='term', data=df_fake, ax=axes[0], palette='Reds_d')
axes[0].set_title('Top terms — Fake')
sns.barplot(x='count', y='term', data=df_real, ax=axes[1], palette='Blues_d')
axes[1].set_title('Top terms — Real')
plt.tight_layout()
plt.show()

### Notes
- This is a compact reproducible analysis: it uses only the CSVs in `data/` and standard libraries.
- You can extend this to bi-grams or compare relative frequencies (normalize by document counts)
  to surface discriminative phrases between fake and real news.