"""


Introduction
Learning how to process text is a skill required for Data Scientists/AI Engineers.
In this project, you will put these skills into practice to identify whether a news headline is real or fake news.

Project Overview
In the file dataset/data.csv, you will find a dataset containing news articles with the following columns:

- label: 0 if the news is fake, 1 if the news is real.
- title: The headline of the news article.
- text: The full content of the article.
- subject: The category or topic of the news.
- date: The publication date of the article.
- Your goal is to build a classifier that is able to distinguish between the two.

Once you have a classifier built, then use it to predict the labels for dataset/validation_data.csv. Generate a new file where the label 2 has been replaced by 0 (fake) or 1 (real) according to your model. Please respect the original file format, do not include extra columns, and respect the column separator.

Please ensure to split the data.csv into training and test datasets before using it for model training or evaluation.

Guidance
Like in a real life scenario, you are able to make your own choices and text treatment. Use the techniques you have learned and the common packages to process this data and classify the text.

Deliverables
- Python Code: Provide well-documented Python code that conducts the analysis.
- Predictions: A csv file in the same format as validation_data.csv but with the predicted labels (0 or 1)
- Accuracy estimation: Provide the teacher with your estimation of how your model will perform.
- Presentation: You will present your model in a 10-minute presentation. Your teacher will provide further instructions.


"""

In [14]:
# 1) Load & quick sanity checks
# Goal: check head, confirm columns, size, nulls, class balance.
import pandas as pd
data_train = pd.read_csv('dataset/data.csv')
data_validation = pd.read_csv('dataset/validation_data.csv')
print(data_train.head())
print(data_train.columns)
print(data_train.shape)
print(data_train.isnull().sum())

   label                                              title  \
0      1  As U.S. budget fight looms, Republicans flip t...   
1      1  U.S. military to accept transgender recruits o...   
2      1  Senior U.S. Republican senator: 'Let Mr. Muell...   
3      1  FBI Russia probe helped by Australian diplomat...   
4      1  Trump wants Postal Service to charge 'much mor...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2  WASHINGTON (Reuters) - The special counsel inv...  politicsNews   
3  WASHINGTON (Reuters) - Trump campaign adviser ...  politicsNews   
4  SEATTLE/WASHINGTON (Reuters) - President Donal...  politicsNews   

                 date  
0  December 31, 2017   
1  December 29, 2017   
2  December 31, 2017   
3  December 30, 2017   
4  December 29, 2017   
Index(['label', 'title', 'text', 'subject', 'date'], dty

In [16]:
# 2) Train/test split (from data.csv)
# Use stratified split on label to preserve class balance.
from sklearn.model_selection import train_test_split
# features we’ll use
X = data_train[['title','text','subject','date']]
# target variable
y = data_train['label'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(31953, 4) (7989, 4) (31953,) (7989,)


In [20]:
# 3) Choose features (text) & minimal preprocessing.
# Easiest strong baseline: TF-IDF combining title + text.
for split in [X_train, X_test]:
    split['combined_text'] = split['title'].fillna('') + ' ' + split['text'].fillna('')

In [21]:
# 4) Vectorizer + model in a Pipeline
# Start with TfidfVectorizer + LogisticRegression (fast, strong baseline). You can swap in LinearSVC or MultinomialNB later.
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
    # converting raw text into numerical features
    ("tfidf", TfidfVectorizer(
        # lower case all text
        lowercase= True,
        # remove common words like the, is, and
        stop_words = "english",
        # limit vocabulary size & keep only the 100,000 most frequent terms
        max_features=100_000,
        # use both single words (unigrams) and pairs of words (bigrams)
        ngram_range=(1,2),
        # ignore words that appear in fewer than 2 documents
        min_df=2
    )),
    # machine learning model to classify news as fake (0) or real (1)
    ("clf", LogisticRegression(
        # algorithm to not stop until 200 iterations
        max_iter=200,
        n_jobs=None,
        # automatically balance weights for classes
        class_weight='balanced'
     ))
])


In [22]:
# 5) Fit & evaluate (baseline)
# Use accuracy and precision/recall/F1 (binary classification).
pipeline.fit(X_train['combined_text'], y_train)
from sklearn.metrics import classification_report, accuracy_score
y_train_pred = pipeline.predict(X_train['combined_text'])
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))
print(classification_report(y_train, y_train_pred, digits=4))

Train Accuracy: 0.9938033987419022
              precision    recall  f1-score   support

           0     0.9960    0.9915    0.9938     15954
           1     0.9916    0.9961    0.9938     15999

    accuracy                         0.9938     31953
   macro avg     0.9938    0.9938    0.9938     31953
weighted avg     0.9938    0.9938    0.9938     31953



In [27]:
# check for overfitting
from sklearn.metrics import accuracy_score, classification_report

y_train_pred = pipeline.predict(X_train['combined_text'])
print("Train Accuracy:", accuracy_score(y_train, y_train_pred))

y_test_pred = pipeline.predict(X_test['combined_text'])
print("Test Accuracy:", accuracy_score(y_test, y_test_pred))
print(classification_report(y_test, y_test_pred, digits=4))

Train Accuracy: 0.9938033987419022
Test Accuracy: 0.9854800350481913
              precision    recall  f1-score   support

           0     0.9889    0.9820    0.9854      3989
           1     0.9821    0.9890    0.9856      4000

    accuracy                         0.9855      7989
   macro avg     0.9855    0.9855    0.9855      7989
weighted avg     0.9855    0.9855    0.9855      7989



In [None]:
# 7) Train on full training data (optional)
# After choosing final settings (pipe or best_model), you can refit on all of df for maximal signal before producing validation predictions.

In [None]:
# 8) Produce predictions for validation_data.csv
# The file format must be the same as the original, but with label 2 replaced by your predictions (0/1).
# Usually validation_data.csv has label=2 as a placeholder.

In [None]:
# 9) Accuracy estimation (what to report)
# Report test set metrics from step 5 (or your CV estimates).
# State: test accuracy, precision/recall/F1 for each class, and any notable error patterns (e.g., satire mistaken as fake).

In [None]:
# 10) (Optional) Nice upgrades
# Use subject: combine with text via ColumnTransformer.
# Use date: extract year/month; sometimes correlates with patterns.
# Calibrate probabilities (CalibratedClassifierCV) if you want threshold tuning.
# Error analysis: inspect top false positives/negatives to refine preprocessing.