"""


Introduction
Learning how to process text is a skill required for Data Scientists/AI Engineers.
In this project, you will put these skills into practice to identify whether a news headline is real or fake news.

Project Overview
In the file dataset/data.csv, you will find a dataset containing news articles with the following columns:

- label: 0 if the news is fake, 1 if the news is real.
- title: The headline of the news article.
- text: The full content of the article.
- subject: The category or topic of the news.
- date: The publication date of the article.
- Your goal is to build a classifier that is able to distinguish between the two.

Once you have a classifier built, then use it to predict the labels for dataset/validation_data.csv. Generate a new file where the label 2 has been replaced by 0 (fake) or 1 (real) according to your model. Please respect the original file format, do not include extra columns, and respect the column separator.

Please ensure to split the data.csv into training and test datasets before using it for model training or evaluation.

Guidance
Like in a real life scenario, you are able to make your own choices and text treatment. Use the techniques you have learned and the common packages to process this data and classify the text.

Deliverables
- Python Code: Provide well-documented Python code that conducts the analysis.
- Predictions: A csv file in the same format as validation_data.csv but with the predicted labels (0 or 1)
- Accuracy estimation: Provide the teacher with your estimation of how your model will perform.
- Presentation: You will present your model in a 10-minute presentation. Your teacher will provide further instructions.


"""

In [None]:
# 1) Load & quick sanity checks
# Goal: confirm columns, size, nulls, class balance.

In [None]:
# 2) Train/test split (from data.csv)
# Use stratified split on label to preserve class balance.

In [1]:
# 3) Choose features (text) & minimal preprocessing.
# Easiest strong baseline: TF-IDF on title + text. Optionally include subject as a categorical feature later.

In [None]:
# 4) Vectorizer + model in a Pipeline
# Start with TfidfVectorizer + LogisticRegression (fast, strong baseline).
# You can swap in LinearSVC or MultinomialNB later.

In [None]:
# 5) Fit & evaluate (baseline)
# Use accuracy and precision/recall/F1 (binary classification).

In [None]:
# 6) (Optional) Quick hyperparameter search
# Small grid to avoid overfitting your time.

In [None]:
# 7) Train on full training data (optional)
# After choosing final settings (pipe or best_model), you can refit on all of df for maximal signal before producing validation predictions.

In [None]:
# 8) Produce predictions for validation_data.csv
# The file format must be the same as the original, but with label 2 replaced by your predictions (0/1).
# Usually validation_data.csv has label=2 as a placeholder.

In [None]:
# 9) Accuracy estimation (what to report)
# Report test set metrics from step 5 (or your CV estimates).
# State: test accuracy, precision/recall/F1 for each class, and any notable error patterns (e.g., satire mistaken as fake).

In [None]:
# 10) (Optional) Nice upgrades
# Use subject: combine with text via ColumnTransformer.
# Use date: extract year/month; sometimes correlates with patterns.
# Calibrate probabilities (CalibratedClassifierCV) if you want threshold tuning.
# Error analysis: inspect top false positives/negatives to refine preprocessing.