## Load Testing Data and Model


In [22]:
import pandas as pd
import joblib

# Load the testing data
testing_df = pd.read_csv('/content/testing_data.csv', header=None)
print("Testing DataFrame loaded successfully.")
print(testing_df.head())

# Load the pre-trained Naive Bayes TF-IDF model
model = joblib.load('/content/naive_bayes_tfidf_model.joblib')
print("Naive Bayes TF-IDF model loaded successfully.")

# Load the pre-trained TF-IDF vectorizer
vectorizer = joblib.load('/content/tfidf_vectorizer.joblib')
print("TF-IDF vectorizer loaded successfully.")

Testing DataFrame loaded successfully.
                                                   0
0  2\tcopycat muslim terrorist arrested with assa...
1  2\twow! chicago protester caught on camera adm...
2  2\tgermany's fdp look to fill schaeuble's big ...
4  2\tu.n. seeks 'massive' aid boost amid rohingy...
Naive Bayes TF-IDF model loaded successfully.
TF-IDF vectorizer loaded successfully.


The 'testing_df' contains both a label and the text content in a single column, separated by a tab. Before making predictions, i will preprocess the DataFrame to extract only the text data for vectorization.


In [23]:
testing_df.columns = ['text_with_label']
text_data = testing_df['text_with_label'].apply(lambda x: x.split('\t', 1)[1])
print("Extracted text data from testing_df successfully.")
print(text_data.head())

Extracted text data from testing_df successfully.
0    copycat muslim terrorist arrested with assault...
1    wow! chicago protester caught on camera admits...
2     germany's fdp look to fill schaeuble's big shoes
4    u.n. seeks 'massive' aid boost amid rohingya '...
Name: text_with_label, dtype: object


Transformed into numerical features using the loaded TF-IDF vectorizer.


In [24]:
X_test_tfidf = vectorizer.transform(text_data)
print("Text data vectorized successfully.")

predictions = model.predict(X_test_tfidf)
print("Predictions made successfully.")
print(f"First 5 predictions: {predictions[:5]}")

Text data vectorized successfully.
Predictions made successfully.
First 5 predictions: [0 0 1 0 1]



Display of the predictions alongside the original text,



In [25]:
output_df = pd.DataFrame({
    'Prediction': predictions,
    'Original Text': text_data
})
print("Combined predictions with original text successfully.")
print(output_df.head())


Combined predictions with original text successfully.
   Prediction                                      Original Text
0           0  copycat muslim terrorist arrested with assault...
1           0  wow! chicago protester caught on camera admits...
2           1   germany's fdp look to fill schaeuble's big shoes
4           1  u.n. seeks 'massive' aid boost amid rohingya '...


 Calculated the value counts of the 'Prediction' column in the 'output_df'. This will show how many instances were predicted as '0' (fake) and '1' (real).



In [26]:
prediction_distribution = output_df['Prediction'].value_counts()
print("Distribution of Predictions (0: Fake, 1: Real):\n")
print(prediction_distribution)

Distribution of Predictions (0: Fake, 1: Real):

Prediction
1    5027
0    4957
Name: count, dtype: int64


## Final Task

### Subtask:
Summarize the process of predicting news categories for the `testing_df` and present the distribution of the model's predictions (0 for fake news, 1 for real news) as requested.


## Summary:

### Q&A
The process successfully predicted news categories for the `testing_df` by loading a pre-trained Naive Bayes TF-IDF model and vectorizer, processing the text data, and then generating predictions. The distribution of these predictions was subsequently analyzed and presented.

### Data Analysis Key Findings
*   All necessary components, including the `testing_data.csv` dataset, the `naive_bayes_tfidf_model.joblib` model, and the `tfidf_vectorizer.joblib` vectorizer, were successfully loaded.
*   Textual content was accurately extracted from the `testing_df` and transformed into TF-IDF features for model input.
*   The model generated predictions, with the first five predictions being `[0 0 1 0 1]`, indicating a mix of fake (0) and real (1) news classifications.
*   The final distribution of predictions showed a nearly balanced classification: 5027 instances were predicted as 'real news' (1), and 4957 instances were predicted as 'fake news' (0).

### Insights or Next Steps
*   The model's predictions suggest a relatively balanced presence of real versus fake news in the `testing_data.csv` dataset.
*   To further validate the model's performance, the next step should involve evaluating these predictions against actual labels (if available in the `testing_data.csv`) using appropriate classification metrics such as accuracy, precision, recall, and F1-score.


Nearly Balanced Predictions: The bar chart of the model's predictions shows a relatively balanced distribution between 'Fake News' (0) and 'Real News' (1) categories. Specifically, 5027 instances were predicted as 'Real News' (1), and 4957 instances were predicted as 'Fake News' (0). This indicates that the Naive Bayes model is not heavily biased towards predicting just one class, at least in terms of its output distribution on this dataset.



Model's Propensity: The model appears to classify almost an equal number of news articles as 'fake' and 'real'. This is a positive sign that it's attempting to differentiate between the two categories, rather than simply predicting the majority class, which can often happen with imbalanced datasets. However, it's crucial to remember that this is merely the distribution of predictions, not an evaluation of correctness.

True Label Comparison is Key: While the prediction distribution is interesting, the most critical next step is to evaluate these predictions against actual, true labels for the testing_data.csv. Without ground truth, we cannot definitively say how accurate, precise, or reliable these predictions are.

So, the primary insight is that the model is making diverse predictions, but without true labels for the testing data, we cannot assess its actual performance.