#### Title: Machine Learning Classifier for News Articles
#### Author: Brian Castillo
#### Date: 14 Sep 2023

_____

### Objectives

- Data Collection: Gathering news articles from various categories.
- Data Preprocessing: Cleaning and preparing the text data for analysis.
- Feature Engineering: Transforming the text into a format that can be fed into ML algorithms.
- Model Building: Creating a machine learning model to classify the articles into different categories.
- Model Evaluation: Assessing the model's performance through metrics like accuracy, precision, recall, etc.
- Interpretation: Analyzing the results and understanding how the model is making its predictions.

__________

In [1]:
import pandas as pd
import json

### Data loading
Using a dataset with ~200k articles

In [22]:
# Initialize an empty list to store the JSON objects
data_list = []

# Open the file and read line by line
with open('News_Category_Dataset_v3.json', 'r') as f:
    for line in f:
        # Parse each line as a JSON object and append to list
        data_list.append(json.loads(line))

# Convert the list to a Pandas DataFrame
news_df = pd.DataFrame(data_list)

# Show the first few rows of the DataFrame
news_df.head()

Unnamed: 0,link,headline,category,short_description,authors,date
0,https://www.huffpost.com/entry/covid-boosters-...,Over 4 Million Americans Roll Up Sleeves For O...,U.S. NEWS,Health experts said it is too early to predict...,"Carla K. Johnson, AP",2022-09-23
1,https://www.huffpost.com/entry/american-airlin...,"American Airlines Flyer Charged, Banned For Li...",U.S. NEWS,He was subdued by passengers and crew when he ...,Mary Papenfuss,2022-09-23
2,https://www.huffpost.com/entry/funniest-tweets...,23 Of The Funniest Tweets About Cats And Dogs ...,COMEDY,"""Until you have a dog you don't understand wha...",Elyse Wanshel,2022-09-23
3,https://www.huffpost.com/entry/funniest-parent...,The Funniest Tweets From Parents This Week (Se...,PARENTING,"""Accidentally put grown-up toothpaste on my to...",Caroline Bologna,2022-09-23
4,https://www.huffpost.com/entry/amy-cooper-lose...,Woman Who Called Cops On Black Bird-Watcher Lo...,U.S. NEWS,Amy Cooper accused investment firm Franklin Te...,Nina Golgowski,2022-09-22


_________

### Data Preproccessing
- Removing special characters
- Converting all text to lower case
- Removing the 'categories' column since this will be a maching learning algorithm to categorize articles
- Any other necessary text cleaning methods

In [19]:
# Checking for any missing values
missing_values = news_df.isnull().sum()
missing_values

link                 0
headline             0
category             0
short_description    0
authors              0
date                 0
dtype: int64

In [71]:
# Converting the headlines to lowercase, removing special characters, and tokenizing each word
news_df['headline_nospecialchar'] = news_df['headline'].str.replace('[^\w\s]', '')

news_df['headline_lower'] = news_df['headline_nospecialchar'].str.lower()

  news_df['headline_nospecialchar'] = news_df['headline'].str.replace('[^\w\s]', '')


In [72]:
#Creating a token for each word
from nltk.tokenize import word_tokenize

news_df['headline_tokens'] = news_df['headline_lower'].apply(word_tokenize)

In [73]:
# Removing stopwords
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

news_df['headline_filtered'] = news_df['headline_tokens'].apply(lambda x: [word for word in x if word.lower() not in stop_words])

In [74]:
# Lemmatization (converting a word to its root form, ex: programming -> program)
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

news_df['headline_lemmatized'] = news_df['headline_filtered'].apply(lambda x: [lemmatizer.lemmatize(word.lower()) for word in x])

In [76]:
#Vectorization 
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

X = vectorizer.fit_transform(news_df['headline_lemmatized'].apply(lambda x: ' '.join(x))) # This is my feature matrix, It is a numerical representation of the text data

________

### Building the classifier

I will be using the **Random Forest** classifier

- Split the dataset into training and testing sets.
- Train a Random Forest classifier on the training set.
- Evaluate the model on the testing set.
- Examine feature importance (which words are most indicative of each category).

In [77]:
# Creating the training sets
y = news_df['category'] # To train the model the model must know the correct answer

# Splitting the data using sklearn
from sklearn.model_selection import train_test_split

# Splitting the data into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training set size:", X_train.shape)
print("Testing set size:", X_test.shape)

Training set size: (146668, 61302)
Testing set size: (62859, 61302)


In [82]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Evaluate the model
y_pred = rf_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

print(f"Model Accuracy: {accuracy}")
print("Classification Report:")
print(report)


Model Accuracy: 0.5333046978157463
Classification Report:
                precision    recall  f1-score   support

          ARTS       0.34      0.21      0.26       438
ARTS & CULTURE       0.36      0.09      0.14       388
  BLACK VOICES       0.53      0.25      0.34      1378
      BUSINESS       0.46      0.37      0.41      1796
       COLLEGE       0.35      0.32      0.34       297
        COMEDY       0.57      0.35      0.43      1620
         CRIME       0.44      0.45      0.44      1094
CULTURE & ARTS       0.68      0.20      0.31       309
       DIVORCE       0.76      0.60      0.67      1013
     EDUCATION       0.34      0.25      0.29       315
 ENTERTAINMENT       0.54      0.67      0.60      5148
   ENVIRONMENT       0.51      0.12      0.20       442
         FIFTY       0.22      0.06      0.09       413
  FOOD & DRINK       0.54      0.69      0.60      1896
     GOOD NEWS       0.27      0.08      0.12       402
         GREEN       0.39      0.27      0.31

In [81]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Initialize the Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

# Predict the labels for the test set
y_pred = clf.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')


Accuracy: 53.33%


### Hyperparameter Tuning

We have gotten a accuracy of 53.33% and this can be improved using:

- Hyperparameter Tuning: Use techniques like Grid Search or Random Search to find the best hyperparameters for your Random Forest model. This can include the number of trees, the maximum depth of trees, the minimum samples required to split an internal node, etc.

- Feature Engineering: Since text data can be noisy, try to explore different ways to represent your text data, maybe using TF-IDF instead of simple word counts.

- Ensemble Methods: Combining multiple models can often yield better results than a single model.

#### Some hyperparameters I will tune:

- n_estimators: The number of trees in the forest.
- max_features: The number of features to consider when looking for the best split.
- max_depth: The maximum depth of the trees.
- min_samples_split: The minimum number of samples required to split an internal node.
- min_samples_leaf: The minimum number of samples required to be at a leaf node.

In [85]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Create the grid search with 3-fold cross-validation
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, 
                           cv=3, n_jobs=-1, verbose=2)

# Fit the grid search model (this will take some time)
grid_search.fit(X_train, y_train)

# Get the best parameters from the grid search
best_params = grid_search.best_params_

best_params

Fitting 3 folds for each of 72 candidates, totalling 216 fits


{'max_depth': None,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 5,
 'n_estimators': 200}

In [None]:
# Initialize the Random Forest Classifier with the best hyperparameters
best_rf_model = RandomForestClassifier(n_estimators=200, max_depth=None, max_features='auto',
                                       min_samples_leaf=1, min_samples_split=5, random_state=42, n_jobs=-1)

# Train the model on the training set
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_best_rf = best_rf_model.predict(X_test)

# Calculate the accuracy of the model
best_rf_accuracy = accuracy_score(y_test, y_pred_best_rf)
best_rf_accuracy

0.5371864013108704

_________

## Conclusion: News Article Category Classifier

### 1. Objective & Dataset:
- **Goal**: Build a machine learning classifier to categorize news articles into their respective categories.
- **Dataset**: 200,000 news articles with attributes like headline, short description, authors, date, and category.

### 2. Data Preprocessing:
- Performed extensive preprocessing on the article text:
  * Removal of unwanted characters.
  * Conversion to lowercase.
  * Lemmatization to reduce words to their base form.
  * Removal of stopwords.
- Utilized the TF-IDF vectorizer for numerical transformation of the cleaned text data.

### 3. Model Selection & Training:
- Chose the **Random Forest Classifier** due to its ability to handle large datasets and robustness against overfitting.
- Initial model yielded an accuracy of ~51%.
- Enhanced performance and efficiency via parameter optimization and parallel processing (`n_jobs=-1`).

### 4. Hyperparameter Tuning:
- Conducted Grid Search with cross-validation for hyperparameter optimization.
- Best Parameters:
  * `max_depth`: None
  * `max_features`: auto
  * `min_samples_leaf`: 1
  * `min_samples_split`: 5
  * `n_estimators`: 200

### 5. Final Model Performance:
- Refined Random Forest Classifier with the best parameters achieved ~53.7% accuracy on the test dataset.
- There's room for further model optimization and exploration of other algorithms.

### 6. Future Directions:
- Explore other machine learning algorithms like Neural Networks or Gradient Boosted Trees.
- Incorporate additional features (e.g., article author, publication date).
- Consider advanced preprocessing techniques, such as n-gram based TF-IDF vectorization.

---

The project highlighted the potential of machine learning in news categorization, with skills and methodologies that are transferable across various domains.
