### Part 1: Data Preprocessing:
1.1 Load the dataset and perform initial exploration to understand its structure.

In [3]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('News_Category_Dataset_v3.csv')

print(df.head())

   Unnamed: 0                                           headline   category  \
0           0  Over 4 Million Americans Roll Up Sleeves For O...  U.S. NEWS   
1           1  American Airlines Flyer Charged, Banned For Li...  U.S. NEWS   
2           2  23 Of The Funniest Tweets About Cats And Dogs ...     COMEDY   
3           3  The Funniest Tweets From Parents This Week (Se...  PARENTING   
4           4  Woman Who Called Cops On Black Bird-Watcher Lo...  U.S. NEWS   

                                   short_description               authors  \
0  Health experts said it is too early to predict...  Carla K. Johnson, AP   
1  He was subdued by passengers and crew when he ...        Mary Papenfuss   
2  "Until you have a dog you don't understand wha...         Elyse Wanshel   
3  "Accidentally put grown-up toothpaste on my to...      Caroline Bologna   
4  Amy Cooper accused investment firm Franklin Te...        Nina Golgowski   

         date  headline_length  short_description_length

1.2 Clean the text data, including removing special characters, stopwords, and applying lowercasing.

In [6]:
from nltk.corpus import stopwords
import nltk
import re
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def clean_text(text):
    text = str(text)
    # lowercase
    text = text.lower()
    # remove special characters
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    # remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# clean 'headline' and 'short_description'
df['cleaned_headline'] = df['headline'].apply(clean_text)
df['cleaned_description'] = df['short_description'].apply(clean_text)

print(df)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Yue\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


        Unnamed: 0                                           headline  \
0                0  Over 4 Million Americans Roll Up Sleeves For O...   
1                1  American Airlines Flyer Charged, Banned For Li...   
2                2  23 Of The Funniest Tweets About Cats And Dogs ...   
3                3  The Funniest Tweets From Parents This Week (Se...   
4                4  Woman Who Called Cops On Black Bird-Watcher Lo...   
...            ...                                                ...   
209522      209522  RIM CEO Thorsten Heins' 'Significant' Plans Fo...   
209523      209523  Maria Sharapova Stunned By Victoria Azarenka I...   
209524      209524  Giants Over Patriots, Jets Over Colts Among  M...   
209525      209525  Aldon Smith Arrested: 49ers Linebacker Busted ...   
209526      209526  Dwight Howard Rips Teammates After Magic Loss ...   

         category                                  short_description  \
0       U.S. NEWS  Health experts said it is too ea

1.3 Perform text tokenization and vectorization using TF-IDF.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# combine the headline and short description
df['text'] = df['cleaned_headline'] + df['cleaned_description']
tfidf = TfidfVectorizer()
text_tfidf = tfidf.fit_transform(df['text'])

1.4 Extract and analyze different features from the text that might be useful for classification, such as word count,
sentence length, n-grams, etc

In [11]:
# word count
df['word_count'] = df['text'].apply(lambda x: len(x.split()))

# sentence length
df['sentence_length'] = df['text'].apply(len)

# n-grams
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 4))
tfidf_matrix = tfidf_vectorizer.fit_transform(df['text'])
feature_names = tfidf.get_feature_names_out()
print(feature_names)


['00' '000' '0000' ... 'zzzsrather' 'zzzssnooze' 'zzzzzs']
