# Exercise

1. Data Loading and Initial Exploration

   - a. Download the 20 Newsgroups dataset using `fetch_20newsgroups` from `sklearn.datasets`.
   - b. Load the dataset into a pandas DataFrame.
   - c. Display the first few rows to understand the structure.
   - d. Get a summary of the dataset:
     - Number of samples.
     - Distribution of classes
     - Any missing values.

2. Text Preprocessing

   After looking at the data, perform the relevant preprocessing steps on the news texts:

   - Noise Removal. 
   - Lexicon Normalization
   - Object Standardization

3. Exploratory Data Analysis (EDA)

   - a. Word Cloud
     - Generate separate word clouds for a few selected newsgroups to visualize the most frequent words.
   - b. Frequency Distribution
     - Plot the top 20 most frequent words in the entire dataset after preprocessing.
   - c. Class Distribution
     - Visualize the distribution of the 20 newsgroups using a bar chart.

4. Feature Extraction

   - a. Part-of-Speech (POS) Tagging
     - Perform POS tagging on a random sample of 500 news articles.
     - Analyze the frequency of different POS tags.
   - b. Named Entity Recognition (NER)
     - Apply NER on the same sample of news articles.
     - Extract and list the most common entities (e.g., persons, organizations, locations).
   - c. TF-IDF Vectorization
     - Convert the preprocessed news articles into TF-IDF feature vectors.
     - Examine the shape of the resulting feature matrix.

5. Model Building

   - a. Train-Test Split
     - Split the dataset into training and testing sets (e.g., 80% train, 20% test).
   - b. Classification Model
     - Choose a classifier (e.g., Logistic Regression, Naive Bayes, Support Vector Machine).
     - Train the model on the TF-IDF feature vectors.
     - Predict the newsgroup categories on the test set.
   - c. Evaluation
     - Calculate evaluation metrics:
       - Accuracy
       - Precision
       - Recall
       - F1-score
     - Display a confusion matrix.
     - Interpret the results.


The way to download the dataset is as follows. Make sure that you do not use the twenty_test data set for training, as it is meant for the final evaluation. 


In [2]:
    
from sklearn.datasets import fetch_20newsgroups

twenty_train = fetch_20newsgroups(
    subset="train", shuffle=True, download_if_missing=True
)
twenty_test = fetch_20newsgroups(subset="test", shuffle=True, download_if_missing=True)