**Name:** Abdallah Saber <br>
**ID:** 23012064 <br>
**Branch:** General

#### This code scrapes data from the 20 Newsgroups dataset and extracts specific fields from each document using regular expressions.

In [1]:
# import libraries

# import the dataset
from sklearn.datasets import fetch_20newsgroups

# import the necessary libraries
import re
import pandas as pd

In [2]:
# Load the 20 Newsgroups dataset
newsgroups = fetch_20newsgroups(subset='train')

# Initialize lists to store extracted data
from_list = []
subject_list = []
summary_list = []
distribution_list = []
organization_list = []
keywords_list = []
lines_list = []


## Define regex patterns:
A dictionary patterns stores the regular expressions for each field. Each pattern is designed to match the beginning of a line (^) followed by the field name and a colon, and then capture the rest of the line as the value ((.*)).

In [3]:
# Define regex patterns for each field
patterns = {
    'From': re.compile(r'^From: (.*)', re.MULTILINE),
    'Subject': re.compile(r'^Subject: (.*)', re.MULTILINE),
    'Summary': re.compile(r'^Summary: (.*)', re.MULTILINE),
    'Distribution': re.compile(r'^Distribution: (.*)', re.MULTILINE),
    'Organization': re.compile(r'^Organization: (.*)', re.MULTILINE),
    'Keywords': re.compile(r'^Keywords: (.*)', re.MULTILINE),
    'Lines': re.compile(r'^Lines: (.*)', re.MULTILINE)
}

**`extract_field` function:**
   - This function takes a regex pattern and text as input.
   - It uses `pattern.search(text)` to find a match in the text.
   - If a match is found, it returns the captured group (the value of the field); otherwise, it returns `None`.

In [4]:
# Function to extract field using regex
def extract_field(pattern: re.Pattern, text: str) -> str:
    match = pattern.search(text)
    return match.group(1) if match else None

In [5]:
# Iterate over each document in the dataset
for text in newsgroups.data:
    try:
        # Extract data from each field and append to respective lists
        from_list.append(extract_field(patterns['From'], text))
        subject_list.append(extract_field(patterns['Subject'], text))
        summary_list.append(extract_field(patterns['Summary'], text))
        distribution_list.append(extract_field(patterns['Distribution'], text))
        organization_list.append(extract_field(patterns['Organization'], text))
        keywords_list.append(extract_field(patterns['Keywords'], text))
        lines_list.append(extract_field(patterns['Lines'], text))
    except Exception as e:
        # Handle any errors that occur during extraction
        print(f"Error processing document: {e}")
        from_list.append(None)
        subject_list.append(None)
        summary_list.append(None)
        distribution_list.append(None)
        organization_list.append(None)
        keywords_list.append(None)
        lines_list.append(None)

In [6]:
# Create a DataFrame to store the extracted data
data = {
    'From': from_list,
    'Subject': subject_list,
    'Summary': summary_list,
    'Distribution': distribution_list,
    'Organization': organization_list,
    'Keywords': keywords_list,
    'Lines': lines_list
}
df = pd.DataFrame(data)

# Display the DataFrame
df.head()

Unnamed: 0,From,Subject,Summary,Distribution,Organization,Keywords,Lines
0,lerxst@wam.umd.edu (where's my thing),WHAT car is this!?,,,"University of Maryland, College Park",,15
1,guykuo@carson.u.washington.edu (Guy Kuo),SI Clock Poll - Final Call,Final call for SI clock reports,,University of Washington,"SI,acceleration,clock,upgrade",11
2,twillis@ec.ecn.purdue.edu (Thomas E Willis),PB questions...,,usa,Purdue University Engineering Computer Network,,36
3,jgreen@amber (Joe Green),Re: Weitek P9000 ?,,world,Harris Computer Systems Division,,14
4,jcm@head-cfa.harvard.edu (Jonathan McDowell),Re: Shuttle Launch Question,,sci,"Smithsonian Astrophysical Observatory, Cambrid...",,23


**Examples of possible benefits of scraping on the downloaded dataset.**

The code does not explicitly print the examples of possible benefits. However, here are some examples based on the extracted data:

- **Analyze the most frequent senders and their associated organizations.**
- **Identify the most common subjects and keywords.**
- **Determine the distribution of messages across different newsgroups.**
- **Analyze the length of messages (using the 'Lines' field) and its correlation with other fields.**