# Week 8: Regular Expressions and Data Structures

## Learning Objectives

By the end of this lab, students should be able to:
- Use **regular expressions (`re` module)** to extract patterns from text.
- Parse **CSV data** manually and with the `csv` module.
- Perform simple **web scraping** using the `requests` and `re` modules.
- Store and manipulate extracted data using **Python dictionaries**.

## Section 1: Introduction to Regular Expressions

Python’s `re` module allows us to search and manipulate strings based on patterns.

### Example

```python
import re

text = "My phone number is 98765-43210."
pattern = r'\d{5}-\d{5}'
match = re.search(pattern, text)
if match:
    print("Phone Number Found:", match.group())
```

Output:
```text
Phone Number Found: 98765-43210
```

## Task 1: Extract Email Addresses (10 marks)

Write a function `extract_emails(text)` that takes a string and returns **a list of all email addresses** present in it.

**Example:**

```python
text = "Contact us at support@iitm.ac.in or admin@maths.org."
print(extract_emails(text))
```

**Expected Output:**
```text
['support@iitm.ac.in', 'admin@maths.org']
```

In [None]:
import re

def extract_emails(text):
    # YOUR CODE HERE
    pattern = r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}'
    return re.findall(pattern, text)

# Sample check
extract_emails("Emails: student1@iitm.ac.in, helpdesk@maths.org")

In [None]:
### BEGIN HIDDEN TESTS
assert extract_emails("abc@gmail.com") == ["abc@gmail.com"]
assert len(extract_emails("no emails here")) == 0
assert extract_emails("mail: a@b.com, x@y.co.in") == ['a@b.com', 'x@y.co.in']
### END HIDDEN TESTS

## Task 2: Extract All Numbers from Text (10 marks)

Write a function `extract_numbers(text)` that returns all the numbers (integers or decimals) found in a given string.

**Example:**

```python
extract_numbers("The values are 12, 45.6, and 789.")
```

**Expected Output:**
```text
['12', '45.6', '789']
```

In [None]:
def extract_numbers(text):
    # YOUR CODE HERE
    pattern = r'<Insert your regex pattern here.>'
    return re.findall(pattern, text)

# Sample check
extract_numbers("The cost is 45.5 and ID is 12345")

In [None]:
### BEGIN HIDDEN TESTS
assert extract_numbers("no numbers") == []
assert extract_numbers("Pi is 3.14 and e is 2.71") == ['3.14', '2.71']
assert extract_numbers("100 200 300") == ['100', '200', '300']
### END HIDDEN TESTS

## Section 2: Parsing CSV Data

### Example: Reading a CSV String

```python
csv_data = '''Name,Marks
Alice,90
Bob,85
Charlie,78
'''
lines = csv_data.strip().split('\n')
for line in lines[1:]:
    name, marks = line.split(',')
    print(name, "scored", marks)
```

Output:
```text
Alice scored 90
Bob scored 85
Charlie scored 78
```

## Task 3: Parse CSV and Return Dictionary (10 marks)

Write a function `parse_csv(data)` that takes a **CSV string** and returns a **dictionary** with names as keys and marks as integer values.

**Example:**

```python
csv_data = "Name,Marks\nAlice,90\nBob,85\nCharlie,78"
print(parse_csv(csv_data))
```

**Expected Output:**
```text
{'Alice': 90, 'Bob': 85, 'Charlie': 78}
```

In [None]:
def parse_csv(data):
    # YOUR CODE HERE

# Sample check
# parse_csv("Name,Marks\nA,1\nB,2") # -> Uncomment this line to check whether your function runs without error and gives expected output.

In [None]:
### BEGIN HIDDEN TESTS
assert parse_csv("Name,Marks\nX,10\nY,20") == {'X':10, 'Y':20}
assert 'Alice' in parse_csv("Name,Marks\nAlice,90\nBob,80")
### END HIDDEN TESTS

## Section 3: Simple Web Scraping

We can extract data from websites using `requests` and `re`.

```python
import requests
import re

url = "https://example.com"
response = requests.get(url)
text = response.text
titles = re.findall(r"<title>(.*?)</title>", text)
print(titles)
```

## Task 4: Extract All URLs from HTML (10 marks)

Write a function `extract_urls(html)` that extracts all hyperlinks (`href="..."`) from an HTML string.

**Example:**

```python
html = '<a href="https://iitm.ac.in">IITM</a> <a href="https://maths.org">Maths</a>'
print(extract_urls(html))
```

**Expected Output:**
```text
['https://iitm.ac.in', 'https://maths.org']
```

In [None]:
def extract_urls(html):
    # YOUR CODE HERE
    pattern = r'<Insert your regex pattern here.>'
    return re.findall(pattern, html)

# Sample check
extract_urls('<a href="https://google.com">Google</a>')

In [None]:
### BEGIN HIDDEN TESTS
assert extract_urls('<a href="https://a.com">A</a> <a href="https://b.org">B</a>') == ['https://a.com', 'https://b.org']
assert len(extract_urls('No links')) == 0
### END HIDDEN TESTS

## Section 4: Storing Scraped Data in a Dictionary

Let’s now combine regex and data structures.

## Task 5: Create Dictionary of Names and URLs (10 marks)

Given an HTML string containing multiple anchor tags, extract **link text (name)** and **URL**, and return a dictionary where the **key = link text** and **value = URL**.

**Example:**

```python
html = '<a href="https://iitm.ac.in">IITM</a> <a href="https://maths.org">Maths</a>'
print(name_url_dict(html))
```

**Expected Output:**
```text
{'IITM': 'https://iitm.ac.in', 'Maths': 'https://maths.org'}
```

In [None]:
def name_url_dict(html):
    # YOUR CODE HERE

# Sample check
# name_url_dict('<a href="https://abc.com">ABC</a>') # -> Uncomment this line to check whether your function runs without error and gives expected output.

In [None]:
### BEGIN HIDDEN TESTS
assert name_url_dict('<a href="https://a.com">A</a>') == {'A': 'https://a.com'}
assert 'Maths' in name_url_dict('<a href="https://maths.org">Maths</a>')
### END HIDDEN TESTS