In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab.ipynb")

# Lab 7 – Regular Expressions

## DSC 80, Fall 2023

### Due Date: Monday, November 20th at 11:59 PM

## Instructions
Welcome to the seventh lab assignment in DSC 80 this quarter!

Much like in DSC 10, this Jupyter Notebook contains the statements of the problems and provides code and Markdown cells to display your answers to the problems. Unlike DSC 10, the notebook is *only* for displaying a readable version of your final answers. The coding will be done in an accompanying `lab.py` file that is imported into the current notebook, and **you will only submit that `lab.py` file**, not this notebook!

Some additional guidelines:
- **Unlike in DSC 10, labs will have both public tests and hidden tests.** The bulk of your grade will come from your scores on hidden tests, which you will only see on Gradescope after the assignment deadline.
- **Do not change the function names in the `lab.py` file!** The functions in the `lab.py` file are how your assignment is graded, and they are graded by their name. If you changed something you weren't supposed to, you can find the original code in the [course GitHub repository](https://github.com/dsc-courses/dsc80-2023-fa).
- Notebooks are nice for testing and experimenting with different implementations before designing your function in your `lab.py` file. You can write code here, but make sure that all of your real work is in the `lab.py` file, since that's all you're submitting.
- **To ensure that all of your work to be submitted is in `lab.py`, we've provided an additional uneditable notebook, called `lab-validation.ipynb`, that contains only the tests and their setup. Make sure you are able to run it top-to-bottom without error before submitting!**
- You are encouraged to write your own additional helper functions to solve the lab, as long as they also end up in `lab.py`.

**Importing code from `lab.py`**:

* Below, we import the `.py` file that's contained in the same directory as this notebook.
* We use the `autoreload` notebook extension to make changes to our `lab.py` file immediately available in our notebook. Without this extension, we would need to restart the notebook kernel to see any changes to `lab.py` in the notebook.
    - `autoreload` is necessary because, upon import, `lab.py` is compiled to bytecode (in the directory `__pycache__`). Subsequent imports of `lab` merely import the existing compiled python.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from lab import *

In [None]:
import pandas as pd
import numpy as np
import os
from pathlib import Path
import re

## Question 1 – Practice with Regular Expressions 🛠
<div class="alert alert-block alert-warning">
<b>Note</b>: The functions in this question all have doctests in their docstrings. The doctests constitute the public tests; we will still run hidden tests on each of your functions. <b>Please don't change any of the docstrings!</b>
</div>

Regular expressions can be tricky, and the best way to gain familiarity with them is through lots of practice. In this question, you will work through ten exercises, each of which requires you to write a regular expression that matches strings that satisfy certain criteria. Make sure to take a close look at the doctests for each function in `lab.py`, as they provide useful guidance for the types of strings you should and shouldn't match.

***Notes:*** 
- Make sure to refer to the [Regular Expression Resources](https://dsc80.com/resources/#regular-expressions) posted on the course website. In particular, we recommend having [regex101.com](https://regex101.com/) open while working, along with the [cheat sheet](https://dsc80.com/resources/other/berkeley-regex-reference.pdf).

- Each exercise has a star rating, between 1 (⭐️) and 3 (⭐️⭐️⭐️) stars, indicating its difficulty level (1 being the easiest, 3 being the hardest). If you are spending lots of time on 1-star exercises, take a close look at the syntax from lecture, as there is probably an easier way of writing the necessary pattern!

<br>

### Exercise 1 (⭐️)

Write a regular expression that matches strings that have `'['` as the third character and `']'` as the sixth character.

<br>

### Exercise 2 (⭐️)

Write a regular expression that matches strings that are phone numbers that start with `'(858)'` and follow the format `'(xxx) xxx-xxxx'` (`'x'` represents a digit).

***Note:*** There is a space between `'(xxx)'` and `'xxx-xxxx'`.

<br>

### Exercise 3 (⭐️)

Write a regular expression that matches strings that:
- are between 6 and 10 characters long (inclusive),
- contain only alphanumeric characters, whitespace and `'?'`, and
- end with `'?'`.

<br>

### Exercise 4 (⭐️⭐️)

Write a regular expression that matches strings with exactly two `'$'`, one of which is at the start of the string, such that:
- the characters between the two `'$'` can be anything (including nothing) except the lowercase letters `'a'`, `'b'`, and `'c'`, (and `'$'`), and
- the characters after the second `'$'` can only be the **lowercase or uppercase** letters `'a'`/`'A'`, `'b'`/`'B'`, and `'c'`/`'C'`, with every `'a'`/`'A'` before every `'b'`/`'B'`, and every `'b'`/`'B'` before every `'c'`/`'C'`. There must be at least one `'a'` or `'A'`, at least one `'b'` or `'B'`, and at least one `'c'` or `'C'`.
    

<br>

### Exercise 5 (⭐️)
Write a regular expression that matches strings that represent valid Python file names, including the extension. 

***Note:*** For simplicity, assume that file names only contain letters, numbers, and underscores (`'_'`).

<br>

### Exercise 6 (⭐️)
Write a regular expression that matches strings that:
- are made up of only lowercase letters and exactly one underscore (`'_'`), and
- have at least one lowercase letter on both sides of the underscore.

<br>

### Exercise 7 (⭐️)
Write a regular expression that matches strings that start with and end with an underscore (`'_'`).

<br>

### Exercise 8 (⭐️)

Apple serial numbers are strings of length 1 or more that are made up of any characters, other than
- the uppercase letter `'O'`, 
- the lowercase letter `'i`', and 
- the number `'1'`.

Write a regular expression that matches strings that are valid Apple serial numbers.

<br>

### Exercise 9 (⭐️⭐️)

ID numbers are formatted as `'SC-NN-CCC-NNNN'`, where 
- SC represents state code in uppercase (e.g. `'CA'`),
- NN represents a number with 2 digits (e.g. `'98'`),
- CCC represents a three letter city code in uppercase (e.g. `'SAN'`), and
- NNNN represents a number with 4 digits (e.g. `'1024'`).

Write a regular expression that matches strings that are ID numbers corresponding to the cities of `'SAN'` or `'LAX'`, or the state of `'NY'`. Assume that there is only one city named `'SAN'` and only one city named `'LAX'`.

<br>

### Exercise 10 (⭐️⭐️⭐️)

Write a function named `match_10` that takes in a string and:
- converts the string to lowercase,
- removes all non-alphanumeric characters (i.e. removes everything that is not in the `\w` character class), and the letter `'a'`, and
- returns a list of every **non-overlapping** three-character substring in the remaining string, starting from the beginning of the string.
   
For instance, consider the following doctest:

```py
>>> match_10('Ab..DEF')
['bde']
```

Here's how `match_10` should process `'Ab..DEF'`:

1. Convert to lowercase: `'ab..def'`.
2. Remove non-alphanumeric characters and the letter `'a'`: `'bdef'`.
3. Starting from the beginning of the string, there is only a single non-overlapping three character substring: `'bde'`. Hence, we return `['bde']`.

***Note:*** Perform your operations in the exact order described above, otherwise your code may not pass all the tests. Don't use a `for`-loop.

In [None]:
grader.check("q1")

## Question 2 – Capturing Groups in Regular Expressions 📡

The dataset stored in `data/messy.txt` contains personal information from a fictional website that a user scraped from web server logs. Within this dataset, there are four fields that are of interest to you:
1. Email Addresses (assume they are alphanumeric usernames and domain names)
2. [Social Security Numbers](https://en.wikipedia.org/wiki/Social_Security_number#Structure)
3. Bitcoin Addresses (alphanumeric strings of long length)
4. Street Addresses

Complete the implementation of the function `extract_personal`, which takes in a string containing the contents of a server log file (like `open('data/messy.txt').read()`) and returns a **tuple of four separate lists** containing values of the 4 pieces of information listed above (in the order listed above). Do **not** keep empty values.

***Note:*** Since this data is messy, your function will be allowed to miss ~5% of the records in each list. Good spot checking using certain useful substrings (e.g. `'@'` for emails) should help assure correctness! Your function will be tested on a sample of the file `messy.txt`.

***Hint:*** There are multiple "delimiters" in use in the file; there are few enough of them that you can safely determine what they are.

In [None]:
# experiment with extract_personal using the file s below
fp = Path('data') / 'messy.txt'
s = open(fp, encoding='utf8').read()

In [None]:
# don't change this cell, but do run it -- it is needed for the tests
test_fp = Path('data') / 'messy.txt'
test_s = open(test_fp, encoding='utf8').read()
emails, ssn, bitcoin, addresses = extract_personal(test_s)

In [None]:
grader.check("q2")

## Question 3 – TF-IDF 📊

The dataset `data/reviews.txt` contains [Amazon reviews](https://cseweb.ucsd.edu/~jmcauley/datasets.html#amazon_reviews/) for ~200k phones and phone accessories. The dataset has already been "cleaned" for you. In this question, you will complete the implementation of a function, which takes in the reviews dataset as a Series (with one entry per review) as well as a single review, and returns the word that "best summarizes the single review" using TF-IDF.

To do so, implement the two functions below.

#### `tfidf_data`
You will first need to decompose the review into individual words, count their frequencies, and calculate the TF-IDF score for each word. Complete the implementation of the function `tfidf_data`, which takes in the reviews data as a Series (`reviews_ser`) and a single review (`review`) from that Series and returns a DataFrame indexed by the words in `review` with four columns:
- `'cnt'`: the number of times each word is found in the review (telling how frequently a word is used in the `review`)
- `'tf'`: the term frequency for each word (giving an idea of how important each word relative to the other words in the `review`)
- `'idf'`: the inverse document frequency for each word (helping understand how important each word is to the whole `reviews_ser`)
- `'tfidf'` the TF-IDF for each word (measuring how important each word is to `review`, taking into account both the frequency of the word in the `review` and across `reviews_ser`)

You may use a `for`-loop. The words in the outputted DataFrame may appear in any order.

***Hint:*** You may need to use the [`'\b'` character](https://www.regular-expressions.info/wordboundaries.html) somewhere.
    
<br>

#### `relevant_word`

Complete the implementation of the function `tfidf_data`, which takes in the DataFrame that `tfidf_data` returns and returns the word that "best summarizes" the review. If there are multiple "best" summary words, return any one of them.

In [None]:
# experiment with tfidf_data using reviews_ser and review below 
fp = Path('data') / 'reviews.txt'
reviews_ser = pd.read_csv(fp, header=None).squeeze("columns")
review = open(Path('data') / 'reviews.txt', encoding='utf8').read().strip()

In [None]:
# don't change this cell, but do run it -- it is needed for the tests
fp = os.path.join('data', 'reviews.txt')
reviews_ser = pd.read_csv(fp, header=None).squeeze("columns")
review = open(os.path.join('data', 'review.txt'), encoding='utf8').read().strip()
q3_tfidf = tfidf_data(reviews_ser, review)

try:
    q3_rel = relevant_word(q3_tfidf)
except:
    q3_rel = None

In [None]:
grader.check("q3")

## Questions 4 and 5 – Tweet Analysis 🐥

The dataset `data/ira.csv` contains tweets tagged by Twitter as likely being posted by the [Internet Research Agency](https://en.wikipedia.org/wiki/Internet_Research_Agency), the tweet factory facing allegations for attempting to influence US political elections.

Questions 4 and 5 will focus on the following:
- Question 4: Look at the hashtags present in the text and trends in their makeup.
- Question 5: Prepare the dataset for modeling by creating features out of the text fields.

### Question 4 – Hashtags #️⃣

You may assume that a hashtag is any string without whitespace that immediately follows a `'#'`.

#### `hashtag_list`

Complete the implementation of the function `hashtag_list`, which takes in a Series of tweet texts and returns a Series containing a list of hashtags present in each tweet's text. If a tweet's text doesn't contain a hashtag, the Series should contain an empty list for that tweet. Don't include the `'#'` symbol in the lists that are returned.

<br>

#### `most_common_hashtag`

Complete the implementation of the function `most_common_hashtag`, which takes in a Series of hashtag lists (as is outputted by `hashtag_list`) and returns a Series consisting of a single hashtag per tweet: 
- If the tweet's text has no hashtags, the entry should in the output Series should be `NaN`.
- If the tweet's text has one distinct hashtag, the entry in the output Series should be that hashtag.
- If the tweet's text has more than one hashtag, the entry in the output Series should be the most common hashtag **in the tweet's text** with respect to **the whole input Series**. If there is a tie for the most common, any of the most common can be returned.
    - For example, if the input Series was `pd.Series([[2], [2], [3, 2, 3]])`, the output would be `pd.Series([2, 2, 2])`. Even though `3` was more common in the third list than `2`, `2` is more common than `3` among all hashtags in the Series.

In [None]:
# The public tests don't test your work on the `ira` data,
# but the hidden tests do.
# So, make sure to thoroughly test your work yourself!
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])

In [None]:
grader.check("q4")

### Question 5 – Features 📋

Now, create a DataFrame of features from the `ira` data.  That is, create a function `create_features` that takes in a DataFrame `ira` that has just a single column, `'text'`, and returns a DataFrame with the same index as `ira` (i.e. the rows correspond to the same tweets) and the following columns:
* `'num_hashtags'`, the number of hashtags present in the tweet.
* `'mc_hashtags'`, the most common hashtag associated to the tweet (using the result of `most_common_hashtag` from Question 4).
* `'num_tags'`, the number of tags the tweet has (look for the presence of `'@'`).
* `'num_links'`, the number of hyperlinks present in the tweet.
    - A hyperlink is a string starting with `'http://'` or `'https://'`, not followed by whitespaces.
* `'is_retweet'`, a Boolean describing whether the tweet is a retweet. A retweet is a tweet that **begins** with `'RT'`.
* `'text'`, a version of the tweet's text that is cleaned according to the following steps, **in this exact order**:
    1. All meta-information above (retweet info, tags, hyperlinks, and hashtags) should be replaced with a single space.
    2. Everything other than letters, numbers, and spaces should be replaced with a single space.
    3. All letters should be lowercase.
    4. All words should be separated by exactly one space, and leading/trailing whitespace should be removed (stripped).
    
The columns in the outputted DataFrame must be in the order `['text', 'num_hashtags', 'mc_hashtags', 'num_tags', 'num_links', 'is_retweet']`. (Remember, the DataFrame that `create_features` is called on only has a single column, `'text'`.)

***Notes:***
- It's a good idea to make helper function for each column.
- The `\w` character class in regex **does not** refer to letters, numbers, and spaces (or even just letters and numbers). As such, you can't use it here!
- `create_features` will take a while to run on the entire dataset – test it on a small sample first!

In [None]:
# The doctests/public tests don't test your work on the `ira` data,
# but the hidden tests do.
# So, make sure to thoroughly test your work yourself!
fp = os.path.join('data', 'ira.csv')
ira = pd.read_csv(fp, names=['id', 'name', 'date', 'text'])

In [None]:
# don't change this cell, but do run it -- it is needed for the tests
# (yes, we know it says "hidden" – there are still truly hidden tests in this question)
fp_hidden = 'data/ira_test.csv'
ira_hidden = pd.read_csv(fp_hidden, header=None)
text_hidden = ira_hidden.iloc[:, -1:]
text_hidden.columns = ['text']

test_hidden = create_features(text_hidden)

In [None]:
grader.check("q5")

## Congratulations! You're done with Lab 7! 🏁

As a reminder, all of the work you want to submit needs to be in `lab.py`.

To verify that all of your work is indeed in `lab.py`, and that you didn't accidentally implement a function in this notebook and not in `lab.py`, we've included another notebook in the lab folder, called `lab-validation.ipynb`. `lab-validation.ipynb` is a version of this notebook with only the `grader.check` cells and the code needed to set up the tests. 

### **Go to `lab-validation.ipynb`, and go to Kernel > Restart & Run All.** This will check if all `grader.check` test cases pass using just the code in `lab.py`.

Once you're able to pass all test cases in `lab-validation.ipynb`, including the call to `grader.check_all()` at the very bottom, then you're ready to submit your `lab.py` (and only your `lab.py`) to Gradescope. Once submitting to Gradescope, make sure to stick around until all test cases pass.

There is also a call to `grader.check_all()` below in _this_ notebook, but make sure to also follow the steps above.

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()