---
---
Problem Set 3: Regular Expressions

Applied Data Science using Python

New York University, Abu Dhabi

Out: 20th Sept 2023 || **Due: 27th Sept 2023 at 23:59**

---
---
#Start Here
## Learning Goals
### General Goals
- Learn the basics of text processing
- Learn the basics of regular expressions

### Specific Goals
- Learn basic regex functions and operators
- Learn patterns and character classes
- Learn how to use quantifiers
- Learn how to use groups
- Learn about look-ahead and look-behind matching

## Collaboration Policy
- You are allowed to talk with / work with other students on homework assignments.
- You can share ideas but not code, analyses or results; you must submit your own code and results. All submitted code will be compared against all code submitted this and previous semesters and online using MOSS. We will also critically analyze the similarities in the submitted reports, methodologies, and results, **but we will not police you**. We expect you all to be mature and responsible enough to finish your work with full integrity.
- You are expected to comply with the [University Policy on Academic Integrity and Plagiarism](https://www.nyu.edu/about/policies-guidelines-compliance/policies-and-guidelines/academic-integrity-for-students-at-nyu.html). Violations may result in penalties, such as failure in a particular assignment.

## Late Submission Policy
You can submit the homework for upto 3 late days. However, we will deduct **20 points** from your homework grade **for each late day you take**. We will not accept the homework after 3 late days.

## Distribution of Class Materials
These problem sets and recitations are intellectual property of NYUAD, and we request the students to **not** distribute them or their solutions to other students who have not signed up for this class, and/or intend to sign up in the future. We also request you don't post these problem sets, and recitations online or on any public platforms.

## Disclaimer
The number of points do not necessarily signify/correlate to the difficulty level of the tasks.

## Submission
You will submit all your code as a Python Notebook through [Brightspace](https://brightspace.nyu.edu/).

---




# General Instructions
This homework is worth 100 points. It has 2 parts. Below each part, we provide a set of concepts required to complete that part. All the parts need to be completed in a Jupyter (Colab) Notebook attached with this handout. We recommend that you read the complete handout before starting the homework.



# Part I: He Who Must Not Be Named (40 points)

The year is 1988. Your name is Barnabas Cuffe. You live in Great Britain and work for the **Ministry of Magic**. You have been assigned the important task of making sure that the wizarding newspaper called the **Daily Prophet** only reports stories that comply with the rules of the Ministry of Magic.

It has come to the Minister's attention that the upcoming issue of the Daily Prophet, makes several references to the **Unspeakables** -- *people who should not be named*.

Your job is to wipe out the names of the Unspeakables and anonymize them, replacing them with the Ministry's preferred name, **John Smith**.

There are far too many articles in this issue for the number of employees in the Ministry of Magic to manually comb over and make the relevant adjustments. The Ministry would like to write a program to do this automatically.


## Example

Suppose we have a segment of text as follows, where the name `Gareth Greengrass` is one of the *Unspeakables*, and hence is on the banned list:

`I believe that Gareth Greengrass is an amazing golfer. Gareth Greengrass’ abilities are far beyond my own. My favorite golfer is Gareth Greengrass, and I have a shirt with GARETH GREENGRASS printed on it.
The article named Gareth-Greengrass was published yesterday. Greengrass is a service that extends Amazon Web Services functionality to Internet of Things.`

After careful processing by your program, this segment should read:

`I believe that John Smith is an amazing golfer. John Smith’ abilities are far beyond my own. My favorite golfer is John Smith, and I have a shirt with GARETH GREENGRASS printed on it.
The article named Gareth-Greengrass was published yesterday. Smith is a service that extends Amazon Web Services functionality to Internet of Things.`


## Prompt

Implement the function `clean` that takes in three arguments: a `list` of banned full names, a `list` of banned last names, and an input `string` to process. It should return as output the input string after the replacement of the banned full names and last names, if any.

Specifically, the function should:

1. Replace all instances of the banned full names with the officially approved name `John Smith`.
2. Replace all banned last names with the officially approved last name `Smith`.

Your implementation will be run against a variety of test cases that will look at normal and edge case behavior of your code. It it thus important that you implement specifically what is necessary, not more or less.

### Some clarifications:

1. *Full name definition*: An instance of a full name always consists of two words, separated only by some whitespace. Each word must be properly capitalized (first letter of each word should be capitalized, the rest should be lower caps) for it to be a full name. Similarly, an instance of a last name is always a single word that is properly capitalized.

2. *Other libraries*: You should not need to use any libraries other than the standard Python libraries.

3. *Regex vs. other methods*: You are required to implement a Regex based solution to this problem.

4. *Whitespace*: Your solution should preserve the original whitespace in the input source, if any. Do not add, remove, or replace any whitespace.

5. *Can a last name be a first name?*: Yes, but you should prioritize full name replacement over last name replacement if possible.

6. *Is that (e.g. Greengrass) a last name or something else?* - You have no way of knowing at this point, so if it looks like a last name, it is a last name. Replace it.

7. *Helper Functions*: You can write helper functions if you'd like.

8. We highly recommend you read this [documentation](https://docs.python.org/3.6/library/re.html) before attempting this part especially functions such as `re.sub()` and `re.compile()`.


In [40]:
import re

def clean(banned_fn_lst, banned_ln_lst, input_text):

  """You can assume that list of last names is the set of unique last names derived
  from the list of full names i.e.
  banned_ln_lst = list(set([s.split()[-1] for s in full_names]))

  :param banned_fn_lst: list of censored full names to be replaced
  :param banned_ln_lst: list of censored last names to be replaced
  :param input_text: input text to process
  :returns clean_str: output text with censored full names and last names removed
  """

  clean_str = "" #output string that will have the names replaced with John Smith

  # Ministry approved full name and last name for your use to replace names with
  replacement_full_name = "John Smith"
  replacement_last_name = "Smith"

  # Please write your implementation below this line
  ######### SOLUTION #########

  ######### SOLUTION END #########

  return clean_str

In [None]:
# Example of how we will call your function with different input/output pairs
# Example pair 1
# Given input
input_text = "I believe that Gareth Greengrass is an amazing golfer. Gareth Greengrass' "\
      "abilities are far beyond my own. My favorite golfer is Gareth Greengrass, "\
      "and I have a shirt with GARETH GREENGRASS printed on it. The article named "\
      "Gareth-Greengrass was published yesterday. Greengrass is a service that extends "\
      "Amazon Web Services functionality to Internet of Things."

# Expected output
output_text = "I believe that John Smith is an amazing golfer. John Smith' "\
      "abilities are far beyond my own. My favorite golfer is John Smith, "\
      "and I have a shirt with GARETH GREENGRASS printed on it. The article named "\
      "Gareth-Greengrass was published yesterday. Smith is a service that extends "\
      "Amazon Web Services functionality to Internet of Things."

# Example pair 2
# Given input
input_text2 = "Samuel Jones was a tall man, but not in an unreachable way. Jones used to play poker in an inn near his house. "\
        "Samuel was so famous, that the inn had a wall with SAMUEL-JONES painted on it. Jones' favourite drink was bourbon on the rocks."

# Expected output
output_text2 = "John Smith was a tall man, but not in an unreachable way. Smith used to play poker in an inn near his house. "\
            "Samuel was so famous, that the inn had a wall with SAMUEL-JONES painted on it. Smith' favourite drink was bourbon on the rocks."

# If your implementation is correct, this line should not give any error
assert(clean(["Gareth Greengrass"],["Greengrass"],input_text) == output_text)
assert(clean(["Samuel Jones"],["Jones"],input_text2) == output_text2)

## *Concepts required to complete this task*

*   Regex Groups
*   Regex Quantifiers
*   Regex Set Operator
*   Regex Anchors
*   Optionally `re.compile()`
*   Optionally `re.sub()`
*   Optionally `map()` function
*   Optionally `lambda` functions

## Rubric

- +30 points for correctness (proper usage of regex library to achieve the desired output)
- +5 points for conciseness (code and the regex usage is concise)
- +5 points for proper comments and variable names

# Part II: Data Exploration: Analyzing COVID-19 Misinformation On Twitter (60 points)

With the emergence of COVID-19 pandemic, the political and medical misinformation elevated to create what was being commonly referred to as the global **infodemic**. A huge chunk of the false information on COVID-19 was spread via Twitter. In this part you will use your knowledge of regular expressions to explore the Twitter data on COVID-19 discourse.

Before we describe the prompt, let us look at the data.


In [42]:
# Loading the dataset

# In Google colab, you need to mount your drive to be access your files. If you are running jupyter notebook locally no need to do this step.
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# Let us now read the tweets as list of lines
path = "/content/drive/My Drive/Fall 2023/Applied Data Science/Dataset/PS3_covid19_misinformation_data.txt" # edit the path to your folder containing the data file
tweets = ""
with open(path,"r") as file:
    # we'll read that into a variable called tweets
    tweets=file.readlines()

# If you want you can look at the tweets by uncommenting the following line
tweets

The data we are working with has 4573 tweets with each line having a username, datetime, tweet, bot probability, and category class, all separated by a whitespace character, two pipes `||`, and another whitespace character. Let's look at an example line from the file above to understand this better:

`twitmo20 || Tue Jul 07 19:19:40 +0000 2020 || Imagine you are a Democratic and were told that COVID is a Bioweapon used on the people in another attempt to destr… https://t.co/JJpXaIqRt8 || 0.7716583196 || politics`

In the above line `twitmo20` is the **username** of the user who tweeted this tweet, `Tue Jul 07 19:19:40 +0000 2020` is the **date and time** of this tweet, `Imagine you are a Democratic and were told that COVID is a Bioweapon used on the people in another attempt to destr… https://t.co/JJpXaIqRt8` is the **tweet** text itself, **0.7716583196** is the probability that this tweet was by a **bot**  *(>0.7716.. signifies that there is more than 77% probability that this is a bot account)*, and finally `politics` means this tweet has been **categorized** or **annotated** as a political tweet.

Now that we have read the data for you above, we would like to explore it, and that is what we will do in the next 3 parts.

## Prompt

### A. Exploring categories (20 points)

In the above example, we saw that the tweet was categorized as a `politics` tweet. That is, however, not the only category in our dataset. Use your knowledge of regular expressions to extract all the different categories in the dataset, along with the distribution of the categories.

More precisely, write a function called `categories_to_counts(tweets)` that takes in the list of tweet text above, and returns a dictionary with `category` as the key and `count` as value.

For example, if your dataset had only two categories (`politics` and `conspiracy`) with two tweets categorized as `politics` and three tweets categorized as `conspiracy`, then your function should return `{'politics': 2, 'conspiracy': 3}` as output.

Tip: You may find the `collections` library useful to count, so we have imported that for you.

In [35]:
from collections import Counter
import re
def categories_to_counts(tweets):
  category_to_count_dict = {}

  # Please write your implementation below this line
  ######### SOLUTION #########

# adding import re at the beginning

  # \|\| matches ||
  # [a-zA-Z]+ matches all the lower and upper letters and they can appear more than once
  # $ matches the element must be at the end of a line
  pattern = r'\|\| ([a-zA-Z]+)$'

  # using a loop to check every line in the dataset
  for tweet in tweets:
    # using re.search() to find the element that satifies the pattern above in every tweet in the dataset
    category_element = re.search(pattern, tweet)
    if category_element:
      # group: (optional) group defaults to zero (meaning that it it will return the complete matched string).
      # it returns -1 if group exists but did not contribute to the match.
      category = category_element.group(1)
      # checking if the category exists
      # if it exists, the category_to_count_dict will count 1
      # if not exists, the category_to_count_dict will remain the same
      category_to_count_dict[category] = category_to_count_dict.get(category,0)+1
  ######### SOLUTION END #########

  return category_to_count_dict

# This is how we will call your function
c2c_dict = categories_to_counts(tweets)

c2c_dict

{'irrelevant': 131,
 'politics': 512,
 'news': 95,
 'conspiracy': 924,
 'emergency': 17}

Just for fun, run this code to actually look at the distribution of your data categories as a pie chart to understand your data better.

In [None]:
# importing appropriate libraries
from matplotlib import pyplot as plt

def create_pie_chart(counts_dictionary):
  # Creating a plot
  fig = plt.figure(figsize =(15, 15))
  data = counts_dictionary.values()
  categories = counts_dictionary.keys()
  plt.pie(data, labels = categories, autopct='%1.0f%%')
  # Displaying the plot
  plt.show()

create_pie_chart(c2c_dict)

#### Rubric

- +12 points for correctness
- +5 points for conciseness
- +3 points for proper comments and variable names

### B. Exploring hashtag distribution in informed and misinformed users (20 points)

Within the COVID-19 discourse, there are two kinds of users, **informed** users, and **misinformed** users. Informed users are the ones that have tweets in the categores of `calling out or correction`, `true prevention`, `true public health response`, and `sarcasm or satire`. These are users who post true and useful information, and call out or make fun of misinformation. Unfortunately, there is also a huge chunk of misinformed users. These are tweeters who tweet about `conspiracy`, `false fact or prevention`, `fake cure`, `fake treatment`, and `false public health response`. Basically, these are users who are responsible for spreading misinformation.

We would like to know that on average how many hashtags do informed users use, and how many hashtags do misinformed users use.

More concretely, complete the function `average_hashtags_per_class(tweets)` that takes in the `tweets` and prints the average number of hashtags used by informed as well as misinformed users.



In [49]:
import re

def average_hashtags_per_class(tweets):

  informed_categories = ['calling out or correction', 'true prevention', 'true public health response', 'sarcasm or satire']
  misinformed_categories = ['conspiracy', 'false fact or prevention', 'fake cure', 'fake treatment', 'false public health response']

  average_hashtags_informed = 0
  average_hashtags_misinformed = 0

  # Please write your implementation below this line
  ######### SOLUTION #########

  # intializing the counts on the tweets
  count_informed = 0
  count_misinformed = 0

  # matching the hashtags' pattern
  pattern_hashtag = r'#\w+'

  # using a loop to check the hashtags through the tweets
  for tweet in tweets:
    # seperating the tweets based on ||
    seperate_tweet = tweet.split(' || ')

    # checking if the tweets have 5 elements (username, date and toime, tweet text, bot, category)
    if len(seperate_tweet) >= 5:
      # finding and extracting the tweet_text and category parts
      tweet_text = seperate_tweet[2]
      category = seperate_tweet[4]
      # finding all the hashtags in the tweet_text
      hashtags = re.findall(pattern_hashtag, tweet_text)

      # checking if the category in the informed_categories
      # and if it has hashtags
      # strip(): removes any leading, and trailing whitespaces
      if category.strip() in informed_categories and hashtags:
        # adding the count of tweets in the hashtags
        average_hashtags_informed += len(hashtags)
        # every tweet in the informed_categories, whether or not there is a hashtag,
        # adding 1 to the count_informed
        count_informed += 1

      # same thing for the misinformed_categories
      elif category.strip() in misinformed_categories and hashtags:
        average_hashtags_misinformed += len(hashtags)
        count_misinformed += 1

  # calculating the average of count_informed and count_misinformed
  # using average/= count to verify there is no 0
  if count_informed > 0:
    average_hashtags_informed /= count_informed
  if count_misinformed > 0:
    average_hashtags_misinformed /= count_misinformed

  ######### SOLUTION END #########

  # Printing average values
  print("Informed users use %f hashtags on average in a tweet"%average_hashtags_informed)
  print("Misinformed users use %f hashtags on average in a tweet"%average_hashtags_misinformed)

average_hashtags_per_class(tweets)

Informed users use 1.674468 hashtags on average in a tweet
Misinformed users use 2.860566 hashtags on average in a tweet


#### Rubric

- +12 points for correctness
- +5 points for conciseness
- +3 points for proper comments and variable names

### C. Retrieving bot accounts (20 points)

For each username in our dataset, we have an assigned bot probability. For this task, we would like to print the usernames of all the users that are bots, and also their bot probabilities. Particularly, we would like to print the usernames and bot probabilities of each user if they have a bot probability of greater than `0.70`.

Write a regex pattern that takes in a tweet text as above, and prints in the following format:

`{'username': '1055WERC', 'botprob': '0.7819662242'}`

`{'username': 'Atho_1982', 'botprob': '0.8633729135'}`

`{'username': 'interaksyon', 'botprob': '0.97794317549999995'}`

...

We have already provided you with the code. All you need to do is write the regex pattern and assign it to the variable `pattern`.

Notes:

1. You are not allowed to change the code below.
2. Please read and play with the code below to understand what it is doing.
3. You are not allowed to use `.split()` function.

In [54]:
def get_bot_accounts(tweet, pattern):
  """
  This function takes in a tweet and a pattern and prints the output
  as described above
  """
  for item in re.finditer(pattern,tweet):
    print(item.groupdict())

######### SOLUTION #########
pattern = r'\|\| (\w+) \|\| [^|]+ \|\| [^|]+ \|\| ([0-9.]+) \|\|'
######### SOLUTION END #########

for tweet in tweets:
  get_bot_accounts(tweet.rstrip(), pattern)

#### Rubric

- +12 points for correctness
- +5 points for conciseness
- +3 points for proper comments and variable names

## *Concepts required to complete this task*

*   Regex Groups
*   Regex Quantifiers
*   Regex Set Operator
*   Regex Anchors

# Concluding Remarks

In this homework we have tried to push you to use regular expressions for all your solutions. This is for learning purposes. This does not necessarily imply that regular expressions always produce the most robust and ideal solutions. While regular expressions are very useful in certain cases, there is typically more than one way to achieve the same results using different methodologies. In fact in many problems it may be just easier to not use regular expressions at all. This is all dependent on the given problem, and how the relevant dataset is formatted. At the end of this course, you will hopefully get better at learning the best methodology to use for the problem at hand.
