<a href="https://colab.research.google.com/github/dhruvkp090/TextAsData/blob/master/Lab_0_Introduction%2C_Python_revision_%26_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text as Data Lab 0: Introduction, Python revision & Regular Expressions

The aims of the lab are to:
 - Introduce you to colab, verify that you're set up with the correct python packages
 - Load textual data from Reddit as a JSON file
 - Explore the data to learn a bit about it
 - Review salient Python features, such as Counter and list comprehensions
 - Introduce regular expressions, a common tool for text matching and processing

**Before you start, save a copy of this lab to your drive using "File > Save a Copy in Drive".** If you skip this step, you may lose progress that you have made (e.g., if you close the browser tab or your computer crashes).

## Colab Introduction


Colab is a cloud-based Jupyter Notebook.  It is used internally by engineers and researchers at Google and companies worldwide to prototype and share data science and ML results in an easy-to-use way. 

It supports:

1. Text Cells with [Markdown](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) formatting
2. Code Cells
3. Notebook stores code, output, and execution order
4. Tab and Tab + Tab Autocomplete
5. IPython Help Features
6. IPython Magics (`%%`)

### Additional Features

- collaborative editing
- history 
- comments
- executed code history
- Shift+click multiple cell selection
- searchable code snipetts + table of contents
- scratchpad (⌘/Ctrl + Alt + N)

### Keyboard Shortcuts
| Command | Action |
| ---- | ----: |
|⌘/Ctrl+Enter | Run Selected Cell |
|Shift+Enter| Run Cell and Select Next |
|Alt+Enter| Run cell and insert new cell|
|⌘/Ctrl+M I | Interrupt Execution |

- You can open the command Palette to see all shortcuts by going to Tools --> Command palette.

### Summary of tips
- Use TAB to autocomplete an expression. 
- You can also execute the code with a ? to get the doc strings
- In Jupyter / Colab you can execute shell commands using `!`, example: "!ls" to list the current files.


*Note:* Occasionally Colab may hang or crash (due to cloud flakiness or bad code).  You can control the execution using the Runtime menu to reboot and start fresh.  To resume where you left off you can click "Run before" and it will run all cells before the one currently selected.


*Note:* You can use Colab with a cloud VM or connect to a 'local' Jupyter instance.

## Downloading

Let's start by downloading a file containing data scraped from the online forum platform [Reddit](https://www.reddit.com/). This should take at most a few seconds to run.

In [1]:
# Download data
!wget -O reddit_posts.json https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EY_R8Y7DkrxMqXGe-zlgeNkBdJU5ZNTf8FYrN2pqDwddMA?download=1

--2023-01-17 17:09:43--  https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EY_R8Y7DkrxMqXGe-zlgeNkBdJU5ZNTf8FYrN2pqDwddMA?download=1
Resolving gla-my.sharepoint.com (gla-my.sharepoint.com)... 13.107.136.9, 13.107.138.9
Connecting to gla-my.sharepoint.com (gla-my.sharepoint.com)|13.107.136.9|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://gla-my.sharepoint.com/personal/jake_lever_glasgow_ac_uk/_layouts/15/download.aspx?UniqueId=8ef1d18f92c34cbca9719efb396078d9 [following]
--2023-01-17 17:09:44--  https://gla-my.sharepoint.com/personal/jake_lever_glasgow_ac_uk/_layouts/15/download.aspx?UniqueId=8ef1d18f92c34cbca9719efb396078d9
Reusing existing connection to gla-my.sharepoint.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 1279064 (1.2M) [application/json]
Saving to: ‘reddit_posts.json’


2023-01-17 17:09:45 (2.75 MB/s) - ‘reddit_posts.json’ saved [1279064/1279064]



## Data Exploration

If we take a look at the contents of the data file, we see that it is encoded as an array of objects in [JSON](https://en.wikipedia.org/wiki/JSON). There are a variety of other formats used for sharing text data (e.g., [XML](https://en.wikipedia.org/wiki/XML), [Protocol Buffers](https://en.wikipedia.org/wiki/Protocol_Buffers), etc.), but JSON is among the most common these days.

In [2]:
# Look at the first 20 lines of the file - note the exclamation mark which tells Colab to run a terminal command instead of Python
!head -n20 reddit_posts.json

[
  {
    "subreddit": "Soda",
    "title": "Anyone tried Irn Bru?",
    "score": 8,
    "id": "ou5yp1",
    "author": "jackibhoy",
    "body": "It\u2019s a Scottish drink and it\u2019s banned some countries and I was wondering if anyone here has tried it. It has quite a unique taste and it\u2019s not something you\u2019d forget quickly. You either love it or hate it I think."
  },
  {
    "subreddit": "Soda",
    "title": "What is the worst or some of the worst sodas you have drunk",
    "score": 3,
    "id": "nt40i4",
    "author": "EpicEllis2004",
    "body": "The absolute worst soda ive ever had that i can remember is probaly the new mystery fanta or watermelon+strawberry tango some other ones include mango coke, sugar free irn bru (but xtra is nice)"
  },
  {
    "subreddit": "tea",
    "title": "I once had a box of tea that I believe was Scottish Highland black tea. Can anyone recommend me a tea along those lines?",


Data provided by online APIs, such as the [Reddit API](https://www.reddit.com/dev/api/), are usually available in JSON. For this lab, we use a simplified version of the Reddit data, aggregating posts across several subreddits (i.e., sub-forums) and using just a handful of salient fields for each post. The data file consists of some structured data (`subreddit`, `score`, `id`, and `author`), and two unstructured text fields (`title` and `body`).

Now let's load the data into Python so we can work with it. The [json](https://docs.python.org/3/library/json.html) package in the Python standard library makes loading the data very easy.

In [3]:
# Load the data into Python
import json
with open('reddit_posts.json', 'rt') as fin:
  reddit_posts = json.load(fin)

Python's JSON parser provides the data as a `list` of `dict` objects. You should already be familiar with these classes; if not, you can refer to [Python's data structures documentation](https://docs.python.org/3/tutorial/datastructures.html).

Let's look at the types of data available provided and some basic information.

In [4]:
# Investigate the structure of the data
print('type(reddit_posts) =', type(reddit_posts))
print('len(reddit_posts) =', len(reddit_posts))
print('type(reddit_posts[0]) =', type(reddit_posts[0]))
print('reddit_posts[0].keys() =', reddit_posts[0].keys())
print('reddit_posts[0]["title"] =', reddit_posts[0]["title"])

type(reddit_posts) = <class 'list'>
len(reddit_posts) = 2000
type(reddit_posts[0]) = <class 'dict'>
reddit_posts[0].keys() = dict_keys(['subreddit', 'title', 'score', 'id', 'author', 'body'])
reddit_posts[0]["title"] = Anyone tried Irn Bru?


We see that there are 2000 posts in the dataset. It's always important to understand the data you're working with, so let's see which subreddits these posts are from.

We could write code to loop through each of the posts and keep a running count of the number of posts in each subreddit:

```python
# The code that uses Counter below essentially does this:
subreddit_counts = {}
for post in reddit_posts:
  subreddit = post['subreddit']
  if subreddit not in subreddit_counts:
    subreddit_counts[subreddit] = 0
  subreddit_counts[subreddit] += 1
subreddit_counts
```

Python provides a convienent [`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) class to count values like this, so we'll use that instead. We can use a [Generator Expression](https://docs.python.org/3/reference/expressions.html#generator-expressions) to get the subreddit from each post and let `Counter` count how many times each subreddit appears.

In [5]:
# Count the number of times each subreddit appears in this dataset
from collections import Counter
Counter(post['subreddit'] for post in reddit_posts)

Counter({'Soda': 174,
         'tea': 236,
         'xbox': 213,
         'antiMLM': 226,
         'HydroHomies': 210,
         'pcgaming': 225,
         'NintendoSwitch': 249,
         'Coffee': 234,
         'PS4': 233})

We see that there are 9 subreddits covered by the dataset, mostly about beverages and gaming. The Soda subreddit is the least common (174 posts) and the NintendoSwitch subreddit is the most common (249 posts).

**Exercise:** Can you find the user with the most posts and how many posts they have? (Hint: `Counter` provides a function that will help you find the item with the highest count, without needing to look through all the counts manually. You can find this in the documentation.)

In [13]:
Counter(post['author'] for post in reddit_posts).most_common(1)[0]

('AutoModerator', 5)

You should find that AutoModerator has the most number of posts, with 5.

AutoModerator sounds like it may not be a human user, so let's take a look at their posts. There are several ways to do this filtering. One of the most concise is to use a [List Comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions), so we'll do that here:

In [14]:
[ post for post in reddit_posts if post['author'] == 'AutoModerator' ]

[{'subreddit': 'NintendoSwitch',
  'title': "/r/NintendoSwitch's Friend Request Weekend! (01/07/2022)",
  'score': 11,
  'id': 'ry4xft',
  'author': 'AutoModerator',
  'body': "Hi everybody! \nIt's time for Friend Request Weekend! Use this thread to exchange Friend Codes and Nintendo Accounts, find new friends and set up gaming sessions! Feel free to share local Switch subreddits as well. You can find current local/regional Switch subreddits [here](https://www.reddit.com/r/NintendoSwitch/wiki/relatedsubs). If you would like to have yours added, please [message the moderators](https://www.reddit.com/message/compose?to=%2Fr%2FNintendoSwitch).\n\u200b\nYou can also check out /r/NintendoFriends or /r/SwitchFCSwap for exchanging accounts and making friends. You can also setup a profile via our Mecha Bowser bot in our [Discord server](https://discord.gg/switch). Use the `!profile edit` command in the `#command-central` channel to get started.\n\u200b\nYou can see previous posts [here](https:

Indeed, through qualitatively inspecting of AutoModerator's posts, they appear to be automatially generated.

Depending on your application, you may want to filter out such posts. For instance, if you were trying to measure public sentiment about a product, you are probably only interested in human users. You could filter out auto-generated posts by manually checking each post, but this becomes impractical as the size of your collection grows. Later in this course we will cover text classification techniques, which could be used to automatically label a large number of posts given manally-labeled training data.



Can you find out anything else interesting from the dataset by inspecting the data?

In [None]:
# Use this cell to explore the dataset. (Feel free to make other cells as well.)

## Regular Expressions

[Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression) (often abbreviated as regex or regexp) are a common tool for pattern matching in text. Python provides regular expressions in the [`re`](https://docs.python.org/3/library/re.html) package.

Regular expressions can get very complicated, but it is valuable to be familiar with them. This portion of the lab provides a basic introduction to regex and provides links to further details if you want to learn more.

### Motivating Example

Let's say you want to find all posts that mention [Irn-Bru](https://en.wikipedia.org/wiki/Irn-Bru). One option would be to use the string contains operator (`in`):

In [15]:
[ p for p in reddit_posts if 'Irn-Bru' in p['title'] or 'Irn-Bru' in p['body'] ]

[]

Hmm, nothing matches? Didn't we see the first post mentioned Irn-Bru?

In [16]:
reddit_posts[0]

{'subreddit': 'Soda',
 'title': 'Anyone tried Irn Bru?',
 'score': 8,
 'id': 'ou5yp1',
 'author': 'jackibhoy',
 'body': 'It’s a Scottish drink and it’s banned some countries and I was wondering if anyone here has tried it. It has quite a unique taste and it’s not something you’d forget quickly. You either love it or hate it I think.'}

Indeed so, but they stylized it as "Irn Bru" rather than "Irn-Bru". There's probably other ways people might write it too, like IrnBru, Irnbru, etc. We could come up with a list of possiblities and control for different casing (like the code below), but we still might miss some. There's a simpler way using regular expressions.

```python
match_strings = ['irn-bru', 'irn bru', 'irnbru']
[ p for p in reddit_posts if any(m in p['title'].lower() or m in p['body'].lower() for m in match_strings) ]
```

### Filtering with Regular Expressions

A regular expression defines a pattern to match in text as a string. Most characters (letters, numbers) in a pattern match themelves. Some perform special functions, like making the precious character optional or allowing multiple characters to match. For instances:
 - "`.`" matches any character.
 - "`?`" makes the precious character optional.

With this, we can define the regular expression `"irn.?bru"`, which will match `irn` and `bru` with any character (or no character) between them. An option allows case insensitive matching.

We can use regular expressions in python by first importing the `re` package and then compling our pattern.

In [17]:
import re
pattern = re.compile(r'irn.?bru', re.I) # re.I makes the pattern case insensitive
# NB: it's usually a good idea to define regex using raw strings (r'') to avoid string escaping within the expression.

The pattern object can do all sorts of things. Most commonly, you will use the `search` function, which finds and returns the first place that the pattern matches a string.

In [18]:
match = pattern.search("Anyone tried Irn Bru?")
match

<re.Match object; span=(13, 20), match='Irn Bru'>

The match object gives the character offsets of the match (`match.span`) and the text itself that matches (`match.group(0)`).

If no match is found, `search` returns `None`, which evaluates as `False` when it's used in an `if` statement. This allows it to be easily used for filtering data. Using regular expressions, we find 6 posts about Irn-Bru.

In [19]:
[ p for p in reddit_posts if pattern.search(p['title']) or pattern.search(p['body']) ]

[{'subreddit': 'Soda',
  'title': 'Anyone tried Irn Bru?',
  'score': 8,
  'id': 'ou5yp1',
  'author': 'jackibhoy',
  'body': 'It’s a Scottish drink and it’s banned some countries and I was wondering if anyone here has tried it. It has quite a unique taste and it’s not something you’d forget quickly. You either love it or hate it I think.'},
 {'subreddit': 'Soda',
  'title': 'What is the worst or some of the worst sodas you have drunk',
  'score': 3,
  'id': 'nt40i4',
  'author': 'EpicEllis2004',
  'body': 'The absolute worst soda ive ever had that i can remember is probaly the new mystery fanta or watermelon+strawberry tango some other ones include mango coke, sugar free irn bru (but xtra is nice)'},
 {'subreddit': 'Soda',
  'title': 'Why is creme soda flavor so common?',
  'score': 0,
  'id': 'ryn22l',
  'author': 'Saracenanator',
  'body': "Typically this flavor of soda is sold with different colors but same flavor. I have tried Pakola, Inca Kola, Irn Bru etc and they all taste same

We can see that there are a variety of ways that people express Irn-Bru.

**Exercise:** Can you write code that finds all the ways people express Irn-Bru in the dataset (e.g., "Irn-Bru", "IRN BRU", etc.)? Hint: you probably need to use the [`findall` function](https://docs.python.org/3/library/re.html#re.findall) with the `pattern` already defined above.

In [23]:
matches = pattern.findall( "".join( p['title'] + '\n' + p['body'] for p in reddit_posts) )

Counter(matches)

Counter({'Irn Bru': 6, 'irn bru': 1, 'IRN-BRU': 2, 'IRN BRU': 1})

(You should find that "Irn Bru" appears 6 times, "IRN-BRU" appears twice, "IRN BRU" once, and "irn bru" once.)

Now let's try to find some years mentioned in the text of the documents. you'll need to define a new pattern. You might find that a [regular expressions cheatsheet](https://cheatography.com/davechild/cheat-sheets/regular-expressions/) is helpful so you can look up the pattern for `digit`.

**Exercise:** Find all 21st-century years (4 digit numbers starting with '20') in the title and bodies of the posts. What is the most common?

In [29]:
yearPattern = re.compile(r'20\d\d')
matches = yearPattern.findall( "".join( p['title'] + '\n' + p['body'] for p in reddit_posts) )

Counter(matches).most_common(1)

[('2016', 35)]

You should find that the digits `2016` appears 35 times across the posts.

Now, a question: do you think all of those 4 digit numbers are actually years in the text. Could they be anything else?

### Additional Resources

These examples only scratches the surface of the capacities of regular expressions. We refer you to these resources for further details.

 - [RegExr](https://regexr.com/) -- Interactively build and evaluation regular expressions in a GUI. This can be helpful when trying to build a tricky regex or figure out why it's not matching what you want it to.
 - [RegexOne](https://regexone.com/) -- A detailed and interactive regular epxression tutorial
 - [Python's `re` documentation](https://docs.python.org/3/library/re.html) -- Provides details about both regex syntax and Python's regex API

## Lab Summary

In this lab you:
 - Started using Colab
 - Loaded textual data from Reddit as a JSON file
 - Explored the data to learn a bit about it
 - Reviewed salient Python features, such as Counter and list comprehensions
 - Explored using regular expressions in Python

**Please remember to submit your completed lab on and feedback form on Moodle.**

## Optional Bonus

Play with [ChatGPT](https://chat.openai.com/). You will have to make a free account to use it. And sometimes it is too busy and you'll need to come back later.

**Task:** Try to get ChatGPT to say something factually incorrect.