### This File Investigates the kind of data we will be working with and to develop (possibly) a uniform way to process text data from Reddit

In [1]:
import pandas as pd
import re
import string

In [2]:
df = pd.read_csv('raw_data/computerscience_hot_posts.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,created_utc,title,text,author,score,upvote_ratio,num_comments,url
0,0,1673829000.0,"Looking for books, videos, or other resources ...",,mobotsar,93,0.99,113,https://www.reddit.com/r/computerscience/comme...
1,1,1686431000.0,/r/ComputerScience will be going dark starting...,"## Update (June 16th, 2023):\n\nThis subreddit...",nuclear_splines,290,0.97,21,https://www.reddit.com/r/computerscience/comme...
2,2,1686512000.0,How computers measure time,Can someone explain this to me? I've been told...,RunDiscombobulated67,86,0.98,27,https://www.reddit.com/r/computerscience/comme...
3,3,1686514000.0,Question About Registers,Hello everyone. There is a misunderstanding I ...,mellowhorses,62,0.97,24,https://www.reddit.com/r/computerscience/comme...
4,4,1686507000.0,Learning a new skill,"Hey guys,\n\nWanted to ask what a good compute...",Haunting_Document142,30,0.9,38,https://www.reddit.com/r/computerscience/comme...


There's some null values in the text, we have to remove them

In [3]:
df_drop_null_text = df.dropna(subset=["text"])
df_drop_null_text.head(10)

Unnamed: 0.1,Unnamed: 0,created_utc,title,text,author,score,upvote_ratio,num_comments,url
1,1,1686431000.0,/r/ComputerScience will be going dark starting...,"## Update (June 16th, 2023):\n\nThis subreddit...",nuclear_splines,290,0.97,21,https://www.reddit.com/r/computerscience/comme...
2,2,1686512000.0,How computers measure time,Can someone explain this to me? I've been told...,RunDiscombobulated67,86,0.98,27,https://www.reddit.com/r/computerscience/comme...
3,3,1686514000.0,Question About Registers,Hello everyone. There is a misunderstanding I ...,mellowhorses,62,0.97,24,https://www.reddit.com/r/computerscience/comme...
4,4,1686507000.0,Learning a new skill,"Hey guys,\n\nWanted to ask what a good compute...",Haunting_Document142,30,0.9,38,https://www.reddit.com/r/computerscience/comme...
5,5,1686502000.0,"Recommendations of Hackathons, GameJams, Tech ...",I'm interested in many fields of CS so any eve...,trojaneo,32,0.89,8,https://www.reddit.com/r/computerscience/comme...
6,6,1686407000.0,Any CS books which present their subject chron...,I recently read calculus reordered and Real An...,ingsocks,78,0.94,13,https://www.reddit.com/r/computerscience/comme...
7,7,1686409000.0,Best Practices using LaTeX.,I have just finished my first seventy page the...,Sorry_Scale_1064,32,0.88,7,https://www.reddit.com/r/computerscience/comme...
8,8,1686431000.0,Pumping lemma Question,I have 2 languages the first one is regular an...,Melodic-Scheme8794,9,0.92,7,https://www.reddit.com/r/computerscience/comme...
9,9,1686346000.0,I don't understand the Halting Problem,"The key to the solution, from what I've got so...",androt14_,51,0.88,47,https://www.reddit.com/r/computerscience/comme...
10,10,1686317000.0,How to Use a Collaborative Approach to Problem...,Hello there!\n\nThis is an article I posted on...,albeXL,7,0.77,1,https://www.reddit.com/r/computerscience/comme...


In [11]:
for i, row in df_drop_null_text[:10].iterrows():
    print(row['text'])

## Update (June 16th, 2023):

This subreddit remains closed to new submissions and comments as part of the ongoing protest over Reddit policy changes. However, we've chosen to switch the subreddit to read-only, so that existing user contributions will not be censored.

# What's going on?

A recent Reddit policy change threatens to kill many beloved third-party mobile apps, making a great many quality-of-life features not seen in the official mobile app **permanently inaccessible** to users.

On May 31, 2023, Reddit announced they were raising the price to make calls to their API from being free to a level that will kill every third party app on Reddit, from Apollo to Reddit is Fun to Narwhal to BaconReader to Sync.

Even if you're not a mobile user and don't use any of those apps, this is a step toward killing other ways of customizing Reddit, such as Reddit Enhancement Suite or the use of the old.reddit.com desktop interface.

This isn't only a problem on the user level: many subreddi

### There's a lot of Links in this text, let's try to remove it using regex

In [4]:
def pretty_print(input):
    print('{0: <20}'.format(input))

pretty_print(df_drop_null_text.iloc[0]['text'])
example_text = df_drop_null_text.iloc[0]['text']

## Update (June 16th, 2023):

This subreddit remains closed to new submissions and comments as part of the ongoing protest over Reddit policy changes. However, we've chosen to switch the subreddit to read-only, so that existing user contributions will not be censored.

# What's going on?

A recent Reddit policy change threatens to kill many beloved third-party mobile apps, making a great many quality-of-life features not seen in the official mobile app **permanently inaccessible** to users.

On May 31, 2023, Reddit announced they were raising the price to make calls to their API from being free to a level that will kill every third party app on Reddit, from Apollo to Reddit is Fun to Narwhal to BaconReader to Sync.

Even if you're not a mobile user and don't use any of those apps, this is a step toward killing other ways of customizing Reddit, such as Reddit Enhancement Suite or the use of the old.reddit.com desktop interface.

This isn't only a problem on the user level: many subreddi

In [5]:
pretty_print(re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', "", df_drop_null_text.iloc[0]['text']))

## Update (June 16th, 2023):

This subreddit remains closed to new submissions and comments as part of the ongoing protest over Reddit policy changes. However, we've chosen to switch the subreddit to read-only, so that existing user contributions will not be censored.

# What's going on?

A recent Reddit policy change threatens to kill many beloved third-party mobile apps, making a great many quality-of-life features not seen in the official mobile app **permanently inaccessible** to users.

On May 31, 2023, Reddit announced they were raising the price to make calls to their API from being free to a level that will kill every third party app on Reddit, from Apollo to Reddit is Fun to Narwhal to BaconReader to Sync.

Even if you're not a mobile user and don't use any of those apps, this is a step toward killing other ways of customizing Reddit, such as Reddit Enhancement Suite or the use of the old.reddit.com desktop interface.

This isn't only a problem on the user level: many subreddi

In [6]:
def remove_links(input):
    return re.sub(r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))''', "", input)

example_text = remove_links(example_text)

### Remove Punctuation

In [7]:
def remove_punctuation(input):
    return input.translate(str.maketrans('','', string.punctuation))

pretty_print(remove_punctuation(example_text))

 Update June 16th 2023

This subreddit remains closed to new submissions and comments as part of the ongoing protest over Reddit policy changes However weve chosen to switch the subreddit to readonly so that existing user contributions will not be censored

 Whats going on

A recent Reddit policy change threatens to kill many beloved thirdparty mobile apps making a great many qualityoflife features not seen in the official mobile app permanently inaccessible to users

On May 31 2023 Reddit announced they were raising the price to make calls to their API from being free to a level that will kill every third party app on Reddit from Apollo to Reddit is Fun to Narwhal to BaconReader to Sync

Even if youre not a mobile user and dont use any of those apps this is a step toward killing other ways of customizing Reddit such as Reddit Enhancement Suite or the use of the oldredditcom desktop interface

This isnt only a problem on the user level many subreddit moderators depend on tools only ava

### For loops are too slow!

In [9]:
for i, row in df_drop_null_text[:10].iterrows():
    text = row['text']
    text = remove_links(text)
    text = remove_punctuation(text)
    text = remove_markdown_characters(text)
    pretty_print(text)

 Update June 16th 2023

This subreddit remains closed to new submissions and comments as part of the ongoing protest over Reddit policy changes However weve chosen to switch the subreddit to readonly so that existing user contributions will not be censored

 Whats going on

A recent Reddit policy change threatens to kill many beloved thirdparty mobile apps making a great many qualityoflife features not seen in the official mobile app permanently inaccessible to users

On May 31 2023 Reddit announced they were raising the price to make calls to their API from being free to a level that will kill every third party app on Reddit from Apollo to Reddit is Fun to Narwhal to BaconReader to Sync

Even if youre not a mobile user and dont use any of those apps this is a step toward killing other ways of customizing Reddit such as Reddit Enhancement Suite or the use of the oldredditcom desktop interface

This isnt only a problem on the user level many subreddit moderators depend on tools only ava