# "AskReddit" - Data Analytics on Redit Web-scraped data

Reddit is a content and community website where users can submit links, text posts, and other types of content to groups of people with similar interests. These groups are called subreddits, and each one specializes in a particular topic. For example, AskReddit is a popular subreddit where you can pose questions to the entire Reddit community. Users answer the questions by commenting on them.

I have worked with a data set containing the top 1,000 questions users posted to AskReddit in 2015. Reddit user P_S_Laplace created the data set, which has five columns that appear in the following order:
1. Title -- The title of the post
2. Score -- The number of upvotes the post received
3. Time -- When the post was posted
4. Gold -- How much Reddit Gold users gave the post
5. NumComs -- The number of comments the post received

## 1. Introduction, Wildcards in Regular Expressions, Searching the Beginnings And Endings Of Strings ##

In [54]:
strings = ["data science", "big data", "metadata"]
regex = "data"

strings = ["bat", "robotics", "megabyte"]
regex = "..t"

strings = ["better not put too much", "butter in the", "batter"]
bad_string = "We also wouldn't want it to be bitter"
regex = "^b....."

strings = ['War of 1812', 'There are 5280 feet to a mile', 'Happy New Year 2016!']

## 2. Reading the dataset

* Use the csv module to read our data set and assign it to posts_with_header.
* Use list slicing to exclude the first row, which represents the column names. Assign this sliced data set to posts.
* Use a for loop and string slicing to print the first 10 rows. See if you notice any patterns in this sample of the data set.

In [27]:
import csv
posts_with_header = list(csv.reader(open("askreddit_2015.csv")))
posts = posts_with_header[1:]
for each in posts[:10] :
    print(each)

['What\'s your internet "white whale", something you\'ve been searching for years to find with no luck?', '11510', '1433213314.0', '1', '26195']
["What's your favorite video that is 10 seconds or less?", '8656', '1434205517.0', '4', '8479']
['What are some interesting tests you can take to find out about yourself?', '8480', '1443409636.0', '1', '4055']
["PhD's of Reddit. What is a dumbed down summary of your thesis?", '7927', '1440188623.0', '0', '13201']
['What is cool to be good at, yet uncool to be REALLY good at?', '7711', '1440082910.0', '0', '20325']
['[Serious] Redditors currently in a relationship, besides dinner and a movie, what are your favorite activities for date night?', '7598', '1439993280.0', '2', '5389']
["Parents of Reddit, what's something that your kid has done that you pretended to be angry about but secretly impressed or amused you?", '7553', '1439161809.0', '0', '11520']
['What is a good subreddit to binge read the All Time Top Posts of?', '7498', '1438822288.0',

We can know from above result that the format of out data is in following format - 
List of [questions title, number of scores, System time, Reddit Gold status, Number of comments/answers to this question]

## 2. Counting Simple Matches in the Data Set with re() package of python

In [47]:
import re
of_reddit_count = 0
for each in posts :
    if re.search("of Reddit", each[0]):
        of_reddit_count += 1
print("Reddit Count: ", of_reddit_count)

Reddit Count:  76


We counted the number of posts in our data set that match the regex "of Reddit". Also assigned the count to of_reddit_count. There are in total 76 posts/questions that has included the string "of Reddit" and are concerned about reddit posts.

## 3. Using Square Brackets to Match Multiple Characters ##

In [29]:
import re
of_reddit_count = 0
of_reddit_count_old = 0
for row in posts:
    if re.search("of Reddit", row[0]) is not None:
        of_reddit_count_old += 1
for row in posts:
    if re.search("of [Rr]eddit", row[0]) is not None:
        of_reddit_count += 1
print(of_reddit_count_old, of_reddit_count)

76 102


If you look at the data set closely, you may notice that some posts use "of Reddit", and others use "of reddit". While both versions have the same format, the capitalization of "Reddit" is different. We can account for this inconsistency with square brackets. We use square brackets in a regex to indicate that any character within them can fill the space.

For example, the regex "[bcr]at" would match the substrings "bat", "cat", and "rat", but nothing else. We indicate that the first character in the regex can be either "b", "c" or "r".

## 4. Escaping Special Characters ##

In [30]:
serious_count = 0
for each in posts :
    if re.search("\[Serious\]", each[0]) is not None:
        serious_count += 1
print(serious_count)

69


## 5. Combining Escaped Characters and Multiple Matches ##

Some people tag serious posts as "[Serious]", and others as "[serious]". We should account for both capitalizations.

In [31]:
serious_count = 0
serious_count_old = 0
for row in posts:
    if re.search("\[Serious\]", row[0]) is not None:
        serious_count_old += 1
     
for row in posts:
    if re.search("\[[sS]erious\]", row[0]) is not None:
        serious_count += 1
        
print(serious_count_old, serious_count)

69 77


## 6. Adding More Complexity to Your Regular Expression ##

Some users have tagged their posts with "(Serious)" or "(serious)", including the parentheses. Therefore, we should account for both square brackets and parentheses. We can do this by using square bracket notation, and escaping the "[", "]", "(", and ")" characters with the backslash.

In [32]:
serious_count = 0
serious_count_old = 0
for row in posts:
    if re.search("\[[Ss]erious\]", row[0]) is not None:
        serious_count_old += 1

for row in posts:
    if re.search("[\[\(][Ss]erious[\]\)]", row[0]) is not None:
        serious_count += 1

print(serious_count_old, serious_count)

77 80


## 7. Combining Multiple Regular Expressions ##

We should consider a post serious only if the tag occurs at the beginning or end of the title. To match titles with the tag at the beginning, we can use the "^" character in a regex. To match titles with the tag at the end, we can use "$". These characters produce two different regular expressions, and we'd like to identify all titles that match either of them.

In [48]:
import re

serious_start_count = 0
serious_end_count = 0
serious_count_final = 0

for each in posts:
    if re.search("^[\[\(][Ss]erious[\]\)]|[\[\(][Ss]erious[\]\)]$", each[0]) is not None :
        serious_count_final += 1

for each in posts:
    if re.search("^[\[\(][Ss]erious[\)\]]", each[0]) is not None :
        serious_start_count += 1

for each in posts:
    if re.search("[\[\(][Ss]erious[\)\]]$", each[0]) is not None :
        serious_end_count += 1

print(serious_start_count)
print(serious_end_count)
print(serious_count_final)

69
11
80


## 8. Using Regular Expressions to Substitute Strings ##

Replace "[serious]", "(Serious)", and "(serious)" with "[Serious]" for all of the titles in posts.

In [67]:
import re
for each in posts:
    each[0] = re.sub("[\[\(][Ss]erious[\]\)]", "[Serious]", each[0])
print(len(posts))

867


## Final conclusion: