## Inspection of the compressing data filtered by score

In [1]:
import pandas as pd
import re
import numpy as np

The Data Sets are 
- reddit_comments_03_04_05_by_score.csv
- reddit_posts_03_04_05_by_score.csv

#### Pre filtering
These DS are a sample originally limited to 10.000.000 raws, each file filtered by scores > 80.

The samples are from March to May 2019 on each file and the data has been "cleaned":
- no data with text "deleted"
- no data with text "removed"
- no data with text "Removed by reddit in response to a copyright notice."
- no NAN on selftext and body
- no empty data on selftext and body
- excluded subreddit "de" because of the German lenguage

In [2]:
comments = pd.read_csv('/Users/giuliagalli/Documents/GitHub/TFM/01_data/reddit_comments_03_04_05_by_score.csv.gz', compression='gzip', 
                                 header=0, sep=',', quotechar='"')

In [3]:
posts = pd.read_csv('/Users/giuliagalli/Documents/GitHub/TFM/01_data/reddit_posts_03_04_05_by_score.csv.gz', compression='gzip', 
                                 header=0, sep=',', quotechar='"')

### Data Overview

In [4]:
comments.shape

(4806645, 3)

In [5]:
comments.describe()

Unnamed: 0,score
count,4806645.0
mean,348.1102
std,975.3767
min,81.0
25%,104.0
50%,149.0
75%,273.0
max,73560.0


In [6]:
posts.shape

(293104, 4)

In [7]:
posts.describe()

Unnamed: 0,score
count,293104.0
mean,575.17878
std,2156.38949
min,81.0
25%,109.0
50%,168.0
75%,350.0
max,133101.0


Let's take a look to our data

In [8]:
comments.head(10)

Unnamed: 0,score,subreddit,body
0,125,CCW,Everyone in Brazil is an off duty cop.
1,147,leagueoflegends,"Same with huhi, wtf"
2,149,BlackPeopleTwitter,I’d try it. \n\nLooks like a commitment tho......
3,264,instantkarma,I mean he was kicking the glass hard enough it...
4,307,ffxiv,That female Ronso. Everyone is all up in arms ...
5,84,DnD,* DM improvises an ad hoc reason why the party...
6,129,collegehockey,"According to all known laws of aviation, there..."
7,216,todayilearned,Not to brag but I have multiple airs in my wal...
8,467,aww,"i'm Timmy the cat, n so lucky i be\n\nmy guy t..."
9,168,Animemes,tbh imho its just the biggest one line meme/in...


In [9]:
posts.head(10)

Unnamed: 0,score,subreddit,title,selftext
0,186,3d6,[5e] Almost 40 AC as a Wizard,"Hello everyone, and what I said in the title i..."
1,204,ACT,Adversity Scores,"I'm going to be switching to the ACT now, caus..."
2,114,AFL,David Mundy: A star,"If Mundy played for any Melbourne based club, ..."
3,109,AJR,Indeed there was a song missing...,"I skipped ""Beats"" by accident when I made my p..."
4,241,AMA,"I spent 5.5 years on a PhD, and then quit. AMA",I was in a PhD program for 5.5 years. Got 4.0...
5,435,AMA,I’m an overweight 21 year old who just started...,Will answer questions after I get off work ton...
6,145,Amd,TheGoodOldGamer latests IPC video something se...,I was surprised to see TheGoodOldGamer latests...
7,177,BPD,The need to have ‘someone’.,I recently started using Tinder and other dati...
8,108,BPD,i just dissociated while driving.,i have plans with a friend in about 15 minutes...
9,149,CBD,Dog started seizing last night. Shoved a pipet...,I saw a video the other day of a dog getting d...


Check that there are not NaN values

In [10]:
comments.isnull().sum()

score        0
subreddit    0
body         0
dtype: int64

In [11]:
posts.isnull().sum()

score        0
subreddit    0
title        0
selftext     0
dtype: int64

Checking the largest text in our comments and posts

In [12]:
comments.body.map(lambda x: len(x)).max()

16140

In [13]:
posts.title.map(lambda x: len(x)).max()

318

In [14]:
posts.selftext.map(lambda x: len(x)).max()

43750

### Data cleaning

#### HTML

As HTLM data does not add more informatio to our text, we clean it.
Do to that, we define a function to clean data with regular expressions. 

http: matches literal characters
\S+: matches all non-whitespace characters (the end of the url)
we replace with the empty string

Removing __html__ text from comments

In [15]:
def remove_html_tags(text):
    clean = re.compile(r'http\S+')
    return re.sub(clean, '', str(text))

In [16]:
comments['body'] = comments['body'].map(remove_html_tags)

Removing __html__ text from posts

In [17]:
posts['title'] = posts['title'].map(remove_html_tags)

In [18]:
posts['selftext'] = posts['selftext'].map(remove_html_tags)

#### Line endings

Removing __\n__ text from comments

In [20]:
def remove_lineEndings (text):
    clean = re.compile(r'\n')
    return re.sub(clean, '', str(text))

In [21]:
comments['body'] = comments['body'].map(remove_lineEndings)

Removing __\n__ text from posts

In [22]:
posts['title'] = posts['title'].map(remove_lineEndings)

In [23]:
posts['selftext'] = posts['selftext'].map(remove_lineEndings)

#### Digits/Punctuation/Symbols

Even if significant information can be hidden in the appearance of the symbols, you’d like to get rid of them.

Removing __symbols__ text from comments

In [66]:
def remove_symbols (text):
    clean = re.compile(r'[^a-zA-Z0-9 r/]') #r/ is how to call a subreddit, I want keep it
    return re.sub(clean, '', str(text))

In [67]:
comments['body'] = comments['body'].map(remove_symbols)

Removing __symbols__ text from posts

In [69]:
posts['title'] = posts['title'].map(remove_symbols)

In [70]:
posts['selftext'] = posts['selftext'].map(remove_symbols)

### Balancing data

In [None]:
comments.groupby("subreddit").count().mean()

comments - will be keeped all subreddit with more than 100 rows

In [None]:
posts.groupby("subreddit").count().mean()

posts - will be keeped all subreddit with more than 30 rows

In [None]:
comments['counter'] = comments.body.map(lambda x: len(x))
comments.head(5)

In [None]:
posts['counter_text'] = posts.selftext.map(lambda x: len(x))
posts.head(5)

In [None]:
posts['counter_title'] = posts.title.map(lambda x: len(x))
posts.head(5)