# Data cleaning

In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [2]:
!gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?code_challenge=CiR_VIRVYxcEEki88vHjZE1L56CJVnQvk6VLCB5I1P4&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


Enter verification code: 4/tQH5BCPk3KVcMPSXzvn-C8_8neesN4chL0eyoAK4bTJPuFD0jLJENHI
If you need to use ADC, see:
  gcloud auth application-default --help

You are now logged in as [giuliag.master@gmail.com].
Your current project is [reddit-master].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [3]:
!gcloud config set project reddit-master

Updated property [core/project].


After being autheticated, we can downloead the final data from the bucket of our project in GCP. 

In [0]:
import pandas as pd
from glob import glob
import re
import numpy as np

We define a function that cleans the text, with regular expressions, from:
*  urls
*  line ending \n
*  digits
*  punctuation
*  symbols

In [0]:
def remove_urls(text):
    clean = re.compile(r'http\S+')
    return re.sub(clean, '', str(text))

def remove_line_endings (text):
    clean = re.compile(r'\n')
    return re.sub(clean, '', str(text))

def remove_symbols (text):
    clean = re.compile(r"[^a-zA-Z0-9' ]") 
    return re.sub(clean, '', str(text))

def clean_text(text):
  return remove_urls(remove_line_endings(remove_symbols(text.lower())))

def clean_text_df(df, column):
  df[column] = df[column].map(clean_text)

## Comments

In [6]:
!gsutil cp gs://reddit_comments_master/comments_2018-* .

Copying gs://reddit_comments_master/comments_2018-000000000000...
Copying gs://reddit_comments_master/comments_2018-000000000001...
Copying gs://reddit_comments_master/comments_2018-000000000002...
Copying gs://reddit_comments_master/comments_2018-000000000003...
- [4 files][846.3 MiB/846.3 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://reddit_comments_master/comments_2018-000000000004...
Copying gs://reddit_comments_master/comments_2018-000000000005...
Copying gs://reddit_comments_master/comments_2018-000000000006...
Copying gs://reddit_comments_master/comments_2018-000000000007...
Copying gs://reddit_comments_master/comments_2018-000000000008...
Copying gs://reddit_comments_master/comments_2018-000000000009...
- [10 f

In [8]:
!pwd

/content


In [9]:
!head comments_2018-000000000004 # checking the file

subreddit,body
aww,So cute..!!♡.♡
aww,Shaken baby syndrome 
aww,She's made her choice
aww,"I have more pics of her but I don't know how to link them and I'm at work, if I figure it out I'll do it when I finish ��"
aww,This is the most adorable thing I have ever seen. Thanks so much for sharing 
aww,"Incredibly sweet! I was thinking however, ""Who was watching his bag?"""
aww,It did �� so  cute!!!! ❤
aww,Link to they toy in the back?
aww,Literal speed dating!


In [0]:
comments_files = glob("/content/comments_2018*") # all files have the same pattern
dfs = []

for file in comments_files:
  df = pd.read_csv(file)
  dfs.append(df)

df_comments = pd.concat(dfs, axis=0, ignore_index=True)

In [11]:
df_comments.describe()

Unnamed: 0,subreddit,body
count,10754982,10754982
unique,14,10424325
top,nba,Your submission has been automatically removed...
freq,768213,18649


**Correct**, we have 14 subreddit each with 768213 comments for a total of 10754982 from 2018.

In [8]:
df_comments.tail()

Unnamed: 0,subreddit,body
10754977,science,"Replace every door and cabinet handle, every d..."
10754978,science,I've never been so happy to live on the west c...
10754979,science,Can't imagine smoking it every day for hours. ...
10754980,science,I've been seeing a lot of comments similar to ...
10754981,science,While earth flaters and anti vaxxers thinks it...


### Pre cleaning filters with SQL

The df has been pre "cleaned":
- no data with body "deleted"
- no data with body "removed"
- no data with body "Removed by reddit in response to a copyright notice."
- no NAN on body
- no empty data on body

### Comments - Data Overview

In [13]:
df_comments.shape

(10754982, 2)

Check that there are not null values

In [14]:
df_comments.isnull().sum()

subreddit    0
body         0
dtype: int64

Check if there are NaN values and delete them


In [16]:
df_comments["body"].isna().sum()

0

Checking the largest text in our comments


In [49]:
df_comments.body.map(lambda x: len(x)).max()

16200

In [50]:
df_comments.body.map(lambda x: len(x)).min()

11

### Comments - Data Cleaning

Cleaning `body` column with the function implemented `clean_text_df`

In [0]:
clean_text_df(df_comments,'body')

In [10]:
df_comments.head()

Unnamed: 0,subreddit,body
0,aww,doubling down with the multiple sub approach g...
1,aww,i 2nd meatball i have a seriously chunky engli...
2,aww,my thoughts as well
3,aww,friends don't eat friends
4,aww,maybe the doctors are waiting with the heart i...


## Posts

In [11]:
!gsutil cp gs://reddit_posts_master/posts_2018 .

Copying gs://reddit_posts_master/posts_2018...
/ [1 files][ 22.5 MiB/ 22.5 MiB]                                                
Operation completed over 1 objects/22.5 MiB.                                     


In [12]:
!pwd

/content


In [19]:
!head posts_2018

subreddit,title,selftext
funny,Say Alpha Kenny One fast!,Now realise you need Jesus...
funny,Mary Poppins sequels,"Disney has announced the next Mary Poppins Sequels. After Mary Poppins Returns, there will be: 

*Mary Poppins Forever 
*Mary Poppins and Jack 
*Mary Poppins Begins 
*The Dark Nanny 
*The Dark Nanny Rises 
*Mary Poppins v Superman: Dawn of Justice 


In [0]:
df_posts = pd.read_csv("posts_2018")

In [21]:
df_posts.describe()

Unnamed: 0,subreddit,title,selftext
count,110656,110656,110656
unique,14,105581,15710
top,science,IamA (Blank) AMA!,[deleted]
freq,7904,67,66247


**Correct**, we have 14 subreddit each with 7904 titles/selftext for a total of 110656 from 2018.

In [22]:
df_posts.head()

Unnamed: 0,subreddit,title,selftext
0,funny,Say Alpha Kenny One fast!,Now realise you need Jesus...
1,funny,Mary Poppins sequels,Disney has announced the next Mary Poppins Seq...
2,funny,Catbombing my ex girlfriend in the shower.,Few years into our relationship we adopted a k...
3,funny,Christmas poetry,"I saw a lady Santa Clause, a standing on the s..."
4,funny,True story: My brother resembles Santa Claus,His portly frame and salt-and-pepper beard ma...


### Pre cleaning filters with SQL

The df has been pre "cleaned":
- no data with selftext "deleted" and title length < 5
- no data with selftext "removed" and title length < 5
- no data with selftext "Removed by reddit in response to a copyright notice." and title length < 5
- no NAN on selftext and title length < 5
- no empty data on selftext and title length < 5

### Posts - Data Overview

In [14]:
df_posts.shape

(110656, 3)

Check that there are not null values

In [15]:
df_posts.isnull().sum()

subreddit    0
title        0
selftext     0
dtype: int64

Checking the largest title in our posts

In [17]:
df_posts.title.map(lambda x: len(x)).max()

316

In [18]:
df_posts.title.map(lambda x: len(x)).min()

6

In [19]:
df_posts.selftext.map(lambda x: len(x)).max()

39880

In [20]:
df_posts.selftext.map(lambda x: len(x)).min()

1

### Posts -  Data Cleaning

Cleaning `title` and `selftext` column with the function implemented `clean_text_df`

In [0]:
clean_text_df(df_posts,'title')
clean_text_df(df_posts,'selftext')

In [26]:
df_posts.head()

Unnamed: 0,subreddit,title,selftext
0,funny,say alpha kenny one fast,now realise you need jesus
1,funny,mary poppins sequels,disney has announced the next mary poppins seq...
2,funny,catbombing my ex girlfriend in the shower,few years into our relationship we adopted a k...
3,funny,christmas poetry,i saw a lady santa clause a standing on the st...
4,funny,true story my brother resembles santa claus,his portly frame and saltandpepper beard make...


### Posts - Assembling Corpus

To proceed we wiil need to:
- merge, in posts ds, the columns __title__ and __selftext__ together on order to have ony one corpus (body) for the 
    corresponding subreddit
- create a one only ds from comments and posts, with the related score and subreddit

In [0]:
df_posts['body']= df_posts.title + " " + df_posts.selftext

In [28]:
df_posts.head(20) #it works properly

Unnamed: 0,subreddit,title,selftext,body
0,funny,say alpha kenny one fast,now realise you need jesus,say alpha kenny one fast now realise you need ...
1,funny,mary poppins sequels,disney has announced the next mary poppins seq...,mary poppins sequels disney has announced the ...
2,funny,catbombing my ex girlfriend in the shower,few years into our relationship we adopted a k...,catbombing my ex girlfriend in the shower few ...
3,funny,christmas poetry,i saw a lady santa clause a standing on the st...,christmas poetry i saw a lady santa clause a s...
4,funny,true story my brother resembles santa claus,his portly frame and saltandpepper beard make...,true story my brother resembles santa claus h...
5,funny,santa gives bad kids coal,when santa started doing this coal was a home ...,santa gives bad kids coal when santa started d...
6,funny,my son describing christmas,so i asked my 3 year old son what they did at ...,my son describing christmas so i asked my 3 ye...
7,funny,how do you spot a blind guy at a nude beach,it's not hard,how do you spot a blind guy at a nude beach it...
8,funny,hittin those milestones,teens i hate the world20s im going to save the...,hittin those milestones teens i hate the world...
9,funny,convinced someone i'm a beagle,i was in grade 6 and we had a giant group wher...,convinced someone i'm a beagle i was in grade ...


Let's keep only the column `body` so we have the same structure in both df

In [0]:
df_posts_corpus = df_posts.drop(['title', 'selftext'], axis=1)

In [30]:
df_posts_corpus.tail(5)

Unnamed: 0,subreddit,body
110651,science,study finds robust sex differences in children...
110652,science,scan technique reveals secret writing in mummy...
110653,science,psychedelic drugs can help relieve the symptom...
110654,science,you are shaped by the genes you inherit and ma...
110655,science,how to make gradient background in pixellab 20...


## New global ds "comments_posts_clear"

Unification of the two datasets.

Remembering the shape of comments and posts, I can check if the final df has been correctly created like a sum of them.

In [31]:
df_comments.shape

(10754982, 2)

In [32]:
df_posts_corpus.shape

(110656, 2)

In [0]:
df_comments_posts = pd.concat([df_comments, df_posts_corpus], axis=0, ignore_index=True)

In [34]:
df_comments_posts.shape #the sum confirm that the ds are now joint in one only

(10865638, 2)

In [35]:
df_comments_posts.describe() # and the number of my subreddit didn't changed

Unnamed: 0,subreddit,body
count,10865638,10865638.0
unique,14,10327835.0
top,todayilearned,
freq,776117,32288.0


In [36]:
df_comments_posts.head(20)

Unnamed: 0,subreddit,body
0,aww,doubling down with the multiple sub approach g...
1,aww,i 2nd meatball i have a seriously chunky engli...
2,aww,my thoughts as well
3,aww,friends don't eat friends
4,aww,maybe the doctors are waiting with the heart i...
5,aww,tell him if he wants to stay hell have to shave
6,aww,that's just mean
7,aww,that's awesome
8,aww,is this in san diego because there are posters...
9,aww,lil bun is my new rapper name


In [37]:
df_comments_posts.tail(20)

Unnamed: 0,subreddit,body
10865618,science,the evolution of man's face over the course of...
10865619,science,nanofabricator would change everything deleted
10865620,science,obscure vomiting illness linked to longterm po...
10865621,science,the dark side of led lightbulbs deleted
10865622,science,skywatchers see 'super blue blood moon' deleted
10865623,science,obese fat becomes inflamed and scarred which m...
10865624,science,the neuroscience of proactive vs hyperreactive...
10865625,science,74 things that blew our minds in 2017 deleted
10865626,science,new oxford university research has revealed th...
10865627,science,fourdimensional physics in two dimensions deleted


Check if there are NaN values and delete them

In [39]:
df_comments_posts["body"].isna().sum()

0

Now we "freeze" our dataset into a csv file using `to_csv`

In [0]:
from google.colab import files
df_comments_posts.to_csv('comments_posts_2018_V2.csv')

In [41]:
!ls

adc.json		    comments_2018-000000000006
comments_2018-000000000000  comments_2018-000000000007
comments_2018-000000000001  comments_2018-000000000008
comments_2018-000000000002  comments_2018-000000000009
comments_2018-000000000003  comments_posts_2018_V2.csv
comments_2018-000000000004  posts_2018
comments_2018-000000000005  sample_data


In [42]:
import os
os.stat('comments_posts_2018_V2.csv').st_size

2166253816

Finally we can upload it to my result's bucket

In [43]:
!gsutil cp /content/comments_posts_2018_V2.csv gs://reddit_final_results/

Copying file:///content/comments_posts_2018_V2.csv [Content-Type=text/csv]...
/ [0 files][    0.0 B/  2.0 GiB]                                                ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

/
Operation completed over 1 objects/2.0 GiB.                                      


Just in case I will need it, I updload also the zip file of `comments_posts_2018`

In [44]:
!zip /content/comments_posts_2018_V2.zip /content/comments_posts_2018_V2.csv

  adding: content/comments_posts_2018_V2.csv (deflated 64%)


In [45]:
!du -ch /content/comments_posts_2018_V2.zip

745M	/content/comments_posts_2018_V2.zip
745M	total


In [46]:
!gsutil cp /content/comments_posts_2018_V2.zip gs://reddit_final_results/

Copying file:///content/comments_posts_2018_V2.zip [Content-Type=application/zip]...
/ [0 files][    0.0 B/744.9 MiB]                                                ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

/
Operation completed over 1 objects/744.9 MiB.                                    
