# Data cleaning

In [3]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [7]:
!gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?code_challenge=ibnO7d81msPacZTiKQUQ3Oq_e_MuuEfzhmudgbtpZjA&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


Enter verification code: 4/tQHwfnB0T7ftuwvKcWxBp9Perhp9RWItfr90Ts7ITZQSL5Uryw1KH2Q
If you need to use ADC, see:
  gcloud auth application-default --help

You are now logged in as [giuliag.master@gmail.com].
Your current project is [reddit-master].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [5]:
!gcloud config set project reddit-master

Updated property [core/project].


After being autheticated, we can downloead the final data from the bucket of our project in GCP. 

In [0]:
import pandas as pd
from glob import glob
import re
import numpy as np

We define a function that cleans the text, with regular expressions, from:
*  urls
*  line ending \n
*  digits
*  punctuation
*  symbols

In [0]:
def remove_urls(text):
    clean = re.compile(r'http\S+')
    return re.sub(clean, '', str(text))

def remove_lineEndings (text):
    clean = re.compile(r'\n')
    return re.sub(clean, '', str(text))

def remove_symbols (text):
    clean = re.compile(r'[^a-zA-Z0-9 r/]') #r/ is how to call a subreddit, I want keep it
    return re.sub(clean, '', str(text))

def clean_text(text):
  return remove_urls(remove_lineEndings(remove_symbols(text)))

def clean_text_df(df, column):
  df[column] = df[column].map(clean_text)

## Comments

In [10]:
!gsutil cp gs://reddit_comments_master/comments_2018-* .

Copying gs://reddit_comments_master/comments_2018-000000000000...
Copying gs://reddit_comments_master/comments_2018-000000000001...
Copying gs://reddit_comments_master/comments_2018-000000000002...
Copying gs://reddit_comments_master/comments_2018-000000000003...
- [4 files][846.3 MiB/846.3 MiB]                                                
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://reddit_comments_master/comments_2018-000000000004...
Copying gs://reddit_comments_master/comments_2018-000000000005...
Copying gs://reddit_comments_master/comments_2018-000000000006...
Copying gs://reddit_comments_master/comments_2018-000000000007...
Copying gs://reddit_comments_master/comments_2018-000000000008...
Copying gs://reddit_comments_master/comments_2018-000000000009...
- [10 f

In [11]:
!pwd

/content


In [12]:
!head comments_2018-000000000004 # checking the file

subreddit,body
aww,So cute..!!♡.♡
aww,Shaken baby syndrome 
aww,She's made her choice
aww,"I have more pics of her but I don't know how to link them and I'm at work, if I figure it out I'll do it when I finish ��"
aww,This is the most adorable thing I have ever seen. Thanks so much for sharing 
aww,"Incredibly sweet! I was thinking however, ""Who was watching his bag?"""
aww,It did �� so  cute!!!! ❤
aww,Link to they toy in the back?
aww,Literal speed dating!


In [0]:
comments_files = glob("/content/comments_2018*") # all files have the same pattern
dfs = []

for file in comments_files:
  df = pd.read_csv(file)
  dfs.append(df)

df_comments = pd.concat(dfs, axis=0, ignore_index=True)

In [14]:
df_comments.describe()

Unnamed: 0,subreddit,body
count,10754982,10754982
unique,14,10424325
top,IAmA,Your submission has been automatically removed...
freq,768213,18649


**Correct**, we have 14 subreddit each with 768213 comments for a total of 10754982 from 2018.

In [16]:
df_comments.tail()

Unnamed: 0,subreddit,body
10754977,science,"Thunder snow?\n\n I'm not familiar, we don't g..."
10754978,science,Is this a newly discovered super parasite or s...
10754979,science,This all presupposes gender inequality to be f...
10754980,science,"Ah yes of course, the ""not my problem"" approach"
10754981,science,"If drank in excess, yes. My comment was not me..."


### Pre cleaning filters with SQL

The df has been pre "cleaned":
- no data with body "deleted"
- no data with body "removed"
- no data with body "Removed by reddit in response to a copyright notice."
- no NAN on body
- no empty data on body

### Comments - Data Overview

In [47]:
df_comments.shape

(10754982, 2)

Check that there are not NaN values

In [48]:
df_comments.isnull().sum()

subreddit    0
body         0
dtype: int64

Checking the largest text in our comments


In [49]:
df_comments.body.map(lambda x: len(x)).max()

16200

In [50]:
df_comments.body.map(lambda x: len(x)).min()

11

### Comments - Data Cleaning

Cleaning `body` column with the function implemented `clean_text_df`

In [0]:
clean_text_df(df_comments,'body')

## Posts

In [21]:
!gsutil cp gs://reddit_posts_master/posts_2018 .

Copying gs://reddit_posts_master/posts_2018...
/ [1 files][ 22.5 MiB/ 22.5 MiB]                                                
Operation completed over 1 objects/22.5 MiB.                                     


In [22]:
!pwd

/content


In [23]:
!head posts_2018

subreddit,title,selftext
funny,Say Alpha Kenny One fast!,Now realise you need Jesus...
funny,Mary Poppins sequels,"Disney has announced the next Mary Poppins Sequels. After Mary Poppins Returns, there will be: 

*Mary Poppins Forever 
*Mary Poppins and Jack 
*Mary Poppins Begins 
*The Dark Nanny 
*The Dark Nanny Rises 
*Mary Poppins v Superman: Dawn of Justice 


In [0]:
df_posts = pd.read_csv("posts_2018")

In [25]:
df_posts.describe()

Unnamed: 0,subreddit,title,selftext
count,110656,110656,110656
unique,14,105581,15710
top,worldnews,IamA (Blank) AMA!,[deleted]
freq,7904,67,66247


**Correct**, we have 14 subreddit each with 7904 titles/selftext for a total of 110656 from 2018.

In [27]:
df_posts.head()

Unnamed: 0,subreddit,title,selftext
0,funny,Say Alpha Kenny One fast!,Now realise you need Jesus...
1,funny,Mary Poppins sequels,Disney has announced the next Mary Poppins Seq...
2,funny,Catbombing my ex girlfriend in the shower.,Few years into our relationship we adopted a k...
3,funny,Christmas poetry,"I saw a lady Santa Clause, a standing on the s..."
4,funny,True story: My brother resembles Santa Claus,His portly frame and salt-and-pepper beard ma...


### Pre cleaning filters with SQL

The df has been pre "cleaned":
- no data with selftext "deleted" and title length < 5
- no data with selftext "removed" and title length < 5
- no data with selftext "Removed by reddit in response to a copyright notice." and title length < 5
- no NAN on selftext and title length < 5
- no empty data on selftext and title length < 5

### Posts - Data Overview

In [28]:
df_posts.shape

(110656, 3)

Check that there are not NaN values

In [65]:
df_posts.isnull().sum()

subreddit    0
title        0
selftext     0
dtype: int64

Checking the largest title in our posts

In [29]:
df_posts.title.map(lambda x: len(x)).max()

316

In [30]:
df_posts.selftext.map(lambda x: len(x)).max()

39880

In [31]:
df_posts.title.map(lambda x: len(x)).min()

6

In [32]:
df_posts.selftext.map(lambda x: len(x)).min() # can be, the title can be the principal part of text in this post

1

### Posts - Assembling Corpus

To proceed we wiil need to:
- merge, in posts ds, the columns __title__ and __selftext__ together on order to have ony one corpus (body) for the 
    corresponding subreddit
- create a one only ds from comments and posts, with the related score and subreddit

In [0]:
df_posts['body']= df_posts.title + " " + df_posts.selftext

In [34]:
df_posts.head(20) #it works properly

Unnamed: 0,subreddit,title,selftext,body
0,funny,Say Alpha Kenny One fast!,Now realise you need Jesus...,Say Alpha Kenny One fast! Now realise you need...
1,funny,Mary Poppins sequels,Disney has announced the next Mary Poppins Seq...,Mary Poppins sequels Disney has announced the ...
2,funny,Catbombing my ex girlfriend in the shower.,Few years into our relationship we adopted a k...,Catbombing my ex girlfriend in the shower. Few...
3,funny,Christmas poetry,"I saw a lady Santa Clause, a standing on the s...","Christmas poetry I saw a lady Santa Clause, a ..."
4,funny,True story: My brother resembles Santa Claus,His portly frame and salt-and-pepper beard ma...,True story: My brother resembles Santa Claus ...
5,funny,Santa gives bad kids coal,"When Santa started doing this, coal was a home...",Santa gives bad kids coal When Santa started d...
6,funny,My Son describing Christmas.,So I asked my 3 year old son what they did at ...,My Son describing Christmas. So I asked my 3 y...
7,funny,How do you spot a blind guy at a nude beach?,It's not hard,How do you spot a blind guy at a nude beach? I...
8,funny,Hittin’ those milestones,Teens: I HATE the world!\n\n20’s: I’m going to...,Hittin’ those milestones Teens: I HATE the wor...
9,funny,Convinced someone i'm a beagle,I was in grade 6 and we had a giant group wher...,Convinced someone i'm a beagle I was in grade ...


Let's keep only the column `body` so we have the same structure in both df

In [0]:
df_posts_corpus = df_posts.drop(['title', 'selftext'], axis=1)

In [39]:
df_posts_corpus.tail(5)

Unnamed: 0,subreddit,body
110651,science,Study finds robust sex differences in children...
110652,science,Scan technique reveals secret writing in mummy...
110653,science,Psychedelic drugs can help relieve the symptom...
110654,science,You Are Shaped by the Genes You Inherit And Ma...
110655,science,How to Make Gradient Background in Pixellab 20...


### Posts -  Data Cleaning

Cleaning `body` column with the function implemented `clean_text_df`

In [0]:
clean_text_df(df_posts_corpus,'body')

In [38]:
df_posts_corpus.head()

Unnamed: 0,subreddit,body
0,funny,Say Alpha Kenny One fast Now realise you need ...
1,funny,Mary Poppins sequels Disney has announced the ...
2,funny,Catbombing my ex girlfriend in the shower Few ...
3,funny,Christmas poetry I saw a lady Santa Clause a s...
4,funny,True story My brother resembles Santa Claus H...


## New global ds "comments_posts_clear"

Unification of the two datasets.

Remembering the shape of comments and posts, I can check if the final df has been correctly created like a sum of them.

In [40]:
df_comments.shape

(10754982, 2)

In [41]:
df_posts_corpus.shape

(110656, 2)

In [0]:
df_comments_posts = pd.concat([df_comments, df_posts_corpus], axis=0, ignore_index=True)

In [43]:
df_comments_posts.shape #the sum confirm that the ds are now joint in one only

(10865638, 2)

In [44]:
df_comments_posts.describe() # and the number of my subreddit didn't changed

Unnamed: 0,subreddit,body
count,10865638,10865638.0
unique,14,10365300.0
top,IAmA,
freq,776117,32182.0


In [45]:
df_comments_posts.head(20)

Unnamed: 0,subreddit,body
0,aww,Dont teach him Bite
1,aww,Shes adorable
2,aww,What an adorable little poser lol
3,aww,Hey beautiful come on over here Wink wink Such...
4,aww,Although I do agree that the news should be br...
5,aww,Right after sloth enters and want to participate
6,aww,What a sweet smile
7,aww,He looks like chiken from kfc lol
8,aww,This is why i cant have pets i get too sad whe...
9,aww,I dont think I like it


In [46]:
df_comments_posts.tail(20)

Unnamed: 0,subreddit,body
10865618,science,The Evolution of Mans Face Over The Course Of ...
10865619,science,Nanofabricator would change everything deleted
10865620,science,Obscure Vomiting Illness Linked to LongTerm Po...
10865621,science,The Dark Side of LED Lightbulbs deleted
10865622,science,Skywatchers see super blue blood Moon deleted
10865623,science,Obese fat becomes inflamed and scarred which m...
10865624,science,The Neuroscience of Proactive vs HyperReactive...
10865625,science,74 Things That Blew Our Minds in 2017 deleted
10865626,science,New Oxford University research has revealed th...
10865627,science,Fourdimensional physics in two dimensions deleted


Now we "freeze" our dataset into a csv file using `to_csv`

In [0]:
from google.colab import files
df_comments_posts.to_csv('comments_posts_2018.csv')

In [48]:
!ls

adc.json		    comments_2018-000000000006
comments_2018-000000000000  comments_2018-000000000007
comments_2018-000000000001  comments_2018-000000000008
comments_2018-000000000002  comments_2018-000000000009
comments_2018-000000000003  comments_posts_2018.csv
comments_2018-000000000004  posts_2018
comments_2018-000000000005  sample_data


In [110]:
import os
os.stat('comments_posts_2018.csv').st_size

2160419359

Finally we can upload it to my result's bucket

In [127]:
!gsutil cp /content/comments_posts_2018.csv gs://reddit_final_results/

Copying file:///content/comments_posts_2018.csv [Content-Type=text/csv]...
/ [0 files][    0.0 B/  2.0 GiB]                                                ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

/
Operation completed over 1 objects/2.0 GiB.                                      


Just in case I will need it, I updload also the zip file of `comments_posts_2018`

In [119]:
!zip /content/comments_posts_2018.zip /content/comments_posts_2018.csv

  adding: content/comments_posts_2018.csv (deflated 63%)


In [122]:
!du -ch /content/comments_posts_2018.zip

771M	/content/comments_posts_2018.zip
771M	total


In [129]:
!gsutil cp /content/comments_posts_2018.zip gs://reddit_final_results/

Copying file:///content/comments_posts_2018.zip [Content-Type=application/zip]...
/ [0 files][    0.0 B/770.6 MiB]                                                ==> NOTE: You are uploading one or more large file(s), which would run
significantly faster if you enable parallel composite uploads. This
feature can be enabled by editing the
"parallel_composite_upload_threshold" value in your .boto
configuration file. However, note that if you do this large files will
be uploaded as `composite objects
<https://cloud.google.com/storage/docs/composite-objects>`_,which
means that any user who downloads such objects will need to have a
compiled crcmod installed (see "gsutil help crcmod"). This is because
without a compiled crcmod, computing checksums on composite objects is
so slow that gsutil disables downloads of composite objects.

\
Operation completed over 1 objects/770.6 MiB.                                    
