# Pre processing - Standardization

In [1]:
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [2]:
!gcloud auth login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?code_challenge=_zDicMjuagjLZMmMq57Gz2Xm4FAKzB45gY6iAfEcp9A&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


Enter verification code: 4/tQG_LUpjV8TFv0kADPbgExTaM7bbCxTWZkbVr67iVaiF7ZTHesmLZwc
If you need to use ADC, see:
  gcloud auth application-default --help

You are now logged in as [galli.giuly@gmail.com].
Your current project is [None].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


In [3]:
!gcloud config set project reddit-master

Updated property [core/project].


In [4]:
import pandas as pd
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [5]:
!gsutil cp gs://reddit_final_results/comments_posts_2018.csv .

Copying gs://reddit_final_results/comments_posts_2018.csv...
- [1 files][  2.0 GiB/  2.0 GiB]   61.8 MiB/s                                   
Operation completed over 1 objects/2.0 GiB.                                      


In [0]:
comments_posts_df = pd.read_csv("comments_posts_2018.csv")

In [8]:
comments_posts_df.columns

Index(['Unnamed: 0', 'subreddit', 'body'], dtype='object')

## Processing lowercase characters

We create a function whose output is lowercase words

In [0]:
sample = comments_posts_df.sample(20)

In [10]:
sample.head()

Unnamed: 0.1,Unnamed: 0,subreddit,body
1611114,1611114,IAmA,Biggest and longest rivalry in North London
8292705,8292705,nba,Klaymond is top 20
6058165,6058165,funny,Youre right op he is creepy
10375371,10375371,funny,Mine was free
4036474,4036474,politics,American exceptionalism is a hell of a drug S...


In [0]:
def lowercase_column (df, column):
  df[column] = df[column].str.lower()
  return df

In [12]:
lowercase_column(sample,'body')

Unnamed: 0.1,Unnamed: 0,subreddit,body
1611114,1611114,IAmA,biggest and longest rivalry in north london
8292705,8292705,nba,klaymond is top 20
6058165,6058165,funny,youre right op he is creepy
10375371,10375371,funny,mine was free
4036474,4036474,politics,american exceptionalism is a hell of a drug s...
2645814,2645814,atheism,shit thats scary to contemplate
6175353,6175353,politics,the last thing i would do when innocent of mur...
6281495,6281495,europe,bruh lets not fool ourselvesthe black sea is s...
3116012,3116012,aww,aww poor girl what happened sending good thoug...
2078223,2078223,technology,i have a feeling this was a huge scam and some...


Once tested we can apply the function to our df

In [8]:
lowercase_column(comments_posts_df,'body')

Unnamed: 0.1,Unnamed: 0,subreddit,body
0,0,aww,dont teach him bite
1,1,aww,shes adorable
2,2,aww,what an adorable little poser lol
3,3,aww,hey beautiful come on over here wink wink such...
4,4,aww,although i do agree that the news should be br...
...,...,...,...
10865633,10865633,science,study finds robust sex differences in children...
10865634,10865634,science,scan technique reveals secret writing in mummy...
10865635,10865635,science,psychedelic drugs can help relieve the symptom...
10865636,10865636,science,you are shaped by the genes you inherit and ma...


### Tokenization

In [0]:
def tokenize_column(df, column):
  df[column] = df.apply(lambda row: nltk.word_tokenize(row[column]), axis=1)

In [0]:
tokenize_column(sample,'body')

In [16]:
sample.head()

Unnamed: 0.1,Unnamed: 0,subreddit,body
4282954,4282954,todayilearned,"[put, this, bracelet, on, your, wrist, it, wil..."
10443399,10443399,nba,"[you, sound, absurd]"
4409800,4409800,movies,"[so, randall, park, was, very, good, at, what,..."
5854805,5854805,funny,"[you, say, potato, i, say, poetawtoh]"
8352942,8352942,movies,"[deadpool, 2, watched, 1, and, 2, back, to, ba..."


We apply the function to our df but first we need to drop the NaN from our df

In [17]:
comments_posts_df["body"].isna().sum()

0

In [0]:
comments_posts_df = comments_posts_df.dropna()

In [0]:
tokenize_column(comments_posts_df,'body')

In [2]:
comments_posts_df.head()

NameError: ignored

## Removing Stop_words

In [0]:
stop_words = set(stopwords.words('english'))

In [0]:
def free_stopwords_column(df, column):
  df[column] = df.apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))