# Downloading the final data

In [2]:
!gcloud auth login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?code_challenge=jng9ziQZ8Yt9ZqHulZeUDUBD6t_X7agnh5J0jiYxWUQ&prompt=select_account&code_challenge_method=S256&access_type=offline&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&response_type=code&client_id=32555940559.apps.googleusercontent.com&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth


If you need to use ADC, see:
  gcloud auth application-default --help

You are now logged in as [galli.giuly@gmail.com].
Your current project is [reddit-master].  You can change this setting by running:
  $ gcloud config set project PROJECT_ID


Updates are available for some Cloud SDK components.  To install them,
please run:
  $ gcloud components update



To take a quick anonymo

In [3]:
!gcloud config set project perfect-operand-267716

Updated property [core/project].


## Summarizing

In this step I'll download my final data from where I'll implementing the best model that predict the related subreddit for a given text.

##### Reddit comments - BQ download

I'll run the query below for reddit_comments for each pre-selected subreddit and the length of the body is more that 10 char.

```SQL
  SELECT
    subreddit,
    body
  FROM
    `fh-bigquery.reddit_comments.2019*`
  WHERE
    subreddit IN ('politics',
      'worldpolitics',
      'neoliberal',
      'Libertarian',
      'Anarchism',
      'socialism',
      'Conservative',
      'hillaryclinton',
      'AskTrumpSupporters',
      'PoliticalHumor',
      'NeutralPolitics',
      'PoliticalDiscussion',
      'ukpolitics',
      'LateStageCapitalism',
      'geopolitics')
    AND LENGTH(body) > 10
    AND body != '[deleted]'
    AND body != '[removed]'
    AND body != '[ Removed by reddit in response to a copyright notice. ]'
    AND body != 'NaN'
    AND body != ''
```

##### Reddit postos - BQ download

I'll run the query below for reddit_posts for each pre-selected subreddit and the length of the title is more than 5 char and the selftext length more that 10 char.

```SQL
SELECT
  subreddit,
  title,
  selftext
FROM
  `fh-bigquery.reddit_posts.2019*`
WHERE
  subreddit IN ('politics',
    'worldpolitics',
    'neoliberal',
    'Libertarian',
    'Anarchism',
    'socialism',
    'Conservative',
    'hillaryclinton',
    'AskTrumpSupporters',
    'PoliticalHumor',
    'NeutralPolitics',
    'PoliticalDiscussion',
    'ukpolitics',
    'LateStageCapitalism',
    'geopolitics')
  AND ((LENGTH(title) > 5
      AND LENGTH(selftext) > 10)
    OR ((selftext != '[deleted]'
        AND LENGTH(title) > 5)
      AND (selftext != '[removed]'
        AND LENGTH(title) > 5)
      AND (selftext != '[ Removed by reddit in response to a copyright notice. ]'
        AND LENGTH(title) > 5)
      AND (selftext != 'NaN'
        AND LENGTH(title) > 5)
      AND (selftext != ''
        AND LENGTH(title) > 5)))
```

# Assembling final data

In [6]:
import pandas as pd
import re
import numpy as np
import pickle as pkl
import spacy 
import ast

from glob import glob
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from sklearn import preprocessing
from nltk.stem.snowball import SnowballStemmer
from sklearn.model_selection import train_test_split

### Comments

In [5]:
!gsutil cp gs://politic-comments/comments_politic* .

Copying gs://politic-comments/comments_politic 000000000000...
Copying gs://politic-comments/comments_politic 000000000001...                  
Copying gs://politic-comments/comments_politic 000000000002...                  
Copying gs://politic-comments/comments_politic 000000000003...                  
| [4 files][  2.6 GiB/  2.6 GiB]   16.4 MiB/s                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m cp ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Copying gs://politic-comments/comments_politic 000000000004...
Copying gs://politic-comments/comments_politic 000000000005...                  
Copying gs://politic-comments/comments_politic 000000000006...                  
Copying gs://politic-comments/comments_politic 000000000007...                  
Copying gs://politic-comments/comments_politic 00000000

In [7]:
!pwd

/Users/giuliagalli/Documents/GitHub/reddit_sentiment_analysis/01_inspection


In [8]:
comments_files = glob("comments_politic*") # all files have the same pattern
dfs = []

for file in comments_files:
  df = pd.read_csv(file)
  dfs.append(df)

df_comments = pd.concat(dfs, axis=0, ignore_index=True)

In [10]:
df_comments.describe()

Unnamed: 0,subreddit,body
count,28666076,28666076
unique,15,24214080
top,politics,"\nAs a reminder, this subreddit [is for civil ..."
freq,17715353,240930


In [9]:
df_comments.shape

(28666076, 2)

In [13]:
df_comments.isnull().sum()

subreddit    0
body         0
dtype: int64

In [7]:
df_comments['subreddit'].unique()

array(['politics', 'neoliberal', 'Conservative', 'LateStageCapitalism',
       'AskTrumpSupporters', 'worldpolitics', 'ukpolitics', 'Anarchism',
       'PoliticalHumor', 'Libertarian', 'PoliticalDiscussion',
       'NeutralPolitics', 'socialism', 'geopolitics', 'hillaryclinton'],
      dtype=object)

In [5]:
df_comments.head()

Unnamed: 0,subreddit,body
0,politics,&gt;Mitch McConnell doesn't represent the peop...
1,politics,Justice is coming for Paul Ryan. You mean some...
2,politics,Such a bs excuse because 99% or journalists le...
3,politics,Hope all you want. AOC will without a doubt be...
4,politics,"Typically, natural gas falls under the G part ..."


### Posts

In [10]:
!gsutil cp gs://politic-posts/posts_politic* .

Copying gs://politic-posts/posts_politic...
\ [1 files][ 26.3 MiB/ 26.3 MiB]                                                
Operation completed over 1 objects/26.3 MiB.                                     


In [11]:
df_posts = pd.read_csv("posts_politic")

In [19]:
df_posts.describe()

Unnamed: 0,subreddit,title,selftext
count,24678,24678,24678
unique,15,23962,24030
top,Libertarian,Discussion Thread,The discussion thread is for casual conversati...
freq,5690,244,79


In [12]:
df_posts.shape

(24678, 3)

In [21]:
df_posts.isnull().sum()

subreddit    0
title        0
selftext     0
dtype: int64

In [11]:
df_posts.head()

Unnamed: 0,subreddit,title,selftext
0,politics,Saturday Morning Political Cartoon Thread,"It's Saturday, folks. Let's all kick back with..."
1,Anarchism,What is actually happening in China?,There's been a lot of news about supposed inte...
2,Anarchism,"Seth Tobocman on Art, Gentrification, War in t...",In this interview Seth Tobocman speaks about t...
3,Anarchism,Sadly true,"“everyone now thinks him- or herself free, eve..."
4,Anarchism,Swedes &amp; Scandies: Go to r/arbetarrorelsen,Swedes and other people in the Scandinavian co...


Assembling corpus posts

In [13]:
df_posts['body']= df_posts.title + " " + df_posts.selftext

In [13]:
df_posts.head()

Unnamed: 0,subreddit,title,selftext,body
0,politics,Saturday Morning Political Cartoon Thread,"It's Saturday, folks. Let's all kick back with...",Saturday Morning Political Cartoon Thread It's...
1,Anarchism,What is actually happening in China?,There's been a lot of news about supposed inte...,What is actually happening in China? There's b...
2,Anarchism,"Seth Tobocman on Art, Gentrification, War in t...",In this interview Seth Tobocman speaks about t...,"Seth Tobocman on Art, Gentrification, War in t..."
3,Anarchism,Sadly true,"“everyone now thinks him- or herself free, eve...",Sadly true “everyone now thinks him- or hersel...
4,Anarchism,Swedes &amp; Scandies: Go to r/arbetarrorelsen,Swedes and other people in the Scandinavian co...,Swedes &amp; Scandies: Go to r/arbetarrorelsen...


In [14]:
posts_corpus_df = df_posts.drop(['title', 'selftext'], axis=1)

In [15]:
posts_corpus_df.head()

Unnamed: 0,subreddit,body
0,politics,Saturday Morning Political Cartoon Thread It's...
1,Anarchism,What is actually happening in China? There's b...
2,Anarchism,"Seth Tobocman on Art, Gentrification, War in t..."
3,Anarchism,Sadly true “everyone now thinks him- or hersel...
4,Anarchism,Swedes &amp; Scandies: Go to r/arbetarrorelsen...


#### Unifing comments and posts

In [15]:
comments_posts_df = pd.concat([df_comments, posts_corpus_df], axis=0, ignore_index=True)

In [17]:
comments_posts_df.head()

Unnamed: 0,subreddit,body
0,politics,&gt;Mitch McConnell doesn't represent the peop...
1,politics,Justice is coming for Paul Ryan. You mean some...
2,politics,Such a bs excuse because 99% or journalists le...
3,politics,Hope all you want. AOC will without a doubt be...
4,politics,"Typically, natural gas falls under the G part ..."


In [16]:
comments_posts_df.shape

(28690754, 2)

In [19]:
comments_posts_df.describe()

Unnamed: 0,subreddit,body
count,28690754,28690754
unique,15,24238319
top,politics,"\nAs a reminder, this subreddit [is for civil ..."
freq,17715641,240930


In [17]:
comments_posts_df["body"].isna().sum()

0

In [10]:
!ls

 adc.json			 'comments_politic 000000000006'
'comments_politic 000000000000'  'comments_politic 000000000007'
'comments_politic 000000000001'  'comments_politic 000000000008'
'comments_politic 000000000002'  'comments_politic 000000000009'
'comments_politic 000000000003'   comments_posts.pkl
'comments_politic 000000000004'   posts_politic
'comments_politic 000000000005'   sample_data


In [None]:
comments_posts_df.to_pickle('comments_posts.pkl')

In [0]:
!gsutil cp /content/comments_posts.pkl gs://reddit_final_results/