# The Echo Chamber Effect - A Case Study (provisional title)

## Milestone 2

In this notebook we will download a small sample of the Reddit dataset for the first time and perform some basic statistics on it.

We will also load the two recommended NLP libraries and try them out to see how they work and to which extent we can take advantage of them.

In [2]:
import pandas as pd
import numpy as np
import scipy as sp
import seaborn as sns
%matplotlib inline

import findspark
#findspark.init(r'C:\Users\jorge\Anaconda3\pkgs\pyspark-2.3.1-py36_1001\Lib\site-packages\pyspark')
findspark.init()
findspark.find()

from datetime import datetime
from matplotlib import pyplot as plt
from pyspark.sql import *
from pyspark.sql.functions import *

from pyspark.sql import SparkSession
from pyspark import SparkContext

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext

In [4]:
DATA_DIR = './data/'

We took a tiny slice of the 2017 data from the cluster, compressed it into a parquet file and downloaded to our local machines in order to mess with it. If succesful we may try with a bigger and more representative sample (e.g. with data from all subreddits and all years sampled randomly but maintaining relative size bewteen subreddits)

In [None]:
sample = spark.read.parquet(DATA_DIR + 'sample2017.parquet')

In [None]:
# nº of subreddits
# relative sizes
# some graphs with posting frequency
# nº of posts per user 

### NLP

Below are our first steps with NLP, basically trying out libraries. Here's a reference with some more libraries https://elitedatascience.com/python-nlp-libraries

#### spaCy

spaCy allows us to find named entities, thus identying the topic(s) of a post or discussion.

spaCy can be found here https://spacy.io/ with instructions for installing here https://spacy.io/usage/

In [None]:
import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load('en_core_web_sm')

# Process whole documents
text = (u"When Sebastian Thrun started working on self-driving cars at "
        u"Google in 2007, few people outside of the company took him "
        u"seriously. “I can tell you very senior CEOs of major American "
        u"car companies would shake my hand and turn away because I wasn’t "
        u"worth talking to,” said Thrun, now the co-founder and CEO of "
        u"online higher education startup Udacity, in an interview with "
        u"Recode earlier this week.")
doc = nlp(text)

# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

# Determine semantic similarities
doc1 = nlp(u"my fries were super gross")
doc2 = nlp(u"such disgusting fries")
similarity = doc1.similarity(doc2)
print(doc1.text, doc2.text, similarity)

#### TextBlob

TextBlob allows for sentiment analysis, translation, and more

TextBlob can be found here https://textblob.readthedocs.io/en/dev/ with installation istructions here https://textblob.readthedocs.io/en/dev/install.html

Unfortunately, TextBlob is only available for MacOSX

In [None]:
from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)
blob.tags           # [('The', 'DT'), ('titular', 'JJ'),
                    #  ('threat', 'NN'), ('of', 'IN'), ...]

blob.noun_phrases   # WordList(['titular threat', 'blob',
                    #            'ultimate movie monster',
                    #            'amoeba-like mass', ...])

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
# 0.060
# -0.341

blob.translate(to="es")  # 'La amenaza titular de The Blob...'