# Exercise: Computational Linguistics over Reddit Data

For this project we are going to ingest Reddit posts, process the data and perform computational linguistics over the Reddit posts.

This project will build off of some work you have previously done. However, beyond that exercise of processing and cataloging the feeds, in this instance you will access the referenced Reddit post and perform computational linguistics over the post itself.

![DataScraper_To_NLP.png MISSING](../images/DataScraper_To_NLP.png)

---

### From the site:

reddit: https://www.reddit.com/  
Reddit gives you the best of the Internet in one place. Get a constantly updating feed of breaking news, fun stories, pics, memes, and videos just for you.


### From Wikipedia:
Reddit is an American social news aggregation, web content rating, and discussion website. 
Registered members submit content to the site such as links, text posts, and images, 
which are then voted up or down by other members. 
Posts are organized by subject into user-created boards called "subreddits", 
which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing. 
Submissions with more up-votes appear towards the top of their subreddit and, if they receive enough votes, ultimately on the site's front page. 



#### Sample Posting:

The below link is an example post from someone that was tinkering with sentiment analysis; specifically they looked at the text of [Moby Dick](https://en.wikipedia.org/wiki/Moby-Dick).

**Spoiler:** The conclusion was that the book is rather negative in sentiment.
It is after all, about vengeance!

https://www.reddit.com/r/LanguageTechnology/comments/9whk23/a_simple_nlp_pipeline_to_calculate_running/



### From: https://www.redditinc.com/
![REDDIT_About.png MISSING](../images/REDDIT_About_latest.png)

---

## Data Acquisition


### Example Code:

In this exercise, we will be using Reddit API for fetching the latest messages. We can also fetch recent posts from Reddit using web feeds (check [here](./rss-feeds.ipynb)), but it seems our IP got banned for excessive requests to Reddit over the last few days. So we will be using Reddit API for which you are required to create your Reddit account and an app. 

Follow [this article](https://gilberttanner.com/blog/scraping-redditdata) to create your credentials. 

### Using Reddit API

For fetching Reddit data using API, we will be using a Python wrapper to Reddit API: [PRAW: The Python Reddit API Wrapper](https://github.com/praw-dev/praw)

Documentation: https://praw.readthedocs.io/en/latest/index.html

In [None]:
import praw

reddit = praw.Reddit(client_id='s_kMno_JZ5Hz_1YlLiQ6eg', 
                     client_secret='1oxmzvlXYoxOIGy52UnJ8FIqkwFAMA', 
                     user_agent='WebScraping')


In [None]:
# get 10 hot posts from the MachineLearning subreddit
hot_posts = reddit.subreddit('datascience').hot(limit=10)  # hot posts

# new_posts = reddit.subreddit('datascience').new(limit=10)  # new posts

# get hottest posts from all subreddits
# hot_posts = reddit.subreddit('all').hot(limit=10)


In [None]:
all_posts = list(hot_posts)  

# this line will initiate the fetching of posts as PRAW use a lazy approach (i.e, fetch when required)
# this part is done to avoid calling Reddit API multiple times while developing our code 

In [None]:

for post in all_posts:
    print(f"id : {post.id}")
    print(f"title : {post.title}")
    print(f"url : {post.url}")
    print(f"author : {str(post.author)} {type(str(post.author))}")
    print(f"score : {post.score} {type(post.score)} ")
    print(f"subreddit : {post.subreddit} {type(post.subreddit)} ")
    print(f"num_comments : {post.num_comments}")
    print(f"body : {post.selftext}")
    print(f"created : {post.created}")
    print(f"link_flair_text : {post.link_flair_text}")
    break  # break the loop after printing information about the first post

### Sub-Reddits

As described above, sub-reddits are communities organized around particular topics.

Some example sub-reddits:
 * https://www.reddit.com/r/datascience/
 * https://www.reddit.com/r/MachineLearning/
 * https://www.reddit.com/r/LanguageTechnology/
 * https://www.reddit.com/r/NLP/
 * https://www.reddit.com/r/Python/


# Exercise Tasks

## Part I: Data Acquisition and Loading 
1. Choose a subreddit of your choice. Preferably something of interest to you. 
1. Conceptualize a database design that can collect the data.
    * Make sure your items (posts) are unique and not duplicated!
    * Make sure you capture at least title, author, subreddit, tags, title link, and timestamp
    * Along with the metadata, capture all the text into one or more data field(s) suitable for information retrieval
    * Write triggers for auto updates of IR related fields
    * Add index (either GIN or GiST) for the IR related fields
    * Additionally, design a field to hold:
        * Sentiment
1. Implement the database in your PostgreSQL schema
1. Implement cells of Python Code that 
    * collect the latest posts from a subreddit of your choice (**should be text-dominant not image/video**) and collect at least 500 posts (if possible), 
    * processes the messages to extract metadata, 
    * process the text for IR, and 
    * perform computational linguistics (i.e, extract sentiment scores), 
    * then insert the data into your database.
1. After you have loaded data from a subreddit, choose a few more subreddits and load those!

## Part II: Analytics 

1. Write some test queries following the text vectors from Module 7.
1. Produce **interesting visualizations** of the linguistic data.
    * Try to look for trends (within a subreddit) and and variations of topics across subreddits
    * Some comparative plots across feeds
1. Write a summary of your findings!

 
 

# Part I: Data Acquisition and Loading

## Task 1: Design your database

Conceptualize a database design that can collect the data.
* Make sure your items (posts) are unique and not duplicated!
* Make sure you capture at least title, link, author, subreddit, tag/flair, and timestamp
* Capture all the body text into fields suitable for information retrieval
* Write triggers for auto updates of IR related fields
* Add index (either GIN or GiST) for the IR related fields
* Additionally, design a field to hold:
    - Sentiment



---

## Task 2: Implement the database in your PostgreSQL schema

You can choose any of the three ways to implement your database. 

* sql magic 
* sql terminal 
* psycopg2 or sqlalchemy


In [None]:
import getpass

# Initialize some variables
mysso= 'dcphw2'    # this is also your schema name. 
schema='dcphw2' 
hostname='pgsql.dsa.lan'
database='dsa_student'

mypasswd = getpass.getpass("Type Password and hit enter")
connection_string = f"postgres://{mysso}:{mypasswd}@{hostname}/{database}"

%load_ext sql
%sql $connection_string 

# Then remove the password from computer memory
del mypasswd

In [90]:
%%sql 

DROP TABLE IF EXISTS reddit;

CREATE TABLE reddit(
        id varchar(6),
        title varchar(1000),
        link varchar(1000),
        author varchar(100),
        subreddit varchar(100),
        tag varchar(100),
        content text,
        time timestamp,
        compound float,
        sentiment varchar(3)
);

ALTER TABLE reddit
ADD CONSTRAINT pk_Reddit PRIMARY KEY (id);

 * postgres://dcphw2:***@pgsql.dsa.lan/dsa_student
Done.
Done.
Done.


[]

In [91]:
%%sql 

ALTER TABLE reddit
  ADD COLUMN content_tsv_gin tsvector;

UPDATE reddit
SET content_tsv_gin = to_tsvector('pg_catalog.english', content);

 * postgres://dcphw2:***@pgsql.dsa.lan/dsa_student
Done.
0 rows affected.


[]

In [92]:
%%sql

CREATE TRIGGER tsv_gin_update 
	BEFORE INSERT OR UPDATE
	ON reddit 
	FOR EACH ROW 
	EXECUTE PROCEDURE 
	tsvector_update_trigger(content_tsv_gin,'pg_catalog.english',content);

 * postgres://dcphw2:***@pgsql.dsa.lan/dsa_student
Done.


[]

In [93]:
%%sql

CREATE INDEX reddit_content
ON Reddit USING GIN(content gin_trgm_ops);

-- GIN INDEX on content_tsv_gin
CREATE INDEX reddit_content_tsv_gin
ON Reddit USING GIN(content_tsv_gin);

 * postgres://dcphw2:***@pgsql.dsa.lan/dsa_student
Done.
Done.


[]

## Task 3: Implement cells of Python Code that

* collect the latest posts from a subreddit of your choice (should be text-dominant not image/video) and collect at least 500 posts (if possible),
* processes the messages to extract id, title, link, author, subreddit, tag/flair, timestamp, etc. 
* process the text for IR, and
* perform computational linguistics (e.g., get sentiment scores)
* then insert the data into your database.


Notes: 
* Each call to Reddit API returns 100 entries max. If we set a limit of more than 100, PRAW will handle multiple API calls internally and lazily fetches data. Check obfuscation and API limitation in https://praw.readthedocs.io/en/v3.6.2/pages/getting_started.html. 
* Develop and test your code with less than 100 messages from a subreddit. Then increase the limit and add few more subreddits. 
* While loading the table, test with one row 


In [59]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report
import re
import os
import psycopg2

In [60]:
## Your code in this cell
## ------------------------

import praw

reddit = praw.Reddit(client_id='s_kMno_JZ5Hz_1YlLiQ6eg', 
                     client_secret='1oxmzvlXYoxOIGy52UnJ8FIqkwFAMA', 
                     user_agent='WebScraping')



In [61]:
import pandas as pd
import datetime
import time
posts = []

ds_subreddit = reddit.subreddit('datascience')
for post in ds_subreddit.hot(limit=10):
    posts.append([post.id, post.title, post.url, post.author, post.subreddit, post.link_flair_text, post.selftext,
                     post.created_utc])
    
posts = pd.DataFrame(posts, columns=['id', 'title', 'link', 'author', 'subreddit', 'tag', 'content', 'time'])

posts['time'] = pd.to_datetime(posts['time'], unit='s')

posts

Unnamed: 0,id,title,link,author,subreddit,tag,content,time
0,q56pjd,Weekly Entering & Transitioning Thread | 10 Oc...,https://www.reddit.com/r/datascience/comments/...,datascience-bot,datascience,Discussion,Welcome to this week's entering & transitionin...,2021-10-10 12:00:30
1,q7zuxn,Putting ML models in production,https://www.reddit.com/r/datascience/comments/...,Proletarian_Tear,datascience,Discussion,What to consider when putting ML models in pro...,2021-10-14 13:35:50
2,q85c4e,Ethical Dilema,https://www.reddit.com/r/datascience/comments/...,Your_Data_Talking,datascience,Discussion,I’ve been put into a conundrum and have an ide...,2021-10-14 18:11:21
3,q80hcb,Any experienced data scientist or analyst look...,https://www.reddit.com/r/datascience/comments/...,JS-AI,datascience,Career,"I work at a healthcare tech company (SaaS), bu...",2021-10-14 14:08:24
4,q896yh,Do companies use Tableau or PowerBI more?,https://www.reddit.com/r/datascience/comments/...,Discombobulated_Pen,datascience,Education,Just starting my Master's and we get to choose...,2021-10-14 21:38:00
5,q844ek,ETL and ELT,https://www.reddit.com/r/datascience/comments/...,KiwiD_1618,datascience,Discussion,Ok I mean I got it. I completely understand wh...,2021-10-14 17:11:00
6,q8c9e5,A Choice Between Jobs,https://www.reddit.com/r/datascience/comments/...,hopeful_hen,datascience,Job Search,Say you are a newly minted data scientist (M.S...,2021-10-15 00:14:12
7,q83ygz,I added Codex (GitHub Copilot) to the terminal,https://www.reddit.com/r/datascience/comments/...,tomd_96,datascience,Tooling,&#x200B;\n\nhttps://i.redd.it/w5uovea36gt71.gi...,2021-10-14 17:02:53
8,q811yq,10 must-read data science papers,https://www.reddit.com/r/datascience/comments/...,RoyBatty234,datascience,Discussion,"Data science might be a young field, but that ...",2021-10-14 14:38:06
9,q87sm3,What is the best course for data science in Co...,https://www.reddit.com/r/datascience/comments/...,Poirot19,datascience,Discussion,,2021-10-14 20:13:29


In [62]:
text = [re.sub(r'@(\w+)', ' ', t) for t in posts['content'].values]

In [63]:
analyzer = SentimentIntensityAnalyzer()
text_sentiment = [analyzer.polarity_scores(t) for t in text]

df = pd.DataFrame(text_sentiment)
df['text'] = text

In [64]:
df['sentiment'] = 'NEU'
df.loc[df['compound'] > 0.05, 'sentiment'] = 'POS'
df.loc[df['compound'] < -0.05, 'sentiment'] = 'NEG'


In [65]:
posts['compound'] = df['compound']
posts['sentiment'] = df['sentiment']

posts

Unnamed: 0,id,title,link,author,subreddit,tag,content,time,compound,sentiment
0,q56pjd,Weekly Entering & Transitioning Thread | 10 Oc...,https://www.reddit.com/r/datascience/comments/...,datascience-bot,datascience,Discussion,Welcome to this week's entering & transitionin...,2021-10-10 12:00:30,0.5093,POS
1,q7zuxn,Putting ML models in production,https://www.reddit.com/r/datascience/comments/...,Proletarian_Tear,datascience,Discussion,What to consider when putting ML models in pro...,2021-10-14 13:35:50,0.7059,POS
2,q85c4e,Ethical Dilema,https://www.reddit.com/r/datascience/comments/...,Your_Data_Talking,datascience,Discussion,I’ve been put into a conundrum and have an ide...,2021-10-14 18:11:21,-0.9349,NEG
3,q80hcb,Any experienced data scientist or analyst look...,https://www.reddit.com/r/datascience/comments/...,JS-AI,datascience,Career,"I work at a healthcare tech company (SaaS), bu...",2021-10-14 14:08:24,0.9636,POS
4,q896yh,Do companies use Tableau or PowerBI more?,https://www.reddit.com/r/datascience/comments/...,Discombobulated_Pen,datascience,Education,Just starting my Master's and we get to choose...,2021-10-14 21:38:00,0.9332,POS
5,q844ek,ETL and ELT,https://www.reddit.com/r/datascience/comments/...,KiwiD_1618,datascience,Discussion,Ok I mean I got it. I completely understand wh...,2021-10-14 17:11:00,0.9638,POS
6,q8c9e5,A Choice Between Jobs,https://www.reddit.com/r/datascience/comments/...,hopeful_hen,datascience,Job Search,Say you are a newly minted data scientist (M.S...,2021-10-15 00:14:12,0.9914,POS
7,q83ygz,I added Codex (GitHub Copilot) to the terminal,https://www.reddit.com/r/datascience/comments/...,tomd_96,datascience,Tooling,&#x200B;\n\nhttps://i.redd.it/w5uovea36gt71.gi...,2021-10-14 17:02:53,0.0,NEU
8,q811yq,10 must-read data science papers,https://www.reddit.com/r/datascience/comments/...,RoyBatty234,datascience,Discussion,"Data science might be a young field, but that ...",2021-10-14 14:38:06,0.8988,POS
9,q87sm3,What is the best course for data science in Co...,https://www.reddit.com/r/datascience/comments/...,Poirot19,datascience,Discussion,,2021-10-14 20:13:29,0.0,NEU


## Task 6: After you have loaded data from a subreddit, choose a few more subreddit and load those!

Add cells if required

In [66]:
## Your code in this cell
## ------------------------

posts = []

ds_subreddit = reddit.subreddit('datascience')
for post in ds_subreddit.hot(limit=100):
    posts.append([post.id, post.title, post.url, post.author, post.subreddit, post.link_flair_text, post.selftext, 
                     post.created_utc])

ml_subreddit = reddit.subreddit('MachineLearning')
for post in ml_subreddit.hot(limit=100):
    posts.append([post.id, post.title, post.url, post.author, post.subreddit, post.link_flair_text, post.selftext,
                     post.created_utc])
    
lt_subreddit = reddit.subreddit('LanguageTechnology')
for post in lt_subreddit.hot(limit=100):
    posts.append([post.id, post.title, post.url, post.author, post.subreddit, post.link_flair_text, post.selftext,
                     post.created_utc])
    
nlp_subreddit = reddit.subreddit('NLP')
for post in nlp_subreddit.hot(limit=100):
    posts.append([post.id, post.title, post.url, post.author, post.subreddit, post.link_flair_text, post.selftext,
                     post.created_utc])
    
py_subreddit = reddit.subreddit('Python')
for post in py_subreddit.hot(limit=100):
    posts.append([post.id, post.title, post.url, post.author, post.subreddit, post.link_flair_text, post.selftext,
                     post.created_utc])

posts = pd.DataFrame(posts, columns=['id', 'title', 'link', 'author', 'subreddit', 'tag', 'content', 'time'])

posts['time'] = pd.to_datetime(posts['time'], unit='s')

posts



Unnamed: 0,id,title,link,author,subreddit,tag,content,time
0,q56pjd,Weekly Entering & Transitioning Thread | 10 Oc...,https://www.reddit.com/r/datascience/comments/...,datascience-bot,datascience,Discussion,Welcome to this week's entering & transitionin...,2021-10-10 12:00:30
1,q7zuxn,Putting ML models in production,https://www.reddit.com/r/datascience/comments/...,Proletarian_Tear,datascience,Discussion,What to consider when putting ML models in pro...,2021-10-14 13:35:50
2,q85c4e,Ethical Dilema,https://www.reddit.com/r/datascience/comments/...,Your_Data_Talking,datascience,Discussion,I’ve been put into a conundrum and have an ide...,2021-10-14 18:11:21
3,q80hcb,Any experienced data scientist or analyst look...,https://www.reddit.com/r/datascience/comments/...,JS-AI,datascience,Career,"I work at a healthcare tech company (SaaS), bu...",2021-10-14 14:08:24
4,q896yh,Do companies use Tableau or PowerBI more?,https://www.reddit.com/r/datascience/comments/...,Discombobulated_Pen,datascience,Education,Just starting my Master's and we get to choose...,2021-10-14 21:38:00
...,...,...,...,...,...,...,...,...
495,q5x975,I’m doing a fresh install on my pc due to upgr...,https://www.reddit.com/r/Python/comments/q5x97...,Sco-Ross,Python,Discussion,,2021-10-11 14:37:48
496,q5pnuo,Crawling Google Scholar to obtain researcher i...,https://www.reddit.com/r/Python/comments/q5pnu...,Fickle-Impression149,Python,Resource,We are part of a research lab and we work in p...,2021-10-11 06:14:52
497,q5m83l,Algebraic Data Types (Rust style enums) implem...,https://www.reddit.com/r/Python/comments/q5m83...,ElViento92,Python,Intermediate Showcase,"Sup everyone, so to test out the new pattern m...",2021-10-11 02:27:24
498,q5w30w,Win the Lottery With Python,https://www.iceorfire.com/post/win-the-lottery...,will_r3ddit_4_food,Python,Tutorial,,2021-10-11 13:40:03


In [67]:
text = [re.sub(r'@(\w+)', ' ', t) for t in posts['content'].values]

analyzer = SentimentIntensityAnalyzer()
text_sentiment = [analyzer.polarity_scores(t) for t in text]

df = pd.DataFrame(text_sentiment)
df['text'] = text

df['sentiment'] = 'NEU'
df.loc[df['compound'] > 0.05, 'sentiment'] = 'POS'
df.loc[df['compound'] < -0.05, 'sentiment'] = 'NEG'

posts['compound'] = df['compound']
posts['sentiment'] = df['sentiment']

posts

Unnamed: 0,id,title,link,author,subreddit,tag,content,time,compound,sentiment
0,q56pjd,Weekly Entering & Transitioning Thread | 10 Oc...,https://www.reddit.com/r/datascience/comments/...,datascience-bot,datascience,Discussion,Welcome to this week's entering & transitionin...,2021-10-10 12:00:30,0.5093,POS
1,q7zuxn,Putting ML models in production,https://www.reddit.com/r/datascience/comments/...,Proletarian_Tear,datascience,Discussion,What to consider when putting ML models in pro...,2021-10-14 13:35:50,0.7059,POS
2,q85c4e,Ethical Dilema,https://www.reddit.com/r/datascience/comments/...,Your_Data_Talking,datascience,Discussion,I’ve been put into a conundrum and have an ide...,2021-10-14 18:11:21,-0.9349,NEG
3,q80hcb,Any experienced data scientist or analyst look...,https://www.reddit.com/r/datascience/comments/...,JS-AI,datascience,Career,"I work at a healthcare tech company (SaaS), bu...",2021-10-14 14:08:24,0.9636,POS
4,q896yh,Do companies use Tableau or PowerBI more?,https://www.reddit.com/r/datascience/comments/...,Discombobulated_Pen,datascience,Education,Just starting my Master's and we get to choose...,2021-10-14 21:38:00,0.9332,POS
...,...,...,...,...,...,...,...,...,...,...
495,q5x975,I’m doing a fresh install on my pc due to upgr...,https://www.reddit.com/r/Python/comments/q5x97...,Sco-Ross,Python,Discussion,,2021-10-11 14:37:48,0.0000,NEU
496,q5pnuo,Crawling Google Scholar to obtain researcher i...,https://www.reddit.com/r/Python/comments/q5pnu...,Fickle-Impression149,Python,Resource,We are part of a research lab and we work in p...,2021-10-11 06:14:52,0.9013,POS
497,q5m83l,Algebraic Data Types (Rust style enums) implem...,https://www.reddit.com/r/Python/comments/q5m83...,ElViento92,Python,Intermediate Showcase,"Sup everyone, so to test out the new pattern m...",2021-10-11 02:27:24,0.2471,POS
498,q5w30w,Win the Lottery With Python,https://www.iceorfire.com/post/win-the-lottery...,will_r3ddit_4_food,Python,Tutorial,,2021-10-11 13:40:03,0.0000,NEU


In [75]:
posts = posts.astype({'author': 'string', 'subreddit': 'string',}) # must convert to usable types

In [76]:
import sqlalchemy
from sqlalchemy import create_engine

engine = create_engine(connection_string)

In [94]:
posts.to_sql('reddit', con=engine, index=False, if_exists='append')

In [97]:
%%sql

Select * from reddit limit 10

 * postgres://dcphw2:***@pgsql.dsa.lan/dsa_student
10 rows affected.


id,title,link,author,subreddit,tag,content,time,compound,sentiment,content_tsv_gin
q56pjd,Weekly Entering & Transitioning Thread | 10 Oct 2021 - 17 Oct 2021,https://www.reddit.com/r/datascience/comments/q56pjd/weekly_entering_transitioning_thread_10_oct_2021/,datascience-bot,datascience,Discussion,"Welcome to this week's entering & transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include: * Learning resources (e.g. books, tutorials, videos) * Traditional education (e.g. schools, degrees, electives) * Alternative education (e.g. online courses, bootcamps) * Job search questions (e.g. resumes, applying, career prospects) * Elementary questions (e.g. where to start, what next) While you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and [Resources](Resources) pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&restrict_sr=1&sort=new).",2021-10-10 12:00:30,0.5093,POS,"'/r/datascience/search?q=weekly%20thread&restrict_sr=1&sort=new).':96 '/r/datascience/wiki/frequently-asked-questions)':76 'also':86 'altern':40 'answer':66,89 'appli':51 'book':31 'bootcamp':45 'career':52 'check':70 'communiti':69 'cours':44 'data':23 'degre':38 'e.g':30,36,42,49,56 'educ':35,41 'elect':39 'elementari':54 'enter':6 'faq':73 'field':25 'get':16 'includ':27 'job':46 'learn':28 'next':61 'onlin':43 'page':80 'past':91 'prospect':53 'question':14,48,55 'resourc':29,78,79 'resum':50 'school':37 'scienc':24 'search':47,87 'start':17,59 'studi':18 'thread':8,10,93 'topic':26 'tradit':34 'transit':7,20 'tutori':32 'video':33 'wait':64 'week':4,92 'welcom':1 'wiki':83 'www.reddit.com':75,95 'www.reddit.com/r/datascience/search?q=weekly%20thread&restrict_sr=1&sort=new).':94 'www.reddit.com/r/datascience/wiki/frequently-asked-questions)':74"
q7zuxn,Putting ML models in production,https://www.reddit.com/r/datascience/comments/q7zuxn/putting_ml_models_in_production/,Proletarian_Tear,datascience,Discussion,What to consider when putting ML models in production using cloud services like Google Cloud or AWS? Are there dedicated production-ready cloud services for ML models?,2021-10-14 13:35:50,0.7059,POS,"'aw':17 'cloud':11,15,24 'consid':3 'dedic':20 'googl':14 'like':13 'ml':6,27 'model':7,28 'product':9,22 'production-readi':21 'put':5 'readi':23 'servic':12,25 'use':10"
q85c4e,Ethical Dilema,https://www.reddit.com/r/datascience/comments/q85c4e/ethical_dilema/,Your_Data_Talking,datascience,Discussion,"I’ve been put into a conundrum and have an idea on what I should do but basically somebody confided in me that they need help on a problem that is very much a core part of their job. They openly admitted they got hired under false pretenses. They have no experience have not taken any of the required classes or coursework to be doing this job and are basically looking at making up data. So it’s an electrical engineering student who was hired for a job in data science and the job is working with regulatory device data and I even know the name of the company that hired them. With Elizabeth Holmes in the news for her Theranos fraud, how should we in data science handle it when we find out that somebody is completely un-qualified for their job, has already lied about qualifications to get the job, and has been given responsibility over medical device data that has huge disastrous potential for patients?",2021-10-14 18:11:21,-0.9349,NEG,"'admit':42 'alreadi':146 'basic':18,70 'class':60 'compani':109 'complet':138 'confid':20 'conundrum':7 'core':35 'coursework':62 'data':75,90,100,127,162 'devic':99,161 'disastr':166 'electr':80 'elizabeth':114 'engin':81 'even':103 'experi':52 'fals':47 'find':133 'fraud':122 'get':151 'given':157 'got':44 'handl':129 'help':26 'hire':45,85,111 'holm':115 'huge':165 'idea':11 'job':39,67,88,94,144,153 'know':104 'lie':147 'look':71 'make':73 'medic':160 'much':33 'name':106 'need':25 'news':118 'open':41 'part':36 'patient':169 'potenti':167 'pretens':48 'problem':29 'put':4 'qualif':149 'qualifi':141 'regulatori':98 'requir':59 'respons':158 'scienc':91,128 'somebodi':19,136 'student':82 'taken':55 'therano':121 'un':140 'un-qualifi':139 've':2 'work':96"
q80hcb,Any experienced data scientist or analyst looking for a new job?,https://www.reddit.com/r/datascience/comments/q80hcb/any_experienced_data_scientist_or_analyst_looking/,JS-AI,datascience,Career,"I work at a healthcare tech company (SaaS), but my CTO likes to call it an unstructured data company. It’s a very lax environment and we have some of the largest companies in healthcare as our customers. It’s still technically a startup, in January we had less than 10 people, now we have over 50. If you are interested, DM me your name, number, resume, and GitHub profile link. I will be helping in the interviews. Pay is great, and it’s a work from home job. Only accepting US applicants.",2021-10-14 14:08:24,0.9636,POS,"'10':51 '50':57 'accept':91 'applic':93 'call':14 'compani':7,19,33 'cto':11 'custom':38 'data':18 'dm':62 'environ':25 'github':69 'great':81 'healthcar':5,35 'help':75 'home':88 'interest':61 'interview':78 'januari':46 'job':89 'largest':32 'lax':24 'less':49 'like':12 'link':71 'name':65 'number':66 'pay':79 'peopl':52 'profil':70 'resum':67 'saa':8 'startup':44 'still':41 'tech':6 'technic':42 'unstructur':17 'us':92 'work':2,86"
q896yh,Do companies use Tableau or PowerBI more?,https://www.reddit.com/r/datascience/comments/q896yh/do_companies_use_tableau_or_powerbi_more/,Discombobulated_Pen,datascience,Education,Just starting my Master's and we get to choose which visualisation tools to use for the visuals in projects (not proficient enough in python yet so sticking with one of the two above) - which of the two would be better to learn this year and therefore more useful to future employers? Or is it easy enough to learn that it doesn't really matter so I should pick the one that is easiest to use (so am also wondering which one is easiest)? Thanks a lot!,2021-10-14 21:38:00,0.9332,POS,"'also':79 'better':41 'choos':10 'doesn':62 'easi':56 'easiest':74,84 'employ':52 'enough':23,57 'futur':51 'get':8 'learn':43,59 'lot':87 'master':4 'matter':65 'one':30,71,82 'pick':69 'profici':22 'project':20 'python':25 'realli':64 'start':2 'stick':28 'thank':85 'therefor':47 'tool':13 'two':33,38 'use':15,49,76 'visual':18 'visualis':12 'wonder':80 'would':39 'year':45 'yet':26"
q844ek,ETL and ELT,https://www.reddit.com/r/datascience/comments/q844ek/etl_and_elt/,KiwiD_1618,datascience,Discussion,"Ok I mean I got it. I completely understand what they are from the experience I have. But does it require any specific tool? Like SQL or some other tools that ITs use? Does the following procedure consider an ETL pipeline? Using R/Python download raw data from multiple clients through sftp servers, manipulate the data so that all have the same structure, save them into a database (nothing special - just csv txt or rds files), when the time comes open the files on R/Python and make various of statistical analysis, export specifically structure outputs that can fit very well on a BI tool and transfer them on the servers. All of this being automatically done. I see a lot of companies asking for ETL. Can I add on my CV that I have ETL experience based on the above procedures?",2021-10-14 17:11:00,0.9638,POS,"'add':127 'analysi':90 'ask':122 'automat':114 'base':136 'bi':102 'client':49 'come':79 'compani':121 'complet':8 'consid':38 'csv':71 'cv':130 'data':46,55 'databas':67 'done':115 'download':44 'etl':40,124,134 'experi':15,135 'export':91 'file':75,82 'fit':97 'follow':36 'got':5 'like':25 'lot':119 'make':86 'manipul':53 'mean':3 'multipl':48 'noth':68 'ok':1 'open':80 'output':94 'pipelin':41 'procedur':37,140 'r/python':43,84 'raw':45 'rds':74 'requir':21 'save':63 'see':117 'server':52,109 'sftp':51 'special':69 'specif':23,92 'sql':26 'statist':89 'structur':62,93 'time':78 'tool':24,30,103 'transfer':105 'txt':72 'understand':9 'use':33,42 'various':87 'well':99"
q8c9e5,A Choice Between Jobs,https://www.reddit.com/r/datascience/comments/q8c9e5/a_choice_between_jobs/,hopeful_hen,datascience,Job Search,"Say you are a newly minted data scientist (M.S. in DS). Company A is interesting, well-positioned financially, and intellectually interesting. But Company A does not have a data scientist. They are a startup, and they want you to be part of a team of 4. You will be head data scientist. Company B is a multinational company, with a well-practiced data science division or focus. You will be mentored and wrapped into the company. In company B, your role is much more well-defined, but you also have less control over the direction of the company. In company A, you can make a larger impact, but the role of mentors op is greatly curtailed. What do you choose? I expect it depends on the preparedness of the candidate. 1. Can you do the job for company A? Are you ready? How can you know? 2. Where will you grow more? What is better for you as a person? 3. Independently, somewhat, of preference, which opportunity is best? Now I know this will differ depending on the companies in question, but assume both are full of talented and experienced quantitative professionals. Just looking for insight. Personally I lean towards companyB, for the safety, but the role from company A is tempting. While it’s hard to judge ones own capability, I think its risky to find yourself new and the only data scientist at a very well funded company (choice A). But it also offers the reward of being critical perhaps to operational success.",2021-10-15 00:14:12,0.9914,POS,"'1':134 '2':150 '3':164 '4':47 'also':92,248 'assum':186 'b':55,81 'best':172 'better':158 'candid':133 'capabl':224 'choic':244 'choos':123 'compani':12,24,54,59,78,80,101,103,141,182,212,243 'companyb':204 'control':95 'critic':254 'curtail':119 'data':7,30,52,65,236 'defin':89 'depend':127,179 'differ':178 'direct':98 'divis':67 'ds':11 'expect':125 'experienc':193 'financi':19 'find':230 'focus':69 'full':189 'fund':242 'great':118 'grow':154 'hard':219 'head':51 'impact':110 'independ':165 'insight':199 'intellectu':21 'interest':15,22 'job':139 'judg':221 'know':149,175 'larger':109 'lean':202 'less':94 'look':197 'm.s':9 'make':107 'mentor':73,115 'mint':6 'much':85 'multin':58 'new':232 'newli':5 'offer':249 'one':222 'op':116 'oper':257 'opportun':170 'part':42 'perhap':255 'person':163,200 'posit':18 'practic':64 'prefer':168 'prepared':130 'profession':195 'quantit':194 'question':184 'readi':145 'reward':251 'riski':228 'role':83,113,210 'safeti':207 'say':1 'scienc':66 'scientist':8,31,53,237 'somewhat':166 'startup':35 'success':258 'talent':191 'team':45 'tempt':215 'think':226 'toward':203 'want':38 'well':17,63,88,241 'well-defin':87 'well-posit':16 'well-pract':62 'wrap':75"
q83ygz,I added Codex (GitHub Copilot) to the terminal,https://www.reddit.com/r/datascience/comments/q83ygz/i_added_codex_github_copilot_to_the_terminal/,tomd_96,datascience,Tooling,&#x200B; https://i.redd.it/w5uovea36gt71.gif  You can now let Zsh write code for you using the plugin I wrote: [https://github.com/tom-doerr/zsh\_codex](https://github.com/tom-doerr/zsh_codex) All you need to provide is a comment or a variable name and the plugin will use OpenAI's Codex AI (powers GitHub Copilot) to write the corresponding code. Be aware that you do need to get access to the Codex API.,2021-10-14 17:02:53,0.0,NEU,"'/tom-doerr/zsh':20 '/tom-doerr/zsh_codex)':24 '/w5uovea36gt71.gif':3 'access':62 'ai':45 'api':66 'awar':55 'code':10,53 'codex':21,44,65 'comment':32 'copilot':48 'correspond':52 'get':61 'github':47 'github.com':19,23 'github.com/tom-doerr/zsh':18 'github.com/tom-doerr/zsh_codex)':22 'i.redd.it':2 'i.redd.it/w5uovea36gt71.gif':1 'let':7 'name':36 'need':27,59 'openai':42 'plugin':15,39 'power':46 'provid':29 'use':13,41 'variabl':35 'write':9,50 'wrote':17 'zsh':8"
q811yq,10 must-read data science papers,https://www.reddit.com/r/datascience/comments/q811yq/10_mustread_data_science_papers/,RoyBatty234,datascience,Discussion,"Data science might be a young field, but that doesn’t mean you won’t face expectations about having an awareness of certain topics. I would like to know (and read) several of the most important recent developments and influential thought pieces. Topics covered by these papers may range from the **orchestration of the DS workflow** to **breakthroughs in faster neural networks** to a **rethinking of our fundamental approach to problem solving with statistics**.",2021-10-14 14:38:06,0.8988,POS,"'approach':69 'awar':21 'breakthrough':58 'certain':23 'cover':44 'data':1 'develop':38 'doesn':10 'ds':55 'expect':17 'face':16 'faster':60 'field':7 'fundament':68 'import':36 'influenti':40 'know':29 'like':27 'may':48 'mean':12 'might':3 'network':62 'neural':61 'orchestr':52 'paper':47 'piec':42 'problem':71 'rang':49 'read':31 'recent':37 'rethink':65 'scienc':2 'sever':32 'solv':72 'statist':74 'thought':41 'topic':24,43 'won':14 'workflow':56 'would':26 'young':6"
q87sm3,What is the best course for data science in Coursera?,https://www.reddit.com/r/datascience/comments/q87sm3/what_is_the_best_course_for_data_science_in/,Poirot19,datascience,Discussion,,2021-10-14 20:13:29,0.0,NEU,



### In part II, we will search your database as `dsa_ro_user user`. To prepare your DB to be read, you will need to grant the dsa_ro_user schema access and select privileges on your table.

```SQL
GRANT USAGE ON SCHEMA <your schema> TO dsa_ro_user;  -- NOTE: change to your schema
GRANT SELECT ON <your table> TO dsa_ro_user;
```

In [98]:
%%sql

GRANT USAGE ON SCHEMA dcphw2 TO dsa_ro_user;  -- NOTE: change to your schema
GRANT SELECT ON reddit TO dsa_ro_user;

 * postgres://dcphw2:***@pgsql.dsa.lan/dsa_student
Done.
Done.


[]

# Save your notebook, then `File > Close and Halt`

---