# **IE Data Capstone with RUBRIX: Analizing conversations about in Reddit with NLP**

## Requirements
- Build one or several models for analyzing Reddit conversations about one or more topics (MBA, Crypto, Stock, etc.)
- At least of these models must be an NLP supervised model trained from a dataset labelled with Rubrix.
- The NLP models should be used to extract insights or build  from a large portion of the Reddit conversations.

## Deliverables

- Report describing the solution, the dataset construction process with Rubrix, and the business value of the solution/analytical results.
- The labelled dataset and optionally the trained models. Recommendation: Use the Hugging Face Hub for publication (see https://rubrix.readthedocs.io/en/stable/guides/datasets.html).


## Tools

Python libraries:

- Rubrix: open-source tool for building and labeling NLP datasets (https://rubrix.readthedocs.io/en/stable/)
- Hugging Face Transformers, datasets, and Hub (https://huggingface.co/)
- spacy (https://spacy.io/)

Development environment:
- Docker and Elasticsearch for Rubrix
- Jupyter notebooks
- Colab for GPU training


## Dataset (starting point)
Reddit conversations from the `MBA` subreddit, which is described:

> Learn about MBA programs, applying to them, and what life is like while in one and afterwards. Please make sure to read our rules and wiki before posting.

For this we will use the `convokit` library by Cornell University: 

> ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.

The most relevant concepts for this project are:

* **CORPUS**: A dataset of conversations
* **CONVERSATION**: An interaction between one or more users
* **UTTERANCE**: The message exchanged between users in a conversation

Let's see how to load the subreddit `MBA` corpus and some examples of these objects. For downloading other subreddits you should change the name after `-`, for example `"subreddit-WallStreetBets`. If you need more detailed information, more updated data, or don't find the subreddit you can use the `PRAW` library to gather data from Reddit yourself (you need to create a reddit account and an application): https://www.geeksforgeeks.org/scraping-reddit-using-python/

In [17]:
from convokit import Corpus, download
corpus = Corpus(filename=download("subreddit-MBA"))
corpus.print_summary_stats()

Dataset already exists at /Users/dani/.convokit/downloads/subreddit-MBA
Number of Speakers: 13017
Number of Utterances: 124157
Number of Conversations: 17219


In [27]:
corpus.random_utterance()

Utterance({'obj_type': 'utterance', 'meta': {'score': 1, 'top_level_comment': 'dajxlx7', 'retrieved_on': 1481850239, 'gilded': 0, 'gildings': None, 'subreddit': 'MBA', 'stickied': False, 'permalink': '', 'author_flair_text': '2nd Year Student '}, 'vectors': [], 'speaker': Speaker({'obj_type': 'speaker', 'meta': {}, 'vectors': [], 'owner': <convokit.model.corpus.Corpus object at 0x1087c2af0>, 'id': 'sail_awayy'}), 'conversation_id': '5fdomn', 'reply_to': 'dakicoh', 'timestamp': 1480432434, 'text': 'Stuff like banks/consultancies/F500', 'owner': <convokit.model.corpus.Corpus object at 0x1087c2af0>, 'id': 'dakke5n'})

In [28]:
corpus.random_conversation()

Conversation({'obj_type': 'conversation', 'meta': {'title': 'MBA or EMBA or Nothing?', 'num_comments': 4, 'domain': 'self.MBA', 'timestamp': 1483581647, 'subreddit': 'MBA', 'gilded': 0, 'gildings': None, 'stickied': False, 'author_flair_text': ''}, 'vectors': [], 'tree': None, 'owner': <convokit.model.corpus.Corpus object at 0x1087c2af0>, 'id': '5m3iix'})

In [33]:
print(corpus.random_conversation()) ; corpus.random_conversation().get_utterances_dataframe()

Conversation('id': '3x7rwc', 'utterances': ['3x7rwc', 'cy28utv', 'cy2a5i6', 'cy2ajt2', 'cy2b0c9', 'cy2bib0', 'cy7gvq6'], 'meta': {'title': 'I would apply to more schools, but all the applications require contacting my references seperately, thus repeatedly. Other options?', 'num_comments': 6, 'domain': 'self.MBA', 'timestamp': 1450361431, 'subreddit': 'MBA', 'gilded': 0, 'gildings': None, 'stickied': False, 'author_flair_text': ''})


Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.score,meta.top_level_comment,meta.retrieved_on,meta.gilded,meta.gildings,meta.subreddit,meta.stickied,meta.permalink,meta.author_flair_text,vectors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
5j72tu,1482162857,I'm very lucky to have gotten in to MIT and Be...,ThrowawayPBD,,5j72tu,3,,1484419294,0,,MBA,False,/r/MBA/comments/5j72tu/applying_r2_after_accep...,,[]
dbdum8b,1482164551,"Well, the biggest thing is you'd drop another ...",ketothrowaway555,5j72tu,5j72tu,2,dbdum8b,1483886395,0,,MBA,False,,,[]
dbe13th,1482172474,"Similar case, accepted at Booth and MIT. Will ...",drnostrand86,5j72tu,5j72tu,2,dbe13th,1483889618,0,,MBA,False,,Prospect,[]
dbe42rq,1482175999,Presumably you'll make a deposit at either Boo...,ThrowawayPBD,dbe13th,5j72tu,2,dbe13th,1483891119,0,,MBA,False,,,[]
dbeo9zn,1482201182,"Booth and MIT are my top choices, but if eithe...",drnostrand86,dbe42rq,5j72tu,3,dbe13th,1483901118,0,,MBA,False,,Prospect,[]


## Some ideas


- Analize the sentiment of responses with regards to certain aspects (costs, courses, etc.). For this you can combine two models: sentiment classification and topic classification.

- Detect names of universities and combine with sentiment. For this you can train a NER model for detecting university names and apply a sentiment to the response.

- Analyse information about MBA programmes in Spain?

## Creating a Rubrix dataset with comments to posts: Text Classification


Let's first create a pandas dataframe out of conversation messages:

In [43]:
conversations = [conversation for conversation in corpus.conversations] ; len(conversations)

17219

In [55]:
# the messages from the first 100 conversations
utterances = [corpus.get_conversation(conversation).get_utterances_dataframe() for conversation in conversations[0:100]]

In [56]:
import pandas as pd

utterances_df = pd.concat(utterances) ; utterances_df.head()

Unnamed: 0_level_0,timestamp,text,speaker,reply_to,conversation_id,meta.author_flair_text,meta.gilded,meta.gildings,meta.permalink,meta.retrieved_on,meta.score,meta.stickied,meta.subreddit,meta.top_level_comment,vectors
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1dm8zb,1367588193,,niliss,,1dm8zb,,0,,/r/MBA/comments/1dm8zb/mba_not_worth_it_if_you...,1412679423,0,False,MBA,,[]
1do1kx,1367649362,,wash0jarvis,,1do1kx,,0,,/r/MBA/comments/1do1kx/how_to_effectively_take...,1412677134,1,False,MBA,,[]
1dsq3o,1367855426,I'll do my best to keep this brief. \n\nI was ...,[deleted],,1dsq3o,,0,,/r/MBA/comments/1dsq3o/nontraditional_mba_conc...,1412670646,2,False,MBA,,[]
c9tkdtx,1367865934,Anthropology major who worked at an auto racin...,SheepdogApproved,1dsq3o,1dsq3o,,0,,,1431283419,3,False,MBA,c9tkdtx,[]
c9tkf3t,1367866026,my program had a liberal arts major who was se...,defectiveburger,1dsq3o,1dsq3o,,0,,,1431283402,2,False,MBA,c9tkf3t,[]


In [57]:
# remove empty text utterances
utterances_df = utterances_df[utterances_df['text'].str.strip().astype(bool)]

Once we have this pandas Dataframe, we can follow the Rubrix guide for importing data from Pandas: https://rubrix.readthedocs.io/en/stable/guides/datasets.html#Importing-from-other-formats

In [None]:
import rubrix as rb
# import data from a pandas DataFrame
dataset_rb = rb.read_pandas(utterances_df, task="TextClassification")

In [59]:
# the above object can used to create a Rubrix dataset
rb.log(dataset_rb, name="ie_reddit_project")

  0%|          | 0/578 [00:00<?, ?it/s]

578 records logged to http://localhost:6900/datasets/rubrix/ie_reddit_project


BulkResponse(dataset='ie_reddit_project', processed=578, failed=0)

## Creating a Rubrix dataset with conversation titles (posts): Text Classification

In [66]:
posts = [corpus.get_conversation(conversation).meta["title"] for conversation in conversations[0:5000]]

In [67]:
# create a pandas df with a text column
df = pd.DataFrame({"text": posts})

dataset_rb = rb.read_pandas(df, task="TextClassification")

# create Rubrix dataset
rb.log(dataset_rb, name="ie_reddit_project_posts")

  0%|          | 0/5000 [00:00<?, ?it/s]

5000 records logged to http://localhost:6900/datasets/rubrix/ie_reddit_project_posts


BulkResponse(dataset='ie_reddit_project_posts', processed=5000, failed=0)

## Creating a Rubrix dataset with conversation pairs (title and response):  Text Classification

In [93]:
pairs = []
for conversation in conversations[0:100]:
    conv = corpus.get_conversation(conversation)
    # get first response
    first_response = conv.get_utterance(conv.get_utterance_ids()[0])
    if first_response.text != "":
        pair = {
            "title": conv.meta["title"],
            "response": first_response.text
        }
        pairs.append(pair)
    
pairs[0:2]

[{'title': 'Non-traditional MBA concerns. Advice greatly appreciated.',
  'response': "I'll do my best to keep this brief. \n\nI was a Liberal Arts Major at a respectable college. After graduating, I toured and worked on music for over 2 years, self-marketing and promoting myself. I handled every aspect of this and very much considered it my own small business, as it was the only income I lived off of for those 2 years. I decided I wanted to further my knowledge of the science behind recording and went back to school to get my Audio Engineering certificate. During this time I managed a locally owned restaurant. I have since been working as an accounting clerk and writing content for start-ups during the last year. \n\nI have since decided I want to go back to school to get my MBA to gain the fundamental knowledge needed to succeed in business. Marketing (and specifically Strategy) is what I want to to concentrate on. \n\n\nMy questions are as follows:\n\n1) With a solid undergrad GPA a

In [94]:
records = [
    rb.TextClassificationRecord(inputs=inputs)
    for inputs in pairs
]
rb.log(records, name="ie_reddit_project_pairs")

  0%|          | 0/57 [00:00<?, ?it/s]

57 records logged to http://localhost:6900/datasets/rubrix/ie_reddit_project_pairs


BulkResponse(dataset='ie_reddit_project_pairs', processed=57, failed=0)

## Creating a Rubrix dataset for labeling entities in responses: Token classification


In [99]:
# the messages from the first 100 conversations
utterances = [corpus.get_conversation(conversation).get_utterances_dataframe() for conversation in conversations[0:100]]
utterances_df = pd.concat(utterances) 

In [105]:
# IMPORTANT: For NER or token classification you need to tokenize text (see https://spacy.io/usage/spacy-101#annotations-token)
import spacy

nlp = spacy.blank("en")

utterances_df['tokens'] = utterances_df.apply(lambda row: [t.text for t in nlp(row["text"])], axis=1) 

#rubrix is able to read the dataframe and identify the columns

record_tok = rb.read_pandas(utterances_df, task="TokenClassification") 

rb.log(record_tok, "ie_reddit_project_tokens")



  0%|          | 0/578 [00:00<?, ?it/s]

578 records logged to http://localhost:6900/datasets/rubrix/ie_reddit_project_tokens


BulkResponse(dataset='ie_reddit_project_tokens', processed=578, failed=0)

## Types of NLP models

### Text Classification
Can be multi-class, multilabel and classify pairs or lists of texts

Tutorials:

https://rubrix.readthedocs.io/en/stable/tutorials/01-labeling-finetuning.html
https://rubrix.readthedocs.io/en/stable/tutorials/weak-supervision-with-rubrix.html

Recommended libraries:

- scikit-learn
- Hugging face transformers (needs GPU, use colab)


### Token Classification

Classify parts of the text, for example detect University names in text, course prices, etc.

Tutorials:

https://rubrix.readthedocs.io/en/stable/tutorials/02-spacy.html (for exploring pretrained models)

https://www.rubrix.ml/blog/concise-concepts-rubrix/

https://www.rubrix.ml/blog/veganuary/

Recommended libraries:

- spaCy
- concise-concepts


## Text2Text

Given a text generate another text, for example generate a summary of the response.

Recommended libraries:

- Hugging face transformers: Use pretrained models from https://huggingface.co/models?pipeline_tag=summarization&sort=downloads (needs GPU, use colab)
