In [1]:
import pandas as pd
import numpy as np

**I loaded the csv file Travis sent me directly to the notebook**

In [2]:
data = pd.read_csv('askscience_data.csv')

In [3]:
data.head(2)

Unnamed: 0.1,Unnamed: 0,title,body,tag,datetime,author,score,upvote_ratio,url
0,0,Post viral cough: why does it get worse after ...,Tl;dr: why is your cough during an upper respi...,Human Body,2022-12-09 02:52:07,CarboniferousCreek,1343.0,0.93,https://www.reddit.com/r/askscience/comments/z...
1,1,Can an x-ray of an adult show chronic malnouri...,If a person was chronically undernourished thr...,Human Body,2022-12-08 18:28:51,Foxs-In-A-Trenchcoat,426.0,0.91,https://www.reddit.com/r/askscience/comments/z...


**Definitions** 
> score: The number of upvotes minus the number of downvotes

> upvote_ratio: Ratio between upvotes and total votes 

**To answer the 1st question:**


1.   Determine the attributes of a successful post on r/askscience

Based on definitions of course the posts with higher scores are more successful and the score itself depends on upvote and down vote:

\begin{align}
  up = \frac{ratio}{(2 \, ratio - 1)} \, score
  \end{align}

\begin{align}
  down = \frac{(1 - ratio)}{(2 \, ratio - 1)} \, score
  \end{align}

Still can check the corrolation of different parameters with the score to relevance:



In [4]:
data['up'] = data['score']*data['upvote_ratio']/(2*data['upvote_ratio']-1)
data['down'] = (1-data['upvote_ratio'])*data['score']/(2*data['upvote_ratio']-1)

In [5]:
correlations = data.corr()
correlations.iloc[1:,1:]

  correlations = data.corr()


Unnamed: 0,score,upvote_ratio,up,down
score,1.0,0.548324,0.99238,0.726761
upvote_ratio,0.548324,1.0,0.503891,0.235986
up,0.99238,0.503891,1.0,0.80586
down,0.726761,0.235986,0.80586,1.0


And there is high correlation between the score and upvote. 

Also can investigate the relationship between different categorical features (e.g. 'tag') and the post's score or upvote_ratio. For example, you can check the average score or upvote_ratio for each tag:

In [6]:
average_score_by_tag = data.groupby('tag')['score'].mean().sort_values(ascending=False)
average_upvote_ratio_by_tag = data.groupby('tag')['upvote_ratio'].mean().sort_values(ascending=False)
#average_up_by_tag = data.groupby('tag')['up'].mean().sort_values(ascending=False)
#average_down_by_tag = data.groupby('tag')['down'].mean().sort_values(ascending=False)
total = average_score_by_tag.to_frame().join(average_upvote_ratio_by_tag)#.join(average_up_by_tag).\
        #join(average_down_by_tag)
total

Unnamed: 0_level_0,score,upvote_ratio
tag,Unnamed: 1_level_1,Unnamed: 2_level_1
Meta,39288.0,0.86
First image of a black hole,13233.0,0.95
Ecology,9213.5,0.92
Anthropology,6975.869565,0.832609
Dog Cognition AMA,6880.0,0.85
CERN AMA,6776.0,0.88
Earth Sciences and Biology,6567.5,0.92
Computing,6095.689655,0.778448
Linguistics,5978.529412,0.836765
Engineering,5014.389474,0.792842


And for some tags there are higher scores in average, meaning the success of a post can be topic dependent as well.

Also can analyze the relationship between text features (title and body) and the post's score or upvote_ratio. e.g. can check for length of the text that could be correlated with a successful post.

In [7]:
def typeChecker(x):
  if type(x) == str:
    return len(x)
  else:
    pass

In [8]:
data['title_length'] = data['title'].apply(len)
data['body_length'] = data['body'].apply(lambda x: typeChecker(x))
correlations = data.corr()
correlations.iloc[1:,1:]


  correlations = data.corr()


Unnamed: 0,score,upvote_ratio,up,down,title_length,body_length
score,1.0,0.548324,0.99238,0.726761,0.219878,0.082565
upvote_ratio,0.548324,1.0,0.503891,0.235986,0.208794,0.086683
up,0.99238,0.503891,1.0,0.80586,0.204485,0.098616
down,0.726761,0.235986,0.80586,1.0,0.090263,0.133595
title_length,0.219878,0.208794,0.204485,0.090263,1.0,0.09287
body_length,0.082565,0.086683,0.098616,0.133595,0.09287,1.0


Which indicates that title length plays more important role than the body length in the success of post.

**To answer the 2nd question:**


> Build a model that can predict the score of a post on r/askscience given at least the title and body of the post

To build a model that predicts the score of a post on r/askscience using the title and body of the post, we can fine-tune a pre-trained language model such as BERT or DistilBERT. These models have shown excellent performance on various NLP tasks and can be adapted for this regression problem. We'll use the Hugging Face transformers library and PyTorch.



In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m58.8 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m110.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m27.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transformers-4.28.1


In [4]:
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from torch.utils.data import DataLoader, Dataset
import torch.optim as optim
import torch.nn as nn
from sklearn.model_selection import train_test_split

Prepare the dataset for training and testing:

In [5]:
from sklearn.preprocessing import MinMaxScaler
class AskScienceDataset(Dataset):
    def __init__(self, texts, targets, tokenizer, max_len):
        self.texts = texts
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        inputs = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = inputs['input_ids'][0]
        attention_mask = inputs['attention_mask'][0]

        return {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'target': torch.tensor(self.targets[idx], dtype=torch.float)
        }

data['body'] = data['body'].fillna('')
data['combined_text'] = data['title'] + ' ' + data['body']
X_train, X_test, y_train, y_test = train_test_split(data['combined_text'], data['score'], test_size=0.2, random_state=42)
# X = data['combined_text'].values
# y = data['score'].values

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = MinMaxScaler()
scaler.fit(y_train.values.reshape(-1,1))
y_train = scaler.transform(y_train.values.reshape(-1,1)).reshape(-1)
y_test = scaler.transform(y_test.values.reshape(-1,1)).reshape(-1)

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
MAX_LEN = 512
BATCH_SIZE = 8
train_dataset = AskScienceDataset(X_train.values, y_train, tokenizer, MAX_LEN)
test_dataset = AskScienceDataset(X_test.values, y_test, tokenizer, MAX_LEN)

train_dataloader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False)


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Initialize the DistilBERT model and set up the training parameters:

In [6]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased',
    num_labels=1
).to(device)
model = torch.compile(model)

optimizer = optim.AdamW(model.parameters(), lr=2e-5)
criterion = nn.MSELoss()
EPOCHS = 3


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.weight', 'classifier

Fine-tune the model:

In [7]:
def train_epoch(model, data_loader, criterion, optimizer, device):
    model.train()
    epoch_loss = 0
    for batch in data_loader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        target = batch["target"].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs.logits.squeeze(), target)
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(data_loader)


def evaluate(model, data_loader, criterion, device):
    model.eval()
    eval_loss = 0
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            target = batch["target"].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs.logits.squeeze(), target)

            eval_loss += loss.item()

    return eval_loss / len(data_loader)


def train(model, train_dataloader, test_dataloader, criterion, optimizer, device, epochs):
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}/{epochs}")
        train_loss = train_epoch(model, train_dataloader, criterion, optimizer, device)
        eval_loss = evaluate(model, test_dataloader, criterion, device)
        print(f"Train loss: {train_loss:.4f}, Eval loss: {eval_loss:.4f}")

train(model, train_dataloader, test_dataloader, criterion, optimizer, device, EPOCHS)


Epoch 1/3


  return F.mse_loss(input, target, reduction=self.reduction)


Train loss: 0.0045, Eval loss: 0.0040
Epoch 2/3
Train loss: 0.0032, Eval loss: 0.0033
Epoch 3/3
Train loss: 0.0020, Eval loss: 0.0031


To improve the model, we can consider using additional features like 'tag' and 'datetime'. Here's why these features might be helpful:

'tag': Different tags may represent different topics or fields within science, and some topics might naturally attract more attention and higher scores. Including 'tag' as a categorical feature can help the model learn the relationship between different tags and post scores.

'datetime': The time when a post is created might also influence its score. Posts published during certain hours of the day or days of the week might receive more attention and upvotes.

To include these features in the model, you can follow these steps:


1.   Convert 'tag' into a one-hot encoded representation:




In [8]:
data_train, data_test = train_test_split(data, test_size=0.2, random_state=42)
tag_dummies_data = pd.get_dummies(data['tag'], prefix='tag')
tag_dummies_train, tag_dummies_test = train_test_split(tag_dummies_data, test_size=0.2, random_state=42)
X_tags_train = tag_dummies_train.values
X_tags_test = tag_dummies_test.values


2.   Extract useful information from 'datetime', such as hour of the day, day of the week, and month:



In [9]:
data_train['datetime']= pd.to_datetime(data_train['datetime'])
data_test['datetime']= pd.to_datetime(data_test['datetime'])
# For 'datetime'
data_train['hour'] = data_train['datetime'].dt.hour
data_train['day_of_week'] = data_train['datetime'].dt.dayofweek
data_train['month'] = data_train['datetime'].dt.month

data_test['hour'] = data_test['datetime'].dt.hour
data_test['day_of_week'] = data_test['datetime'].dt.dayofweek
data_test['month'] = data_test['datetime'].dt.month



3.   Normalize the numerical features (hour, day_of_week, and month) to have a mean of 0 and a standard deviation of 1:



In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_train[['hour', 'day_of_week', 'month']] = scaler.fit_transform(data_train[['hour', 'day_of_week', 'month']])
data_test[['hour', 'day_of_week', 'month']] = scaler.transform(data_test[['hour', 'day_of_week', 'month']])

X_datetime_train = data_train[['hour', 'day_of_week', 'month']].values
X_datetime_test = data_test[['hour', 'day_of_week', 'month']].values



4.   After fine-tuning, use this model to generate embeddings for the text data. Create a function to generate embeddings for the text data using the fine-tuned language model:




In [11]:
def generate_embeddings(texts, tokenizer, model, max_len):
    embeddings = []
    for text in texts:
        inputs = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        input_ids = inputs['input_ids'].to(device)
        attention_mask = inputs['attention_mask'].to(device)
        
        with torch.no_grad():
            outputs = model.base_model(input_ids=input_ids, attention_mask=attention_mask)
        
        embeddings.append(outputs.last_hidden_state[:, 0, :].squeeze().cpu().numpy())

    return np.array(embeddings)




5.   Generate embeddings for the text data:



In [12]:
text_embeddings_train = generate_embeddings(X_train, tokenizer, model, MAX_LEN)
text_embeddings_test = generate_embeddings(X_test, tokenizer, model, MAX_LEN)




7.   Concatenate the text embeddings from the language model with the one-hot encoded 'tag' and normalized 'datetime' features:




In [13]:
def concatenate_features(text_features, tag_features, datetime_features):
    return np.concatenate([text_features, tag_features, datetime_features], axis=1)
X_combined_train = concatenate_features(text_embeddings_train, X_tags_train, X_datetime_train)
X_combined_test = concatenate_features(text_embeddings_test, X_tags_test, X_datetime_test)


Then can use simple fully connected network to train on X_combined_train, and test on X_combined_test.


In [17]:

class FeedforwardNN(nn.Module):
    def __init__(self, input_dim):
        super(FeedforwardNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.bn1 = nn.BatchNorm1d(128)
        self.fc2 = nn.Linear(128, 64)
        self.bn2 = nn.BatchNorm1d(64)
        self.fc3 = nn.Linear(64, 1)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.fc1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.bn2(x)
        x = self.relu(x)
        x = self.fc3(x)
        return x

input_dim = X_combined_train.shape[1]
model = FeedforwardNN(input_dim).to(device)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
X_train_tensor = torch.tensor(X_combined_train, dtype=torch.float).to(device)
y_train_tensor = torch.tensor(y_train, dtype=torch.float).to(device)

X_test_tensor = torch.tensor(X_combined_test, dtype=torch.float).to(device)
y_test_tensor = torch.tensor(y_test, dtype=torch.float).to(device)

num_epochs = 50
batch_size = 32

for epoch in range(num_epochs):
    permutation = torch.randperm(X_train_tensor.size()[0])
    model.train()
    train_loss = 0.0
    for i in range(0, X_train_tensor.size()[0], batch_size):
        optimizer.zero_grad()
        
        indices = permutation[i:i+batch_size]
        batch_x = X_train_tensor[indices]
        batch_y = y_train_tensor[indices].view(-1, 1)
        
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        
        loss.backward()
        optimizer.step()
        train_loss += loss.item()
    train_loss /= len(X_train_tensor)
    
    if (epoch+1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {train_loss:.6f}')
        permutation = torch.randperm(X_test_tensor.size()[0])
        model.eval()
        val_loss = 0.0
        with torch.no_grad():
          for i in range(0, X_test_tensor.size()[0], batch_size):
            indices = permutation[i:i+batch_size]
            batch_x_test = X_test_tensor[indices]
            batch_y_test = y_test_tensor[indices].view(-1, 1)
            outputs_test = model(batch_x_test)
            loss = criterion(outputs_test, batch_y_test)

            val_loss += loss.item()

        val_loss /= len(X_test_tensor)
        print(f"Epoch {epoch+1}/{num_epochs}, Validation Loss: {val_loss:.6f}")


Epoch [10/50], Loss: 0.000028
Epoch 10/50, Validation Loss: 0.000107
Epoch [20/50], Loss: 0.000024
Epoch 20/50, Validation Loss: 0.000097
Epoch [30/50], Loss: 0.000017
Epoch 30/50, Validation Loss: 0.000105
Epoch [40/50], Loss: 0.000016
Epoch 40/50, Validation Loss: 0.000098
Epoch [50/50], Loss: 0.000016
Epoch 50/50, Validation Loss: 0.000097
