<a href="https://colab.research.google.com/github/gestured/Cat_vs_Dog_Classifier/blob/main/notebooks/nb_RedditDataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Session-based Recommender System

In [1]:
#@title Import Libraries
import pandas as pd

from datetime import datetime , timedelta
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.utils import shuffle
import random
import ast
import numpy as np
from tqdm import tqdm

## Cloning Git or Mounting Drive

### Drive Way

In [2]:
#@title Mount G-drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#@title Read data and output top 10 rows
path_to_csv = '/content/drive/MyDrive/MajorProject/reddit_data/reddit_data.csv'

data = pd.read_csv(path_to_csv)

print(len(data))
data.head()

14000000


Unnamed: 0,username,subreddit,utc
0,kabanossi,photoshopbattles,1482748000.0
1,kabanossi,GetMotivated,1482748000.0
2,kabanossi,vmware,1482748000.0
3,kabanossi,carporn,1482748000.0
4,kabanossi,DIY,1482747000.0




### Github Way



In [None]:
#@title Clone Git Repo in Session
!git clone https://github.com/PragunSaini/MajorProject2022.git

In [None]:
#@title Read data from Git Repo
data_rawUrlGit = 'https://media.githubusercontent.com/media/PragunSaini/MajorProject2022/master/Datasets/reddit_data/reddit_data.csv'
data = pd.read_csv(data_rawUrlGit)

## Data Analysis

Point 1: each unique user's records are together 

Point 2: for each user, time in utc is in descending order

In [4]:
data.head(10)

Unnamed: 0,username,subreddit,utc
0,kabanossi,photoshopbattles,1482748000.0
1,kabanossi,GetMotivated,1482748000.0
2,kabanossi,vmware,1482748000.0
3,kabanossi,carporn,1482748000.0
4,kabanossi,DIY,1482747000.0
5,kabanossi,food,1482747000.0
6,kabanossi,CatastrophicFailure,1482514000.0
7,kabanossi,photoshopbattles,1482514000.0
8,kabanossi,carporn,1482513000.0
9,kabanossi,techsupport,1482513000.0


In [5]:
print("User Count: " , data['username'].nunique())
print("Sureddit counts: " , data['subreddit'].nunique())

User Count:  22610
Sureddit counts:  34967


In [6]:
frequency_sr_df = data['subreddit'].value_counts().rename_axis('unique_redd').reset_index(name='counts')
frequency_sr_df.head()

Unnamed: 0,unique_redd,counts
0,AskReddit,1030290
1,politics,367860
2,The_Donald,216939
3,nfl,173883
4,leagueoflegends,157663


In [7]:
frequency_users_df = data['username'].value_counts().rename_axis('unique_users').reset_index(name='counts')
frequency_users_df.head(5)

Unnamed: 0,unique_users,counts
0,LostAccountant,1000
1,bakitbakitba,1000
2,Neres28,1000
3,rainbowgeoff,1000
4,cosmiccrystalponies,1000


## Data Preprocessing

### Removing users with frequency less than a Minimum_Frequency_Factor

**Rules for preprocessing dataset:**

1. Dataset format : [index, user, item, timestamp]
2. Sort by users and timestamps
3. Remove users or items with less than 10 occurances
4. Parse dataset into user -> array(sessions) format and session -> array((timestamp, item))
5. Interactions within 3600 seconds are in same session
6. Remove item repitions in the same session
7. Remove sessions that have 1 item only or too many items (here > 40)
8. Split sessions further if possible using max session size 20 and min session size 1
9. Remove users with less data (ex. having < 3 sessions)
10. Map user and item values to sequential labels for further usage

**Removing subreddits that have frequency < 10**


In [23]:
frequency_sr_df = data['subreddit'].value_counts()
data_subredditRemoval = data[~data['subreddit'].isin(frequency_sr_df[frequency_sr_df < 10].index)]


**Removing users with frequency < 10**

In [24]:
frequency_user_df = data_subredditRemoval['username'].value_counts()
Final_Data = data_subredditRemoval[~data_subredditRemoval['username'].isin(frequency_user_df[frequency_user_df < 10].index)]

In [25]:
Final_Data.head()

Unnamed: 0,username,subreddit,utc
0,kabanossi,photoshopbattles,1482748000.0
1,kabanossi,GetMotivated,1482748000.0
2,kabanossi,vmware,1482748000.0
3,kabanossi,carporn,1482748000.0
4,kabanossi,DIY,1482747000.0


In [26]:
print(f"No of users : {len(Final_Data['username'].unique())}")
print(f"No of items : {len(Final_Data['subreddit'].unique())}")
print(f"No of events : {len(Final_Data)}")

No of users : 21742
No of items : 13937
No of events : 13937354


In [27]:
Final_Data['utc'] = pd.to_datetime(Final_Data['utc'], unit='s')
Final_sortedData = Final_Data.sort_values(by=["username", "utc"] , ignore_index = True)
Final_sortedData.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,username,subreddit,utc
0,--ANUSTART-,news,2015-12-29 17:43:17
1,--ANUSTART-,news,2015-12-29 18:35:49
2,--ANUSTART-,AdviceAnimals,2015-12-30 15:54:03
3,--ANUSTART-,AskReddit,2015-12-30 16:19:23
4,--ANUSTART-,explainlikeimfive,2015-12-30 16:39:05
5,--ANUSTART-,Documentaries,2015-12-31 16:25:46
6,--ANUSTART-,Showerthoughts,2015-12-31 17:20:29
7,--ANUSTART-,AskReddit,2015-12-31 17:47:43
8,--ANUSTART-,sanantonio,2015-12-31 19:14:58
9,--ANUSTART-,gaming,2016-01-02 00:32:33


### Setting up data with user's session details





In [3]:
# Final_sortedData = pd.read_csv('/content/drive/MyDrive/MajorProject/SortedData_Reddit.csv')

In [40]:
# preprocessing variables
SESSION_TIME = timedelta(seconds=60*60)
MAX_SESSION_LENGTH = 20
MIN_REQUIRED_SESSIONS = 3
MIN_ITEM_SUPPORT = 10

In [35]:
#@title **Methods to split dataset into User Sessions according to preprocessing rules**
# session -> list of sessions in format (utc, tag)
def collapse_session(session):
  new_session = [session[0]]
  for i in range(1, len(session)):
    last_session = new_session[-1]
    current_session = session[i]
    if current_session[1] != last_session[1]:
      new_session.append(current_session)
  
  return new_session

# user_sessions -> sessions of a user -> setof(user: array((utc, tag)))
def collapse_repeating_session(user_sessions):
  for user, session in user_sessions.items():
      for i in range(len(session)):
        session[i] = collapse_session(session[i])

# Remove sessions with only one event 
def remove_invalid_sessions(user_sessions):
  new_user_sessions = {}
  for user in user_sessions.keys():
        if user not in new_user_sessions:
            new_user_sessions[user] = []
        current = user_sessions[user]
        for session in current:
            if len(session) > 1 and len(session) <= MAX_SESSION_LENGTH*2:
                new_user_sessions[user].append(session)
  return new_user_sessions


# session -> list of sessions in format (utc, tag)
def split_session(session):
  splits = [session[i:i+MAX_SESSION_LENGTH] for i in range(0, len(session), MAX_SESSION_LENGTH)]
  # check last session length
  if len(splits[-1]) < 2:
    return splits[:-1]
  return splits

# session -> list of sessions in format (timestamp, subreddit)
def split_long_sessions(user_sessions):
    for user, sessions in user_sessions.items():
        user_sessions[user] = []
        for session in sessions:
          user_sessions[user] += split_session(session)

# dataset -> session dataset (columns : [index, user, item, utc])
# Assumes dataset is sorted by user and timestamp
def split_dataset_to_sessions(dataset):
  user_sessions = {}
  current_session = []
  for row in tqdm(dataset.itertuples()):
    username, subreddit, utc = row[1:] # Ignore index
    event = (utc, subreddit)
    
    # New User
    if username not in user_sessions:
      user_sessions[username] = []
      current_session = []
      user_sessions[username].append(current_session)
      current_session.append(event)
      continue
    
    # Existing user
    last_event = current_session[-1]
    timedelta = event[0] - last_event[0]
    if timedelta < SESSION_TIME:
      current_session.append(event)
    else:
      current_session = [event]
      user_sessions[username].append(current_session)

  collapse_repeating_session(user_sessions)
  user_sessions = remove_invalid_sessions(user_sessions)
  split_long_sessions(user_sessions)
  
  # Remove users with less sessions
  to_remove = set()
  for user, sessions in user_sessions.items():
    if (len(sessions) < MIN_REQUIRED_SESSIONS):
      to_remove.add(user)
  for user in to_remove:
    del user_sessions[user]
  
  return user_sessions
  # Final sessions data available for user

In [36]:
sessions = split_dataset_to_sessions(Final_sortedData)

13937354it [05:06, 45532.41it/s]


In [37]:
# Calculate statistics from session data

users = sessions.keys()
items = set()
num_sessions = 0
num_interactions = 0
interactions_per_user = []
interactions_per_session = []

for _, ses in sessions.items():
  num_sessions += len(ses)
  user_interactions = 0
  for session in ses:
    num_interactions += len(session)
    interactions_per_session.append(len(session))
    user_interactions += len(session)
    for event in session:
      items.add(event[1])
  interactions_per_user.append(user_interactions)

Results from paper   
- No of users : 18173  
- No of items : 13521  
- No of session : 1119225  
- No of interactions : 2868050  
- No of interactions per session : 2.6  
- No of interactions per user : 157.8 

In [39]:
print("Results from preprocessing")
print(f"No of users : {len(users)}")
print(f"No of items : {len(items)}")
print(f"No of session : {num_sessions}")
print(f"No of interactions : {num_interactions}")
print(f"No of interactions per session : {np.array(interactions_per_session).mean()}")
print(f"No of interactions per user : {np.array(interactions_per_user).mean()}")

Results from preprocessing
No of users : 18186
No of items : 13737
No of session : 1123442
No of interactions : 3388177
No of interactions per session : 3.015889560831801
No of interactions per user : 186.30688441658418
