In [1]:
import pandas as pd
import datetime as dt
import re

In [2]:
reddit_df = pd.read_csv('phr4r.csv')
reddit_df["created_utc"] = reddit_df["created_utc"].apply(lambda x: dt.datetime.fromtimestamp(x))
reddit_df.head()

Unnamed: 0,id,title,selftext,num_comments,score,created_utc
0,5lghbd,20 [F4F] Manila,Movie? Coffee?,1,8,2017-01-02 02:52:40
1,5le6zf,25 [M4R] - Let's start the new year right and ...,Hi!\nI want to start the new year right! I wan...,2,7,2017-01-01 15:45:41
2,5le8pd,Muff dive,[deleted],0,7,2017-01-01 16:01:21
3,5lgv46,20 [M4A] looking for minecraft players,Looking for players around bulacan for a LAN m...,0,6,2017-01-02 04:07:05
4,5li0y4,[26M4F] Dominate me.,I want to be your servant. Your dog. Your dick...,0,5,2017-01-02 07:52:15


In [3]:
reddit_df.describe(include='all')

Unnamed: 0,id,title,selftext,num_comments,score,created_utc
count,89480,89480,87732,89480.0,89480.0,89480
unique,89480,84502,51288,,,89393
top,avcdj2,M4F,[removed],,,2019-06-15 18:49:19
freq,1,76,33198,,,3
first,,,,,,2017-01-01 08:41:46
last,,,,,,2019-12-31 07:56:10
mean,,,,2.270798,2.334343,
std,,,,5.978305,6.682115,
min,,,,0.0,0.0,
25%,,,,0.0,1.0,


Some posts do not have a selftext or description, let's leave this as is for the moment and not drop them since the title of these posts could still be used to gain insight.

In [4]:
reddit_df.isnull().sum()

id                 0
title              0
selftext        1748
num_comments       0
score              0
created_utc        0
dtype: int64

### Data Cleaning

The usual title used in the phr4r thread is `<Age> [<Gender>4<Gender>] <Short Title>` We extract these metadata to be utilized to further gain insight from the data.

In [5]:
def extract_age(title):
    pattern = re.compile(r"(^\d\d)|(?<=^\[)\d\d")
    result = pattern.search(title)
    
    try:
        if result.group():
            return int(result.group())
    except:
        return None
    
    return None

def extract_title(title):
    pattern = re.compile(r"(?<=[\]|\)]).*|(^[a-zA-Z0-9].*)")
    result = pattern.search(title)
    try:
        if result.group():
            return result.group()
    except:
        return None
    
    return None

def extract_gender(title):
    pattern = re.compile(r"\w4\w")
    result = pattern.search(title)
    
    try:
        if result.group():
            user_gender, target_gender = result.group().split("4")
            return user_gender.lower(), target_gender.lower()
    except:
        return None, None
    
    return None, None

reddit_df["main_title"] = reddit_df["title"].map(extract_title)
reddit_df["user_age"] = reddit_df["title"].map(extract_age)
reddit_df["user_gender"] = [row[0] for row in reddit_df["title"].map(extract_gender)]
reddit_df["target_gender"] = [row[1] for row in reddit_df["title"].map(extract_gender)]

In [6]:
reddit_df.head(5)

Unnamed: 0,id,title,selftext,num_comments,score,created_utc,main_title,user_age,user_gender,target_gender
0,5lghbd,20 [F4F] Manila,Movie? Coffee?,1,8,2017-01-02 02:52:40,20 [F4F] Manila,20.0,f,f
1,5le6zf,25 [M4R] - Let's start the new year right and ...,Hi!\nI want to start the new year right! I wan...,2,7,2017-01-01 15:45:41,25 [M4R] - Let's start the new year right and ...,25.0,m,r
2,5le8pd,Muff dive,[deleted],0,7,2017-01-01 16:01:21,Muff dive,,,
3,5lgv46,20 [M4A] looking for minecraft players,Looking for players around bulacan for a LAN m...,0,6,2017-01-02 04:07:05,20 [M4A] looking for minecraft players,20.0,m,a
4,5li0y4,[26M4F] Dominate me.,I want to be your servant. Your dog. Your dick...,0,5,2017-01-02 07:52:15,Dominate me.,26.0,m,f


In [7]:
reddit_df.isnull().sum()

id                  0
title               0
selftext         1748
num_comments        0
score               0
created_utc         0
main_title         73
user_age         6737
user_gender      3374
target_gender    3374
dtype: int64

For the missing values we do the following:
- Set user_gender and target_gender into `u` for null values. This represents unknown. 
- Set user_age to 0 for age that is missing. 
- Set to an empty string for selftext that is null.

In [8]:
reddit_df['selftext'].fillna("", inplace=True)
reddit_df['main_title'].fillna("", inplace=True)
reddit_df['user_age'].fillna(0, inplace=True)
reddit_df['user_gender'].fillna('u', inplace=True)
reddit_df['target_gender'].fillna('u', inplace=True)

In [9]:
reddit_df.isnull().sum()

id               0
title            0
selftext         0
num_comments     0
score            0
created_utc      0
main_title       0
user_age         0
user_gender      0
target_gender    0
dtype: int64

Check the unique gender values. These are the meaning of the following:
- f - female
- m - male 
- u - unknown
- d - drinks/date
- r - redditor 
- t - transgender 
- c - couple 
- a - any 
- h - hire 
- w - woman
- b - both/bi
- v - vapers

In [10]:
print("user_gender unique values:", reddit_df['user_gender'].unique())
print("target_gender unique values:", reddit_df['target_gender'].unique())

user_gender unique values: ['f' 'm' 'u' '2' 'd' 'r' 't' '0' 'c' 'w' 'l' '3' 'g' '1' 'e' 'a' '9' 'x'
 's' 'i' '5' 'j' 'h' 'b' '7' 'n' 'o' '6' 'y' '8']
target_gender unique values: ['f' 'r' 'u' 'a' 'm' 'd' 'h' 't' 'w' '2' 'n' 'c' 'y' 'g' 'p' 'l' 'b' 'x'
 'q' 'e' 'k' 'i' '8' '6' 'μ' '3' 's' 'v' '0' '_']


These will be the data cleaning done on gender:
- replace 0,2,l, and etc with `u`. This is caused by a different format used reddit users to post a title 
- replace w to f since they mean the same thing 
- replace redditor, drinks/date, hire to `a`. 

I think drinks/date can be imputed based on the user gender (i.e. male gender high change target gender is female) but to be gender sensitive we replace to any instead. 

In [11]:
reddit_df[reddit_df['target_gender'] == '_']

Unnamed: 0,id,title,selftext,num_comments,score,created_utc,main_title,user_age,user_gender,target_gender
83559,e8tt0x,23 [F4_] I avoided people because it helps and...,I almost broke down at work when I heard I was...,5,1,2019-12-11 01:37:51,23 [F4_] I avoided people because it helps and...,23.0,f,_


In [12]:
gender_map = {
    'μ': 'u',
    '0': 'u',
    '1': 'u',
    '2': 'u',
    '3': 'u',
    '5': 'u',
    '6': 'u',
    '7': 'u',
    '8': 'u',
    '9': 'u',
    'e': 'u',
    'g': 'u',
    'i': 'u',
    'j': 'u',
    'k': 'u',
    'l': 'u',
    'n': 'u',
    'o': 'u',
    'p': 'u',
    'q': 'u',
    's': 'u',
    'x': 'u',
    'y': 'u',
    'w': 'f',
    'b': 'a',
    'e': 'a',
    'r': 'a',
    'd': 'a',
    'h': 'a',
    'v': 'a',
    '_': 'a',
}
reddit_df['user_gender'].replace(gender_map, inplace=True)
reddit_df['target_gender'].replace(gender_map, inplace=True)

In [13]:
print("user_gender unique values:", reddit_df['user_gender'].unique())
print("target_gender unique values:", reddit_df['target_gender'].unique())

user_gender unique values: ['f' 'm' 'u' 'a' 't' 'c']
target_gender unique values: ['f' 'a' 'u' 'm' 't' 'c']


Save the clean data to be used for topic modeling and EDA.

In [16]:
reddit_df.to_csv("phr4r_cleaned.csv", index=False)