# Imports and Functions  

In [1]:
import pandas as pd
from datetime import datetime, timedelta

def change_datetime_zone(df):
    df = datetime.strptime(df, "%Y-%m-%d %H:%M:%S")
    hour_diff = timedelta(hours=3)
    df = df + hour_diff
    return df

def raw_date_to_datetime(raw_date):
    datetime_dict = dict()  # New row to replace
    datetime_dict["Date"] = raw_date  # Get the date in the format "DD.MM.YYYY"
    datetime_dict["DayName"] = raw_date.strftime("%A")  # Get the full day name
    return datetime_dict 

def is_subscribed_to(subreddit_name, subscribed_subs_df):
    if (subscribed_subs_df == subreddit_name).any().any():
        return 1
    return 0

# Data Collecting

I will read the data from the .csv files and process them accordingly. Then finally I will save them in final .csv files.

## Different Data
* Subscribed Subreddits
* IP Logs
* Post, Comment Votes
* Posts, Comments

In [2]:
# Data paths
data_path = "./data/scrapped_data/"
backup_data_path = "./data/raw_data/"
save_path = "./data/final_data/"

## Subscribed Subreddits

This data holds the subreddits that I am _currently_ subscribed to. It does not hold a history or any temporal data.  

Reddit subreddits doesn't include direct a way to _categorize_ them by some kind of a tag or topic system execpt a list of _"flairs"_ that can be used to label posts in that subreddit. However, to better analyse them a tag system might be useful. I used a google form to annotate the subreddits and use that data to categorize them, more details can be found in the [README](README.md) file.  

This data file also includes followed users (not subs). They are marked with a 'u_' prefix in the data. They will be filtered, but saved anyway since they might be useful in the future.

In [3]:
# Read the file, sort by name of the subreddits and reset the index after sorting
sub_fname = "subscribed_subreddits.csv"
subreddits_df = pd.read_csv(data_path + sub_fname).sort_values(by="subreddit").reset_index().drop(columns="index")

followed_users_df = subreddits_df[subreddits_df["subreddit"].str.contains("u_")]
subreddits_df = subreddits_df[~subreddits_df["subreddit"].str.contains("u_")]
subreddits_df.to_csv(save_path + sub_fname, index=False)
subreddits_df

Unnamed: 0,subreddit,Flairs,Tags
0,AskScienceFiction,,"hobby, information/question"
1,CodeBullet,"meme, question for codebullet, video idea, oth...","fandom, meme, programming"
2,DaystromInstitute,,"movie/show, fandom"
3,Deepspaceninememes,"original content [oc], shitpost","meme, movie/show, fandom"
4,ElectroBOOM,"faf - rectify, electroboom question, non-elect...","fandom, hobby, meme"
5,ExposurePorn,,hobby
6,FATErpg,,"information/question, hobby"
7,GeekyaparLamers,,"fandom, meme"
8,GreekMythology,"discussion, question, art, culture, history, i...","hobby, information/question"
9,GuessTheMovie,,movie/show


## IP Logs

IP logs data holds information about my logins to Reddit. It holds the datetime and the IP that I used. This data might be used on showing my active times even though it doesn't hold information on how long I have stayed active.  

The date data is in the form of "yyyy-mm-dd hh:mm:ss UTC". I will convert time into GMT+3, and name days.

In [4]:
# Read the file, drop the first row that holds the registiration IP only, drop the IP column and reset the indexing
logs_fname = "ip_logs.csv"
login_datetime_df = pd.read_csv(backup_data_path + logs_fname).rename(columns={"date": "RawDate"}).drop(index=0, columns="ip").reset_index().drop(columns="index")

try:
    raw_date_col = login_datetime_df["RawDate"]  # Raw Date column

    # Add new columns
    login_datetime_df[["Date", "DayName"]] = None
    
    for idx in range(len(raw_date_col)):
        raw_date = raw_date_col.iloc[idx].replace(" UTC", "")  # Get the time in UTC time
    
        # Convert datetime to local time zone
        local_datetime = change_datetime_zone(raw_date)
        datetime_dict = raw_date_to_datetime(local_datetime)

        # Insert items from datetime_dict to the new columns
        for key in datetime_dict:
            login_datetime_df.loc[idx, key] = datetime_dict[key]

    login_datetime_df = login_datetime_df.drop(columns="RawDate")
except KeyError:
    pass

login_datetime_df["Date"] = pd.to_datetime(login_datetime_df["Date"])
login_datetime_df.to_csv(save_path + logs_fname, index=False)
login_datetime_df

Unnamed: 0,Date,DayName
0,2023-09-07 18:45:18,Thursday
1,2023-09-07 21:58:39,Thursday
2,2023-09-08 11:26:37,Friday
3,2023-09-08 13:01:42,Friday
4,2023-09-09 14:30:51,Saturday
...,...,...
487,2023-12-15 13:23:58,Friday
488,2023-12-15 15:20:35,Friday
489,2023-12-15 15:30:10,Friday
490,2023-12-15 16:30:27,Friday


## Post Votes

This data includes the posts that I have voted. It includes an ID, the post link, and the type of the vote and through Reddit API it includes the total score (net number of upvotes), and the flair; however, no temporal data.  

I will get the subreddit name from the URL, my vote and I will compare the sub to the subscribed subs data and get wheter or not I am subscribed to that subreddit currently.  

Also, note that some of the posts are inaccesible due to different reasons, such as sub is closed, which prevents data collection through Reddit API; therefore, there are some missing values in the data.

In [5]:
# Read the file, rename the vote direction column and drop the id column.
post_votes_fname = "post_votes.csv"
post_votes_df = pd.read_csv(data_path + post_votes_fname).rename(columns={"direction": "MyVote", "Upvotes": "UpvoteCount", "Downvotes": "DownvoteCount"}).drop(columns="id")

# Convert vote direction to a quantitative value
post_votes_df["MyVote"] = post_votes_df["MyVote"].map({"up": 1, "down": -1})

# Add the new columns
post_votes_df[["SubredditName", "IsSubscribed"]] = None

for idx in range(len(post_votes_df["permalink"])):
    post_vote_dict = {"SubredditName": None, "IsSubscribed": None}
        
    # Get the sub name from the link
    permalink = post_votes_df.loc[idx, "permalink"]
    start_idx = permalink.find("r/") + 2
    stop_idx = permalink.find("/", start_idx)
    sub_name = permalink[start_idx:stop_idx]
    
    # Add sub name to the corresponding place
    post_votes_df.loc[idx, "SubredditName"] = sub_name

    # Check if the sub is subscribed
    post_votes_df.loc[idx, "IsSubscribed"] = is_subscribed_to(sub_name, subreddits_df)

# Drop the permalink column
post_votes_df = post_votes_df.drop(columns="permalink")
# Specify the Dtypes for later use
post_votes_df["IsSubscribed"] = post_votes_df["IsSubscribed"].astype(dtype="bool")

post_votes_df.to_csv(save_path + post_votes_fname, index=False)
post_votes_df

Unnamed: 0,MyVote,Scores,SubredditName,IsSubscribed
0,1.0,,unexpectedMontyPython,True
1,1.0,195.0,ProgrammerHumor,True
2,1.0,11075.0,TheLastAirbender,True
3,1.0,145.0,risa,True
4,1.0,,unexpectedMontyPython,True
...,...,...,...,...
1616,1.0,20344.0,TheLastAirbender,True
1617,1.0,8737.0,gaming,True
1618,,8.0,ProgrammerHumor,True
1619,1.0,6.0,seinfeld,True


## Comment Votes  

Comment votes is almost identical to the post votes data except that this includes the information about comments that I have voted instead of posts. Also through Reddit API it holds the score (net number of upvotes).

I will perform the same cleaning as the post votes data: Remove the ID, get the subreddit name from the URL, my vote and I will compare the sub to the subscribed subs data and get wheter or not I am subscribed to that subreddit currently.  

Also, note that some of the comments or their posts are inaccesible due to different reasons which prevents data collection through Reddit API; therefore, there are some missing values in the data.

In [6]:
# Read the file, rename the vote direction column and drop the id column.
comment_votes_fname = "comment_votes.csv"
comment_votes_df = pd.read_csv(data_path + comment_votes_fname).rename(columns={"direction": "MyVote"}).drop(columns="id")

# Convert vote direction to a quantitative value
comment_votes_df["MyVote"] = comment_votes_df["MyVote"].map({"up": 1, "down": -1})

# Add the new columns
comment_votes_df[["SubredditName", "IsSubscribed"]] = None

for idx in range(len(comment_votes_df["permalink"])):
    comment_vote_dict = {"SubredditName": None, "IsSubscribed": None}
        
    # Get the sub name from the link
    permalink = comment_votes_df.loc[idx, "permalink"]
    start_idx = permalink.find("r/") + 2
    stop_idx = permalink.find("/", start_idx)
    sub_name = permalink[start_idx:stop_idx]

    # Add sub name to the corresponding place
    comment_votes_df.loc[idx, "SubredditName"] = sub_name

    # Check if the sub is subscribed
    comment_votes_df.loc[idx, "IsSubscribed"] = is_subscribed_to(sub_name, subreddits_df)

# Drop the permalink column
comment_votes_df = comment_votes_df.drop(columns="permalink")
# Specify the Dtypes for later use
comment_votes_df["IsSubscribed"] = comment_votes_df["IsSubscribed"].astype(dtype="bool")

comment_votes_df.to_csv(save_path + comment_votes_fname, index=False)
comment_votes_df

Unnamed: 0,MyVote,Score,SubredditName,IsSubscribed
0,1.0,2.0,GenP,False
1,1.0,1.0,flashcarts,False
2,1.0,1182.0,ProgrammerHumor,True
3,1.0,197.0,startrek,True
4,1.0,3.0,startrek,True
...,...,...,...,...
496,1.0,562.0,TheLastAirbender,True
497,1.0,1134.0,ProgrammerHumor,True
498,1.0,286.0,TheLastAirbender,True
499,1.0,2.0,TheLastAirbender,True


## Posts  

The posts data is about the posts that I have created. It includes an ID, a permalink to the post, posting date, the IP that I have used, subreddit name that the post has been posted, and gildings and url data. Also the score and the flair through Reddit API.  

I will drop the permalink, IP, gildings, url, and the temporal data and check if I am subscribed to the sub I have posted. I will keep the IDs to compare with the comments data later on.

Also, note that some of the posts are inaccesible due to different reasons which prevents data collection through Reddit API; therefore, there are some missing values in the data.

In [7]:
# Read the file, rename the id, date, and subreddit columns and drop the permalink, ip, gildings, and url columns.
posts_fname = "post_headers.csv"
posts_df = pd.read_csv(data_path + posts_fname).rename(columns={"id": "ID", "subreddit": "Subreddit"}).drop(columns=["permalink", "ip", "gildings", "url", "date"])

# Add new columns
posts_df["IsSubscribed"] = posts_df["Subreddit"].map(lambda x: is_subscribed_to(x, subreddits_df))

posts_df.to_csv(save_path + posts_fname, index=False)
posts_df

Unnamed: 0,ID,Subreddit,Scores,IsSubscribed
0,v7jv2a,consolerepair,3,1
1,m3jkjt,NintendoDSi,2,0
2,10x881w,startrek,3,1
3,126w0at,webdev,1,0
4,r84lhi,flashcarts,2,0
5,126w1tz,webdev,1,0
6,15rhuzw,montypython,60,1
7,16z0xzh,TheLastAirbender,12,1
8,10vxs5t,startrek,10,1
9,zx3wgi,consolerepair,2,1


## Comments  

Similar to the data about the posts, comments data also includes an ID, a permalink to the comment, comment date, the IP that I have used, subreddit name that the post that been commented has been posted, gildings, and net score through Reddit API. It does not include a url data like posts and it holds two extra information: a link to the parent object and _if the parent is posted by me_ an ID of the parent.  

I will drop the permalink, IP, gildings, and the date. I check if I am subscribed to the sub I have posted, and I will check if I own the post.  

Also, note that some of the comments or their posts are inaccesible due to different reasons which prevents data collection through Reddit API; therefore, there are some missing values in the data.

In [8]:
# ead the file, rename the id, date, and subreddit columns and drop the permalink, ip, and gildings columns.
comments_fname = "comment_headers.csv"
comments_df = pd.read_csv(data_path + comments_fname).rename(columns={"id": "ID", "subreddit": "Subreddit"}).drop(columns=["permalink", "ip", "gildings", "date", "parent"])

    # Add new columns
comments_df[["IsSubscribed", "IsPostOwned"]] = None
    
for idx in range(comments_df.shape[0]):
    comments_dict = dict()
    # Check if subscribed
    comments_dict["IsSubscribed"] = is_subscribed_to(comments_df.loc[idx, "Subreddit"], subreddits_df)

    # Get the post id from the link, note that it does not have to be the parent id if it is reply to another comment
    post_link = comments_df.loc[idx, "link"]
    post_id_idx_start = post_link.find("comments/") + 9
    post_id_idx_end = post_link.find("/", post_id_idx_start)
    post_id = post_link[post_id_idx_start:post_id_idx_end]
    # Check if the post is owned by me
    comments_dict["IsPostOwned"] = int((posts_df == post_id).any().any())

    # Insert items from comments_dict to the new columns
    for key in comments_dict:
        comments_df.loc[idx, key] = comments_dict[key]

comments_df = comments_df.drop(columns=["link"])

comments_df.to_csv(save_path + comments_fname, index=False)
comments_df

Unnamed: 0,ID,Subreddit,Score,IsSubscribed,IsPostOwned
0,j7jy563,startrek,7,1,1
1,j7k0fm0,startrek,4,1,0
2,jo77dm4,veYakinEvren,1,1,0
3,jr1sd8q,veYakinEvren,2,1,0
4,j7yb85s,gaming,1,1,0
...,...,...,...,...,...
173,jxz1l2k,yugioh,-1,1,0
174,k97l0xc,TheLastAirbender,2,1,0
175,k5kx721,mathmemes,1,1,0
176,k40vri7,veYakinEvren,1,1,0
