<img src="https://i.pngimg.me/thumb/f/720/comvecteezy492057.jpg" style="float: left; margin: 20px; height: 85px">

# Classification Project
## Relationship Status Group Prediction

--- 
# Part 1 - Scraping Data

---

# Project Goal

<span style="font-size: 16px;"> 
    
The objective of this project is to leverage data scraping techniques to gather online posts and classify them into married or dating groups. This initiative aims to enhance the functionality of a psychological and emotional support app by providing tailored resources. <br><br>
    By utilizing machine learning algorithms and natural language processing (NLP) techniques, I seek to develop a robust classification system that accurately categorizes online posts, enabling the app to offer personalized assistance to individuals navigating the complexities of romantic relationships.

# Packages Import and Data Access Set Up

In [1]:
import sys  
sys.path.insert(1, '../../project-3')

import praw
import pandas as pd
import os
from datetime import datetime
from credentials import gf_id, gf_secret, gf_agent, gf_username, gf_password

In [2]:
# accessing reddit through PRAW with the required credentials
reddit = praw.Reddit(
client_id = gf_id,
client_secret = gf_secret,
user_agent = gf_agent,
username = gf_username,
password = gf_password
)

Version 7.7.0 of praw is outdated. Version 7.7.1 was released Tuesday July 11, 2023.


# Scraping subreddits

- <span style="font-size: 16px;"> Getting top posts of different timelines 
- <span style="font-size: 16px;"> Getting title, text of posts and the respective subreddit, which is the target of this project <br>  
- <span style="font-size: 16px;"> Creating data extracted in one dataframe for each subreddit
- <span style="font-size: 16px;"> Combining groups' dataframes in one dataframe and exporting to csv

In [3]:
# function that filters the posts to get only the title, self text and subreddit
def combine_data(posts):
    data = []
    for post in posts:
        data.append([post.title, post.selftext, post.subreddit])
    return data

In [4]:
def subreddits_scraping(subreddits, folder):
    
    files = os.listdir(folder)
    doc = '_'.join(subreddits)
    
    if '.ipynb_checkpoints' in files:
        files.remove('.ipynb_checkpoints')
    
    for sub in subreddits:
        # extracting the informations of the subreddit
        subreddit = reddit.subreddit(sub)

        # getting the top 1000 posts of different timelines
        posts_con = subreddit.controversial(limit=1000)
        posts_top_year = subreddit.top(limit=1000, time_filter="year")
        posts_top_month = subreddit.top(limit=1000, time_filter="month")
        posts_top_week = subreddit.top(limit=1000, time_filter="week")

        # filtering the infos to get only the title, self text and subreddit the posts came from
        data_con = combine_data(posts_con)
        data_top_year = combine_data(posts_top_year)
        data_top_month = combine_data(posts_top_month) 
        data_top_week = combine_data(posts_top_week)
        
        # combining the extracted infos of all the posts into a dataframe
        df = pd.DataFrame(data_con + data_top_year + data_top_month + data_top_week, columns = ['title', 'self_text', 'subreddit'])
        df = df.drop_duplicates()
        
        # getting the extration time info
        time_now = datetime.now().strftime('%d-%m-%H-%M')
        
        # exporting the dataframe to a csv file with the extraction time info
        df.to_csv(folder+'/'+sub+time_now+'.csv', index=False, encoding='utf-8')
        
    # looping through the csv files to concatenate all the dataframes from the different extraction times
    subs = pd.concat([pd.read_csv(folder+'/'+file) for file in files])
    subs = subs[['title', 'self_text', 'subreddit']]
    subs.drop_duplicates(inplace=True)
    subs.to_csv(folder+'/'+doc+'.csv', index = False, encoding='utf-8')

In [5]:
# scraping the subreddits and combining all posts in one csv file in scrapes folder
subreddits_scraping(['marriage', 'dating'],'../scrapes')

# Checking data

In [6]:
# reading the csv with all posts and subreddits to check nulls and proportion of subreddits
marriage_dating = pd.read_csv('../scrapes/marriage_dating.csv')
marriage_dating.head()

Unnamed: 0,title,self_text,subreddit
0,What does it mean when a girl says I don’t fee...,I had a date yesterday and I thought it was ok...,dating
1,why do some guys get hurt/bothered if you slee...,My ex recently found out I hooked up with some...,dating
2,Worst era to ever date in as a guy.,Is there a single (attractive) young woman (ea...,dating
3,"Why do some men hate ""Vanilla women""?","Hiya. Earlier today I got called a ""vanilla wo...",dating
4,Guys the Victim mentality WONT help you,I've been couple of years in this sub and that...,dating


In [7]:
# checking size of dataframe with all posts
marriage_dating.shape

(13649, 3)

In [8]:
# checking data balance
marriage_dating['subreddit'].value_counts(normalize=True)

dating      0.570015
Marriage    0.429985
Name: subreddit, dtype: float64

In [9]:
# checking null values
marriage_dating.isna().sum()

title          1
self_text    401
subreddit      2
dtype: int64

In [10]:
# dropping null values
marriage_dating.dropna(inplace=True)
marriage_dating.isna().sum()

title        0
self_text    0
subreddit    0
dtype: int64

In [11]:
# checking data balance after removing null values
marriage_dating['subreddit'].value_counts(normalize=True)

dating      0.58697
Marriage    0.41303
Name: subreddit, dtype: float64

# Exporting Data

In [12]:
# exporting data to csv
marriage_dating.to_csv('../output/marriage_dating.csv')