<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DSI-SG-42
## Project 3: Web APIs & NLP
> Authors: Pius Yee, Conrad Aw, Eugene Matthew Cheong
---
## Introduction

In today's fast-paced world, the role of a mother is more challenging and multifaceted than ever before. Mothers are not only caregivers but also significant contributors to the workforce. This is especially true in Singapore, where the workforce is the main driver of the economy. As the nation progresses, the competing demands of work and family life have placed immense pressure on working mothers, making it a topic of great importance and ongoing debate.

Recent discussions have highlighted the need for more support for working mothers, despite the government's efforts to improve support systems. The challenges faced by working mothers in Singapore are unique due to the high expectations of productivity and efficiency in both professional and personal spheres. These challenges often lead to a struggle to maintain a healthy work-life balance, which can impact their well-being and that of their families.

To better understand the difficulties faced by the current generation of new parents, particularly mothers, it is crucial to delve into their thoughts, concerns, and discussions. One effective way to do this is through web-scraping on anonymous platforms like Reddit, where many parents share their experiences, seek advice, and discuss various topics related to parenting and work-life balance.

By analyzing the data collected from Reddit, we can identify the "hot topics" that are most relevant and pressing for working mothers. This information can provide valuable insights into the specific challenges they face, the kind of support they need, and the solutions that could be implemented to improve their situation. Ultimately, this project aims to shed light on the experiences of working mothers in Singapore, contributing to a better understanding of their needs and paving the way for more targeted and effective support measures.

News Articles:
- [Channel News Asia - Working Mum challenges](https://www.channelnewsasia.com/commentary/working-mums-challenges-best-tips-juggle-home-work-family-1373851)

- [Nanyang Business School article - Importance of work Life balance for women](https://www.ntu.edu.sg/business/news-events/news/story-detail/what-work-life-balance-means-for-women)

- [Straits Times - Subsidies for non-working mothers](https://www.straitstimes.com/singapore/smart-investment-to-give-lower-income-non-working-mums-more-childcare-subsidies)

- [Channel News Asia - More support for middle class working mothers](https://www.channelnewsasia.com/singapore/budget-2023-debate-working-mothers-middle-class-cpf-ceiling-jobs-skills-integrators-3295851)

- [What can government and employers do to support mothers](https://www.nytimes.com/2021/02/04/parenting/government-employer-support-moms.html)

## Persona

This is a Ministry of Social and Family Development (MSF) meeting. The Data Science team is presenting a structured, data-driven approach to keyword research so as to enhance digital campaigning efforts and improve visibility and engagement. The target audience are the Marketing Strategist and Campaign Strategist.

## Problem Statement

In light of the current generation's high engagement on social media and forums, particularly in discussions related to work-life balance and childcare, our Campaign Strategists face challenges in consistently identifying and categorizing relevant topics for our campaigns. This reliance on personal assumptions rather than empirical evidence hampers the effectiveness of our campaigns or programs, hindering our ability to optimize online presence and effectively reach our target audience. To address this challenge, we propose developing a classification model to accurately classify posts and forums relevant to mothers, thereby enhancing our campaign targeting and outreach strategies.

## Table of Contents ##

### 1.0 Data Collection ###

[1.1 Import Packages](#1.1-import-packages)

[1.2 Webscraping](#1.2-webscraping)

[1.3 Selection of subreddits](#1.3-selection-of-subreddits)

[1.4 Code for text scraping](#1.4-code-for-text-scraping)

### [2. Data Cleaning, Preprocessing and EDA](./2.0%20Data%20Cleaning%20&%20Preprocessing.ipynb)

[2.1 Import CSV file](./2.0%20Data%20Cleaning%20&%20Preprocessing.ipynb)

[2.2 Data cleaning](./2.0%20Data%20Cleaning%20&%20Preprocessing.ipynb)

[2.3 Preprocessing](./2.0%20Data%20Cleaning%20&%20Preprocessing.ipynb)

[2.4 Exploratory Data Analysis (EDA)](./2.0%20Data%20Cleaning%20&%20Preprocessing.ipynb)

### [3. Modelling, Evaluation and Tuning](./3.0%20Modelling.ipynb)

[3.1 Import CSV file](./3.0%20Modelling.ipynb)

[3.2 Splitting of train/test data](./3.0%20Modelling.ipynb)

[3.3 Modelling](./3.0%20Modelling.ipynb)

[3.4 Model Selection](./3.0%20Modelling.ipynb)

[3.5 AUC-ROC](./3.0%20Modelling.ipynb)

[3.6 Save it as Pickle](./3.0%20Modelling.ipynb)

[3.7 Recommendations](./3.0%20Modelling.ipynb)

[3.8 Conclusion](./3.0%20Modelling.ipynb)

### [4. Model Testing](./4.0%20Model%20Test.ipynb)

[4.1 Scraping Input Reddit](./4.0%20Model%20Test.ipynb)

[4.2 Cleaning Data](./4.0%20Model%20Test.ipynb)

[4.3 Tokenizing with Regex](./4.0%20Model%20Test.ipynb)

[4.4 Stop Word Removal](./4.0%20Model%20Test.ipynb)

[4.5 Lemmatization](./4.0%20Model%20Test.ipynb)

[4.6 Run the model](./4.0%20Model%20Test.ipynb)

---

## 1.0 Data Collection

### 1.1 Import packages

In [1]:
import pandas as pd
import praw
import time

### 1.2 Webscraping

As per the project requirements, webscraping will be conducted on two subreddits as part of the solution formation. We will be using the official Reddit API that will require initialization before the webscraping activity is conducted.

In [2]:
# Initialize a Reddit instance with your API credentials
reddit = praw.Reddit(
    client_id='YheGGNwn1zlIePJLrJZZYw',
    client_secret='JWe8I5cM8YCZGowmL_WPe1d-UuXuFw',
    user_agent='eumattbro'
)

#### 1.3 Selection of subreddit

Characteristics considered during the selection process:

1) **Relevance**: The discussions and subjects in the subreddit must be relevant to Mothers and Fathers as these are the main protagonist of our target group.

2) **Activity & Engagement Level**: The subreddit needs to be active and engaging based on the number of posts and comments as higher activity levels may provide more interesting data for analysis.

3) **Size**: The member size of the subreddit is important as larger subreddits may have more diverse content and discussions while smaller subreddits may have more focused discussions on specific topics.

4) **Moderation**: Strong moderation preferred to ensure fewer spam or irrelevant posts.

4) **Quality of Content**: Quality of content based on post and comments level of detail and informativeness.

5) **Variety**: Consider selecting a mix of subreddits to get a diverse range of content and perspectives.


Based on the above mentioned characteristics, we identified **r/daddit** and **r/Mommit** for scraping. One of the main reason was also that mommit and daddit had more positive sentiments in the analysis and upon further reading, we noted that the discussiosn made in the posts were relevant respectively and it was easier to distinguish both. this is aligned with the sentiment analysis.


In [3]:
sub_reddits = ["daddit", "Mommit"]

#### 1.4 Code for text scraping

 Create code to collect and organize data from **r/Mommit** and **r/daddit** posts and their comments into a structured format suitable for analysis

In [7]:
# Recursive function that takes a list of comments and a dictionary as arguments.
def print_comments(comments, comment_dict):
    # For each comment in the list, pause for 0.1 seconds to avoid tiriggering rate limits
    for comment in comments:
        time.sleep(0.1)
        # Append the comment's body text to comment_dict
        comment_dict.append(comment.body)
        # If the comment has replies, appends the replies to comment_dict and recursively calls itself to process these replies.
        if len(comment.replies) > 0:
            comment_dict.append(comment.replies)
            print_comments(comment.replies, comment_dict)

# Create ext list
ext = []
for sub_reddit in sub_reddits:
    # iterates over a list of subreddit names (sub_reddits). For each subreddit, fetch up to 1000 hot submissions.
    for submission in reddit.subreddit(sub_reddit).hot(limit=1000):
        # For each submission, create a dictionary containing the subreddit name, post title, self-text, score, and URL.
        post_data = ({
            "subreddit": sub_reddit,
            "title": submission.title,
            "selftext": submission.selftext,
            "score": submission.score,
            "url": submission.url,
            })

        # Replace the placeholder comments (used for loading additional comments) with actual comments
        submission.comments.replace_more(limit=None)
        post_data['comments'] = []

        print_comments(submission.comments, post_data['comments']) 


        ext.append(post_data)
# Create pandas dataframe
web_df = pd.DataFrame(ext)

web_df.to_csv('../datasets/daddit_mommit_df.csv')

#### Next Notebok: [2.0 Data Cleaning and Preprocessing](./2.0%20Data%20Cleaning%20&%20Preprocessing.ipynb)