## Mental Health Discussion Analyzer Web Scraper
#### Author
**Name:** Andres Figueroa  
**Email:** andresfigueroa@brandeis.edu

#### Project Description
The purpose of this file is to collects data by web scraping. I am doing this on .ipynb file because this is my first time web scraping. I like the markdown, makes notes look nice.

---

#### Importing Libraries

In [16]:
import praw
import pandas as pd
from dotenv import load_dotenv
import os
from datetime import datetime

The `praw` library is the Python Reddit API Wrapper (PRAW). PRAW is the Reddit API, this will allow me acces the data I want from Reddit.

**Note:** I can use the `requests` and `BeautifulSoup` for raw HTML scraping, but I guess the API is easier to use.

---

### Understanding the Target Data (What We Are Looking For)

I was thinking about scraping data from the following Reddit communities:
- r/mentalhealth
- r/depression
- r/anxiety

**Note:** After viewing the number of followers and reading some of the posts, I have decided to just scrape data from just `r/mentalhealth` as posts cover a variety of disorders and situations.

As for the data that I will be collecting, I will will collect:
- Post Body Text
- Timestamp
- Number of Upvotes
- Number of Comments

**Note:** From this data I am hoping to better understand what people are talking about, where conversations or posts are positive, negative, or neutral.

`r/mentalhealth` URL: https://www.reddit.com/r/mentalhealth/

---

#### Authenticating

In [17]:
load_dotenv()

print("Client ID from env:", os.getenv("REDDIT_CLIENT_ID"))

reddit = praw.Reddit(
    client_id = os.getenv("REDDIT_CLIENT_ID"),
    client_secret = os.getenv("REDDIT_CLIENT_SECRET"),
    user_agent = "MentalHealthAnalyzer by /u/Friendly-Sir-8457",
    username = os.getenv("REDDIT_USERNAME"),
    password = os.getenv("REDDIT_PASSWORD")
)

print("Logged in as:", reddit.user.me())

Client ID from env: TlXsBUQsFFuj9U9zXa4HEg
Logged in as: Friendly-Sir-8457


Here, I am loading environment varaibles with `load_dotenv()` and passing them through the Reddit API with PRAW. Then, I am testing whether the passed environment variables (login info.) to see if they worked by printing `reddit.user.me()`. I am essentially doing a really annoying login process, authenticating.  

**Note:** APIs are frusturating

---

#### Taking a Peek at the Scraped Data

In [18]:
subreddit = reddit.subreddit("mentalhealth")

posts = []
for post in subreddit.new(limit = 500):
    posts.append({
        "title": post.title,
        "score": post.score,
        "num_comments": post.num_comments,
        "created": post.created_utc,
        "selftext": post.selftext
    })

df = pd.DataFrame(posts)

print("Collected", len(df), "posts")
df.head()

Collected 500 posts


Unnamed: 0,title,score,num_comments,created,selftext
0,I feel behind in life and it is making me depr...,1,0,1757374000.0,I just feel so behind in life. I keep wishing ...
1,I don't understand why I'm hurting,1,0,1757374000.0,I've been diagnosed by a psychiatrist for depr...
2,Struggling with Maladaptive Daydreaming: How D...,1,0,1757374000.0,"Hi everyone,\nI’m reaching out because I’m str..."
3,I hate my life,2,2,1757374000.0,I'm 30m. And I hate my life. I hate my job and...
4,Guys it's slowly but surely becoming unbearable,1,1,1757374000.0,"I can't. I just can't. \n\nAt day, at work, wi..."


Here, I am taking a peek at the posts in the subreddit `r/mentalhealth` sorted by d'New'. I am collecting the posts title, score, the number of comments, time it was created, and the bodt text of the post. The time looks funky, so I'll see how I can fix or translate it to be usable. Also, I am a little unsure on how many posts I should record.

---

#### Cleaning Our DataFrame:

##### Dropping Posts With No Body Text

In [19]:
df = df.dropna(subset=["title", "selftext"])

##### Dropping Posts That Were Deleted or Removed

In [20]:
df = df[~df["selftext"].isin(["[deleted]", "[removed]"])]

##### Converting the Time of A Posts Created

In [21]:
df["created"] = pd.to_datetime(df["created"], unit = "s")

**Note:** Supposedly the numbers that I was looking at (e.g. `1.757373e+09`) was a Unix timestamp. Which oddly is the number seconds past ever since "January 1st, 1970 at UTC".

##### Dropping Duplicates

In [22]:
df = df.drop_duplicates(subset = ["title", "selftext"])

##### Removing Any Extra Space

In [23]:
df["title"] = df["title"].str.strip()
df["selftext"] = df["selftext"].str.strip()

---

#### The Clean DataFrame

In [24]:
df.head()

Unnamed: 0,title,score,num_comments,created,selftext
0,I feel behind in life and it is making me depr...,1,0,2025-09-08 23:29:39,I just feel so behind in life. I keep wishing ...
1,I don't understand why I'm hurting,1,0,2025-09-08 23:27:41,I've been diagnosed by a psychiatrist for depr...
2,Struggling with Maladaptive Daydreaming: How D...,1,0,2025-09-08 23:27:27,"Hi everyone,\nI’m reaching out because I’m str..."
3,I hate my life,2,2,2025-09-08 23:26:05,I'm 30m. And I hate my life. I hate my job and...
4,Guys it's slowly but surely becoming unbearable,1,1,2025-09-08 23:22:24,"I can't. I just can't. \n\nAt day, at work, wi..."


---

#### Turning Our DataFrame Into A CSV

In [25]:
df.to_csv("mental_health_posts.csv", index = False)
df.shape
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   title         500 non-null    object        
 1   score         500 non-null    int64         
 2   num_comments  500 non-null    int64         
 3   created       500 non-null    datetime64[ns]
 4   selftext      500 non-null    object        
dtypes: datetime64[ns](1), int64(2), object(2)
memory usage: 19.7+ KB


**Note:** Well, I got my data. Now, I need to figure out how I am going to make the data usable. I guess I am more so intimidated by how I am going to be turning this data into insights usable to other people.