## Mental Health Discussion Analyzer Web Scraper
#### Author
**Name:** Andres Figueroa  
**Email:** andresfigueroa@brandeis.edu

#### Project Description
The purpose of this file is to collects data by web scraping. I am doing this on .ipynb file because this is my first time web scraping. I like the markdown, makes notes look nice.

---

#### Importing Libraries

In [7]:
import praw
import pandas as pd
from dotenv import load_dotenv
import os

The `praw` library is the Python Reddit API Wrapper (PRAW). PRAW is the Reddit API, this will allow me acces the data I want from Reddit.

**Note:** I can use the `requests` and `BeautifulSoup` for raw HTML scraping, but I guess the API is easier to use.

---

### Understanding the Target Data (What We Are Looking For)

I was thinking about scraping data from the following Reddit communities:
- r/mentalhealth
- r/depression
- r/anxiety

**Note:** After viewing the number of followers and reading some of the posts, I have decided to just scrape data from just `r/mentalhealth` as posts cover a variety of disorders and situations.

As for the data that I will be collecting, I will will collect:
- Post Body Text
- Timestamp
- Number of Upvotes
- Number of Comments

**Note:** From this data I am hoping to better understand what people are talking about, where conversations or posts are positive, negative, or neutral.

`r/mentalhealth` URL: https://www.reddit.com/r/mentalhealth/

---

#### Authenticating

In [8]:
load_dotenv()

print("Client ID from env:", os.getenv("REDDIT_CLIENT_ID"))

reddit = praw.Reddit(
    client_id = os.getenv("REDDIT_CLIENT_ID"),
    client_secret = os.getenv("REDDIT_CLIENT_SECRET"),
    user_agent = "MentalHealthAnalyzer by /u/Friendly-Sir-8457",
    username = os.getenv("REDDIT_USERNAME"),
    password = os.getenv("REDDIT_PASSWORD")
)

print("Logged in as:", reddit.user.me())

Client ID from env: TlXsBUQsFFuj9U9zXa4HEg
Logged in as: Friendly-Sir-8457


Here, I am loading environment varaibles with `load_dotenv()` and passing them through the Reddit API with PRAW. Then, I am testing whether the passed environment variables (login info.) to see if they worked by printing `reddit.user.me()`. I am essentially doing a really annoying login process, authenticating.  

**Note:** APIs are frusturating

In [10]:
subreddit = reddit.subreddit("mentalhealth")

posts = []
for post in subreddit.hot(limit=10):
    posts.append({
        "title": post.title,
        "score": post.score,
        "num_comments": post.num_comments,
        "created": post.created_utc,
        "selftext": post.selftext
    })

df = pd.DataFrame(posts)

print("Collected", len(df), "posts")
df.head()

Collected 10 posts


Unnamed: 0,title,score,num_comments,created,selftext
0,Wellness Wednesday,2,1,1756876000.0,>*“Sometimes the bravest and most important th...
1,r/MentalHealth is looking for moderators,21,27,1720874000.0,Hey r/mentalhealth! We're looking to grow our ...
2,All those advices for depression are trash,74,38,1757316000.0,I hate hearing all the usual that people say t...
3,does depression go away without medication?,7,6,1757343000.0,things has been so hard right now and im not r...
4,Some people are unbelievable,6,6,1757345000.0,I once made a post about r@pe. Im a csa victim...
