<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3 - Web APIs & NLP on Internet and Alcohol Addiction Subreddits <br> [Part 1 of 2]

_Prepared by: Timothy Chan, 17 Mar 2023_

## Contents:
- [Background](#Background)
- [Data Collection](#Data-Collection)

---
## Background
---

### Problem Statement

A non-profit organization aims to develop a chatbot to provide users with advice and resources on internet and alcohol addiction. To facilitate this, they require a model that can differentiate between the two addictions based on users' questions.

Around [8% of the global population is addicted to the internet](https://techjury.net/blog/technology-addiction-statistics/), which can lead to negative consequences such as social isolation and health problems. Meanwhile, alcohol addiction has limited treatment options and can have a significant impact on individuals, families, and communities. [Over 70% of individuals with alcohol abuse will relapse at some point](https://www.therecoveryvillage.com/alcohol-abuse/alcohol-relapse-statistics/), and stigma may prevent them from seeking help. 

The organization plans to provide personalized advice and resources to users based on the addiction category detected from their question.

**Approach**

This is a natural language processing (NLP) binary classification problem. 

In **Part 1**, we will collect posts from `r/nosurf` and `r/stopdrinking` as proxy of internet and alcohol addiction topics using `Pushshift API`.

For Data Cleaning, Processing, EDA, Modelling and Recommendations please refer to **Part 2**.

---
## Data Collection
---

Using `Postshift API`, we will collect posts from `r/nosurf` and `r/stopdrinking` as proxy of internet and alcohol addiction. 1,000 posts before UTC 2023-03-07 00:00:00 for both subreddits will be scraped for comparison.

The posts were scraped on 8 Mar 2023 10.23pm Singapore time.

In [5]:
import pandas as pd
import requests
import datetime
import random
import time

In [7]:
# Create function to extract Subreddit posts according to parameters
def reddit_extract(subreddit, size, date_time):
    url = 'https://api.pushshift.io/reddit/search/submission'
    
    # Convert the date string to a datetime object
    before_datetime = datetime.datetime.strptime(date_time, '%Y-%m-%d %H:%M:%S')
    
    # Convert the datetime object to a Unix timestamp
    before_timestamp = int(before_datetime.timestamp())
    
    # Set the API parameters
    params = {
        'subreddit': subreddit, # subreddit name
        'size': size, # number of posts
        'before': before_timestamp # run only before this timestamp
    }
    
    # Make the API request and retrieve the data
    res = requests.get(url, params)
    data = res.json()['data']
    
    # Extract the desired columns and return the data as a DataFrame
    df = pd.DataFrame(data)[['subreddit', 'title', 'selftext', 'utc_datetime_str', 'created_utc']]
    
    # Print number of rows
    print(f"Subreddit: {subreddit} | Number of rows: {len(df)}")
    
    # Print number of rows
    for index, row in df.iterrows():
        time.sleep(random.randint(1, 3)) # Randomly pause between 1 to 3 seconds
    
    return df

In [8]:
# Extract subreddit
df_nosurf = reddit_extract('nosurf', 1000, '2023-03-07 00:00:00')

Subreddit: nosurf | Number of rows: 1000


In [9]:
# Check first 5 rows
df_nosurf.head()

Unnamed: 0,subreddit,title,selftext,utc_datetime_str,created_utc
0,nosurf,How to stop using the Internet as a coping mec...,"Basically, I have a bad habit of using the Int...",2023-03-06 15:58:17,1678118297
1,nosurf,The more time I spend away from YouTube and ot...,Content blockers and other restrictions have h...,2023-03-06 13:31:16,1678109476
2,nosurf,Save your memories in a safe space that won't ...,"We hate social media, so in the last 4 years w...",2023-03-06 11:49:33,1678103373
3,nosurf,How Do I delete the internet from my Iphone?,,2023-03-06 08:38:32,1678091912
4,nosurf,2020 vs 2023 me,Damn how much I've improved is astounding. My ...,2023-03-06 02:13:58,1678068838


In [13]:
# Check last 5 rows
df_nosurf.tail()

Unnamed: 0,subreddit,title,selftext,utc_datetime_str,created_utc
995,nosurf,I’m addicted to reddit because it’s the only p...,"TLDR: I’m living a lie, and it’s preventing me...",2022-12-28 23:38:16,1672270696
996,nosurf,In couple days ill be offline for 1 year no ho...,so i realised how much my life is taken over b...,2022-12-28 23:33:43,1672270423
997,nosurf,I live in Japan and I want to leave social med...,I live in Japan and all my family and friends ...,2022-12-28 23:18:19,1672269499
998,nosurf,What are your routines going to look like goin...,I feel like this will be fun! Anyone who’d lik...,2022-12-28 23:06:16,1672268776
999,nosurf,internet addiction,"hi, i'm a 15 year old girl who is literally a...",2022-12-28 22:07:47,1672265267


In [10]:
# Extract subreddit
df_stopdrinking = reddit_extract('stopdrinking', 1000, '2023-03-07 00:00:00')

Subreddit: stopdrinking | Number of rows: 1000


In [11]:
# Check first 5 rows
df_stopdrinking.head()

Unnamed: 0,subreddit,title,selftext,utc_datetime_str,created_utc
0,stopdrinking,$650 in booze since New Year. Yesterday was my...,[removed],2023-03-06 15:41:40,1678117300
1,stopdrinking,"Day 1, again and I don’t feel hopeful.","Daily beer drinker here, for over a year. Last...",2023-03-06 15:39:40,1678117180
2,stopdrinking,"33 days, one day closer to my new high score","Hello, checking in again. I'm trying to do tha...",2023-03-06 15:38:38,1678117118
3,stopdrinking,Dumped a bottle of whiskey I found stashed,It’s been almost 40 days since my last relapse...,2023-03-06 15:37:04,1678117024
4,stopdrinking,1 year today and it has been life changing!,It was a year ago when I said “enough is enou...,2023-03-06 15:32:39,1678116759


In [14]:
# Check last 5 rows
df_stopdrinking.tail()

Unnamed: 0,subreddit,title,selftext,utc_datetime_str,created_utc
995,stopdrinking,I'm not having a good day,"Nothing serious, just several fuck-ups on my p...",2023-03-01 14:51:31,1677682291
996,stopdrinking,two months,"I did it. \n\nI drank for more years, daily, t...",2023-03-01 14:51:01,1677682261
997,stopdrinking,"Today is 600, and life is pretty good","If I can get here, you can too. I didn't get h...",2023-03-01 14:48:35,1677682115
998,stopdrinking,Triple. Digits!!!,Nuff said :),2023-03-01 14:48:27,1677682107
999,stopdrinking,Is it safe to stop cold turkey after nearly 2 ...,[removed],2023-03-01 14:41:35,1677681695


Noted that posts for stopdrinking is between 1 Mar 23 to 6 Mar 23, while posts for nosurf is between 28 Dec 23 to 6 Mar 23.
Since content of habits/addiction related posts should not vary much across dates posted, we will not be too concerned about the dates and would keep equal number of posts for comparison as far as possible.

In [12]:
# Export to csv
df_nosurf.to_csv("data/data_nosurf.csv", index=False, encoding='utf-8-sig')
df_stopdrinking.to_csv("data/data_stopdrinking.csv", index=False, encoding='utf-8-sig')