# Setting up Praw and collecting data from reddit.

Note about Skincare addiction and KBeauty data: I collected this data on April 10-13th. The reason I spaced the days out was to account for the number of requests for the PRAW(Python Reddit API Wrapper). 

In [None]:
### Datasets ###
Background on data: The following data was used in my analysis and was obtained from [Gapminder.com](https://www.gapminder.org/) which is a is a non-profit foundation based in Stockholm, Sweden, that aims to promote sustainable global development and enhance understanding of important global trends through the use of reliable data. With guidance from General Assemebly, I was able to find this source.

Gross National Income (GNI) per capita in current US dollars - Variable: Gni_per_cap_atlas_method_con2021.csv - Includes data from 1800 to 2050

Population by Country - Variable: Population.csv - Includes data from 1800 to 2100

Broadband subscribers per 100 people - Variable: broadband_subscribers_per_100_people - Fixed broadband subcriptions refers to fixed subcriptions to high-speed access to the public internet(A TCP/IP Connection) at downstream speeds equal to or great than 256 kbit/s.Includes data from 1998 to 2022 as 1998 was the year Broadband was created. Note: I choose to look at Broadband subscribers per 100 people vs all Broadband subscribers due to manageability and more standarization comparison.


In [None]:
### Data Dictionary ###
|Feature|Type|Dataset|Description|
|---|---|---|---|
|**created_utc**|*object*|Reddit|Data converted to UTC| 
|**title**|*object*|Reddit|Post titles from reddit|
|**self_text**|*object*|Reddit|Posts|
|**subreddit **|*object*|Redit|Which subreddit, Korean skincare or skincare addiction|

In [None]:
# About Data
# Data Dictionary 

In [1]:
import praw
import pandas as pd

In [2]:
reddit = praw.Reddit(
    client_id='p5gLpag2YEttDsAr_m-h1w',
    client_secret='-f2xSBNKksxXytgj9s1HOww1lZXOZg',
    user_agent='Subfind',
    username='Unique_Visual7167',
    password='Incorrect963h!'
)

Using Python reddit API Wrapper to collect data.

### Collecting data

In [3]:
# Choose your subreddit
subreddit = reddit.subreddit('Skincare_Addiction')

# Adjust the limit as needed -- Note that this will grab the 25 most recent posts
posts = subreddit.new(limit=950)

In [4]:
data = []
for post in posts:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit])

# Turn into a dataframe
skincaread = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit'])
skincaread.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1713070000.0,combining two body lotions,I use Cerave body lotion and a retinol body lo...,Skincare_Addiction
1,1713069000.0,Is SPF 30 moisturizer enough to use with Retinol?,Is SPF 30 moisturizer enough to use with Retin...,Skincare_Addiction
2,1713060000.0,Retinol Eye Cream,\nSo I accidentally bought an eye cream that c...,Skincare_Addiction
3,1713059000.0,Question about Niacinamide cleanser + serum,Can I use CeraVe's foaming cleanser (which con...,Skincare_Addiction
4,1713058000.0,Tretinoin combined with exfoliants?,Hi! 🩷 please be kind .. I’m sorry if this is a...,Skincare_Addiction


In [5]:
subreddit = reddit.subreddit('koreanskincare')

posts = subreddit.new(limit=950)

In [6]:
data = []
for post in posts:
    data.append([post.created_utc, post.title, post.selftext, post.subreddit])

# Turn into a dataframe
kskincare = pd.DataFrame(data, columns = ['created_utc', 'title', 'self_text', 'subreddit'])
kskincare.head()

Unnamed: 0,created_utc,title,self_text,subreddit
0,1713098000.0,Breakouts after koren skincare products:help m...,"Hey everyone,\nI’m using this new skincare pro...",koreanskincare
1,1713094000.0,skincare product recs ?,"i have dry, sensitive, congested skin\ni have ...",koreanskincare
2,1713074000.0,Difference between Skin1004 Centella Ampoule a...,Hello guys!\n\nI wanted to ask if any of you k...,koreanskincare
3,1713063000.0,Adding new product in the routine!,"Pic 1 : wow, the packing looks really good. I ...",koreanskincare
4,1713045000.0,Idk where to begin please help!,I am trying to make the switch to kskincare an...,koreanskincare


In [7]:
skincaread.shape

(950, 4)

In [8]:
kskincare.shape

(950, 4)

Checking the shape to see if the data was collected properly. I decided to pick 950 because 1000 was not giving me the amount that I wanted once I cleaned it.

In [9]:
skincaread = skincaread.drop_duplicates()
skincaread

Unnamed: 0,created_utc,title,self_text,subreddit
0,1.713070e+09,combining two body lotions,I use Cerave body lotion and a retinol body lo...,Skincare_Addiction
1,1.713069e+09,Is SPF 30 moisturizer enough to use with Retinol?,Is SPF 30 moisturizer enough to use with Retin...,Skincare_Addiction
2,1.713060e+09,Retinol Eye Cream,\nSo I accidentally bought an eye cream that c...,Skincare_Addiction
3,1.713059e+09,Question about Niacinamide cleanser + serum,Can I use CeraVe's foaming cleanser (which con...,Skincare_Addiction
4,1.713058e+09,Tretinoin combined with exfoliants?,Hi! 🩷 please be kind .. I’m sorry if this is a...,Skincare_Addiction
...,...,...,...,...
945,1.710672e+09,Need Help with Routine/ Products,20 F- I’ve struggled with acne off and on for ...,Skincare_Addiction
946,1.710671e+09,Help with my routineee,Hi! I really need some help to create a routin...,Skincare_Addiction
947,1.710659e+09,Very itchy and dry skin during/ after shower,Very dry itchy skin during/after shower\n\nMy ...,Skincare_Addiction
948,1.710656e+09,"Need help wity Hyperpigmentation, Acne, Geneti...",1st 3 pics are my face and 4th pic is the skin...,Skincare_Addiction


In [10]:
kskincare = kskincare.drop_duplicates()
kskincare

Unnamed: 0,created_utc,title,self_text,subreddit
0,1.713098e+09,Breakouts after koren skincare products:help m...,"Hey everyone,\nI’m using this new skincare pro...",koreanskincare
1,1.713094e+09,skincare product recs ?,"i have dry, sensitive, congested skin\ni have ...",koreanskincare
2,1.713074e+09,Difference between Skin1004 Centella Ampoule a...,Hello guys!\n\nI wanted to ask if any of you k...,koreanskincare
3,1.713063e+09,Adding new product in the routine!,"Pic 1 : wow, the packing looks really good. I ...",koreanskincare
4,1.713045e+09,Idk where to begin please help!,I am trying to make the switch to kskincare an...,koreanskincare
...,...,...,...,...
945,1.702017e+09,Anua peach 70% niacinamide serum give anyone e...,I can’t tell but it’s the newest product I am ...,koreanskincare
946,1.701946e+09,Recommendations?,Hi everyone! I’m new to skin care and have no ...,koreanskincare
947,1.701893e+09,Hi everyone! Please help! What skincare produc...,,koreanskincare
948,1.701763e+09,BOJ cleansing oil,"I recently bought this cleanser, its currently...",koreanskincare


Dropping duplicates in data.

I fetched additional data and created dataframes for each new top hot. I did this to collect more data and get relevant data to answer.

In [11]:
subreddit = reddit.subreddit('koreanskincare')

new_posts = subreddit.new(limit=450)

top_posts = subreddit.top(limit=450)

rising_posts = subreddit.rising(limit=450)

# Create a list of posts for each category
new_data = []
top_data = []
rising_data = []

for post in new_posts:
    if post.selftext:
        new_data.append([post.created_utc, post.title, post.selftext, post.subreddit])

for post in top_posts:
    if post.selftext:
        top_data.append([post.created_utc, post.title, post.selftext, post.subreddit])

for post in rising_posts:
    if post.selftext:
        rising_data.append([post.created_utc, post.title, post.selftext, post.subreddit])

# Create dataframes for each category
kskincare_new = pd.DataFrame(new_data, columns=['created_utc', 'title', 'self_text', 'subreddit'])
kskincare_top = pd.DataFrame(top_data, columns=['created_utc', 'title', 'self_text', 'subreddit'])
kskincare_rising = pd.DataFrame(rising_data, columns=['created_utc', 'title', 'self_text', 'subreddit'])

Fetching data from the Korean Skincare subreddmit.

In [12]:
# Fetch subreddit
subreddit = reddit.subreddit('Skincare_Addiction')

# Fetch new posts
new_posts = subreddit.new(limit=450)

# Fetch top posts
top_posts = subreddit.top(limit=450)

# Fetch rising posts
rising_posts = subreddit.rising(limit=450)

# Create a list of posts for each category
new_data = []
top_data = []
rising_data = []

for post in new_posts:
    if post.selftext:
        new_data.append([post.created_utc, post.title, post.selftext, post.subreddit])

for post in top_posts:
    if post.selftext:
        top_data.append([post.created_utc, post.title, post.selftext, post.subreddit])

for post in rising_posts:
    if post.selftext:
        rising_data.append([post.created_utc, post.title, post.selftext, post.subreddit])

# Create dataframes for each category
skincareadd_new = pd.DataFrame(new_data, columns=['created_utc', 'title', 'self_text', 'subreddit'])
skincareadd_top = pd.DataFrame(top_data, columns=['created_utc', 'title', 'self_text', 'subreddit'])
skincareadd_rising= pd.DataFrame(rising_data, columns=['created_utc', 'title', 'self_text', 'subreddit'])

Fetching data from the Skincare Addiction subreddit.

In [None]:
skincareadd_new.shape

skincareadd_top.shape

skincareadd_rising.shape

kskincare_new.shape

kskincare_top.shape

kskincare_rising.shape

kskincare.shape

Checking rows of categories to see how many of each I was able to obtain.

### Merging dataframes in each category together.

In [20]:
kskincare = pd.concat([kskincare, kskincare_new, kskincare_top, kskincare_rising], ignore_index=True)

In [21]:
kskincare.shape

(1780, 4)

In [22]:
skincaread = pd.concat([skincaread, skincareadd_new, skincareadd_top, skincareadd_rising], ignore_index=True)

In [23]:
skincaread.shape

(1492, 4)

### Saving files.

In [26]:
kskincare.to_csv('data/koreanskincarereddit.csv', index = False)

In [27]:
skincaread.to_csv('data/skincareaddiction.csv', index = False)

Summary of data collection process: 
Through using PRAW, I was able to collect 1492 for Skincare Addiction subreddit and 1780 for Korean Skincare subreddit. After initially adding 950 for each, I was able to collect 450 of each of the Top, New, and Rising Reddit categories. In terms of Reddit categorization, "Top" means what has gotten the most upvotes in the subreddit. "New" means recent posts, and "Rising" means posts that are getting more activity at the current moment. Finally, I saved the Korean skincare subreddit as "kskincare" and Skincare Addiction subreddit as "skincaread" as CSV files to prepare for cleaning.